Weighting variables in two-step cluster analysis

1998

收藏 2014-04-23

I'm using SPSS to perform two-step cluster analyses. SPSS shows predictor importance of each variable used in an analysis. Oftentimes, a binary variable like gender (sorry, I'm just keeping it simple!) will be the most important variable to the formation of the clusters, even if you don't want it to be.

Is there a way to weight variables, so that maybe I can downplay, but not eliminate, gender's role in the analysis?

Thank you for the help!

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

ReneeBK

2014-4-23 02:21:44

One thing to keep in mind before turning to weights is that gender can be considered a "swamping" variable in a two-step cluster analysis. Differences between gender are oftentimes large, and thus overpower weaker, but still substantively interesting heterogeneity in your data.

Instead of down-weighting gender, you could consider looking into a finite mixture regression model. Finite mixture models are a model-based cluster analysis (clusters are usually assumed to be multivariate Gaussian) and a finite mixture regression model essentially combines a cluster analysis with a regression. In your case, you could use gender as a predictor, perform this analysis, and detect clusters while taking into account the predictive power of gender (as well as other variables of interest). More information can be found here and here in the flexmix R package documentation.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-4-23 02:27:03

I want to assign different weights to the variables in my cluster analysis, but my program (Stata) doesn't seem to have an option for this, so I need to do it manually.

Imagine 4 variables A, B, C, D. The weights for those variables should be

w(A)=50%
w(B)=25%
w(C)=10%
w(D)=15%
I am wondering whether one of the following two approaches would actually do the trick:

First I standardize all variables (e.g. by their range). Then I multiply each standardized variable with their weight. Then do the cluster analysis.
I multiply all variables with their weight and standardize them afterwards. Then do the cluster analysis.
Or are both ideas complete nonsense?

The clustering algorithms (I try 3 different) I wish to use are k-means, weighted-average linkage and average-linkage. I plan to use weighted-average linkage to determine a good number of clusters which I plug into k-means afterwards.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-4-23 02:28:11

One way to assign a weight to a variable is by changing its scale. The trick works for the clustering algorithms you mention, viz. k-means, weighted-average linkage and average-linkage.

Kaufman, Leonard, and Peter J. Rousseeuw. "Finding groups in data: An introduction to cluster analysis." (2005) - page 11:

The choice of measurement units gives rise to relative weights of the variables. Expressing a variable in smaller units will lead to a larger range for that variable, which will then have a large effect on the resulting structure. On the other hand, by standardizing one attempts to give all variables an equal weight, in the hope of achieving objectivity. As such, it may be used by a practitioner who possesses no prior knowledge. However, it may well be that some variables are intrinsically more important than others in a particular application, and then the assignment of weights should be based on subject-matter knowledge (see, e.g., Abrahamowicz, 1985).

On the other hand, there have been attempts to devise clustering techniques that are independent of the scale of the variables (Friedman and Rubin, 1967). The proposal of Hardy and Rasson (1982) is to search for a partition that minimizes the total volume of the convex hulls of the clusters. In principle such a method is invariant with respect to linear transformations of the data, but unfortunately no algorithm exists for its implementation (except for an approximation that is restricted to two dimensions). Therefore, the dilemma of standardization appears unavoidable at present and the programs described in this book leave the choice up to the user

Abrahamowicz, M. (1985), The use of non-numerical a pnon information for measuring dissimilarities, paper presented at the Fourth European Meeting of the Psychometric Society and the Classification Societies, 2-5 July, Cambridge (UK).

Friedman, H. P., and Rubin, J. (1967), On some invariant criteria for grouping data. J . Amer. Statist. ASSOC6.,2 , 1159-1178.

Hardy, A., and Rasson, J. P. (1982), Une nouvelle approche des problemes de classification automatique, Statist. Anal. Donnies, 7, 41-56.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2014-4-23 02:29:06

Yes approach 1 is the right one, and correspond to what Kaufman, Leonard, and Peter J. Rousseeuw are saying in the paragraphs I quoted in the answer. Approach 2 would be useless as the standardization removes the weights :)

Franck Dernoncourt

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群