Dear All,
I am trying to do cluster analysis for 305 cases with 44 variables. All 44 variables are nominal data (1 or 0). Would you please suggest me, which cluster analysis method will be suitable for such data. Thank you. Kuramura ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Hi
In IBM SPSS we have 2 step,k means and hierarchical cluster analysis........ but 2 step or
hierarchical cluster analysis will be appropriate. On Thu, Feb 23, 2012 at 8:15 PM, Kuramura <[hidden email]> wrote: Dear All, Rajesh M S |
Administrator
|
In reply to this post by Kuramura
Note that SPSS CLUSTER provides a HUGE number of distance measures (26 of which appear in the dropdown as appropriate for binary data) and seven different clustering methods. Pretty much impossible to recommend anything with simply the information that the variables are nominal.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
I have always seen more benefit, for my data, in using factor
analysis instead of cluster analysis. Dichotomous items raise some problems for factoring which do not disappear for clusters. In particular - How extreme a proportion is determines the limit of how big the correlation will be with another proportion. Limits or problems exist for other distance measures. Because of that - If you do a factor analysis with 44 correlated 0/1 variables, the factors will (tend to) break out according to the item means. I have had data where I said, "That's okay. I will use a factor analysis with 44 variables and derive 15 to 20 factors with 2 or 3 items each; score up the 15-20 factors as simple totals for the items; and carry out a new factor analysis on the 15-20 totals in order to obtain definitions for 4 or 5 new totals. Then the 5 new scores would be my covariates. If I were going to do a cluster analysis, I would take those steps so that I could use those reduced scores for the clustering. -- Rich Ulrich > Date: Thu, 23 Feb 2012 10:51:08 -0800 > From: [hidden email] > Subject: Re: Cluster analysis for binary data > To: [hidden email] > > Note that SPSS CLUSTER provides a HUGE number of distance measures (26 of > which appear in the dropdown as appropriate for binary data) and seven > different clustering methods. Pretty much impossible to recommend anything > with simply the information that the variables are nominal. > > > Kuramura wrote > > > > Dear All, > > > > I am trying to do cluster analysis for 305 cases with 44 variables. All 44 > > variables are nominal data (1 or 0). Would you please suggest me, which > > cluster analysis method will be suitable for such data. > > |
Hector,
First - Principal Components is worth looking at for the first step, since one purpose is to include all the items. PCA generallly produces more factors than PFA, and it is more likely to include all the items that have extreme proportions (and therefore, generally smaller covariances). As to scoring factors: I'm not sure that I follow what you are suggesting to weight, but here is my reaction. I have generally created scores as the simple sum or average of items, to take advantage of "length of scale" for creating a robust score. - A scale with 10 items, but three of them weighted heavily, will have the generally-lower reliability that you would expect for a 3 or 4 items scale. A scale with 10 items is expected to be more reliable. On the other hand, that rule is not hard-and-fast. A binary item that is rarely endorsed will have low variance, and perhaps should count for more. So that is one exception. The other major exception is mainly for something like an overall Total composite, when the selection of items seems unbalanced, for the sense that we are deriving from the scale. An example: If there turn out to be three sub-scales, with 20 items, 6 items, and 6 items, I might argue to create the Total as the average of the 3 sub-scale average-item-scores, rather than use the average of the 32 items. -- Rich Ulrich From: [hidden email] To: [hidden email]; [hidden email] Subject: RE: Cluster analysis for binary data Date: Fri, 24 Feb 2012 00:43:59 -0300 Rich, I’ve done on occasion something similar (factor analysis of binary data, then adding the factor scores), but with a twist: I weighted the factor scores according to the contribution of each factor to total explained variance (in my case 100% of the variance was “explained” because I used Principal Components, but this is not the point here). Thus a minor factor explaining, say, 3% of the variance would receive less weight than the first factor which explains perhaps 40%. What do you think of such an approach?
Hector [snip, previous] |
In reply to this post by Kuramura
Kuramura,
You need to look for a appropriate dissimilarity coefficient. Jochen Bacher published a 196 page script on cluster analysis from the 2002 ZA spring seminary at Cologne University which explains the pros & cons of the different dissimilarity coefficient, too. You find the legit downloadable text at http://www.clusteranalyse.net/sonstiges/zaspringseminar2002/lecturenotes.pdf HTH Dr Frank Thomas FTR Internet Research Rosny-sous-Bois France On 23/02/2012 15:45, Kuramura wrote: > Dear All, > > I am trying to do cluster analysis for 305 cases with 44 variables. All 44 > variables are nominal data (1 or 0). Would you please suggest me, which > cluster analysis method will be suitable for such data. > > Thank you. > > Kuramura > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |