Hi
Listers,
I am trying to run
a cluster analysis of 90+ records data set with 10 variables. All variables
are
categorical:
1- Yes
2- No
88 - Not
sure
I've used Hierarchical
Cluster Analysis with Between-Groups Linkage method and Squared Euclidean
Distance.
I got only two clusters,
see below frequency results.
Is there any way to
increase the number of clusters?
Thank in advance.
Boreak This email is intended solely for the named addressee.If you are not the addressee indicated please delete it immediately. |
Under Hierarchical Cluster Analysis - go to
statistics - go to Cluster membership - here you set your
preferences re number of clusters.
Mark Webb Line +27 (21) 786 4379 Cell +27 (72) 199 1000 [Poor reception] Fax +27 (86) 260 1946 Skype tomarkwebb Email [hidden email] Client ftp http://targetlinkresearch.co.za/cftp/ On 2011/11/17 08:04 AM, Boreak Silk wrote:
|
In reply to this post by Boreak Silk
Squared Euclidean distance doesn’t make much sense for categorical variables. It seems to me that your variables are at least ordered but that not sure should be between yes and no. Dr. Paul R. Swank, Children's Learning Institute Professor, Department of Pediatrics, Medical School Adjunct Professor, School of Public Health University of Texas Health Science Center-Houston From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Boreak Silk Hi Listers, I am trying to run a cluster analysis of 90+ records data set with 10 variables. All variables are categorical: 1- Yes 2- No 88 - Not sure I've used Hierarchical Cluster Analysis with Between-Groups Linkage method and Squared Euclidean Distance. I got only two clusters, see below frequency results.
Is there any way to increase the number of clusters? Thank in advance. Boreak This email is intended solely for the named addressee. |
Administrator
|
Hello Andrés. When you say you tried to find the best predictors from among 10 continuous variables, does that mean you used some kind of stepwise selection method? If so, you should take a look at the following links, both of which discuss problems with stepwise selection:
http://www.stata.com/support/faqs/stat/stepwise.html http://os1.amc.nl/mediawiki/images/Babyak_-_overfitting.pdf Figure 3 in the article at the second link also shows that for logistic regression, one should have at least 10 or 15 (preferably 15) events per predictor to avoid over-fitting. Regarding your question about why all cases are predicted to be in category 1, I suspect this is because of the value of the cut point used for classification (probably .5). I know that the cut point can be changed for the LOGISTIC REGRESSION command (see the Help for the /CRITERIA sub-command). This option may also exist for GENLIN, but I don't know where it is. HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Administrator
|
I just noticed that I received this response off-list. (Andrés, notice what it says in my signature file about not checking my hotmail account regularly.)
--- Start of off-list response --- Thank you very much. I didn't use any stepwise method. When I put all the variables in the model, only tree are significant. My sample size is big enough, I have more than 1000 cases. I'm noy sure if thios is afecting the model, but my DV is 1 = 97% of the cases, 0 = 3% of the cases Kindly Andrés --- End of off-list response --- How many more than 1000 cases? If n = 1000, the number of events you have is 3% x 1000 = 30. The Babyak article on over-fitting will tell you that for logistic regression (I'm assuming you chose a binomial error distribution, so you have a logistic regression model), the number of events per variable (EPV) should be 15-20. That would mean you can only accommodate 2 explanatory variables at most. If you include 10, you are severely over-fitting your model. (It's like fitting a linear regression model with only 3 or 4 data points: You can do it, but you won't have much confidence in the line that is fitted.) Second, it's not clear to me if your final model had 3 or 10 variables in it. In either case, deleting predictors that are not significant is a bad practice. Look for "Lack of insignificant variables in the final model" here: http://biostat.mc.vanderbilt.edu/wiki/Main/ManuscriptChecklist Finally, I'm not surprised that all cases are predicted to be in category 1, given that 97% of them are in category 1. I think you'd need an awfully strong predictor variable to see anything else. HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
A technical point. Eliminating insignificant predictors will bias the estimates of the remaining predictors if the variables eliminated are correlated with the remaining predictors and the degree of bias is a function of the size of those correlations.
Paul Dr. Paul R. Swank, Children's Learning Institute Professor, Department of Pediatrics, Medical School Adjunct Professor, School of Public Health University of Texas Health Science Center-Houston -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bruce Weaver Sent: Tuesday, November 22, 2011 5:01 PM To: [hidden email] Subject: Re: Significant parameters in GLM but all predicted categories are the same I just noticed that I received this response off-list. (Andrés, notice what it says in my signature file about not checking my hotmail account regularly.) --- Start of off-list response --- Thank you very much. I didn't use any stepwise method. When I put all the variables in the model, only tree are significant. My sample size is big enough, I have more than 1000 cases. I'm noy sure if thios is afecting the model, but my DV is 1 = 97% of the cases, 0 = 3% of the cases Kindly Andrés --- End of off-list response --- How many more than 1000 cases? If n = 1000, the number of events you have is 3% x 1000 = 30. The Babyak article on over-fitting will tell you that for logistic regression (I'm assuming you chose a binomial error distribution, so you have a logistic regression model), the number of events per variable (EPV) should be 15-20. That would mean you can only accommodate 2 explanatory variables at most. If you include 10, you are severely over-fitting your model. (It's like fitting a linear regression model with only 3 or 4 data points: You can do it, but you won't have much confidence in the line that is fitted.) Second, it's not clear to me if your final model had 3 or 10 variables in it. In either case, deleting predictors that are not significant is a bad practice. Look for "Lack of insignificant variables in the final model" here: http://biostat.mc.vanderbilt.edu/wiki/Main/ManuscriptChecklist Finally, I'm not surprised that all cases are predicted to be in category 1, given that 97% of them *are* in category 1. I think you'd need an awfully strong predictor variable to see anything else. HTH. Bruce Weaver wrote > > Hello Andrés. When you say you tried to find the best predictors from > among 10 continuous variables, does that mean you used some kind of > stepwise selection method? If so, you should take a look at the following > links, both of which discuss problems with stepwise selection: > > http://www.stata.com/support/faqs/stat/stepwise.html > http://os1.amc.nl/mediawiki/images/Babyak_-_overfitting.pdf > > Figure 3 in the article at the second link also shows that for logistic > regression, one should have at least 10 or 15 (preferably 15) events per > predictor to avoid over-fitting. > > Regarding your question about why all cases are predicted to be in > category 1, I suspect this is because of the value of the cut point used > for classification (probably .5). I know that the cut point can be > changed for the LOGISTIC REGRESSION command (see the Help for the > /CRITERIA sub-command). This option may also exist for GENLIN, but I > don't know where it is. > > HTH. > > > ANDRES ALBERTO BURGA LEON wrote >> >> Hello to all: >> >> I try to found the best predictors (from among 10 continuos variables) >> for >> a dichotoumous criterion >> >> I have run a GLM model using a logit link function, and get 3 significant >> predictors. >> >> When I save the precited category, al cases get 1 (no case is predicted >> to >> 0 using this tree variables) >> >> I was trying to make sense of this results, I mean, how to interpret that >> all cases are predicted to the same category. >> >> >> Is somethink like: The models is usefull to dicriminate among positive >> values in ln(p/q), but not negative values >> >> >> Kindly >> >> Andrés >> >> PS: Sorry for my english >> >> >> Mg. Andrés Burga León >> Coordinador de Análisis e Informática >> Unidad de Medición de la Calidad Educativa (UMC) >> Ministerio de Educación del Perú >> Av.de la Arqeuología s/n (cuadra 2) >> Lima 41 >> Perú >> Teléfono 615-5840 - 6155800 anexo 1212 >> http://www2.minedu.gob.pe/umc/ >> > ----- -- Bruce Weaver [hidden email] http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." NOTE: My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Cluster-analysis-tp5000228p5015065.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |