SPSSX Discussion

Cluster analysis

Classic

List

Threaded

7 messages Options

Boreak Silk

Nov 17, 2011; 6:04am

Cluster analysis

30 posts

Hi Listers,

I am trying to run a cluster analysis of 90+ records data set with 10 variables. All variables are

categorical:

1- Yes

2- No

88 - Not sure

I've used Hierarchical Cluster Analysis with Between-Groups Linkage method and Squared Euclidean Distance.

I got only two clusters, see below frequency results.

CLU9_1 Average Linkage (Between Groups)
$B!! (B		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	1	5	5.4	5.8	5.8
	2	48	52.2	55.8	61.6
	3	1	1.1	1.2	62.8
	4	5	5.4	5.8	68.6
	5	2	2.2	2.3	70.9
	6	1	1.1	1.2	72.1
	7	1	1.1	1.2	73.3
	8	22	23.9	25.6	98.8
	9	1	1.1	1.2	100.0
	Total	86	93.5	100.0	$B!! (B
Missing	System	6	6.5	$B!! (B	$B!! (B
Total		92	100.0	$B!! (B	$B!! (B

Is there any way to increase the number of clusters?

Thank in advance.

Boreak

This email is intended solely for the named addressee.
If you are not the addressee indicated please delete it immediately.

Mark Webb-5

Nov 17, 2011; 6:32am

Re: Cluster analysis

96 posts

Under Hierarchical Cluster Analysis - go to statistics - go to Cluster membership - here you set your preferences re number of clusters.

Mark Webb

Line +27 (21) 786 4379
Cell +27 (72) 199 1000 [Poor reception]
Fax  +27 (86) 260 1946

Skype       tomarkwebb
Email       [hidden email]
Client ftp  http://targetlinkresearch.co.za/cftp/

On 2011/11/17 08:04 AM, Boreak Silk wrote:

Hi Listers,

I am trying to run a cluster analysis of 90+ records data set with 10 variables. All variables are

categorical:

1- Yes

2- No

88 - Not sure

I've used Hierarchical Cluster Analysis with Between-Groups Linkage method and Squared Euclidean Distance.

I got only two clusters, see below frequency results.

CLU9_1 Average Linkage (Between Groups)

$B!! (B
Frequency

Percent

Valid Percent

Cumulative Percent

Valid

1

5

5.4

5.8

5.8

2

48

52.2

55.8

61.6

3

1

1.1

1.2

62.8

4

5

5.4

5.8

68.6

5

2

2.2

2.3

70.9

6

1

1.1

1.2

72.1

7

1

1.1

1.2

73.3

8

22

23.9

25.6

98.8

9

1

1.1

1.2

100.0

Total

86

93.5

100.0
$B!! (B

Missing

System

6

6.5
$B!! (B $B!! (B

Total

92

100.0
$B!! (B $B!! (B

Is there any way to increase the number of clusters?

Thank in advance.

Boreak
This email is intended solely for the named addressee.
If you are not the addressee indicated please delete it immediately.

... [show rest of quote]

Swank, Paul R

Nov 17, 2011; 3:09pm

Re: Cluster analysis

365 posts

In reply to this post by Boreak Silk

Squared Euclidean distance doesn’t make much sense for categorical variables. It seems to me that your variables are at least ordered but that not sure should be between yes and no.

Dr. Paul R. Swank,

Children's Learning Institute

Professor, Department of Pediatrics, Medical School

Adjunct Professor, School of Public Health

University of Texas Health Science Center-Houston

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Boreak Silk
Sent: Thursday, November 17, 2011 12:04 AM
To: [hidden email]
Subject: Cluster analysis

Hi Listers,

I am trying to run a cluster analysis of 90+ records data set with 10 variables. All variables are

categorical:

1- Yes

2- No

88 - Not sure

I've used Hierarchical Cluster Analysis with Between-Groups Linkage method and Squared Euclidean Distance.

I got only two clusters, see below frequency results.

CLU9_1 Average Linkage (Between Groups)
$B!! (B		Frequency	Percent	Valid Percent	Cumulative Percent
Valid	1	5	5.4	5.8	5.8
	2	48	52.2	55.8	61.6
	3	1	1.1	1.2	62.8
	4	5	5.4	5.8	68.6
	5	2	2.2	2.3	70.9
	6	1	1.1	1.2	72.1
	7	1	1.1	1.2	73.3
	8	22	23.9	25.6	98.8
	9	1	1.1	1.2	100.0
	Total	86	93.5	100.0	$B!! (B
Missing	System	6	6.5	$B!! (B	$B!! (B
Total		92	100.0	$B!! (B	$B!! (B

Is there any way to increase the number of clusters?

Thank in advance.

Boreak

This email is intended solely for the named addressee.
If you are not the addressee indicated please delete it immediately.

ANDRES ALBERTO BURGA LEON

Nov 17, 2011; 11:04pm

Significant parameters in GLM but all predicted categories are the same

67 posts

Hello to all:

I try to found the best predictors (from among 10 continuos variables) for a dichotoumous criterion

I have run a GLM model using a logit link function, and get 3 significant predictors.

When I save the precited category, al cases get 1 (no case is predicted to 0 using this tree variables)

I was trying to make sense of this results, I mean, how to interpret that all cases are predicted to the same category.

Is somethink like: The models is usefull to dicriminate among positive values in ln(p/q), but not negative values

Kindly

Andrés

PS: Sorry for my english

Mg. Andrés Burga León
Coordinador de Análisis e Informática
Unidad de Medición de la Calidad Educativa (UMC)
Ministerio de Educación del Perú
Av.de la Arqeuología s/n (cuadra 2)
Lima 41
Perú
Teléfono 615-5840 - 6155800 anexo 1212
http://www2.minedu.gob.pe/umc/

Bruce Weaver

Nov 18, 2011; 2:15am

Re: Significant parameters in GLM but all predicted categories are the same

Administrator

3512 posts

Hello Andrés. When you say you tried to find the best predictors from among 10 continuous variables, does that mean you used some kind of stepwise selection method? If so, you should take a look at the following links, both of which discuss problems with stepwise selection:

http://www.stata.com/support/faqs/stat/stepwise.html
http://os1.amc.nl/mediawiki/images/Babyak_-_overfitting.pdf

Figure 3 in the article at the second link also shows that for logistic regression, one should have at least 10 or 15 (preferably 15) events per predictor to avoid over-fitting.

Regarding your question about why all cases are predicted to be in category 1, I suspect this is because of the value of the cut point used for classification (probably .5). I know that the cut point can be changed for the LOGISTIC REGRESSION command (see the Help for the /CRITERIA sub-command). This option may also exist for GENLIN, but I don't know where it is.

HTH.

ANDRES ALBERTO BURGA LEON wrote

Hello to all:

I try to found the best predictors (from among 10 continuos variables) for
a dichotoumous criterion

I have run a GLM model using a logit link function, and get 3 significant
predictors.

When I save the precited category, al cases get 1 (no case is predicted to
0 using this tree variables)

I was trying to make sense of this results, I mean, how to interpret that
all cases are predicted to the same category.

Is somethink like: The models is usefull to dicriminate among positive
values in ln(p/q), but not negative values

Kindly

Andrés

PS: Sorry for my english

Mg. Andrés Burga León
Coordinador de Análisis e Informática
Unidad de Medición de la Calidad Educativa (UMC)
Ministerio de Educación del Perú
Av.de la Arqeuología s/n (cuadra 2)
Lima 41
Perú
Teléfono 615-5840 - 6155800 anexo 1212
http://www2.minedu.gob.pe/umc/
... [show rest of quote]

... [show rest of quote]

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

Bruce Weaver

Nov 22, 2011; 11:00pm

Re: Significant parameters in GLM but all predicted categories are the same

Administrator

3512 posts

I just noticed that I received this response off-list. (Andrés, notice what it says in my signature file about not checking my hotmail account regularly.)

--- Start of off-list response ---

Thank you very much.

I didn't use any stepwise method. When I put all the variables in the model, only tree are significant.

My sample size is big enough, I have more than 1000 cases.

I'm noy sure if thios is afecting the model, but my DV is 1 = 97% of the cases, 0 = 3% of the cases

Kindly

Andrés

--- End of off-list response ---

How many more than 1000 cases? If n = 1000, the number of events you have is 3% x 1000 = 30. The Babyak article on over-fitting will tell you that for logistic regression (I'm assuming you chose a binomial error distribution, so you have a logistic regression model), the number of events per variable (EPV) should be 15-20. That would mean you can only accommodate 2 explanatory variables at most. If you include 10, you are severely over-fitting your model. (It's like fitting a linear regression model with only 3 or 4 data points: You can do it, but you won't have much confidence in the line that is fitted.)

Second, it's not clear to me if your final model had 3 or 10 variables in it. In either case, deleting predictors that are not significant is a bad practice. Look for "Lack of insignificant variables in the final model" here: http://biostat.mc.vanderbilt.edu/wiki/Main/ManuscriptChecklist

Finally, I'm not surprised that all cases are predicted to be in category 1, given that 97% of them are in category 1. I think you'd need an awfully strong predictor variable to see anything else.

HTH.

Bruce Weaver wrote

Hello Andrés. When you say you tried to find the best predictors from among 10 continuous variables, does that mean you used some kind of stepwise selection method? If so, you should take a look at the following links, both of which discuss problems with stepwise selection:

http://www.stata.com/support/faqs/stat/stepwise.html
http://os1.amc.nl/mediawiki/images/Babyak_-_overfitting.pdf

Figure 3 in the article at the second link also shows that for logistic regression, one should have at least 10 or 15 (preferably 15) events per predictor to avoid over-fitting.

Regarding your question about why all cases are predicted to be in category 1, I suspect this is because of the value of the cut point used for classification (probably .5). I know that the cut point can be changed for the LOGISTIC REGRESSION command (see the Help for the /CRITERIA sub-command). This option may also exist for GENLIN, but I don't know where it is.

HTH.

ANDRES ALBERTO BURGA LEON wrote

Hello to all:

I try to found the best predictors (from among 10 continuos variables) for
a dichotoumous criterion

I have run a GLM model using a logit link function, and get 3 significant
predictors.

When I save the precited category, al cases get 1 (no case is predicted to
0 using this tree variables)

I was trying to make sense of this results, I mean, how to interpret that
all cases are predicted to the same category.

Is somethink like: The models is usefull to dicriminate among positive
values in ln(p/q), but not negative values

Kindly

Andrés

PS: Sorry for my english

Mg. Andrés Burga León
Coordinador de Análisis e Informática
Unidad de Medición de la Calidad Educativa (UMC)
Ministerio de Educación del Perú
Av.de la Arqeuología s/n (cuadra 2)
Lima 41
Perú
Teléfono 615-5840 - 6155800 anexo 1212
http://www2.minedu.gob.pe/umc/
... [show rest of quote]

... [show rest of quote]

... [show rest of quote]

... [show rest of quote]

Swank, Paul R

Nov 23, 2011; 5:33pm

Re: Significant parameters in GLM but all predicted categories are the same

365 posts

A technical point. Eliminating insignificant predictors will bias the estimates of the remaining predictors if the variables eliminated are correlated with the remaining predictors and the degree of bias is a function of the size of those correlations.

Paul

Dr. Paul R. Swank,
Children's Learning Institute
Professor, Department of Pediatrics, Medical School
Adjunct Professor, School of Public Health
University of Texas Health Science Center-Houston

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bruce Weaver
Sent: Tuesday, November 22, 2011 5:01 PM
To: [hidden email]
Subject: Re: Significant parameters in GLM but all predicted categories are the same

I just noticed that I received this response off-list. (Andrés, notice what
it says in my signature file about not checking my hotmail account
regularly.)

--- Start of off-list response ---

Thank you very much.

I didn't use any stepwise method. When I put all the variables in the model,
only tree are significant.

My sample size is big enough, I have more than 1000 cases.

I'm noy sure if thios is afecting the model, but my DV is 1 = 97% of the
cases, 0 = 3% of the cases

Kindly

Andrés

--- End of off-list response ---

How many more than 1000 cases? If n = 1000, the number of events you have
is 3% x 1000 = 30. The Babyak article on over-fitting will tell you that
for logistic regression (I'm assuming you chose a binomial error
distribution, so you have a logistic regression model), the number of events
per variable (EPV) should be 15-20. That would mean you can only
accommodate 2 explanatory variables at most. If you include 10, you are
severely over-fitting your model. (It's like fitting a linear regression
model with only 3 or 4 data points: You can do it, but you won't have much
confidence in the line that is fitted.)

Second, it's not clear to me if your final model had 3 or 10 variables in
it. In either case, deleting predictors that are not significant is a bad
practice. Look for "Lack of insignificant variables in the final model"
here: http://biostat.mc.vanderbilt.edu/wiki/Main/ManuscriptChecklist

Finally, I'm not surprised that all cases are predicted to be in category 1,
given that 97% of them *are* in category 1. I think you'd need an awfully
strong predictor variable to see anything else.

HTH.

Bruce Weaver wrote

>
> Hello Andrés. When you say you tried to find the best predictors from
> among 10 continuous variables, does that mean you used some kind of
> stepwise selection method? If so, you should take a look at the following
> links, both of which discuss problems with stepwise selection:
>
> http://www.stata.com/support/faqs/stat/stepwise.html
> http://os1.amc.nl/mediawiki/images/Babyak_-_overfitting.pdf
>
> Figure 3 in the article at the second link also shows that for logistic
> regression, one should have at least 10 or 15 (preferably 15) events per
> predictor to avoid over-fitting.
>
> Regarding your question about why all cases are predicted to be in
> category 1, I suspect this is because of the value of the cut point used
> for classification (probably .5). I know that the cut point can be
> changed for the LOGISTIC REGRESSION command (see the Help for the
> /CRITERIA sub-command). This option may also exist for GENLIN, but I
> don't know where it is.
>
> HTH.
>
>
> ANDRES ALBERTO BURGA LEON wrote
>>
>> Hello to all:
>>
>> I try to found the best predictors (from among 10 continuos variables)
>> for
>> a dichotoumous criterion
>>
>> I have run a GLM model using a logit link function, and get 3 significant
>> predictors.
>>
>> When I save the precited category, al cases get 1 (no case is predicted
>> to
>> 0 using this tree variables)
>>
>> I was trying to make sense of this results, I mean, how to interpret that
>> all cases are predicted to the same category.
>>
>>
>> Is somethink like: The models is usefull to dicriminate among positive
>> values in ln(p/q), but not negative values
>>
>>
>> Kindly
>>
>> Andrés
>>
>> PS: Sorry for my english
>>
>>
>> Mg. Andrés Burga León
>> Coordinador de Análisis e Informática
>> Unidad de Medición de la Calidad Educativa (UMC)
>> Ministerio de Educación del Perú
>> Av.de la Arqeuología s/n (cuadra 2)
>> Lima 41
>> Perú
>> Teléfono 615-5840 - 6155800 anexo 1212
>> http://www2.minedu.gob.pe/umc/
>>
>

... [show rest of quote]

-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Cluster-analysis-tp5000228p5015065.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD