SPSS k-means empty cluster strategy

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

SPSS k-means empty cluster strategy

Kirill Orlov
I have a question to someone who knows for sure; maybe an SPSS team developer come down to respond...

In k-means clustering, there can be numerous ways to cope with the case when an empty cluster appears on an iteration. QUICK CLUSTER in spss presumable does either (1) or (2):
1) Discard the empty cluster and reset K to the number of nonempty clusters for further iterations.
2) Leave the empty cluster its previous centroid so far in hope that it will pick points on future iterations.

SPSS Algorithms document does not describe it, the empty cluster management, and the output does not allow to induce whether (1) or (2) is used unless the empty cluster could indeed pick points later and you know it - the data situation which is very difficult to simulate, in order to be able to check.

That's why I'm asking directly somebody who knows QUICK CLUSTER internals: how is it there?

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: SPSS k-means empty cluster strategy

Jon Peck
Kirill,

Here is an answer from the Statistics development team.

It sticks with the number of groups you tell it to use, even if one winds up with no cases (so #2). This is implicit in the algorithms description.  You can verify this by creating a data set with the values 1,2,3,8,9,10 and an initial centers file with values 2,5,9 and running it with three clusters specified. It produces a solution with three clusters but one is empty.

On Fri, Jan 20, 2017 at 6:33 AM, Kirill Orlov <[hidden email]> wrote:
I have a question to someone who knows for sure; maybe an SPSS team developer come down to respond...

In k-means clustering, there can be numerous ways to cope with the case when an empty cluster appears on an iteration. QUICK CLUSTER in spss presumable does either (1) or (2):
1) Discard the empty cluster and reset K to the number of nonempty clusters for further iterations.
2) Leave the empty cluster its previous centroid so far in hope that it will pick points on future iterations.

SPSS Algorithms document does not describe it, the empty cluster management, and the output does not allow to induce whether (1) or (2) is used unless the empty cluster could indeed pick points later and you know it - the data situation which is very difficult to simulate, in order to be able to check.

That's why I'm asking directly somebody who knows QUICK CLUSTER internals: how is it there?

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: SPSS k-means empty cluster strategy

Kirill Orlov
Jon, Thank you very much for asking the team.
I believe that what the command does is #2 startegy.
However, the only reliable way to prove it is to have en example data where an empty cluster ceases be empty later.

Does the team have such an example?


21.01.2017 1:37, Jon Peck пишет:
Kirill,

Here is an answer from the Statistics development team.

It sticks with the number of groups you tell it to use, even if one winds up with no cases (so #2). This is implicit in the algorithms description.  You can verify this by creating a data set with the values 1,2,3,8,9,10 and an initial centers file with values 2,5,9 and running it with three clusters specified. It produces a solution with three clusters but one is empty.

On Fri, Jan 20, 2017 at 6:33 AM, Kirill Orlov <[hidden email]> wrote:
I have a question to someone who knows for sure; maybe an SPSS team developer come down to respond...

In k-means clustering, there can be numerous ways to cope with the case when an empty cluster appears on an iteration. QUICK CLUSTER in spss presumable does either (1) or (2):
1) Discard the empty cluster and reset K to the number of nonempty clusters for further iterations.
2) Leave the empty cluster its previous centroid so far in hope that it will pick points on future iterations.

SPSS Algorithms document does not describe it, the empty cluster management, and the output does not allow to induce whether (1) or (2) is used unless the empty cluster could indeed pick points later and you know it - the data situation which is very difficult to simulate, in order to be able to check.

That's why I'm asking directly somebody who knows QUICK CLUSTER internals: how is it there?

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: SPSS k-means empty cluster strategy

Kirill Orlov
In reply to this post by Jon Peck
My investigation of the matter conclusion:

SPSS QUICK CLUSTER goes way #1, not way #2. It closes empty cluster and reduces k. In the table of results, SPSS "displays" the empty cluster as if it lingers on. It however, will remain empty forever: it is imaginary.

To demonstrate both strategies #1 and #2 I use my k-means clustering function which has an option to select either one or the other. The function is given in the end of the mail.

*Data.

data list list /v1 v2 v3 v4.
begin data
 -5.3719   6.8275  -2.4715  -2.9269
   .2489    .9947   1.6518   -.7582
  -.6734   6.2760   2.1313  -1.4891
  7.0904   4.9132   4.3320    .6370
 -4.8910   3.3054    .6933  -2.2904
 -5.2235   4.4024  -4.7882   -.9235
 -6.6057   3.1523  -3.5771   1.9346
  1.1405   3.8492   3.9094   1.9281
 -2.0834   4.0908  -3.8790   -.9108
 -5.3272  -5.8114  -4.0793  -1.8701
  -.5324   5.0289  -1.5181   2.1175
  -.1765   5.5228    .1666  -2.1917
   .5143   4.3835   -.7475  -1.1784
 -5.0602   5.4136  -2.0116  -1.1643
 -5.6256  -2.8274    .3236  -2.1148
 -7.4521  -1.1892  -8.7782  -1.4719
 -2.8088   5.1122   1.4035  -3.3286
  -.9397   2.6754    .4425    .9759
 -2.1202   2.2543    .0631   8.0546
 -5.0650  -3.1739  -4.4108   2.3357
 -2.2947   7.4541   -.6578  -1.1082
  -.7374   4.6378   2.6841   6.6423
 -3.2455   2.2238    .7113  -2.2192
  -.8054   2.8557   2.1528   2.9120
   .6183   1.2614   1.1844   2.4262
  1.4211   5.4323   3.2233   -.6277
 -2.0253   6.4139  -2.9089   -.7994
 -1.9517   4.8544  -2.0280   3.6886
 -5.3311   5.1372  -2.2791  -2.0547
 -6.5082   1.4453   3.6264   -.2660
 -2.6869   3.8685   3.8642    .3965
 -6.1176   -.0562  -7.1008  -1.9151
   .5911   4.9496   5.7457  -2.9711
 -1.8626   1.7992    .9262   -.5677
 -3.7868   2.6043  -3.1591   1.4017
 -3.7373   2.3847  -4.8227  -3.0691
 -5.8313  -1.8897  -3.8046  -1.1344
 -4.2784   5.5208   -.8193  -3.4886
 -3.5518   1.2854   1.8671   8.8924
 -2.4243   2.8768  -4.4541  -2.3716
 -6.4837   2.0786  -4.1860    .4368
 -1.1392   -.4814   3.6461   -.1291
 -5.2734   3.1431  -3.1340  -2.3779
   .1394   6.8628   9.5687   -.2306
  -.1108   1.2527   4.7212   -.2919
 -5.8650   2.8461  -3.6460  -2.9794
 -5.5111   1.5783   -.7719   -.6472
 -5.4563   3.5221  -3.1965  -4.0272
 -6.6606    .3149  -6.0475  -1.8230
 11.4975   7.2059   9.0181  -1.6396
end data.


*Do k-means with strategy #1 (deletion of the empty cluster).
*The results will correspond to what SPSS does. Cluster 3 is deleted (= it will stay empty forever).

set mxloops 1000.
matrix.
get data /vari= v1 to v4.
comp ini= {1,2,3,3;
           2,2,8,3;
           4,2,0,15}. /*Bad initial centres (produce empty cluster immediately)
!kmeans(data%ini%5%1%dummy%fin%pre%dpre).
print /title 'RESULTS'.
print pre /title 'Classification centres'.
print fin /title 'Recalculated centroids'.
PRINT dummy.
save {dummy*t(1:ncol(dummy)),sqrt(dpre)} /out= * /vari= clu dpre.
end matrix.


*Do k-means with strategy #2 (leave empty cluster its latest centre in hope the cluster will pick cases in future).
*The results differ from what the previous does. Here, cluster 3, empty from the 1st classification, gains cases on from iteration 4.

set mxloops 1000.
matrix.
get data /vari= v1 to v4.
comp ini= {1,2,3,3;
           2,2,8,3;
           4,2,0,15}. /*Bad initial centres (produce empty cluster immediately)
!kmeans(data%ini%5%0%dummy%fin%pre%dpre).
print /title 'RESULTS'.
print pre /title 'Classification centres'.
print fin /title 'Recalculated centroids'.
PRINT dummy.
save {dummy*t(1:ncol(dummy)),sqrt(dpre)} /out= * /vari= clu dpre.
end matrix.


*------------------------------------------------.

*The function (macro, to run from within MATRIX).

******** K-MEANS CLUSTERING ********.
*/*!kmeans(data%ini%iter%empty%name1%name2%name3%name4)*/*.
*Version 1.
*Cluster analysis by k-means method.
*Takes scale data DATA (n cases x p variables) and clusters the cases in specified number of clusters k.
*Initial centres must be offered by the user.
*INI - matrix of initial centres (means) sized k clusters x p variables (columns in DATA).
*ITER - do this number of iterations (nonnegative integer scalar, e.g. 10). ITER=0 corresponds to
*classification without iterations, i.e. the assignment of cases to initial centres without improving of
*the latter.
*EMPTY - digit (not name or expression, you may optionally quote or apostrophe the digit) 0 or 1.
*It is specification of the way the problem of empty clusters is to be solved if such clusters emerge:
*"1" - get rid of empty cluster by reducing k; k will not be able to return to the initial value.
*"0" - leave empty cluster its previous centre, in hope that later it will pick cases (which may
*sometimes not happen).
*Empty clusters take place under bad initial centres and also with strongly discrete data.
*Results:
*NAME1 - the obtained cluster membership, in the form n x k binary matrix, i.e. k dummy variables
*corresponding to the clusters. Each case is assigned to one cluster: 1 stands in that column.
*You can create single categorical variable of cluster membership as: NAME1*t(1:ncol(NAME1)).
*NAME2 - k x p matrix of final, recalculated means: these are the centroids of NAME1 clusters.
*NAME3 - k x p  matrix of pre-final means: these are the latest centres to which cases were
*classified to.
*NAME4 - n x 1 column of squared euclidean distances from cases to centres NAME3, to which cases
*were last time assigned. For example, if a case according to NAME1 appeared to be
*put in cluster 3, then its distance NAME4 to the centre of that cluster is the distance to the centre
*which is the 3rd row in NAME3, for just based on that distance the case was assigned to it.
*This function may require prior setting of limit for number of cycles to enough big value
*by command SET MXLOOPS.

define !kmeans(!pos= !token(1) /!pos= !charend('%') /!pos= !charend('%') /!pos= !charend('%') /!pos= !charend('%')
              /!pos= !charend('%') /!pos= !charend('%') /!pos= !charend('%') /!pos= !charend(')'))
comp @data= !2.
comp !7= !3.
comp @k= nrow(!7).
comp @iter= !4.
!let !empty= !unquote(!5)
comp @ind= 0.
comp @cones= make(nrow(@data),1,1).
comp @rones= make(1,ncol(@data),1).
comp !6= make(nrow(@data),@k,0).
loop @it= 0 to @iter.
-comp !8= !7.
-loop @i= 1 to @k.
- comp !6(:,@i)= rssq(@data-@cones*!8(@i,:)).
-end loop.
-comp !9= rmin(!6).
-comp @ass= make(nrow(@data),1,0).
-loop @i= 1 to @k.
- comp !6(:,@i)= (!6(:,@i)=!9)&*(not @ass).
- comp @ass= @ass+!6(:,@i).
-end loop.
-comp @n= t(csum(!6)).
-do if all(@n).
- comp !7= t(!6)*@data/(@n*@rones).
-else.
  !if (!empty) !then
-  comp @ass= @n>0.
-  comp @nnz= csum(@ass).
-  comp @ind= make(1,@nnz+1,0).
-  comp @c= 1.
-  loop @i= 1 to @k.
-   comp @ind(@c)= @i.
-   comp @c= @c+@ass(@i).
-  end loop.
-  comp @ind= @ind(1:@nnz).
-  comp @n= @n(@ind).
-  comp !6= !6(:,@ind).
-  comp @k= @nnz.
-  comp !7= t(!6)*@data/(@n*@rones).
-  release @nnz,@c.
  !else
-  comp @ass= @n=0.
-  comp @ind= @ass*@rones&*!8.
-  comp @n= @n+@ass.
-  comp !7= t(!6)*@data/(@n*@rones).
-  comp !7= !7+@ind.
  !ifend
-end if.
end loop.
release @data,@k,@iter,@cones,@rones,@it,@i,@ass,@n,@ind.
!enddefine.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: SPSS k-means empty cluster strategy

Kirill Orlov
In reply to this post by Jon Peck
P.S.
For better tracking what is is going on the iterations I've incerted some print-out commands into the function body, for your convenience.
So use this with the examples I've given in the previous email:.

define !kmeans(!pos= !token(1) /!pos= !charend('%') /!pos= !charend('%') /!pos= !charend('%') /!pos= !charend('%')
              /!pos= !charend('%') /!pos= !charend('%') /!pos= !charend('%') /!pos= !charend(')'))
comp @data= !2.
comp !7= !3.
comp @k= nrow(!7).
comp @iter= !4.
!let !empty= !unquote(!5)
comp @ind= 0.
comp @cones= make(nrow(@data),1,1).
comp @rones= make(1,ncol(@data),1).
comp !6= make(nrow(@data),@k,0).
loop @it= 0 to @iter.
-comp !8= !7.
-loop @i= 1 to @k.
- comp !6(:,@i)= rssq(@data-@cones*!8(@i,:)).
-end loop.
-comp !9= rmin(!6).
-comp @ass= make(nrow(@data),1,0).
-loop @i= 1 to @k.
- comp !6(:,@i)= (!6(:,@i)=!9)&*(not @ass).
- comp @ass= @ass+!6(:,@i).
-end loop.
-comp @n= t(csum(!6)).
PRINT @it.
PRINT @n /title 'Within cluster frequencies'.
-do if all(@n).
- comp !7= t(!6)*@data/(@n*@rones).
-else.
  !if (!empty) !then
-  comp @ass= @n>0.
-  comp @nnz= csum(@ass).
-  comp @ind= make(1,@nnz+1,0).
-  comp @c= 1.
-  loop @i= 1 to @k.
-   comp @ind(@c)= @i.
-   comp @c= @c+@ass(@i).
-  end loop.
-  comp @ind= @ind(1:@nnz).
-  comp @n= @n(@ind).
-  comp !6= !6(:,@ind).
-  comp @k= @nnz.
-  comp !7= t(!6)*@data/(@n*@rones).
-  release @nnz,@c.
  !else
-  comp @ass= @n=0.
-  comp @ind= @ass*@rones&*!8.
-  comp @n= @n+@ass.
-  comp !7= t(!6)*@data/(@n*@rones).
-  comp !7= !7+@ind.
  !ifend
-end if.
PRINT !8 /title 'Classification centres'.
PRINT !7 /title 'Recalculated centroids'.
end loop.
release @data,@k,@iter,@cones,@rones,@it,@i,@ass,@n,@ind.
!enddefine.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: SPSS k-means empty cluster strategy

Kirill Orlov
In reply to this post by Jon Peck
I have just received another letter from Jon Peck who informs there that an SPSS developer confirms that QUICK CLUSTER indeed follows what I called "strategy #1": an empty cluster is aborted and won't be filled from that step (the cluster label is not erased from the output cluster centres table, however, it is shown as empty there).

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD