This post was updated on .
I have a problem with my two step cluster results. It appears that they are highly dependent on the way the cases are sorted. With the same variables I have two different cluster solutions. The same happened with other datasets with the same variables.
Looking at the bibliography I found the following “The results may depend on the input order of cases. Therefore, SPSS (2001: 2) recommends to use a random order.” (https://lists.sunysb.edu/index.cgi?A3=ind0607&L=CLASS-L&E=base64&P=597803&B=------%3D_NextPart_000_00FA_01C6B495.146555B0&T=application%2Fpdf;%20name=%22bacher-wenzig-vogler_spss-twostep.pdf%22&N=bacher-wenzig-vogler_spss-twostep.pdf&attachment=q) And “Warning: The final solution may depend on the order of the cases in the file. To minimize the effect, arrange the cases in random order. Sort them by the last digit of their ID numbers or something similar” (http://www.norusis.com/pdf/SPC_v19.pdf) So I created a random variable and the same problem occurred because I can only sort the cases on ascending or descending with the case identification number or the random variable created. Could anyone advice me how to sort the cases randomly or any other way as suggested as above? Thank you |
If you create a variable that contains
random values, sorting in ascending or descending order based on those
random values results in essentially random sort order.
Note: To replicate results, set the seed (SET SEED) prior to generating the random values. Rick Oliver Senior Information Developer IBM Business Analytics (SPSS) E-mail: [hidden email] From: silvsoriano <[hidden email]> To: [hidden email] Date: 11/17/2014 12:51 PM Subject: How to sort cases randomly instead of ascending or descending order? - two step cluster Sent by: "SPSSX(r) Discussion" <[hidden email]> I have a problem with my two step cluster results. It appears that they are highly dependent on the way the cases are sorted. With the same variables I have two different cluster solutions. The same happened with other datasets with the same variables. Looking at the bibliography I found the following “The results may depend on the input order of cases. Therefore, SPSS (2001: 2) recommends to use a random order.” (https://lists.sunysb.edu/index.cgi?A3=ind0607&L=CLASS-L&E=base64&P=597803&B=------%3D_NextPart_000_00FA_01C6B495.146555B0&T=application%2Fpdf;%20name=%22bacher-wenzig-vogler_spss-twostep.pdf%22&N=bacher-wenzig-vogler_spss-twostep.pdf&attachment=q) And “Warning: The final solution may depend on the order of the cases in the file. To minimize the effect, arrange the cases in random order. Sort them by the last digit of their ID numbers or something similar” (http://www.norusis.com/pdf/SPC_v19.pdf) So I created a random variable and the same problem occurred because I can only sort the cases on ascending or descending with the case identification number or the random variable created. Could any advice me how to sort the cases randomly or any other way as suggested as above? Thank you -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/How-to-sort-cases-randomly-instead-of-ascending-or-descending-order-two-step-cluster-tp5727968.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Thank you for the reply.
The problem that I can still find in my two step cluster exploration is that sorting them in ascending or descending order (with the random variable) produce two different results as you can find here: for_spss_list.docx What would be the explanation that ordering the cases in ascending or descending with a random variable can produce these two different results? Is there any other way to sort cases that is not ascending or descending to avoid this problem? Thank you again. |
Yes, the sort order will always affect
the result, even if it's random. The purpose of the random sort is to eliminate
any effects that might be caused by some meaningful sort order, particularly
data sorted by the value of any of the inputs used in the model.
Rick Oliver Senior Information Developer IBM Business Analytics (SPSS) E-mail: [hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Thank you, so what about the suggestion of "Sort them
by the last digit of their ID numbers or something similar" Can it be possible to do that? Any syntax available? Are there only two ways of sorting cases (ascending or descending?) Thank you. |
Any arbitrary sorting order should be sufficient.
There is not particular advantage to sorting by the last two (or first two, or middle two, or whatever) digits of an ID variable, and you'll still have sort in ascending or descending order. This should remove any unwanted effects of sort order: set rng mc seed 123456789. /*if you want to replicate later. compute randvar=rv.uniform(1,1000). sort cases by randvar. delete variables randvar. /*optional, you can keep it if you want. twostep cluster... Regardless of what scheme you use to randomly reorder the cases, the Twostep cluster results will always differ when the order of cases changes, even when the order has no meaning. It's the nature of the algorithm. Rick Oliver Senior Information Developer IBM Business Analytics (SPSS) E-mail: [hidden email] From: silvsoriano <[hidden email]> To: [hidden email] Date: 11/17/2014 01:57 PM Subject: Re: How to sort cases randomly instead of ascending or descending order? - two step cluster Sent by: "SPSSX(r) Discussion" <[hidden email]> Thank you, so what about the suggestion of "Sort them by the last digit of their ID numbers or something similar" Can it be possible to do that? Any syntax available? Are there only two ways of sorting cases (ascending or descending?) Thank you. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/How-to-sort-cases-randomly-instead-of-ascending-or-descending-order-two-step-cluster-tp5727968p5727972.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Thank you Rick!
Silvia |
In reply to this post by SSiSSa
First are you sure you are using the exact same syntax?
Secondly results can differ depending on the order the cases are available to TWOSTEP. The less actual clustering there is among the cases the more likely results are to be simply the result of random noise. Please describe what you variables are, what your cases are, how they were selected, How many cases do you have etc. Are your variables reasonably uncorrelated with each other? What are their levels of measurement? What are you trying to acheive by clustering? I do not have access to SPSS at the moment but doesn't TWOSTEP still produce AIC or BIC for different numbers of clusters?
Art Kendall
Social Research Consultants |
Hi there,
Some time ago I asked about appropriate clustering techniques for my exploration (I am using a UK survey and selecting the group I am exploring). We decided to use only variables that will reflect my theoretical approach with this specific group my variables are shown in the model here: http://spssx-discussion.1045642.n5.nabble.com/file/n5727970/for_spss_list.docx E.g. age of respondent, housing tenure, marital status, age when becoming a mother, etc. They are nine in total. How would you stop the random noise? I am using BIC. At the moment I am not using syntax so the problem is not explained by it. Thank you |
It is surprising that you are using advanced techniques without learning about syntax.
The GUI (menus etc.) start off as great "training wheels". As you get beyond a brief intro and into basics the GUI is a real help in drafting your syntax. When you use the GUI it generates syntax that you can <paste> into your syntax window. Then you can tell whether you are giving the same commands. Does the BIC for different numbers of clusters look meaningfully different when you sort the data differently? What are the values that your variables can take? How many cases do you have? How was the survey done? How were these cases selected? Are you implicitly trying to compare/contrast some respondents with other respondents? Without a clear understanding of what you have and what you are trying to do it is impossible to give valid advice on what would help you reach your goals. "Noise" refers to the degree to which your model (in this instance the cluster solution) does NOT fit the variability in the data.
Art Kendall
Social Research Consultants |
Hello my syntax is below but I am preferring to explore by clicking because this syntax is provisional and not the problem.
TWOSTEP CLUSTER /CATEGORICAL VARIABLES=WHITELONEMOTHERS THREECHILDRENORMORE SINGLENEVERMARRIED NSECMJ10LESS3 OWNEROCCUPIERS BINARYAGEYOUNGESTCHILD HIGHEREDUCATIONORDEGREE /CONTINUOUS VARIABLES=AGE AGEWHENBECOMINGMOTHER /DISTANCE LIKELIHOOD /NUMCLUSTERS AUTO 15 BIC /HANDLENOISE 0 /MEMALLOCATE 64 /CRITERIA INITHRESHOLD(0) MXBRANCH(8) MXLEVEL(3) /VIEWMODEL DISPLAY=YES. I want to understand why the number of cluster differ so much when sorting the cases (which was my main concern to develop this question) e.g. when sorting ascendant with the random variable the model gives 4 clusters but when I sort them descending the model gives 2 clusters. The results were here (http://spssx-discussion.1045642.n5.nabble.com/file/n5727970/for_spss_list.docx) I am using the UK Labour Force survey that have a representative sample from the UK population. My group is lone mothers and I want to classify them with an intersectional approach (i.e. multiple inequalities). There is a lot of literature regarding the choosing of the variables so each variable was carefully selected). Thank youl. |
Or maybe handling the noise in the syntax is the problem....
What do you recommend for handing the noise in this model? THANK YOU |
In reply to this post by SSiSSa
if you are sure that your syntax for generating a variable with random numbers and that you sorted on that and that you used exactly that syntax both times, then the only conclusion is that there are not clear patterns (clusters, profiles) in your dataset.
As done your data does not appear to have more pattern than if you have generated a set of random numbers and then tried to cluster them. There are not definitive modes in multivariate space as defined by your variables. Some of your variables appear that they may in fact be dichotomies. Do you have much variation in each of your variables? please explain in more detail what you mean by "an intersectional approach (i.e. multiple inequalities)"
Art Kendall
Social Research Consultants |
Free forum by Nabble | Edit this page |