Hello,
I have a data set with some 52,000 cases and three continuous variables to cluster. How can I calculate (or are there empirical data) on how much working memory to allocate ? Of course, a possible answer that I know is: as much as possible. The special work space memory limit is set to 6148 kb. I doubled work space to 12 296 kb. Same negative result. To avoid the obstacle I reduced the span of the data a) in recoding the data into deciles, b) in rounding the data that are very fine-grained (they are the dimensions of a preceding factor analysis). Same negative result. I then excluded all cases with missing data before the start of the procedure which reduces the total sample size to some 47,000 cases. Same result. I then reduced the sample size by limiting it to 20% of the original size. This worked. But are there other ways to exploit the whole diversity of the data, in particular avoiding the sampling ? And finally, if there is no other way than to reduce the sample size by sampling, how to translate the types fund by the cluster procedure to all the non-sampled cases ? TIA -ftr ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
The worspace setting only affects some procedures. With a large dataset, consider using TWOSTEP CLUSTER, which is designed for large datasets and does not use the WORKSPACE setting. On Tue, Jan 24, 2017 at 9:52 AM, ftr public <[hidden email]> wrote: Hello, |
Administrator
|
In reply to this post by ftr public
Note that with 52,000 cases the program will attempt to create a matrix with approximately 1,351,974,000 elements.
I am assuming it stores only the lower (or upper) triangle. Good luck with that!
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Jon Peck
I missed to say that I tried TWOSTEP CLUSTER which resulted in 2 cases: one a small type representing below on tenth of the cases, and a gigantic second type, with all the other cases. So, this id not help me that much and I tried CLUSTER. What about doing a 10% sample with CLUSTER and using the types
found as starting points of a k-means clustering ? On 24/01/2017 18:37, Jon Peck wrote:
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
That is a popular strategy. Ward's method in CLUSTER is often used for this. Another method to consider is knn (Analyze > Classify > Nearest Neighbor, at least if you have a target variable. On Tue, Jan 24, 2017 at 1:49 PM, ftr public <[hidden email]> wrote:
|
In reply to this post by ftr public
Coarsening your continuous variables will not alter the number of similarity/proximity/distance coefficients. However, if you temporarily coarsen your variables to 6 or six values (maybe using RANK), do CROSSTABS show lumpiness?
Did you use the AUTO option in TWOSTEP? Please post your syntax. Are the variables reasonably uncorrelated? to get a handle on your data I suggest that you (1) generate a series of purely random variables. (2) select a few randomly selected sets of 500 or so cases using one of those variables. (3) do 3D scatterplots of each set. Do there appear to be some lumps and clear spaces when you rotate the plots and look at them from different perspectives. try TWOSTEP with the cases sorted on the completely random variables, AIC vs BIC for the criteria. If you do histograms of the continuous variables what do the univariate distributions look like? It would also help list members help you if you were to say what kind of entities are the cases, and what your continuous variables are.
Art Kendall
Social Research Consultants |
"What kind of entities" and why? - I have seen very few cluster results that were worth doing, judging by the results. In particular, it is usually necessary to /describe/ the clusters in terms of ANOVA between clusters, in terms of their variables, and then, for their association with some criterion. Criterion-wise, the
clusters themselves seem to add just about nothing to what you have if you start out with the MANOVA in
the first place. - For a criterion that is categorical, the MANOVA can be a simple discriminant function.
"Reasonably uncorrelated" is ambiguous. With zero /association/ (not just linear correlation), you have no basis for clusters. With too much linear correlation between two or three variables, you might as well
start out with a composite score and chop the result into Low, Middle, High.
Factor Analysis is what I found to be useful as the general tool. Available variance is partitioned into "factors", which may be individually scored, like, into Low, Middle, High; and cross-tab two or three factors to see if the groupings make sense (by whatever sense). What is more obvious for Factoring than for
Clustering is that you are very much better off if you start out with "entities" (variables) that are decently scaled -- That is, you probably want to use transformations to remove extreme skewing, and you further want to do something about "outliers" that remain. (Drop them? Or, do they form a "cluster" that is useful?)
-- From: SPSSX(r) Discussion <[hidden email]> on behalf of Art Kendall <[hidden email]>
Sent: Tuesday, January 24, 2017 4:20 PM To: [hidden email] Subject: Re: insufficient memory for cluster ... Are the variables reasonably uncorrelated? ... It would also help list members help you if you were to say what kind of entities are the cases, and what your continuous variables are. ... |
Perhaps I read the OP wrong and was wrong that the intent is to cluster cases.
Many clusterings of cases used variables that were scores on variables where the key was based on a factor analysis of variables using varimax rotation. Often items would be retained on a new variable if the loaded cleanly and exceeded a rotated loading abs(.4) or better. Unit weights were often used. As a result any observed correlation among the score variables should be small. The distances in euclidean space are more useful if the 3 axes are orthogonal. --- I asked about the 3 continuous variables because their meaning (labels) would be useful in thinking about the whole effort. For example, it may be that z-scores would be just as informative. It would also clarify why there would be missing data. I asked about what the entities were that constituted the cases, again because understanding what the whole effort is for has often helped me help clients. Cluster analyses techniques can be thought of as creating a new nominal level variable the describes bunches/heaps/piles of cases that have profiles in common. The most common kind of clustering is of cases. The general idea is to find such groupings of cases so that the cases are close to the centroid of the cluster and far from the centroids of of other clusters. Q factor analysis of doubly centered raw data was an old time way of clustering cases. Where as an R factor anlysis is done on a matrix that is number of variables by number of variables, Q factor analysis and other forms of cluster analysis are done on a matrix of number of cases by number of cases. Discriminant function classification phases can be useful in cleaning up cluster assignments. Rarely would I rely on a single run of a single heuristic approach. I tended to use the consensus of several similarity measures among profiles and several agglomeration methods.
Art Kendall
Social Research Consultants |
Free forum by Nabble | Edit this page |