This post was updated on .
Hello everyone,
I have a question regarding using a cluster method with binary data. I have more than 3000 cases and around 8 binary variables (I have recoded them as binary variables based on my theoretical approach to be able to use a hierarchical cluster method). My variables are: age, ethnicity, number of children, highest qualification, marital status, housing tenure, age of youngest child and economic activity. My methodological approach is based on exploring cases as having multiple inequalities and as a result I am not interested in the strength/correlation of each of the variables to explain the classification but to discover types of cases as a whole (e.g. a person of a determined age, marital status and ethnicity are in a specific situation whereas other person with different characteristics are in other situation).That will be the first stage of my analysis. The second stage of my analysis covers 26 variables (including the previous ones and employment related variables) and the same sample size. The variables in this stage are mixed (continuous, ordinal and categorical) and I have been advised to used two-step cluster method. Would it be possible to use other cluster methods for the second stage? The problem that I find with my variables is that because many of them are based on employment characteristics, there is a strong cleavage between cases that work and cases that do not work. I am reluctant to use latent class analysis and factor analysis because my theoretical and methodological approach are based in exploring the case without differentiating between the variables but the case as a whole (complex/critical realism). Would you have any advice for me? Thank you in advance! |
As in many other situations reducing variables to dichotomies throws away a lot of information.
Most clustering methods assume that the variables are reasonably uncorrelated. TWOSTEP is able to handle both categorical and continuous variables. Do you already have "situation" or is that what you are trying to determine? I have been doing clustering since 1972, but I have never seen mixing explanatory and explained variables in the same clustering. Or am I mis-reading your post? By "cases that work" do you mean are vs are not employed?
Art Kendall
Social Research Consultants |
This post was updated on .
Thank you for your response,
I was planning to use binary variables to be able to use hierarchical cluster analysis and I was also going to use the two step cluster method to give a more robust explanation in the decision for choosing a number of groups. Based on that number of groups, I was then planning to analyse differences in employment characteristics and claims on benefits so the demographic features will help me to understand those differences using again those two methods. Yes, I mean cases in employment and not in employment. I am not too sure if this is feasible or not, should I choose other methods? I was inclined toward clustering techniques because it will differentiate groups of cases as a whole (considering all the variables at once) and not dismembering variables and their explanatory strength in the analysis. Would This be feasible? Thank you for your help |
Please remember that Two-step cluster treats categorical variables,
including binary, as specifically nominal ones (read more
http://stats.stackexchange.com/q/116856/3277). Hierarchical method
(or, say, medoid method as well) is much more versatile in respect
to binary data.
Also, the option to a automatically determine the number of clusters in two-step is not a magic wand feature. This option attempts to derive AIC or BIC clustering criterion measure at a half-way of the clustering process. This is really convenient from perspective of speed/amout of job when number of objects is very large. But is less precise than assessment of the best number of clusters after clustering is done, by comparing many solutions with different number of clusters. 15.10.2014 9:49, silvsoriano пишет:
Thank you for your response, I was planning to use binary variables to be able to use hierarchical cluster analysis and I was also going to use the two step cluster method to give a more robust explanation in the decision for choosing a number of groups. Based on that number of groups, I was then planning to analyse differences in employment characteristics and claims on benefits so the demographic features will help me to understand those differences using again those two methods. Yes, I mean cases in employment and not in employment. I am not too sure if this is feasible or not, should I choose other methods? I was inclined toward clustering techniques because it will differentiate groups of cases as a whole (considering all the variables at once) and not dismembering variables and their explanatory strength in the analysis. Would This be feasible? ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
The OP on that link is a little ambiguous. It sound as if the (s)he wants to cluster variables, at least first.
Do you know if that is so?
Art Kendall
Social Research Consultants |
Thank you for the pieces of advice!
My cases are related with a usual 'disadvantage group' of people. The idea behind this analysis is showing the existence of different (multiple) disadvantages within this 'disadvantage group'. I have around 8 variables that are demographics (continuous + categorical (nominal) + binary (e.g. renting/not renting) 10 variables that are related with employment (continuous categorical and binary); and 8 variables that are related with claiming benefits (nominal - two options: claiming or not claiming a particular benefit) Does cluster analysis make sense for my project? Thank you Silvia |
silvsoriano You could try submitting your query to the Google Group MedStats. This is the home of the most notable statisticians, including Professor Martin Bland. Or you could google Martin Bland to find his website... I remember there being very useful entries including one on Cluster analysis. Whist anyone (and I mean, anyone) can view the content of MedStats no strings attached, to post your query you will need to join MedStats (it's easy). One selling point about MedStats is that all of the discussion relating to your post is visible. And, if you join, you will have access to all the archived material: this is a huge advantage. I can say all this because I founded MedStats :). Best Wishes, Martin Martin P. Holt Freelance Medical Statistician and Quality Expert [hidden email] Calvin Coolidge: "Persistence and Determination Alone are Omnipotent" Albert Einstein: "My sense of God is my sense of wonder about the Universe" Linked In: https://www.linkedin.com/profile/edit?trk=nav_responsive_sub_nav_edit_profile On Wednesday, 15 October 2014, 16:32, silvsoriano
<[hidden email]> wrote:
|
In reply to this post by SSiSSa
If I read you post correctly, you would not be able to compare/contrast the disadvantaged group from other groups.
Cluster analysis might be used to find out if the there are patterns of applying for benefits. How were the set of 8 benefits selected? Are the applications independent of each other or are the people going to one source where thy could apply for any/all of the 8? Are there systematic difference in the eligibility for the benefits, eg., gender linked, child free vs has children of certain ages, means tested etc.? Without fuller understanding of your project much else would be shooting from the hip.
Art Kendall
Social Research Consultants |
In reply to this post by SSiSSa
I'm not sure what your *ideal* results would look like, but it sounds to me
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
as if it would take some luck for a clustering output to provide it. If Employed/Unemployed is inevitably going to dominate the results, one simple expedient is to perform separate analyses for those two groups. "Correspondence analysis" is what comes to mind to me, if you want a set of features from outcomes (on left side of the equation) to emerge which are associated with a set of features on the right side. If you already have a specified set of outcomes, then the Discriminant Function procedure might provide some nice plots with centroids of the first two or three of functions. -- Rich Ulrich > Date: Wed, 15 Oct 2014 08:27:39 -0700 > From: [hidden email] > Subject: Re: Cluster analysis expert needed > To: [hidden email] > > Thank you for the pieces of advice! > > My cases are related with a usual 'disadvantage group' of people. The idea > behind this analysis is showing the existence of different (multiple) > disadvantages within this 'disadvantage group'. > > I have around 8 variables that are demographics (continuous + categorical > (nominal) + binary (e.g. renting/not renting) > > 10 variables that are related with employment (continuous categorical and > binary); and > > 8 variables that are related with claiming benefits (nominal - two options: > claiming or not claiming a particular benefit) > > Does cluster analysis make sense for my project? > |
This post was updated on .
In reply to this post by Art Kendall
Hi there,
My first research question is "Are there different groups within this disadvantage group?". Previous research has found different typologies but not using a cluster analysis technique neither with my current theoretical approach (intersectionality). I expect to use demographic variables in this stage (e.g. marital status, age, ethnicity, number of dependent children, etc) My second research question is what are the different employment patterns and claims on UK benefits found in these different groups? I expect to use employment (e.g. income, type of employment, reason for part-time job, underemployment, overtime) and benefit variables (e.g. child benefits, income support, housing benefits) in this stage. To make the things more complex, I am planning to analysis a time frame (2007-2013) to explore possible changes due to the context of recession/austerity. I am using a UK national survey that has big representative samples each year so at some point I want to compare these types within my 'disadvantage' group with other group that is relevant for my research. Would this make more sense? Is cluster analysis still appropriate? I really find cluster analysis the most appealing approach because of what it entails (exploring cases with multiple variables and not strength of specific variables) Thank you! |
It seems you have 3 sets of variables. Why not try to find patterns separately among all three?
It may be possible to find groups of cases separately in each set using the techniques that have been suggested and then work with a variable representing agreement among the various techniques. In 1976 while at the US Census Bureau I published a description of what I called "core clusters". However, I am in the midst of moving for my second retirement and have no idea where my files are. Since then more exploratory/heuristic techniques such as TWOSTEP, Correspondence analysis, CATPCA, etc. have been developed. At that time clustering was not available in SPSS and clustering was done in separate ad hoc programs. SPSS was used in developing the core cluster memberships. At the most compacted way of looking at your data you could end up with 3 new variables operationalizing memberships in groups with common profiles within each set. Other possibilities include "when relating a variable summarizing group a how well do I know set b". I suggest that you work out your questions and post queries on the Classification Society mailing list. see http://www.classification-society.org/clsoc/clsoc.php The people there specialize in multivariate method of class (pattern) detection, class (pattern recognition), multidimensional scaling, correspondence analysis (aka dual scaling), etc. There are many people from the British Classification Society on the list. They may be able offer some local guidance.
Art Kendall
Social Research Consultants |
Free forum by Nabble | Edit this page |