Hi! I have 8 dummy variables (two possible values each: 0 or 1). The value 1 means the presence of certain characteristic measured by the variable. I would like to know what are the most common combinations of characteristics in my dataset, avoiding the creation of all possible combinations.
Any help will be very welcome. Thanks in advance. And Merry Christmas Eduard Bonet
__________________ Linux user n.444814 (K)Ubuntu user 12888 |
How about expanding the list of dummies as powers of two, as in COMPUTE combovar = (v1*1) + (v2*2) + (v3*4) + (v4*8) +(v5*16) + (v6*32) + (v7*64) + (v8*128). Values will range from 0 (all negative) to 255 (all positive) With a list of frequencies, pick off the top cases and decode them into the component responses. ... Mark Miller On Wed, Dec 26, 2012 at 8:33 AM, Eduard Bonet <[hidden email]> wrote:
|
Alternatively multiply each successive variable by a power of 10 and then add them all together. This will result in a series of combinations for which you can then produce a frequency count. Something like [untested]: compute comb = v1 * 10000000 + v2 * 1000000 ~~~ + v8 format comb (n8) freq comb. Anyway, play around with it and see what you get. John F Hall (Mr) [retired academic survey researcher] Email: [hidden email] Website: www.surveyresearch.weebly.com From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mark Miller How about expanding the list of dummies as powers of two, as in COMPUTE combovar = (v1*1) + (v2*2) + (v3*4) + (v4*8) +(v5*16) + (v6*32) + (v7*64) + (v8*128). Values will range from 0 (all negative) to 255 (all positive) With a list of frequencies, pick off the top cases and decode them into the component responses. ... Mark Miller On Wed, Dec 26, 2012 at 8:33 AM, Eduard Bonet <[hidden email]> wrote: Hi! I have 8 dummy variables (two possible values each: 0 or 1). The value 1 means the presence of certain characteristic measured by the variable. I would like to know what are the most common combinations of characteristics in my dataset, avoiding the creation of all possible combinations. Any help will be very welcome. Thanks in advance. And Merry Christmas Eduard Bonet |
In reply to this post by Eduard Bonet
I think the more common approach this is to use a person centered analytic approach such as cluster analysis. Select binary as the within group linkage.
Matt Sent from my iPad On Dec 26, 2012, at 11:35 AM, "Eduard Bonet" <[hidden email]> wrote: > Hi! > > I have 8 dummy variables (two possible values each: 0 or 1). The value 1 means the presence of certain characteristic measured by the variable. I would like to know what are the most common combinations of characteristics in my dataset, avoiding the creation of all possible combinations. > > Any help will be very welcome. > > Thanks in advance. And Merry Christmas > > > Eduard Bonet > __________________ > Linux user n.444814 > (K)Ubuntu user 12888 ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Eduard Bonet
For the exact information --
I would sort by all 8, aggregate with all 8 as Break, and collect the count. Sort in descending order and list. For approximate information that might be more useful, I would do a simple factor analysis. Be aware that for dichotomies, variables with similar skews will show bias for higher intercorrelations with each other. (Thus: If there is only one latent factor but different means, the analysis will reflect those biases by breaking out 2 or 3 factors which are distinguished by their item means.) -- Rich Ulrich Date: Wed, 26 Dec 2012 19:33:00 +0300 From: [hidden email] Subject: Most common combinations of dummy variables To: [hidden email] Hi! I have 8 dummy variables (two possible values each: 0 or 1). The value 1 means the presence of certain characteristic measured by the variable. I would like to know what are the most common combinations of characteristics in my dataset, avoiding the creation of all possible combinations. |
In reply to this post by Eduard Bonet
CONTENTS DELETED
The author has deleted this message.
|
Administrator
|
In a spirit similar to several others, I would build a concatenation.
Doesn't need to be a string. More readable as base 10 rather than base 2. Also some sort of cluster or FA might be suitable. Caveat Emptor. -- DATA LIST / v1 to v8 1-8. BEGIN DATA 10011001 11100001 10011011 11100101 00110011 00011100 11110000 11000100 END DATA. COUNT #=v1 TO v8 (LO THRU HI,MISSING). DO REPEAT V=v1 to v8 . COMPUTE #=#-1. COMPUTE HASH=SUM(HASH, V*10**# ). END REPEAT. FORMATS HASH (N8). LIST. V1 V2 V3 V4 V5 V6 V7 V8 HASH 1 0 0 1 1 0 0 1 10011001 1 1 1 0 0 0 0 1 11100001 1 0 0 1 1 0 1 1 10011011 1 1 1 0 0 1 0 1 11100101 0 0 1 1 0 0 1 1 00110011 0 0 0 1 1 1 0 0 00011100 1 1 1 1 0 0 0 0 11110000 1 1 0 0 0 1 0 0 11000100 Number of cases read: 8 Number of cases listed: 8
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Poes, Matthew Joseph
I recommended factoring, and I pointed to the potential problem of
artificial associations of variables with the same amount of skew. The cluster procedure gives you options for measuring "distances", and several of those options avoid that handicap. So I agree that clustering could be more robust than doing a factor analysis on these binary data. -- Rich Ulrich > Date: Wed, 26 Dec 2012 17:45:39 +0000 > From: [hidden email] > Subject: Re: Most common combinations of dummy variables > To: [hidden email] > > I think the more common approach this is to use a person centered analytic approach such as cluster analysis. Select binary as the within group linkage. > > Matt > ... |
Administrator
|
In reply to this post by David Marso
While we are at it. I have appended a version which uses a string variable.
Note use of SUBSTR on the left hand side. A lot of people don't realize this is kosher and of course is much better than CONCAT for numerous reasons. DATA LIST / v1 to v8 1-8. BEGIN DATA 10011001 11100001 10011011 11100101 00110011 00011100 11110000 11000100 END DATA. COUNT #=v1 TO v8 (LO THRU HI,MISSING). DO REPEAT V=v1 to v8 . COMPUTE #=#-1. COMPUTE HASH=SUM(HASH, V*10**# ). END REPEAT. STRING strhash (A8). DO REPEAT v=v1 TO v8 /#INDEX=1 TO 8. + COMPUTE SUBSTR(strhash,#INDEX,1)=STRING(v,F1). END REPEAT. FORMATS HASH (N8). LIST. V1 V2 V3 V4 V5 V6 V7 V8 HASH STRHASH 1 0 0 1 1 0 0 1 10011001 10011001 1 1 1 0 0 0 0 1 11100001 11100001 1 0 0 1 1 0 1 1 10011011 10011011 1 1 1 0 0 1 0 1 11100101 11100101 0 0 1 1 0 0 1 1 00110011 00110011 0 0 0 1 1 1 0 0 00011100 00011100 1 1 1 1 0 0 0 0 11110000 11110000 1 1 0 0 0 1 0 0 11000100 11000100 Number of cases read: 8 Number of cases listed: 8
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
A note of warning about substr on the left
hand side. While with simple ascii text or western code pages where
1 byte = 1 character, lhs substr is straightforward, with Asian multibyte
character sets or Unicode (utf-8), the number of bytes in a character varies,
so it would be difficult to get this right. Statistics does not allow
char.substr to be used on the left for this reason.
So, in short, do not use substr on the left if there is any possibility of variable width characters or you are planning to move to Unicode (which is the default mode in V21). Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: David Marso <[hidden email]> To: [hidden email], Date: 12/27/2012 02:50 PM Subject: Re: [SPSSX-L] Most common combinations of dummy variables Sent by: "SPSSX(r) Discussion" <[hidden email]> While we are at it. I have appended a version which uses a string variable. Note use of SUBSTR on the left hand side. A lot of people don't realize this is kosher and of course is much better than CONCAT for numerous reasons. DATA LIST / v1 to v8 1-8. BEGIN DATA 10011001 11100001 10011011 11100101 00110011 00011100 11110000 11000100 END DATA. COUNT #=v1 TO v8 (LO THRU HI,MISSING). DO REPEAT V=v1 to v8 . COMPUTE #=#-1. COMPUTE HASH=SUM(HASH, V*10**# ). END REPEAT. STRING strhash (A8). DO REPEAT v=v1 TO v8 /#INDEX=1 TO 8. + COMPUTE SUBSTR(strhash,#INDEX,1)=STRING(v,F1). END REPEAT. FORMATS HASH (N8). LIST. V1 V2 V3 V4 V5 V6 V7 V8 HASH STRHASH 1 0 0 1 1 0 0 1 10011001 10011001 1 1 1 0 0 0 0 1 11100001 11100001 1 0 0 1 1 0 1 1 10011011 10011011 1 1 1 0 0 1 0 1 11100101 11100101 0 0 1 1 0 0 1 1 00110011 00110011 0 0 0 1 1 1 0 0 00011100 00011100 1 1 1 1 0 0 0 0 11110000 11110000 1 1 0 0 0 1 0 0 11000100 11000100 Number of cases read: 8 Number of cases listed: 8 David Marso wrote > In a spirit similar to several others, I would build a concatenation. > Doesn't need to be a string. More readable as base 10 rather than base 2. > Also some sort of cluster or FA might be suitable. Caveat Emptor. > -- > DATA LIST / v1 to v8 1-8. > BEGIN DATA > 10011001 > 11100001 > 10011011 > 11100101 > 00110011 > 00011100 > 11110000 > 11000100 > END DATA. > COUNT #=v1 TO v8 (LO THRU HI,MISSING). > DO REPEAT V=v1 to v8 . > COMPUTE #=#-1. > COMPUTE HASH=SUM(HASH, V*10**# ). > END REPEAT. > FORMATS HASH (N8). > LIST. > > > > > V1 V2 V3 V4 V5 V6 V7 V8 HASH > > 1 0 0 1 1 0 0 1 10011001 > 1 1 1 0 0 0 0 1 11100001 > 1 0 0 1 1 0 1 1 10011011 > 1 1 1 0 0 1 0 1 11100101 > 0 0 1 1 0 0 1 1 00110011 > 0 0 0 1 1 1 0 0 00011100 > 1 1 1 1 0 0 0 0 11110000 > 1 1 0 0 0 1 0 0 11000100 > > > Number of cases read: 8 Number of cases listed: 8 > Rose, Miriam wrote >> Another approach to consider would be to create a "pattern" variable. >> This is done by converting the numeric data (0 or 1) into string data >> ('0' >> or '1') and then concatenating the 8 new variables into a pattern >> variable >> that represents all of the dummy variables in sequence. Then you can >> examine the frequency of the resulting patterns, which would be all the >> combinations present in the dataset. >> >> Miriam Rose >> >> >> On Wed, 26 Dec 2012 19:33:00 +0300, Eduard Bonet < >> bonedu@ >> > wrote: >> >>>Hi! >>> >>>I have 8 dummy variables (two possible values each: 0 or 1). The value 1 >>>means the presence of certain characteristic measured by the variable. I >>>would like to know what are the most common combinations of >> characteristics >>>in my dataset, avoiding the creation of all possible combinations. >>> >>>Any help will be very welcome. >>> >>>Thanks in advance. And Merry Christmas >>> >>> >>>Eduard Bonet >>>__________________ >>>Linux user n.444814 >>>(K)Ubuntu user 12888 >>> >> >> ===================== >> To manage your subscription to SPSSX-L, send a message to >> LISTSERV@.UGA >> (not to SPSSX-L), with no body text except the >> command. To leave the list, send the command >> SIGNOFF SPSSX-L >> For a list of commands to manage subscriptions, send the command >> INFO REFCARD ----- Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Most-common-combinations-of-dummy-variables-tp5717128p5717146.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |