Most common combinations of dummy variables

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Most common combinations of dummy variables

Eduard Bonet
Hi! 

I have 8 dummy variables (two possible values each: 0 or 1).  The value 1 means the presence of certain characteristic measured by the variable.  I would like to know what are the most common combinations of characteristics in my dataset, avoiding the creation of all possible combinations. 

Any help will be very welcome.

Thanks in advance.  And Merry Christmas


Eduard Bonet
__________________
Linux user n.444814  
(K)Ubuntu user  12888
Reply | Threaded
Open this post in threaded view
|

Re: Most common combinations of dummy variables

Mark Miller
How about expanding the list of dummies as powers of two, as in

COMPUTE    combovar =  (v1*1)  +  (v2*2)  + (v3*4) + (v4*8) +(v5*16) + (v6*32) + (v7*64) + (v8*128).

Values will range from 0 (all negative) to 255 (all positive)
With a list of frequencies, pick off the top cases and decode them into the component responses.

... Mark Miller


On Wed, Dec 26, 2012 at 8:33 AM, Eduard Bonet <[hidden email]> wrote:
Hi! 

I have 8 dummy variables (two possible values each: 0 or 1).  The value 1 means the presence of certain characteristic measured by the variable.  I would like to know what are the most common combinations of characteristics in my dataset, avoiding the creation of all possible combinations. 

Any help will be very welcome.

Thanks in advance.  And Merry Christmas


Eduard Bonet
__________________
Linux user n.444814  
(K)Ubuntu user  12888

Reply | Threaded
Open this post in threaded view
|

Re: Most common combinations of dummy variables

John F Hall

Alternatively multiply each successive variable by a power of 10 and then add them all together.  This will result in a series of combinations for which you can then produce a frequency count.

 

Something like [untested]:

 

compute comb = v1 * 10000000 + v2 * 1000000  ~~~ + v8

format comb (n8)

freq comb.

 

Anyway, play around with it and see what you get.

 

 

John F Hall (Mr)

[retired academic survey researcher]

 

Email:     [hidden email]

Website: www.surveyresearch.weebly.com

 

 

 

 

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mark Miller
Sent: 26 December 2012 17:50
To: [hidden email]
Subject: Re: Most common combinations of dummy variables

 

How about expanding the list of dummies as powers of two, as in

 

COMPUTE    combovar =  (v1*1)  +  (v2*2)  + (v3*4) + (v4*8) +(v5*16) + (v6*32) + (v7*64) + (v8*128).

 

Values will range from 0 (all negative) to 255 (all positive)

With a list of frequencies, pick off the top cases and decode them into the component responses.

 

... Mark Miller

 

On Wed, Dec 26, 2012 at 8:33 AM, Eduard Bonet <[hidden email]> wrote:

Hi! 

 

I have 8 dummy variables (two possible values each: 0 or 1).  The value 1 means the presence of certain characteristic measured by the variable.  I would like to know what are the most common combinations of characteristics in my dataset, avoiding the creation of all possible combinations. 

 

Any help will be very welcome.

 

Thanks in advance.  And Merry Christmas

 

 

Eduard Bonet
__________________
Linux user n.444814  
(K)Ubuntu user  12888

 

Reply | Threaded
Open this post in threaded view
|

Re: Most common combinations of dummy variables

Poes, Matthew Joseph
In reply to this post by Eduard Bonet
I think the more common approach this is to use a person centered analytic approach such as cluster analysis.  Select binary as the within group linkage.

Matt

Sent from my iPad

On Dec 26, 2012, at 11:35 AM, "Eduard Bonet" <[hidden email]> wrote:

> Hi!
>
> I have 8 dummy variables (two possible values each: 0 or 1).  The value 1 means the presence of certain characteristic measured by the variable.  I would like to know what are the most common combinations of characteristics in my dataset, avoiding the creation of all possible combinations.
>
> Any help will be very welcome.
>
> Thanks in advance.  And Merry Christmas
>
>
> Eduard Bonet
> __________________
> Linux user n.444814
> (K)Ubuntu user  12888

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Most common combinations of dummy variables

Rich Ulrich
In reply to this post by Eduard Bonet
For the exact information --
I would sort by all 8, aggregate with all 8 as Break, and collect the count.
Sort in descending order and list.

For approximate information that might be more useful,
I would do a simple factor analysis. 

Be aware that for dichotomies, variables with similar skews will
show bias for higher intercorrelations with each other. (Thus:
If there is only one latent factor but different means, the
analysis will reflect those biases by breaking out 2 or 3 factors
which are distinguished by their item means.)

--
Rich Ulrich


Date: Wed, 26 Dec 2012 19:33:00 +0300
From: [hidden email]
Subject: Most common combinations of dummy variables
To: [hidden email]

Hi! 

I have 8 dummy variables (two possible values each: 0 or 1).  The value 1 means the presence of certain characteristic measured by the variable.  I would like to know what are the most common combinations of characteristics in my dataset, avoiding the creation of all possible combinations. 


Reply | Threaded
Open this post in threaded view
|

Re: Most common combinations of dummy variables

mrose@benrose.org
In reply to this post by Eduard Bonet
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: Most common combinations of dummy variables

David Marso
Administrator
In a spirit similar to several others, I would build a concatenation.
Doesn't need to be a string.  More readable as base 10 rather than base 2.
Also some sort of cluster or FA might be suitable. Caveat Emptor.
--
DATA LIST  / v1 to v8 1-8.
BEGIN DATA
10011001
11100001
10011011
11100101
00110011
00011100
11110000
11000100
END DATA.
COUNT #=v1 TO v8 (LO THRU HI,MISSING).
DO REPEAT V=v1 to v8 .
COMPUTE #=#-1.
COMPUTE HASH=SUM(HASH, V*10**# ).
END REPEAT.
FORMATS HASH (N8).
LIST.




V1 V2 V3 V4 V5 V6 V7 V8     HASH

 1  0  0  1  1  0  0  1 10011001
 1  1  1  0  0  0  0  1 11100001
 1  0  0  1  1  0  1  1 10011011
 1  1  1  0  0  1  0  1 11100101
 0  0  1  1  0  0  1  1 00110011
 0  0  0  1  1  1  0  0 00011100
 1  1  1  1  0  0  0  0 11110000
 1  1  0  0  0  1  0  0 11000100


Number of cases read:  8    Number of cases listed:  8

Rose, Miriam wrote
Another approach to consider would be to create a "pattern" variable.
This is done by converting the numeric data (0 or 1) into string data ('0'
or '1') and then concatenating the 8 new variables into a pattern variable
that represents all of the dummy variables in sequence.  Then you can
examine the frequency of the resulting patterns, which would be all the
combinations present in the dataset.

Miriam Rose


On Wed, 26 Dec 2012 19:33:00 +0300, Eduard Bonet <[hidden email]> wrote:

>Hi!
>
>I have 8 dummy variables (two possible values each: 0 or 1).  The value 1
>means the presence of certain characteristic measured by the variable.  I
>would like to know what are the most common combinations of
characteristics
>in my dataset, avoiding the creation of all possible combinations.
>
>Any help will be very welcome.
>
>Thanks in advance.  And Merry Christmas
>
>
>Eduard Bonet
>__________________
>Linux user n.444814
>(K)Ubuntu user  12888
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Most common combinations of dummy variables

Rich Ulrich
In reply to this post by Poes, Matthew Joseph
I recommended factoring, and I pointed to the potential problem of
artificial associations of variables with the same amount of skew.

The cluster procedure gives you options for measuring "distances",
and several of those options avoid that handicap. 

So I agree that clustering could be more robust than doing a factor
analysis on these binary data.

--
Rich Ulrich

> Date: Wed, 26 Dec 2012 17:45:39 +0000
> From: [hidden email]
> Subject: Re: Most common combinations of dummy variables
> To: [hidden email]
>
> I think the more common approach this is to use a person centered analytic approach such as cluster analysis. Select binary as the within group linkage.
>
> Matt
>
...
Reply | Threaded
Open this post in threaded view
|

Re: Most common combinations of dummy variables

David Marso
Administrator
In reply to this post by David Marso
While we are at it.  I have appended a version which uses a string variable.
Note use of SUBSTR on the left hand side.  A lot of people don't realize this is kosher and of course is much better than CONCAT for numerous reasons.
DATA LIST  / v1 to v8 1-8.
BEGIN DATA
10011001
11100001
10011011
11100101
00110011
00011100
11110000
11000100
END DATA.
COUNT #=v1 TO v8 (LO THRU HI,MISSING).
DO REPEAT V=v1 to v8 .
COMPUTE #=#-1.
COMPUTE HASH=SUM(HASH, V*10**# ).
END REPEAT.

STRING strhash (A8).
DO REPEAT v=v1 TO v8 /#INDEX=1 TO 8.
+  COMPUTE SUBSTR(strhash,#INDEX,1)=STRING(v,F1).
END REPEAT.
FORMATS HASH (N8).
LIST.




V1 V2 V3 V4 V5 V6 V7 V8     HASH STRHASH

 1  0  0  1  1  0  0  1 10011001 10011001
 1  1  1  0  0  0  0  1 11100001 11100001
 1  0  0  1  1  0  1  1 10011011 10011011
 1  1  1  0  0  1  0  1 11100101 11100101
 0  0  1  1  0  0  1  1 00110011 00110011
 0  0  0  1  1  1  0  0 00011100 00011100
 1  1  1  1  0  0  0  0 11110000 11110000
 1  1  0  0  0  1  0  0 11000100 11000100


Number of cases read:  8    Number of cases listed:  8

David Marso wrote
In a spirit similar to several others, I would build a concatenation.
Doesn't need to be a string.  More readable as base 10 rather than base 2.
Also some sort of cluster or FA might be suitable. Caveat Emptor.
--
DATA LIST  / v1 to v8 1-8.
BEGIN DATA
10011001
11100001
10011011
11100101
00110011
00011100
11110000
11000100
END DATA.
COUNT #=v1 TO v8 (LO THRU HI,MISSING).
DO REPEAT V=v1 to v8 .
COMPUTE #=#-1.
COMPUTE HASH=SUM(HASH, V*10**# ).
END REPEAT.
FORMATS HASH (N8).
LIST.




V1 V2 V3 V4 V5 V6 V7 V8     HASH

 1  0  0  1  1  0  0  1 10011001
 1  1  1  0  0  0  0  1 11100001
 1  0  0  1  1  0  1  1 10011011
 1  1  1  0  0  1  0  1 11100101
 0  0  1  1  0  0  1  1 00110011
 0  0  0  1  1  1  0  0 00011100
 1  1  1  1  0  0  0  0 11110000
 1  1  0  0  0  1  0  0 11000100


Number of cases read:  8    Number of cases listed:  8

Rose, Miriam wrote
Another approach to consider would be to create a "pattern" variable.
This is done by converting the numeric data (0 or 1) into string data ('0'
or '1') and then concatenating the 8 new variables into a pattern variable
that represents all of the dummy variables in sequence.  Then you can
examine the frequency of the resulting patterns, which would be all the
combinations present in the dataset.

Miriam Rose


On Wed, 26 Dec 2012 19:33:00 +0300, Eduard Bonet <[hidden email]> wrote:

>Hi!
>
>I have 8 dummy variables (two possible values each: 0 or 1).  The value 1
>means the presence of certain characteristic measured by the variable.  I
>would like to know what are the most common combinations of
characteristics
>in my dataset, avoiding the creation of all possible combinations.
>
>Any help will be very welcome.
>
>Thanks in advance.  And Merry Christmas
>
>
>Eduard Bonet
>__________________
>Linux user n.444814
>(K)Ubuntu user  12888
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Most common combinations of dummy variables

Jon K Peck
A note of warning about substr on the left hand side.  While with simple ascii text or western code pages where 1 byte = 1 character,  lhs substr is straightforward, with Asian multibyte character sets or Unicode (utf-8), the number of bytes in a character varies, so it would be difficult to get this right.  Statistics does not allow char.substr to be used on the left for this reason.

So, in short, do not use substr on the left if there is any possibility of variable width characters or you are planning to move to Unicode (which is the default mode in V21).


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        David Marso <[hidden email]>
To:        [hidden email],
Date:        12/27/2012 02:50 PM
Subject:        Re: [SPSSX-L] Most common combinations of dummy variables
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




While we are at it.  I have appended a version which uses a string variable.
Note use of SUBSTR on the left hand side.  A lot of people don't realize
this is kosher and of course is much better than CONCAT for numerous
reasons.
DATA LIST  / v1 to v8 1-8.
BEGIN DATA
10011001
11100001
10011011
11100101
00110011
00011100
11110000
11000100
END DATA.
COUNT #=v1 TO v8 (LO THRU HI,MISSING).
DO REPEAT V=v1 to v8 .
COMPUTE #=#-1.
COMPUTE HASH=SUM(HASH, V*10**# ).
END REPEAT.

STRING strhash (A8).
DO REPEAT v=v1 TO v8 /#INDEX=1 TO 8.
+  COMPUTE SUBSTR(strhash,#INDEX,1)=STRING(v,F1).
END REPEAT.
FORMATS HASH (N8).
LIST.




V1 V2 V3 V4 V5 V6 V7 V8     HASH STRHASH

1  0  0  1  1  0  0  1 10011001 10011001
1  1  1  0  0  0  0  1 11100001 11100001
1  0  0  1  1  0  1  1 10011011 10011011
1  1  1  0  0  1  0  1 11100101 11100101
0  0  1  1  0  0  1  1 00110011 00110011
0  0  0  1  1  1  0  0 00011100 00011100
1  1  1  1  0  0  0  0 11110000 11110000
1  1  0  0  0  1  0  0 11000100 11000100


Number of cases read:  8    Number of cases listed:  8


David Marso wrote
> In a spirit similar to several others, I would build a concatenation.
> Doesn't need to be a string.  More readable as base 10 rather than base 2.
> Also some sort of cluster or FA might be suitable. Caveat Emptor.
> --
> DATA LIST  / v1 to v8 1-8.
> BEGIN DATA
> 10011001
> 11100001
> 10011011
> 11100101
> 00110011
> 00011100
> 11110000
> 11000100
> END DATA.
> COUNT #=v1 TO v8 (LO THRU HI,MISSING).
> DO REPEAT V=v1 to v8 .
> COMPUTE #=#-1.
> COMPUTE HASH=SUM(HASH, V*10**# ).
> END REPEAT.
> FORMATS HASH (N8).
> LIST.
>
>
>
>
> V1 V2 V3 V4 V5 V6 V7 V8     HASH
>
>  1  0  0  1  1  0  0  1 10011001
>  1  1  1  0  0  0  0  1 11100001
>  1  0  0  1  1  0  1  1 10011011
>  1  1  1  0  0  1  0  1 11100101
>  0  0  1  1  0  0  1  1 00110011
>  0  0  0  1  1  1  0  0 00011100
>  1  1  1  1  0  0  0  0 11110000
>  1  1  0  0  0  1  0  0 11000100
>
>
> Number of cases read:  8    Number of cases listed:  8
> Rose, Miriam wrote
>> Another approach to consider would be to create a "pattern" variable.
>> This is done by converting the numeric data (0 or 1) into string data
>> ('0'
>> or '1') and then concatenating the 8 new variables into a pattern
>> variable
>> that represents all of the dummy variables in sequence.  Then you can
>> examine the frequency of the resulting patterns, which would be all the
>> combinations present in the dataset.
>>
>> Miriam Rose
>>
>>
>> On Wed, 26 Dec 2012 19:33:00 +0300, Eduard Bonet &lt;

>> bonedu@

>> &gt; wrote:
>>
>>>Hi!
>>>
>>>I have 8 dummy variables (two possible values each: 0 or 1).  The value 1
>>>means the presence of certain characteristic measured by the variable.  I
>>>would like to know what are the most common combinations of
>> characteristics
>>>in my dataset, avoiding the creation of all possible combinations.
>>>
>>>Any help will be very welcome.
>>>
>>>Thanks in advance.  And Merry Christmas
>>>
>>>
>>>Eduard Bonet
>>>__________________
>>>Linux user n.444814
>>>(K)Ubuntu user  12888
>>>
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to

>> LISTSERV@.UGA

>>  (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Most-common-combinations-of-dummy-variables-tp5717128p5717146.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD