SPSSX Discussion

Most common combinations of dummy variables

Classic

List

Threaded

10 messages Options

Eduard Bonet

Most common combinations of dummy variables

Hi!

I have 8 dummy variables (two possible values each: 0 or 1). The value 1 means the presence of certain characteristic measured by the variable. I would like to know what are the most common combinations of characteristics in my dataset, avoiding the creation of all possible combinations.

Any help will be very welcome.

Thanks in advance. And Merry Christmas

Eduard Bonet
__________________
Linux user n.444814
(K)Ubuntu user 12888

Mark Miller

Re: Most common combinations of dummy variables

How about expanding the list of dummies as powers of two, as in

COMPUTE combovar = (v1*1) + (v2*2) + (v3*4) + (v4*8) +(v5*16) + (v6*32) + (v7*64) + (v8*128).

Values will range from 0 (all negative) to 255 (all positive)

With a list of frequencies, pick off the top cases and decode them into the component responses.

... Mark Miller

On Wed, Dec 26, 2012 at 8:33 AM, Eduard Bonet <[hidden email]> wrote:

Hi!

I have 8 dummy variables (two possible values each: 0 or 1). The value 1 means the presence of certain characteristic measured by the variable. I would like to know what are the most common combinations of characteristics in my dataset, avoiding the creation of all possible combinations.

Any help will be very welcome.

Thanks in advance. And Merry Christmas

Eduard Bonet
__________________
Linux user n.444814
(K)Ubuntu user 12888

John F Hall

Re: Most common combinations of dummy variables

Alternatively multiply each successive variable by a power of 10 and then add them all together. This will result in a series of combinations for which you can then produce a frequency count.

Something like [untested]:

compute comb = v1 * 10000000 + v2 * 1000000 ~~~ + v8

format comb (n8)

freq comb.

Anyway, play around with it and see what you get.

John F Hall (Mr)

[retired academic survey researcher]

Email: [hidden email]

Website: www.surveyresearch.weebly.com

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Mark Miller
Sent: 26 December 2012 17:50
To: [hidden email]
Subject: Re: Most common combinations of dummy variables

How about expanding the list of dummies as powers of two, as in

COMPUTE combovar = (v1*1) + (v2*2) + (v3*4) + (v4*8) +(v5*16) + (v6*32) + (v7*64) + (v8*128).

Values will range from 0 (all negative) to 255 (all positive)

With a list of frequencies, pick off the top cases and decode them into the component responses.

... Mark Miller

On Wed, Dec 26, 2012 at 8:33 AM, Eduard Bonet <[hidden email]> wrote:

Hi!

Any help will be very welcome.

Thanks in advance. And Merry Christmas

Eduard Bonet
__________________
Linux user n.444814
(K)Ubuntu user 12888

Poes, Matthew Joseph

Re: Most common combinations of dummy variables

In reply to this post by Eduard Bonet

I think the more common approach this is to use a person centered analytic approach such as cluster analysis. Select binary as the within group linkage.

Matt

Sent from my iPad

On Dec 26, 2012, at 11:35 AM, "Eduard Bonet" <[hidden email]> wrote:

> Hi!
>
> I have 8 dummy variables (two possible values each: 0 or 1). The value 1 means the presence of certain characteristic measured by the variable. I would like to know what are the most common combinations of characteristics in my dataset, avoiding the creation of all possible combinations.
>
> Any help will be very welcome.
>
> Thanks in advance. And Merry Christmas
>
>
> Eduard Bonet
> __________________
> Linux user n.444814
> (K)Ubuntu user 12888

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Rich Ulrich

Re: Most common combinations of dummy variables

In reply to this post by Eduard Bonet

For the exact information --
I would sort by all 8, aggregate with all 8 as Break, and collect the count.
Sort in descending order and list.

For approximate information that might be more useful,
I would do a simple factor analysis.

Be aware that for dichotomies, variables with similar skews will
show bias for higher intercorrelations with each other. (Thus:
If there is only one latent factor but different means, the
analysis will reflect those biases by breaking out 2 or 3 factors
which are distinguished by their item means.)

--
Rich Ulrich

Date: Wed, 26 Dec 2012 19:33:00 +0300
From: [hidden email]
Subject: Most common combinations of dummy variables
To: [hidden email]

Hi!

mrose@benrose.org

Re: Most common combinations of dummy variables

In reply to this post by Eduard Bonet

CONTENTS DELETED

The author has deleted this message.

David Marso

Re: Most common combinations of dummy variables

Administrator

In a spirit similar to several others, I would build a concatenation.
Doesn't need to be a string. More readable as base 10 rather than base 2.
Also some sort of cluster or FA might be suitable. Caveat Emptor.
--
DATA LIST / v1 to v8 1-8.
BEGIN DATA
10011001
11100001
10011011
11100101
00110011
00011100
11110000
11000100
END DATA.
COUNT #=v1 TO v8 (LO THRU HI,MISSING).
DO REPEAT V=v1 to v8 .
COMPUTE #=#-1.
COMPUTE HASH=SUM(HASH, V*10**# ).
END REPEAT.
FORMATS HASH (N8).
LIST.

V1 V2 V3 V4 V5 V6 V7 V8 HASH

1 0 0 1 1 0 0 1 10011001
1 1 1 0 0 0 0 1 11100001
1 0 0 1 1 0 1 1 10011011
1 1 1 0 0 1 0 1 11100101
0 0 1 1 0 0 1 1 00110011
0 0 0 1 1 1 0 0 00011100
1 1 1 1 0 0 0 0 11110000
1 1 0 0 0 1 0 0 11000100

Number of cases read: 8 Number of cases listed: 8

Rose, Miriam wrote

Another approach to consider would be to create a "pattern" variable.
This is done by converting the numeric data (0 or 1) into string data ('0'
or '1') and then concatenating the 8 new variables into a pattern variable
that represents all of the dummy variables in sequence. Then you can
examine the frequency of the resulting patterns, which would be all the
combinations present in the dataset.

Miriam Rose

On Wed, 26 Dec 2012 19:33:00 +0300, Eduard Bonet <[hidden email]> wrote:

>Hi!
>
>I have 8 dummy variables (two possible values each: 0 or 1). The value 1
>means the presence of certain characteristic measured by the variable. I
>would like to know what are the most common combinations of
characteristics
>in my dataset, avoiding the creation of all possible combinations.
>
>Any help will be very welcome.
>
>Thanks in advance. And Merry Christmas
>
>
>Eduard Bonet
>__________________
>Linux user n.444814
>(K)Ubuntu user 12888
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Rich Ulrich

Re: Most common combinations of dummy variables

In reply to this post by Poes, Matthew Joseph

I recommended factoring, and I pointed to the potential problem of
artificial associations of variables with the same amount of skew.

The cluster procedure gives you options for measuring "distances",
and several of those options avoid that handicap.

So I agree that clustering could be more robust than doing a factor
analysis on these binary data.

--
Rich Ulrich

> Date: Wed, 26 Dec 2012 17:45:39 +0000
> From: [hidden email]
> Subject: Re: Most common combinations of dummy variables
> To: [hidden email]
>
> I think the more common approach this is to use a person centered analytic approach such as cluster analysis. Select binary as the within group linkage.
>
> Matt
>
...

David Marso

Re: Most common combinations of dummy variables

Administrator

In reply to this post by David Marso

David Marso wrote

In a spirit similar to several others, I would build a concatenation.
Doesn't need to be a string. More readable as base 10 rather than base 2.
Also some sort of cluster or FA might be suitable. Caveat Emptor.
--
DATA LIST / v1 to v8 1-8.
BEGIN DATA
10011001
11100001
10011011
11100101
00110011
00011100
11110000
11000100
END DATA.
COUNT #=v1 TO v8 (LO THRU HI,MISSING).
DO REPEAT V=v1 to v8 .
COMPUTE #=#-1.
COMPUTE HASH=SUM(HASH, V*10**# ).
END REPEAT.
FORMATS HASH (N8).
LIST.

V1 V2 V3 V4 V5 V6 V7 V8 HASH

1 0 0 1 1 0 0 1 10011001
1 1 1 0 0 0 0 1 11100001
1 0 0 1 1 0 1 1 10011011
1 1 1 0 0 1 0 1 11100101
0 0 1 1 0 0 1 1 00110011
0 0 0 1 1 1 0 0 00011100
1 1 1 1 0 0 0 0 11110000
1 1 0 0 0 1 0 0 11000100

Number of cases read: 8 Number of cases listed: 8

Rose, Miriam wrote

Another approach to consider would be to create a "pattern" variable.
This is done by converting the numeric data (0 or 1) into string data ('0'
or '1') and then concatenating the 8 new variables into a pattern variable
that represents all of the dummy variables in sequence. Then you can
examine the frequency of the resulting patterns, which would be all the
combinations present in the dataset.

Miriam Rose

On Wed, 26 Dec 2012 19:33:00 +0300, Eduard Bonet <[hidden email]> wrote:

>Hi!
>
>I have 8 dummy variables (two possible values each: 0 or 1). The value 1
>means the presence of certain characteristic measured by the variable. I
>would like to know what are the most common combinations of
characteristics
>in my dataset, avoiding the creation of all possible combinations.
>
>Any help will be very welcome.
>
>Thanks in advance. And Merry Christmas
>
>
>Eduard Bonet
>__________________
>Linux user n.444814
>(K)Ubuntu user 12888
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck

Re: Most common combinations of dummy variables

A note of warning about substr on the left hand side. While with simple ascii text or western code pages where 1 byte = 1 character, lhs substr is straightforward, with Asian multibyte character sets or Unicode (utf-8), the number of bytes in a character varies, so it would be difficult to get this right. Statistics does not allow char.substr to be used on the left for this reason.

So, in short, do not use substr on the left if there is any possibility of variable width characters or you are planning to move to Unicode (which is the default mode in V21).

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: David Marso <[hidden email]>
To: [hidden email],
Date: 12/27/2012 02:50 PM
Subject: Re: [SPSSX-L] Most common combinations of dummy variables
Sent by: "SPSSX(r) Discussion" <[hidden email]>

While we are at it. I have appended a version which uses a string variable. Note use of SUBSTR on the left hand side. A lot of people don't realize this is kosher and of course is much better than CONCAT for numerous reasons. DATA LIST / v1 to v8 1-8. BEGIN DATA 10011001 11100001 10011011 11100101 00110011 00011100 11110000 11000100 END DATA. COUNT #=v1 TO v8 (LO THRU HI,MISSING). DO REPEAT V=v1 to v8 . COMPUTE #=#-1. COMPUTE HASH=SUM(HASH, V*10**# ). END REPEAT. STRING strhash (A8). DO REPEAT v=v1 TO v8 /#INDEX=1 TO 8. + COMPUTE SUBSTR(strhash,#INDEX,1)=STRING(v,F1). END REPEAT. FORMATS HASH (N8). LIST. V1 V2 V3 V4 V5 V6 V7 V8 HASH STRHASH 1 0 0 1 1 0 0 1 10011001 10011001 1 1 1 0 0 0 0 1 11100001 11100001 1 0 0 1 1 0 1 1 10011011 10011011 1 1 1 0 0 1 0 1 11100101 11100101 0 0 1 1 0 0 1 1 00110011 00110011 0 0 0 1 1 1 0 0 00011100 00011100 1 1 1 1 0 0 0 0 11110000 11110000 1 1 0 0 0 1 0 0 11000100 11000100 Number of cases read: 8 Number of cases listed: 8 David Marso wrote > In a spirit similar to several others, I would build a concatenation. > Doesn't need to be a string. More readable as base 10 rather than base 2. > Also some sort of cluster or FA might be suitable. Caveat Emptor. > -- > DATA LIST / v1 to v8 1-8. > BEGIN DATA > 10011001 > 11100001 > 10011011 > 11100101 > 00110011 > 00011100 > 11110000 > 11000100 > END DATA. > COUNT #=v1 TO v8 (LO THRU HI,MISSING). > DO REPEAT V=v1 to v8 . > COMPUTE #=#-1. > COMPUTE HASH=SUM(HASH, V*10**# ). > END REPEAT. > FORMATS HASH (N8). > LIST. > > > > > V1 V2 V3 V4 V5 V6 V7 V8 HASH > > 1 0 0 1 1 0 0 1 10011001 > 1 1 1 0 0 0 0 1 11100001 > 1 0 0 1 1 0 1 1 10011011 > 1 1 1 0 0 1 0 1 11100101 > 0 0 1 1 0 0 1 1 00110011 > 0 0 0 1 1 1 0 0 00011100 > 1 1 1 1 0 0 0 0 11110000 > 1 1 0 0 0 1 0 0 11000100 > > > Number of cases read: 8 Number of cases listed: 8 > Rose, Miriam wrote >> Another approach to consider would be to create a "pattern" variable. >> This is done by converting the numeric data (0 or 1) into string data >> ('0' >> or '1') and then concatenating the 8 new variables into a pattern >> variable >> that represents all of the dummy variables in sequence. Then you can >> examine the frequency of the resulting patterns, which would be all the >> combinations present in the dataset. >> >> Miriam Rose >> >> >> On Wed, 26 Dec 2012 19:33:00 +0300, Eduard Bonet < >> bonedu@ >> > wrote: >> >>>Hi! >>> >>>I have 8 dummy variables (two possible values each: 0 or 1). The value 1 >>>means the presence of certain characteristic measured by the variable. I >>>would like to know what are the most common combinations of >> characteristics >>>in my dataset, avoiding the creation of all possible combinations. >>> >>>Any help will be very welcome. >>> >>>Thanks in advance. And Merry Christmas >>> >>> >>>Eduard Bonet >>>__________________ >>>Linux user n.444814 >>>(K)Ubuntu user 12888 >>> >> >> ===================== >> To manage your subscription to SPSSX-L, send a message to >> LISTSERV@.UGA >> (not to SPSSX-L), with no body text except the >> command. To leave the list, send the command >> SIGNOFF SPSSX-L >> For a list of commands to manage subscriptions, send the command >> INFO REFCARD ----- Please reply to the list and not to my personal email. Those desiring my consulting or training services please feel free to email me. -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/Most-common-combinations-of-dummy-variables-tp5717128p5717146.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD