SPSSX Discussion

10 most frequent occurring values of a multiple response set

Classic

List

Threaded

9 messages Options

Edward Boadi

10 most frequent occurring values of a multiple response set

Dear List,
I have a data file with variables :
X , y1 , y2 , z1, z2 and z3

I wont to a accomplish the following task :
1. create a multiple response set z from z1,z2 and z3 .
2. Rank z and select cases for rank z <= 10
3. select cases from my original data file where z = z1, z2 or z3

My objective is to create a new dataset restricted to 10 most frequent occurring values of a multiple response set created from z1 , z2 and z3

Any ideas on how to accomplish this will be most welcome.

Regards to all .

Beadle, ViAnn

Re: 10 most frequent occurring values of a multiple response set

I'm not quite sure what it means to rank z since it is a set of 3 values. Are you looking for the most frequently occurring combinations of the three variables?

Is this some sort of RFM analysis?

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edward Boadi
Sent: Thursday, July 20, 2006 2:03 PM
To: [hidden email]
Subject: 10 most frequent occurring values of a multiple response set

Dear List,
I have a data file with variables :
X , y1 , y2 , z1, z2 and z3

I wont to a accomplish the following task :
1. create a multiple response set z from z1,z2 and z3 .
2. Rank z and select cases for rank z <= 10
3. select cases from my original data file where z = z1, z2 or z3

My objective is to create a new dataset restricted to 10 most frequent occurring values of a multiple response set created from z1 , z2 and z3

Any ideas on how to accomplish this will be most welcome.

Regards to all .

Edward Boadi

Re: 10 most frequent occurring values of a multiple response set

In reply to this post by Edward Boadi

This is not RFM analysis.

Yes Iam looking for 10 most frequently occurring combinations of the three variables as my initial step.
Then select X , y1 , y2 , z1, z2 and z3 where (z1,z2,z3) = z ie where z1,z2, and z3 corresponds to the 10 most
frequent occurring combinations of z1,z2 and z3.

Regards.

-----Original Message-----
From: Beadle, ViAnn [mailto:[hidden email]]
Sent: Thursday, July 20, 2006 3:15 PM
To: Edward Boadi; [hidden email]
Subject: RE: 10 most frequent occurring values of a multiple response
set

I'm not quite sure what it means to rank z since it is a set of 3 values. Are you looking for the most frequently occurring combinations of the three variables?

Is this some sort of RFM analysis?

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edward Boadi
Sent: Thursday, July 20, 2006 2:03 PM
To: [hidden email]
Subject: 10 most frequent occurring values of a multiple response set

Dear List,
I have a data file with variables :
X , y1 , y2 , z1, z2 and z3

I wont to a accomplish the following task :
1. create a multiple response set z from z1,z2 and z3 .
2. Rank z and select cases for rank z <= 10
3. select cases from my original data file where z = z1, z2 or z3

My objective is to create a new dataset restricted to 10 most frequent occurring values of a multiple response set created from z1 , z2 and z3

Any ideas on how to accomplish this will be most welcome.

Regards to all .

Beadle, ViAnn

Re: 10 most frequent occurring values of a multiple response set

In reply to this post by Edward Boadi

Then ignore the whole concept of a multiple response set and just compute some variable which is a combination of all three values. For example if z1, z2, and z3 take on two values you'll need some thing like:

Compute z=z1 + z2*1000 + z3*100000.

The second step is to rank occurrences, not values.

You need to use aggregate to capture the occurrences into a variable, using the N function and z as your break variable. This will give you a dataset with one row for each unique value of z and N. Sort that dataset in descending order on N and then compute nrank= $casenum after the sort.
So your aggregated dataset has z, N, and nrank. You have to get nrank onto your original dataset through a table match. But to do so, you need to sort both the aggregated dataset and the original dataset on z and use z as the matching key. Once nrank is on your dataset then you can either filter or select cases with rankz less than or equal to 10.

Here's some syntax that I pasted from SPSS, release 14+ that might do the trick:
GET
FILE='C:\Program Files\SPSS\orginaldata.sav'.
DATASET NAME DataSet1 WINDOW=FRONT.
COMPUTE Z=z1+z2*1000+z3*100000.
DATASET DECLARE ranked_data.
AGGREGATE
/OUTFILE='ranked_data'
/BREAK=z
/N=N.
DATASET ACTIVATE ranked_data.
SORT CASES BY
N (D) .
COMPUTE nrank = $casenum .
EXECUTE .
DATASET ACTIVATE DataSet1.
SORT CASES BY
z (A) .
DATASET ACTIVATE ranked_data.
SORT CASES BY
z (A) .
DATASET ACTIVATE DataSet1.
SAVE OUTFILE='C:\Program Files\SPSS\originaldata.sav'
/COMPRESSED.
MATCH FILES /FILE=*
/TABLE='ranked_data'
/BY z.
EXECUTE.
USE ALL.
COMPUTE filter_$=(nrank <= 10).
VARIABLE LABEL filter_$ 'nrank <= 10 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
... rest of analysis goes here

I think the big issue here is what to do about ties. In my example the 10 most frequently occurring value was shared by 5 values and this code takes the first 10 frequencies which happen to be sorted on the z variable.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edward Boadi
Sent: Thursday, July 20, 2006 2:25 PM
To: [hidden email]
Subject: Re: 10 most frequent occurring values of a multiple response set

This is not RFM analysis.

Yes Iam looking for 10 most frequently occurring combinations of the three variables as my initial step.
Then select X , y1 , y2 , z1, z2 and z3 where (z1,z2,z3) = z ie where z1,z2, and z3 corresponds to the 10 most
frequent occurring combinations of z1,z2 and z3.

Regards.

-----Original Message-----
From: Beadle, ViAnn [mailto:[hidden email]]
Sent: Thursday, July 20, 2006 3:15 PM
To: Edward Boadi; [hidden email]
Subject: RE: 10 most frequent occurring values of a multiple response
set

I'm not quite sure what it means to rank z since it is a set of 3 values. Are you looking for the most frequently occurring combinations of the three variables?

Is this some sort of RFM analysis?

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edward Boadi
Sent: Thursday, July 20, 2006 2:03 PM
To: [hidden email]
Subject: 10 most frequent occurring values of a multiple response set

Dear List,
I have a data file with variables :
X , y1 , y2 , z1, z2 and z3

I wont to a accomplish the following task :
1. create a multiple response set z from z1,z2 and z3 .
2. Rank z and select cases for rank z <= 10
3. select cases from my original data file where z = z1, z2 or z3

My objective is to create a new dataset restricted to 10 most frequent occurring values of a multiple response set created from z1 , z2 and z3

Any ideas on how to accomplish this will be most welcome.

Regards to all .

Richard Ristow

Re: 10 most frequent occurring values of a multiple response set

At 04:28 PM 7/20/2006, Beadle, ViAnn wrote:

>Compute some variable which is a combination of all three values. For
>example if z1, z2, and z3 take on two[-digit] values you'll need some
>thing like:
>
>Compute z=z1 + z2*1000 + z3*100000.
>
>The second step is to rank occurrences, not values.
>
>You need to use aggregate to capture the occurrences into a variable,
>using the N function and z as your break variable.

Etc. I think this is exactly right, except why "compute some variable
which is a combination of all three values"? AGGREGATE is perfectly
happy with BREAKing on multiple variables. I'd suggest

DATASET DECLARE ranked_data.
AGGREGATE
/OUTFILE='ranked_data'
/BREAK=z1 z2 z3
/N=N.

instead of

COMPUTE Z=z1+z2*1000+z3*100000.
DATASET DECLARE ranked_data.
AGGREGATE
/OUTFILE='ranked_data'
/BREAK=z
/N=N.

Edward Boadi

Re: 10 most frequent occurring values of a multiple response set

In reply to this post by Edward Boadi

Thanks Richard + Beadle for your syntax on the above subject.

I have some couple of questions.

Consider the syntax :
.
.
.
GET FILE='C:\Program Files\SPSS\originaldata.sav'.
MATCH FILES /FILE=*
/TABLE='ranked_data'
/BY z.
SELECT IF (nrank <= 10).
EXECUTE.
.
.
.

Where
1. ranked_data contains aggregated data with variables z and nrank
2. originaldata.sav is the original data file with variables x,y1,y2,z1,z2 and z3
3. z variable was created from the aggregation of z1,z2 and z3

The syntax above is suppose to keep only cases with z1 , z2 and Z3 are in the ranked data file (nrank <= 10).
But after donig my analysis I still get values of z 1, z2 and z3 that are not in the ranked data file.

Please advice.

Regards to all.

-----Original Message-----
From: Richard Ristow [mailto:[hidden email]]
Sent: Friday, July 21, 2006 12:46 AM
To: [hidden email]
Cc: Edward Boadi; Beadle, ViAnn
Subject: Re: 10 most frequent occurring values of a multiple response
set

At 04:28 PM 7/20/2006, Beadle, ViAnn wrote:

Edward Boadi

Re: 10 most frequent occurring values of a multiple response set

In reply to this post by Edward Boadi

The following syntax creates a data set 'originaldata.sav' and 'ranked_data.sav'<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

with 'ranked_data.sav' containing (Z and RankN) the rank of aggregated values of (z1,z2,z3 and z4).

Now ,I want to perform the following task:

SET the values of (z1,z2,z3 and z4) to system missing when (z1,z2,z3 and z4) is not in 'ranked_data.sav' for RankN <= 3 say .

Thus I want to exclude from my analysis , values of (z1,z2,z3 and z4) with rank greater than 3.

But after running the syntax below, I still have values of z1,z2,z3 and z4 with rank greater than 3.

DATA LIST FREE/x y z1 z2 z3 z4.

BEGIN DATA

2 1 1 3 3 5

1 2 4 1 4 3

1 3 4 5 9 1

1 4 5 2 5 4

2 2 5 1 3 1

1 3 2 2 2 5

1 2 1 2 1 1

1 1 9 4 1 1

1 1 2 4 5 1

1 3 1 5 1 1

1 1 2 4 4 4

1 2 2 9 4 4

2 4 5 1 2 3

1 1 1 1 9 2

1 2 5 1 1 3

1 4 5 1 2 1

1 3 1 2 4 4

END DATA.

VECTOR r=z1 to z4.

LOOP cnt=1 TO 4.

- COMPUTE z=r(cnt).

- XSAVE OUTFILE ='C:\Temp\originaldata.sav'.

END LOOP.

EXECUTE.

SAVE OUTFILE='C:\Temp\originaldata.sav'.

GET FILE ='C:\Temp\originaldata.sav'.

***Aggregate by Z .

AGGREGATE

/OUTFILE='c:\Temp\ranked_data.sav'

/BREAK=z

/N_BREAK=N.

****Rank z.

GET FILE= 'c:\Temp\ranked_data.sav'.

SORT CASES BY N_BREAK (D) .

RANK VARIABLES=N_BREAK (D)

/RANK INTO RankN.

EXECUTE .

SORT CASES BY Z(A).

SAVE OUTFILE ='C:\Temp\ranked_data.sav' .

GET FILE='C:\Temp\originaldata.sav'.

SORT CASES BY Z(A).

MATCH FILES /FILE=*

/TABLE='c:\Temp\ranked_data.sav'

/BY z.

EXECUTE.

SELECT IF (RankN <= 3).

-----Original Message-----
From: Beadle, ViAnn [mailto:[hidden email]]
Sent: Friday, July 21, 2006 1:12 PM
To: Edward Boadi; Richard Ristow
Subject: RE: 10 most frequent occurring values of a multiple response set

Kinda hard to say what's wrong here without lots more information such as the complete syntax you actually ran, and a case listing that demonstrates the problem. And I'm not sure what you mean "I still get values of z1, z2, and z3 that are not in the ranked data file."

_____

From: Edward Boadi [mailto:[hidden email]]
Sent: Fri 7/21/2006 11:19 AM
To: Richard Ristow; Beadle, ViAnn; [hidden email]
Subject: RE: 10 most frequent occurring values of a multiple response set

Thanks Richard + Beadle for your syntax on the above subject.

I have some couple of questions.

Consider the syntax :
.
.
.
GET FILE='C:\Program Files\SPSS\originaldata.sav'.
MATCH FILES /FILE=*
/TABLE='ranked_data'
/BY z.
SELECT IF (nrank <= 10).
EXECUTE.
.
.
.

Where
1. ranked_data contains aggregated data with variables z and nrank
2. originaldata.sav is the original data file with variables x,y1,y2,z1,z2 and z3
3. z variable was created from the aggregation of z1,z2 and z3

The syntax above is suppose to keep only cases with z1 , z2 and Z3 are in the ranked data file (nrank <= 10).
But after donig my analysis I still get values of z 1, z2 and z3 that are not in the ranked data file.

Please advice.

Regards to all.

-----Original Message-----
From: Richard Ristow [ mailto:[hidden email]]
Sent: Friday, July 21, 2006 12:46 AM
To: [hidden email]
Cc: Edward Boadi; Beadle, ViAnn
Subject: Re: 10 most frequent occurring values of a multiple response
set

At 04:28 PM 7/20/2006, Beadle, ViAnn wrote:

Richard Ristow

Re: 10 most frequent occurring values of a multiple response set

In reply to this post by Edward Boadi

At 12:19 PM 7/21/2006, Edward Boadi wrote:

>I have some couple of questions.
[...]

>GET FILE='C:\Program Files\SPSS\originaldata.sav'.
>MATCH FILES /FILE=*
> /TABLE='ranked_data'
> /BY z.
>SELECT IF (nrank <= 10).
>.
>
>Where
>1. ranked_data contains aggregated data with variables z and nrank
>2. originaldata.sav is the original data file with
>variables x,y1,y2,z1,z2 and z3
>3. z variable was created from the aggregation of z1,z2 and z3
>
>The syntax above is suppose to keep only cases with z1 , z2 and Z3
>are in the ranked data file (nrank <= 10). But after donig my analysis
>I still get values of z 1, z2 and z3 that are not in the ranked data
>file. Please advise.

OK. The syntax that ViAnn Beadle suggested, and I modified, looks for
the top ten COMBINATIONS of the three values z1, z2, z3. (The code is
sensitive to order - the combination z1=A,z2=B,z3=C is counted as
different from, say z1=C,z2=B,z3=A.)

I wasn't sure from your postings, but it sounds like you want to keep
the 10 values that occur most often, over all three variables. OK, that
can be done, though let me know, first, if I've got that right. And
what do you do, if, say,
z1=A,z2=B,z3=C, A is among the top 10, and B and C are not? Keep the
combination, or make z2 and z3 system-missing, or what?

It's manageable; I'd just like to understand the problem better.

Good luck,
Richard

>Please advice.
>
>
>Regards to all.
>
>
>
>-----Original Message-----
>From: Richard Ristow [mailto:[hidden email]]
>Sent: Friday, July 21, 2006 12:46 AM
>To: [hidden email]
>Cc: Edward Boadi; Beadle, ViAnn
>Subject: Re: 10 most frequent occurring values of a multiple response
>set
>
>
>At 04:28 PM 7/20/2006, Beadle, ViAnn wrote:
>
> >Compute some variable which is a combination of all three values.
> For
> >example if z1, z2, and z3 take on two[-digit] values you'll need
> some
> >thing like:
> >
> >Compute z=z1 + z2*1000 + z3*100000.
> >
> >The second step is to rank occurrences, not values.
> >
> >You need to use aggregate to capture the occurrences into a
> variable,
> >using the N function and z as your break variable.
>
>Etc. I think this is exactly right, except why "compute some variable
>which is a combination of all three values"? AGGREGATE is perfectly
>happy with BREAKing on multiple variables. I'd suggest
>
>DATASET DECLARE ranked_data.
>AGGREGATE
> /OUTFILE='ranked_data'
> /BREAK=z1 z2 z3
> /N=N.
>
>instead of
>
>COMPUTE Z=z1+z2*1000+z3*100000.
>DATASET DECLARE ranked_data.
>AGGREGATE
> /OUTFILE='ranked_data'
> /BREAK=z
> /N=N.
>
>
>
>--
>No virus found in this incoming message.
>Checked by AVG Anti-Virus.
>Version: 7.1.394 / Virus Database: 268.10.1/390 - Release Date:
>7/17/2006

Edward Boadi

Re: 10 most frequent occurring values of a multiple response set

In reply to this post by Edward Boadi

Thanks Richard,
I want to keep the 10 values that occur most often, over all three variables (z1,z2 and Z3).

Thus
1. if z1=A,z2=B,z3=C, A is among the top 10, and B and C are not? make z2 and z3 system-missing for that record (case)
2. if z1=A,z2=B,z3=C, A and B is among the top 10, and C is not? make z3 system-missing for that record (case) etc

My objective is to set (z1,z2 and Z3) to system-missing for values of (z1,z2 and z3) that are not in the top 10.

Thanks .

Edward

-----Original Message-----
From: Richard Ristow [mailto:[hidden email]]
Sent: Friday, July 21, 2006 6:36 PM
To: Edward Boadi; [hidden email]
Subject: Re: 10 most frequent occurring values of a multiple response
set

At 12:19 PM 7/21/2006, Edward Boadi wrote:

>I have some couple of questions.
[...]