10 most frequent occurring values of a multiple response set

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

10 most frequent occurring values of a multiple response set

Edward Boadi
Dear List,
I have a data file with  variables :
X ,  y1 ,  y2 ,   z1,  z2 and  z3

I wont to a accomplish the following task :
        1.      create a multiple response set  z  from  z1,z2 and z3 .
        2.      Rank z and select cases for rank z <= 10
        3.      select cases from my original data file where  z = z1, z2  or z3

My objective is to create a new dataset  restricted to  10 most frequent occurring values of a multiple response set created from  z1 , z2 and z3

Any ideas on how to accomplish this will be most welcome.

Regards to all .
Reply | Threaded
Open this post in threaded view
|

Re: 10 most frequent occurring values of a multiple response set

Beadle, ViAnn
I'm not quite sure what it means to rank z since it is a set of 3 values. Are you looking for the most frequently occurring combinations of the three variables?

Is this some sort of RFM analysis?

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edward Boadi
Sent: Thursday, July 20, 2006 2:03 PM
To: [hidden email]
Subject: 10 most frequent occurring values of a multiple response set

Dear List,
I have a data file with  variables :
X ,  y1 ,  y2 ,   z1,  z2 and  z3

I wont to a accomplish the following task :
        1.      create a multiple response set  z  from  z1,z2 and z3 .
        2.      Rank z and select cases for rank z <= 10
        3.      select cases from my original data file where  z = z1, z2  or z3

My objective is to create a new dataset  restricted to  10 most frequent occurring values of a multiple response set created from  z1 , z2 and z3

Any ideas on how to accomplish this will be most welcome.

Regards to all .
Reply | Threaded
Open this post in threaded view
|

Re: 10 most frequent occurring values of a multiple response set

Edward Boadi
In reply to this post by Edward Boadi
This is not RFM analysis.

Yes Iam  looking for 10 most frequently occurring combinations of the three variables as my initial step.
Then select  X ,  y1 ,  y2 ,   z1,  z2 and  z3 where (z1,z2,z3) = z ie where z1,z2, and z3 corresponds to the 10 most
frequent occurring combinations of z1,z2 and z3.

Regards.



-----Original Message-----
From: Beadle, ViAnn [mailto:[hidden email]]
Sent: Thursday, July 20, 2006 3:15 PM
To: Edward Boadi; [hidden email]
Subject: RE: 10 most frequent occurring values of a multiple response
set


I'm not quite sure what it means to rank z since it is a set of 3 values. Are you looking for the most frequently occurring combinations of the three variables?

Is this some sort of RFM analysis?

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edward Boadi
Sent: Thursday, July 20, 2006 2:03 PM
To: [hidden email]
Subject: 10 most frequent occurring values of a multiple response set

Dear List,
I have a data file with  variables :
X ,  y1 ,  y2 ,   z1,  z2 and  z3

I wont to a accomplish the following task :
        1.      create a multiple response set  z  from  z1,z2 and z3 .
        2.      Rank z and select cases for rank z <= 10
        3.      select cases from my original data file where  z = z1, z2  or z3

My objective is to create a new dataset  restricted to  10 most frequent occurring values of a multiple response set created from  z1 , z2 and z3

Any ideas on how to accomplish this will be most welcome.

Regards to all .
Reply | Threaded
Open this post in threaded view
|

Re: 10 most frequent occurring values of a multiple response set

Beadle, ViAnn
In reply to this post by Edward Boadi
Then ignore the whole concept of a multiple response set and just compute some variable which is a combination of all three values. For example if z1, z2, and z3 take on two values you'll need some thing like:

Compute z=z1 + z2*1000 + z3*100000.

The second step is to rank occurrences, not values.

You need to use aggregate to capture the occurrences into a variable, using the N function and z as your break variable. This will give you a dataset with one row for each unique value of z and N. Sort that dataset in descending order on N and then compute nrank= $casenum after the sort.
So your aggregated dataset has z, N, and nrank. You have to get nrank onto your original dataset through a table match. But to do so, you need to sort both the aggregated dataset and the original dataset on z and use z as the matching key. Once nrank is on your dataset then you can either filter or select cases with rankz less than or equal to 10.

Here's some syntax that I pasted from SPSS, release 14+ that might do the trick:
GET
  FILE='C:\Program Files\SPSS\orginaldata.sav'.
DATASET NAME DataSet1 WINDOW=FRONT.
COMPUTE Z=z1+z2*1000+z3*100000.
DATASET DECLARE ranked_data.
AGGREGATE
  /OUTFILE='ranked_data'
  /BREAK=z
  /N=N.
DATASET ACTIVATE ranked_data.
SORT CASES BY
  N (D) .
COMPUTE nrank = $casenum .
EXECUTE .
DATASET ACTIVATE DataSet1.
SORT CASES BY
  z (A) .
DATASET ACTIVATE ranked_data.
SORT CASES BY
  z (A) .
DATASET ACTIVATE DataSet1.
SAVE OUTFILE='C:\Program Files\SPSS\originaldata.sav'
 /COMPRESSED.
MATCH FILES /FILE=*
 /TABLE='ranked_data'
 /BY z.
EXECUTE.
USE ALL.
COMPUTE filter_$=(nrank <= 10).
VARIABLE LABEL filter_$ 'nrank <= 10 (FILTER)'.
VALUE LABELS filter_$  0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
... rest of analysis goes here

I think the big issue here is what to do about ties. In my example the 10 most frequently occurring value was shared by 5 values and this code takes the first 10 frequencies which happen to be sorted on the z variable.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edward Boadi
Sent: Thursday, July 20, 2006 2:25 PM
To: [hidden email]
Subject: Re: 10 most frequent occurring values of a multiple response set

This is not RFM analysis.

Yes Iam  looking for 10 most frequently occurring combinations of the three variables as my initial step.
Then select  X ,  y1 ,  y2 ,   z1,  z2 and  z3 where (z1,z2,z3) = z ie where z1,z2, and z3 corresponds to the 10 most
frequent occurring combinations of z1,z2 and z3.

Regards.



-----Original Message-----
From: Beadle, ViAnn [mailto:[hidden email]]
Sent: Thursday, July 20, 2006 3:15 PM
To: Edward Boadi; [hidden email]
Subject: RE: 10 most frequent occurring values of a multiple response
set


I'm not quite sure what it means to rank z since it is a set of 3 values. Are you looking for the most frequently occurring combinations of the three variables?

Is this some sort of RFM analysis?

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edward Boadi
Sent: Thursday, July 20, 2006 2:03 PM
To: [hidden email]
Subject: 10 most frequent occurring values of a multiple response set

Dear List,
I have a data file with  variables :
X ,  y1 ,  y2 ,   z1,  z2 and  z3

I wont to a accomplish the following task :
        1.      create a multiple response set  z  from  z1,z2 and z3 .
        2.      Rank z and select cases for rank z <= 10
        3.      select cases from my original data file where  z = z1, z2  or z3

My objective is to create a new dataset  restricted to  10 most frequent occurring values of a multiple response set created from  z1 , z2 and z3

Any ideas on how to accomplish this will be most welcome.

Regards to all .
Reply | Threaded
Open this post in threaded view
|

Re: 10 most frequent occurring values of a multiple response set

Richard Ristow
At 04:28 PM 7/20/2006, Beadle, ViAnn wrote:

>Compute some variable which is a combination of all three values. For
>example if z1, z2, and z3 take on two[-digit] values you'll need some
>thing like:
>
>Compute z=z1 + z2*1000 + z3*100000.
>
>The second step is to rank occurrences, not values.
>
>You need to use aggregate to capture the occurrences into a variable,
>using the N function and z as your break variable.

Etc. I think this is exactly right, except why "compute some variable
which is a combination of all three values"? AGGREGATE is perfectly
happy with BREAKing on multiple variables. I'd suggest

DATASET DECLARE ranked_data.
AGGREGATE
   /OUTFILE='ranked_data'
   /BREAK=z1 z2 z3
   /N=N.

instead of

COMPUTE Z=z1+z2*1000+z3*100000.
DATASET DECLARE ranked_data.
AGGREGATE
   /OUTFILE='ranked_data'
   /BREAK=z
   /N=N.
Reply | Threaded
Open this post in threaded view
|

Re: 10 most frequent occurring values of a multiple response set

Edward Boadi
In reply to this post by Edward Boadi
Thanks Richard + Beadle for your syntax on the above subject.

I have some couple of questions.

Consider the syntax :
.
.
.
GET FILE='C:\Program Files\SPSS\originaldata.sav'.
MATCH FILES /FILE=*
 /TABLE='ranked_data'
 /BY z.
SELECT IF (nrank <= 10).
EXECUTE.
.
.
.

Where
1. ranked_data contains aggregated data with variables z and nrank
2. originaldata.sav is the original data file with variables  x,y1,y2,z1,z2 and z3
3. z variable was created from the aggregation of z1,z2 and z3

The syntax above is suppose to keep only cases  with z1 , z2 and Z3 are in the ranked data file (nrank <= 10).
But after donig my analysis I still get values of z 1, z2 and z3 that are not in the ranked data file.

Please advice.


Regards to all.



-----Original Message-----
From: Richard Ristow [mailto:[hidden email]]
Sent: Friday, July 21, 2006 12:46 AM
To: [hidden email]
Cc: Edward Boadi; Beadle, ViAnn
Subject: Re: 10 most frequent occurring values of a multiple response
set


At 04:28 PM 7/20/2006, Beadle, ViAnn wrote:

>Compute some variable which is a combination of all three values. For
>example if z1, z2, and z3 take on two[-digit] values you'll need some
>thing like:
>
>Compute z=z1 + z2*1000 + z3*100000.
>
>The second step is to rank occurrences, not values.
>
>You need to use aggregate to capture the occurrences into a variable,
>using the N function and z as your break variable.

Etc. I think this is exactly right, except why "compute some variable
which is a combination of all three values"? AGGREGATE is perfectly
happy with BREAKing on multiple variables. I'd suggest

DATASET DECLARE ranked_data.
AGGREGATE
   /OUTFILE='ranked_data'
   /BREAK=z1 z2 z3
   /N=N.

instead of

COMPUTE Z=z1+z2*1000+z3*100000.
DATASET DECLARE ranked_data.
AGGREGATE
   /OUTFILE='ranked_data'
   /BREAK=z
   /N=N.
Reply | Threaded
Open this post in threaded view
|

Re: 10 most frequent occurring values of a multiple response set

Edward Boadi
In reply to this post by Edward Boadi
The following syntax  creates a data set  'originaldata.sav' and  'ranked_data.sav'<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

with 'ranked_data.sav' containing  (Z and RankN) the rank of aggregated values of (z1,z2,z3 and z4).

 

Now ,I want to perform  the following task:

 

SET the values of  (z1,z2,z3 and z4) to system missing  when  (z1,z2,z3 and z4) is not in  'ranked_data.sav'  for RankN <= 3 say .

 

Thus I want to exclude from  my analysis , values of  (z1,z2,z3 and z4) with rank greater than 3.

 

But after running the syntax below, I still have values of z1,z2,z3 and z4 with rank greater than 3.

 

DATA LIST FREE/x y z1 z2 z3 z4.

BEGIN DATA

2 1 1 3 3 5

1 2 4 1 4 3

1 3 4 5 9 1

1 4 5 2 5 4

2 2 5 1 3 1

1 3 2 2 2 5

1 2 1 2 1 1

1 1 9 4 1 1

1 1 2 4 5 1

1 3 1 5 1 1

1 1 2 4 4 4

1 2 2 9 4 4

2 4 5 1 2 3

1 1 1 1 9 2

1 2 5 1 1 3

1 4 5 1 2 1

1 3 1 2 4 4

END DATA.

VECTOR r=z1 to z4.

LOOP cnt=1 TO 4.

- COMPUTE z=r(cnt).

- XSAVE OUTFILE ='C:\Temp\originaldata.sav'.

END LOOP.

EXECUTE.

 

SAVE OUTFILE='C:\Temp\originaldata.sav'.

GET FILE ='C:\Temp\originaldata.sav'.

***Aggregate by Z .

AGGREGATE

/OUTFILE='c:\Temp\ranked_data.sav'

/BREAK=z

/N_BREAK=N.

****Rank z.

GET FILE= 'c:\Temp\ranked_data.sav'.

SORT CASES BY N_BREAK (D) .

RANK VARIABLES=N_BREAK (D)

/RANK INTO RankN.

EXECUTE .

SORT CASES BY Z(A).

SAVE OUTFILE ='C:\Temp\ranked_data.sav' .

GET FILE='C:\Temp\originaldata.sav'.

SORT CASES BY Z(A).

MATCH FILES /FILE=*

/TABLE='c:\Temp\ranked_data.sav'

/BY z.

EXECUTE.

SELECT IF (RankN <= 3).

 

 

 

 

 

 

-----Original Message-----
From: Beadle, ViAnn [mailto:[hidden email]]
Sent: Friday, July 21, 2006 1:12 PM
To: Edward Boadi; Richard Ristow
Subject: RE: 10 most frequent occurring values of a multiple response set


Kinda hard to say what's wrong here without lots more information such as the complete syntax you actually ran, and a case listing that demonstrates the problem. And I'm not sure what you mean "I still get values of z1, z2, and z3 that are not in the ranked data file."

  _____  

From: Edward Boadi [mailto:[hidden email]]
Sent: Fri 7/21/2006 11:19 AM
To: Richard Ristow; Beadle, ViAnn; [hidden email]
Subject: RE: 10 most frequent occurring values of a multiple response set



Thanks Richard + Beadle for your syntax on the above subject.

I have some couple of questions.

Consider the syntax :
.
.
.
GET FILE='C:\Program Files\SPSS\originaldata.sav'.
MATCH FILES /FILE=*
 /TABLE='ranked_data'
 /BY z.
SELECT IF (nrank <= 10).
EXECUTE.
.
.
.

Where
1. ranked_data contains aggregated data with variables z and nrank
2. originaldata.sav is the original data file with variables  x,y1,y2,z1,z2 and z3
3. z variable was created from the aggregation of z1,z2 and z3

The syntax above is suppose to keep only cases  with z1 , z2 and Z3 are in the ranked data file (nrank <= 10).
But after donig my analysis I still get values of z 1, z2 and z3 that are not in the ranked data file.

Please advice.


Regards to all.



-----Original Message-----
From: Richard Ristow [ mailto:[hidden email]]
Sent: Friday, July 21, 2006 12:46 AM
To: [hidden email]
Cc: Edward Boadi; Beadle, ViAnn
Subject: Re: 10 most frequent occurring values of a multiple response
set


At 04:28 PM 7/20/2006, Beadle, ViAnn wrote:

>Compute some variable which is a combination of all three values. For
>example if z1, z2, and z3 take on two[-digit] values you'll need some
>thing like:
>
>Compute z=z1 + z2*1000 + z3*100000.
>
>The second step is to rank occurrences, not values.
>
>You need to use aggregate to capture the occurrences into a variable,
>using the N function and z as your break variable.

Etc. I think this is exactly right, except why "compute some variable
which is a combination of all three values"? AGGREGATE is perfectly
happy with BREAKing on multiple variables. I'd suggest

DATASET DECLARE ranked_data.
AGGREGATE
   /OUTFILE='ranked_data'
   /BREAK=z1 z2 z3
   /N=N.

instead of

COMPUTE Z=z1+z2*1000+z3*100000.
DATASET DECLARE ranked_data.
AGGREGATE
   /OUTFILE='ranked_data'
   /BREAK=z
   /N=N.
Reply | Threaded
Open this post in threaded view
|

Re: 10 most frequent occurring values of a multiple response set

Richard Ristow
In reply to this post by Edward Boadi
At 12:19 PM 7/21/2006, Edward Boadi wrote:

>I have some couple of questions.
[...]

>GET FILE='C:\Program Files\SPSS\originaldata.sav'.
>MATCH FILES /FILE=*
>  /TABLE='ranked_data'
>  /BY z.
>SELECT IF (nrank <= 10).
>.
>
>Where
>1. ranked_data contains aggregated data with variables z and nrank
>2. originaldata.sav is the original data file with
>variables  x,y1,y2,z1,z2 and z3
>3. z variable was created from the aggregation of z1,z2 and z3
>
>The syntax above is suppose to keep only cases  with z1 , z2 and Z3
>are in the ranked data file (nrank <= 10). But after donig my analysis
>I still get values of z 1, z2 and z3 that are not in the ranked data
>file. Please advise.

OK. The syntax that ViAnn Beadle suggested, and I modified, looks for
the top ten COMBINATIONS of the three values z1, z2, z3. (The code is
sensitive to order - the combination z1=A,z2=B,z3=C is counted as
different from, say z1=C,z2=B,z3=A.)

I wasn't sure from your postings, but it sounds like you want to keep
the 10 values that occur most often, over all three variables. OK, that
can be done, though let me know, first, if I've got that right. And
what do you do, if, say,
z1=A,z2=B,z3=C, A is among the top 10, and B and C are not? Keep the
combination, or make z2 and z3 system-missing, or what?

It's manageable; I'd just like to understand the problem better.

Good luck,
Richard

>Please advice.
>
>
>Regards to all.
>
>
>
>-----Original Message-----
>From: Richard Ristow [mailto:[hidden email]]
>Sent: Friday, July 21, 2006 12:46 AM
>To: [hidden email]
>Cc: Edward Boadi; Beadle, ViAnn
>Subject: Re: 10 most frequent occurring values of a multiple response
>set
>
>
>At 04:28 PM 7/20/2006, Beadle, ViAnn wrote:
>
> >Compute some variable which is a combination of all three values.
> For
> >example if z1, z2, and z3 take on two[-digit] values you'll need
> some
> >thing like:
> >
> >Compute z=z1 + z2*1000 + z3*100000.
> >
> >The second step is to rank occurrences, not values.
> >
> >You need to use aggregate to capture the occurrences into a
> variable,
> >using the N function and z as your break variable.
>
>Etc. I think this is exactly right, except why "compute some variable
>which is a combination of all three values"? AGGREGATE is perfectly
>happy with BREAKing on multiple variables. I'd suggest
>
>DATASET DECLARE ranked_data.
>AGGREGATE
>    /OUTFILE='ranked_data'
>    /BREAK=z1 z2 z3
>    /N=N.
>
>instead of
>
>COMPUTE Z=z1+z2*1000+z3*100000.
>DATASET DECLARE ranked_data.
>AGGREGATE
>    /OUTFILE='ranked_data'
>    /BREAK=z
>    /N=N.
>
>
>
>--
>No virus found in this incoming message.
>Checked by AVG Anti-Virus.
>Version: 7.1.394 / Virus Database: 268.10.1/390 - Release Date:
>7/17/2006
Reply | Threaded
Open this post in threaded view
|

Re: 10 most frequent occurring values of a multiple response set

Edward Boadi
In reply to this post by Edward Boadi
Thanks Richard,
I want to  keep the 10 values that occur most often, over all three variables (z1,z2 and Z3).

Thus
1. if z1=A,z2=B,z3=C, A is among the top 10, and B and C are not?  make z2 and z3 system-missing for that record (case)
2. if z1=A,z2=B,z3=C, A and B is among the top 10, and C is not?  make z3 system-missing for that record (case) etc

My objective is to set (z1,z2 and Z3) to system-missing for values of (z1,z2 and z3) that are not in the top 10.

Thanks .

Edward







-----Original Message-----
From: Richard Ristow [mailto:[hidden email]]
Sent: Friday, July 21, 2006 6:36 PM
To: Edward Boadi; [hidden email]
Subject: Re: 10 most frequent occurring values of a multiple response
set


At 12:19 PM 7/21/2006, Edward Boadi wrote:

>I have some couple of questions.
[...]

>GET FILE='C:\Program Files\SPSS\originaldata.sav'.
>MATCH FILES /FILE=*
>  /TABLE='ranked_data'
>  /BY z.
>SELECT IF (nrank <= 10).
>.
>
>Where
>1. ranked_data contains aggregated data with variables z and nrank
>2. originaldata.sav is the original data file with
>variables  x,y1,y2,z1,z2 and z3
>3. z variable was created from the aggregation of z1,z2 and z3
>
>The syntax above is suppose to keep only cases  with z1 , z2 and Z3
>are in the ranked data file (nrank <= 10). But after donig my analysis
>I still get values of z 1, z2 and z3 that are not in the ranked data
>file. Please advise.

OK. The syntax that ViAnn Beadle suggested, and I modified, looks for
the top ten COMBINATIONS of the three values z1, z2, z3. (The code is
sensitive to order - the combination z1=A,z2=B,z3=C is counted as
different from, say z1=C,z2=B,z3=A.)

I wasn't sure from your postings, but it sounds like you want to keep
the 10 values that occur most often, over all three variables. OK, that
can be done, though let me know, first, if I've got that right. And
what do you do, if, say,
z1=A,z2=B,z3=C, A is among the top 10, and B and C are not? Keep the
combination, or make z2 and z3 system-missing, or what?

It's manageable; I'd just like to understand the problem better.

Good luck,
Richard

>Please advice.
>
>
>Regards to all.
>
>
>
>-----Original Message-----
>From: Richard Ristow [mailto:[hidden email]]
>Sent: Friday, July 21, 2006 12:46 AM
>To: [hidden email]
>Cc: Edward Boadi; Beadle, ViAnn
>Subject: Re: 10 most frequent occurring values of a multiple response
>set
>
>
>At 04:28 PM 7/20/2006, Beadle, ViAnn wrote:
>
> >Compute some variable which is a combination of all three values.
> For
> >example if z1, z2, and z3 take on two[-digit] values you'll need
> some
> >thing like:
> >
> >Compute z=z1 + z2*1000 + z3*100000.
> >
> >The second step is to rank occurrences, not values.
> >
> >You need to use aggregate to capture the occurrences into a
> variable,
> >using the N function and z as your break variable.
>
>Etc. I think this is exactly right, except why "compute some variable
>which is a combination of all three values"? AGGREGATE is perfectly
>happy with BREAKing on multiple variables. I'd suggest
>
>DATASET DECLARE ranked_data.
>AGGREGATE
>    /OUTFILE='ranked_data'
>    /BREAK=z1 z2 z3
>    /N=N.
>
>instead of
>
>COMPUTE Z=z1+z2*1000+z3*100000.
>DATASET DECLARE ranked_data.
>AGGREGATE
>    /OUTFILE='ranked_data'
>    /BREAK=z
>    /N=N.
>
>
>
>--
>No virus found in this incoming message.
>Checked by AVG Anti-Virus.
>Version: 7.1.394 / Virus Database: 268.10.1/390 - Release Date:
>7/17/2006