Dear List,
I have a data file with variables : X , y1 , y2 , z1, z2 and z3 I wont to a accomplish the following task : 1. create a multiple response set z from z1,z2 and z3 . 2. Rank z and select cases for rank z <= 10 3. select cases from my original data file where z = z1, z2 or z3 My objective is to create a new dataset restricted to 10 most frequent occurring values of a multiple response set created from z1 , z2 and z3 Any ideas on how to accomplish this will be most welcome. Regards to all . |
I'm not quite sure what it means to rank z since it is a set of 3 values. Are you looking for the most frequently occurring combinations of the three variables?
Is this some sort of RFM analysis? -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edward Boadi Sent: Thursday, July 20, 2006 2:03 PM To: [hidden email] Subject: 10 most frequent occurring values of a multiple response set Dear List, I have a data file with variables : X , y1 , y2 , z1, z2 and z3 I wont to a accomplish the following task : 1. create a multiple response set z from z1,z2 and z3 . 2. Rank z and select cases for rank z <= 10 3. select cases from my original data file where z = z1, z2 or z3 My objective is to create a new dataset restricted to 10 most frequent occurring values of a multiple response set created from z1 , z2 and z3 Any ideas on how to accomplish this will be most welcome. Regards to all . |
In reply to this post by Edward Boadi
This is not RFM analysis.
Yes Iam looking for 10 most frequently occurring combinations of the three variables as my initial step. Then select X , y1 , y2 , z1, z2 and z3 where (z1,z2,z3) = z ie where z1,z2, and z3 corresponds to the 10 most frequent occurring combinations of z1,z2 and z3. Regards. -----Original Message----- From: Beadle, ViAnn [mailto:[hidden email]] Sent: Thursday, July 20, 2006 3:15 PM To: Edward Boadi; [hidden email] Subject: RE: 10 most frequent occurring values of a multiple response set I'm not quite sure what it means to rank z since it is a set of 3 values. Are you looking for the most frequently occurring combinations of the three variables? Is this some sort of RFM analysis? -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edward Boadi Sent: Thursday, July 20, 2006 2:03 PM To: [hidden email] Subject: 10 most frequent occurring values of a multiple response set Dear List, I have a data file with variables : X , y1 , y2 , z1, z2 and z3 I wont to a accomplish the following task : 1. create a multiple response set z from z1,z2 and z3 . 2. Rank z and select cases for rank z <= 10 3. select cases from my original data file where z = z1, z2 or z3 My objective is to create a new dataset restricted to 10 most frequent occurring values of a multiple response set created from z1 , z2 and z3 Any ideas on how to accomplish this will be most welcome. Regards to all . |
In reply to this post by Edward Boadi
Then ignore the whole concept of a multiple response set and just compute some variable which is a combination of all three values. For example if z1, z2, and z3 take on two values you'll need some thing like:
Compute z=z1 + z2*1000 + z3*100000. The second step is to rank occurrences, not values. You need to use aggregate to capture the occurrences into a variable, using the N function and z as your break variable. This will give you a dataset with one row for each unique value of z and N. Sort that dataset in descending order on N and then compute nrank= $casenum after the sort. So your aggregated dataset has z, N, and nrank. You have to get nrank onto your original dataset through a table match. But to do so, you need to sort both the aggregated dataset and the original dataset on z and use z as the matching key. Once nrank is on your dataset then you can either filter or select cases with rankz less than or equal to 10. Here's some syntax that I pasted from SPSS, release 14+ that might do the trick: GET FILE='C:\Program Files\SPSS\orginaldata.sav'. DATASET NAME DataSet1 WINDOW=FRONT. COMPUTE Z=z1+z2*1000+z3*100000. DATASET DECLARE ranked_data. AGGREGATE /OUTFILE='ranked_data' /BREAK=z /N=N. DATASET ACTIVATE ranked_data. SORT CASES BY N (D) . COMPUTE nrank = $casenum . EXECUTE . DATASET ACTIVATE DataSet1. SORT CASES BY z (A) . DATASET ACTIVATE ranked_data. SORT CASES BY z (A) . DATASET ACTIVATE DataSet1. SAVE OUTFILE='C:\Program Files\SPSS\originaldata.sav' /COMPRESSED. MATCH FILES /FILE=* /TABLE='ranked_data' /BY z. EXECUTE. USE ALL. COMPUTE filter_$=(nrank <= 10). VARIABLE LABEL filter_$ 'nrank <= 10 (FILTER)'. VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMAT filter_$ (f1.0). FILTER BY filter_$. ... rest of analysis goes here I think the big issue here is what to do about ties. In my example the 10 most frequently occurring value was shared by 5 values and this code takes the first 10 frequencies which happen to be sorted on the z variable. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edward Boadi Sent: Thursday, July 20, 2006 2:25 PM To: [hidden email] Subject: Re: 10 most frequent occurring values of a multiple response set This is not RFM analysis. Yes Iam looking for 10 most frequently occurring combinations of the three variables as my initial step. Then select X , y1 , y2 , z1, z2 and z3 where (z1,z2,z3) = z ie where z1,z2, and z3 corresponds to the 10 most frequent occurring combinations of z1,z2 and z3. Regards. -----Original Message----- From: Beadle, ViAnn [mailto:[hidden email]] Sent: Thursday, July 20, 2006 3:15 PM To: Edward Boadi; [hidden email] Subject: RE: 10 most frequent occurring values of a multiple response set I'm not quite sure what it means to rank z since it is a set of 3 values. Are you looking for the most frequently occurring combinations of the three variables? Is this some sort of RFM analysis? -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Edward Boadi Sent: Thursday, July 20, 2006 2:03 PM To: [hidden email] Subject: 10 most frequent occurring values of a multiple response set Dear List, I have a data file with variables : X , y1 , y2 , z1, z2 and z3 I wont to a accomplish the following task : 1. create a multiple response set z from z1,z2 and z3 . 2. Rank z and select cases for rank z <= 10 3. select cases from my original data file where z = z1, z2 or z3 My objective is to create a new dataset restricted to 10 most frequent occurring values of a multiple response set created from z1 , z2 and z3 Any ideas on how to accomplish this will be most welcome. Regards to all . |
At 04:28 PM 7/20/2006, Beadle, ViAnn wrote:
>Compute some variable which is a combination of all three values. For >example if z1, z2, and z3 take on two[-digit] values you'll need some >thing like: > >Compute z=z1 + z2*1000 + z3*100000. > >The second step is to rank occurrences, not values. > >You need to use aggregate to capture the occurrences into a variable, >using the N function and z as your break variable. Etc. I think this is exactly right, except why "compute some variable which is a combination of all three values"? AGGREGATE is perfectly happy with BREAKing on multiple variables. I'd suggest DATASET DECLARE ranked_data. AGGREGATE /OUTFILE='ranked_data' /BREAK=z1 z2 z3 /N=N. instead of COMPUTE Z=z1+z2*1000+z3*100000. DATASET DECLARE ranked_data. AGGREGATE /OUTFILE='ranked_data' /BREAK=z /N=N. |
In reply to this post by Edward Boadi
Thanks Richard + Beadle for your syntax on the above subject.
I have some couple of questions. Consider the syntax : . . . GET FILE='C:\Program Files\SPSS\originaldata.sav'. MATCH FILES /FILE=* /TABLE='ranked_data' /BY z. SELECT IF (nrank <= 10). EXECUTE. . . . Where 1. ranked_data contains aggregated data with variables z and nrank 2. originaldata.sav is the original data file with variables x,y1,y2,z1,z2 and z3 3. z variable was created from the aggregation of z1,z2 and z3 The syntax above is suppose to keep only cases with z1 , z2 and Z3 are in the ranked data file (nrank <= 10). But after donig my analysis I still get values of z 1, z2 and z3 that are not in the ranked data file. Please advice. Regards to all. -----Original Message----- From: Richard Ristow [mailto:[hidden email]] Sent: Friday, July 21, 2006 12:46 AM To: [hidden email] Cc: Edward Boadi; Beadle, ViAnn Subject: Re: 10 most frequent occurring values of a multiple response set At 04:28 PM 7/20/2006, Beadle, ViAnn wrote: >Compute some variable which is a combination of all three values. For >example if z1, z2, and z3 take on two[-digit] values you'll need some >thing like: > >Compute z=z1 + z2*1000 + z3*100000. > >The second step is to rank occurrences, not values. > >You need to use aggregate to capture the occurrences into a variable, >using the N function and z as your break variable. Etc. I think this is exactly right, except why "compute some variable which is a combination of all three values"? AGGREGATE is perfectly happy with BREAKing on multiple variables. I'd suggest DATASET DECLARE ranked_data. AGGREGATE /OUTFILE='ranked_data' /BREAK=z1 z2 z3 /N=N. instead of COMPUTE Z=z1+z2*1000+z3*100000. DATASET DECLARE ranked_data. AGGREGATE /OUTFILE='ranked_data' /BREAK=z /N=N. |
In reply to this post by Edward Boadi
The following syntax creates a data set 'originaldata.sav' and 'ranked_data.sav'<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />
with 'ranked_data.sav' containing (Z and RankN) the rank of aggregated values of (z1,z2,z3 and z4). Now ,I want to perform the following task: SET the values of (z1,z2,z3 and z4) to system missing when (z1,z2,z3 and z4) is not in 'ranked_data.sav' for RankN <= 3 say . Thus I want to exclude from my analysis , values of (z1,z2,z3 and z4) with rank greater than 3. But after running the syntax below, I still have values of z1,z2,z3 and z4 with rank greater than 3. DATA LIST FREE/x y z1 z2 z3 z4. BEGIN DATA 2 1 1 3 3 5 1 2 4 1 4 3 1 3 4 5 9 1 1 4 5 2 5 4 2 2 5 1 3 1 1 3 2 2 2 5 1 2 1 2 1 1 1 1 9 4 1 1 1 1 2 4 5 1 1 3 1 5 1 1 1 1 2 4 4 4 1 2 2 9 4 4 2 4 5 1 2 3 1 1 1 1 9 2 1 2 5 1 1 3 1 4 5 1 2 1 1 3 1 2 4 4 END DATA. VECTOR r=z1 to z4. LOOP cnt=1 TO 4. - COMPUTE z=r(cnt). - XSAVE OUTFILE ='C:\Temp\originaldata.sav'. END LOOP. EXECUTE. SAVE OUTFILE='C:\Temp\originaldata.sav'. GET FILE ='C:\Temp\originaldata.sav'. ***Aggregate by Z . AGGREGATE /OUTFILE='c:\Temp\ranked_data.sav' /BREAK=z /N_BREAK=N. ****Rank z. GET FILE= 'c:\Temp\ranked_data.sav'. SORT CASES BY N_BREAK (D) . RANK VARIABLES=N_BREAK (D) /RANK INTO RankN. EXECUTE . SORT CASES BY Z(A). SAVE OUTFILE ='C:\Temp\ranked_data.sav' . GET FILE='C:\Temp\originaldata.sav'. SORT CASES BY Z(A). MATCH FILES /FILE=* /TABLE='c:\Temp\ranked_data.sav' /BY z. EXECUTE. SELECT IF (RankN <= 3). -----Original Message----- From: Beadle, ViAnn [mailto:[hidden email]] Sent: Friday, July 21, 2006 1:12 PM To: Edward Boadi; Richard Ristow Subject: RE: 10 most frequent occurring values of a multiple response set Kinda hard to say what's wrong here without lots more information such as the complete syntax you actually ran, and a case listing that demonstrates the problem. And I'm not sure what you mean "I still get values of z1, z2, and z3 that are not in the ranked data file." _____ From: Edward Boadi [mailto:[hidden email]] Sent: Fri 7/21/2006 11:19 AM To: Richard Ristow; Beadle, ViAnn; [hidden email] Subject: RE: 10 most frequent occurring values of a multiple response set Thanks Richard + Beadle for your syntax on the above subject. I have some couple of questions. Consider the syntax : . . . GET FILE='C:\Program Files\SPSS\originaldata.sav'. MATCH FILES /FILE=* /TABLE='ranked_data' /BY z. SELECT IF (nrank <= 10). EXECUTE. . . . Where 1. ranked_data contains aggregated data with variables z and nrank 2. originaldata.sav is the original data file with variables x,y1,y2,z1,z2 and z3 3. z variable was created from the aggregation of z1,z2 and z3 The syntax above is suppose to keep only cases with z1 , z2 and Z3 are in the ranked data file (nrank <= 10). But after donig my analysis I still get values of z 1, z2 and z3 that are not in the ranked data file. Please advice. Regards to all. -----Original Message----- From: Richard Ristow [ mailto:[hidden email]] Sent: Friday, July 21, 2006 12:46 AM To: [hidden email] Cc: Edward Boadi; Beadle, ViAnn Subject: Re: 10 most frequent occurring values of a multiple response set At 04:28 PM 7/20/2006, Beadle, ViAnn wrote: >Compute some variable which is a combination of all three values. For >example if z1, z2, and z3 take on two[-digit] values you'll need some >thing like: > >Compute z=z1 + z2*1000 + z3*100000. > >The second step is to rank occurrences, not values. > >You need to use aggregate to capture the occurrences into a variable, >using the N function and z as your break variable. Etc. I think this is exactly right, except why "compute some variable which is a combination of all three values"? AGGREGATE is perfectly happy with BREAKing on multiple variables. I'd suggest DATASET DECLARE ranked_data. AGGREGATE /OUTFILE='ranked_data' /BREAK=z1 z2 z3 /N=N. instead of COMPUTE Z=z1+z2*1000+z3*100000. DATASET DECLARE ranked_data. AGGREGATE /OUTFILE='ranked_data' /BREAK=z /N=N. |
In reply to this post by Edward Boadi
At 12:19 PM 7/21/2006, Edward Boadi wrote:
>I have some couple of questions. [...] >GET FILE='C:\Program Files\SPSS\originaldata.sav'. >MATCH FILES /FILE=* > /TABLE='ranked_data' > /BY z. >SELECT IF (nrank <= 10). >. > >Where >1. ranked_data contains aggregated data with variables z and nrank >2. originaldata.sav is the original data file with >variables x,y1,y2,z1,z2 and z3 >3. z variable was created from the aggregation of z1,z2 and z3 > >The syntax above is suppose to keep only cases with z1 , z2 and Z3 >are in the ranked data file (nrank <= 10). But after donig my analysis >I still get values of z 1, z2 and z3 that are not in the ranked data >file. Please advise. OK. The syntax that ViAnn Beadle suggested, and I modified, looks for the top ten COMBINATIONS of the three values z1, z2, z3. (The code is sensitive to order - the combination z1=A,z2=B,z3=C is counted as different from, say z1=C,z2=B,z3=A.) I wasn't sure from your postings, but it sounds like you want to keep the 10 values that occur most often, over all three variables. OK, that can be done, though let me know, first, if I've got that right. And what do you do, if, say, z1=A,z2=B,z3=C, A is among the top 10, and B and C are not? Keep the combination, or make z2 and z3 system-missing, or what? It's manageable; I'd just like to understand the problem better. Good luck, Richard >Please advice. > > >Regards to all. > > > >-----Original Message----- >From: Richard Ristow [mailto:[hidden email]] >Sent: Friday, July 21, 2006 12:46 AM >To: [hidden email] >Cc: Edward Boadi; Beadle, ViAnn >Subject: Re: 10 most frequent occurring values of a multiple response >set > > >At 04:28 PM 7/20/2006, Beadle, ViAnn wrote: > > >Compute some variable which is a combination of all three values. > For > >example if z1, z2, and z3 take on two[-digit] values you'll need > some > >thing like: > > > >Compute z=z1 + z2*1000 + z3*100000. > > > >The second step is to rank occurrences, not values. > > > >You need to use aggregate to capture the occurrences into a > variable, > >using the N function and z as your break variable. > >Etc. I think this is exactly right, except why "compute some variable >which is a combination of all three values"? AGGREGATE is perfectly >happy with BREAKing on multiple variables. I'd suggest > >DATASET DECLARE ranked_data. >AGGREGATE > /OUTFILE='ranked_data' > /BREAK=z1 z2 z3 > /N=N. > >instead of > >COMPUTE Z=z1+z2*1000+z3*100000. >DATASET DECLARE ranked_data. >AGGREGATE > /OUTFILE='ranked_data' > /BREAK=z > /N=N. > > > >-- >No virus found in this incoming message. >Checked by AVG Anti-Virus. >Version: 7.1.394 / Virus Database: 268.10.1/390 - Release Date: >7/17/2006 |
In reply to this post by Edward Boadi
Thanks Richard,
I want to keep the 10 values that occur most often, over all three variables (z1,z2 and Z3). Thus 1. if z1=A,z2=B,z3=C, A is among the top 10, and B and C are not? make z2 and z3 system-missing for that record (case) 2. if z1=A,z2=B,z3=C, A and B is among the top 10, and C is not? make z3 system-missing for that record (case) etc My objective is to set (z1,z2 and Z3) to system-missing for values of (z1,z2 and z3) that are not in the top 10. Thanks . Edward -----Original Message----- From: Richard Ristow [mailto:[hidden email]] Sent: Friday, July 21, 2006 6:36 PM To: Edward Boadi; [hidden email] Subject: Re: 10 most frequent occurring values of a multiple response set At 12:19 PM 7/21/2006, Edward Boadi wrote: >I have some couple of questions. [...] >GET FILE='C:\Program Files\SPSS\originaldata.sav'. >MATCH FILES /FILE=* > /TABLE='ranked_data' > /BY z. >SELECT IF (nrank <= 10). >. > >Where >1. ranked_data contains aggregated data with variables z and nrank >2. originaldata.sav is the original data file with >variables x,y1,y2,z1,z2 and z3 >3. z variable was created from the aggregation of z1,z2 and z3 > >The syntax above is suppose to keep only cases with z1 , z2 and Z3 >are in the ranked data file (nrank <= 10). But after donig my analysis >I still get values of z 1, z2 and z3 that are not in the ranked data >file. Please advise. OK. The syntax that ViAnn Beadle suggested, and I modified, looks for the top ten COMBINATIONS of the three values z1, z2, z3. (The code is sensitive to order - the combination z1=A,z2=B,z3=C is counted as different from, say z1=C,z2=B,z3=A.) I wasn't sure from your postings, but it sounds like you want to keep the 10 values that occur most often, over all three variables. OK, that can be done, though let me know, first, if I've got that right. And what do you do, if, say, z1=A,z2=B,z3=C, A is among the top 10, and B and C are not? Keep the combination, or make z2 and z3 system-missing, or what? It's manageable; I'd just like to understand the problem better. Good luck, Richard >Please advice. > > >Regards to all. > > > >-----Original Message----- >From: Richard Ristow [mailto:[hidden email]] >Sent: Friday, July 21, 2006 12:46 AM >To: [hidden email] >Cc: Edward Boadi; Beadle, ViAnn >Subject: Re: 10 most frequent occurring values of a multiple response >set > > >At 04:28 PM 7/20/2006, Beadle, ViAnn wrote: > > >Compute some variable which is a combination of all three values. > For > >example if z1, z2, and z3 take on two[-digit] values you'll need > some > >thing like: > > > >Compute z=z1 + z2*1000 + z3*100000. > > > >The second step is to rank occurrences, not values. > > > >You need to use aggregate to capture the occurrences into a > variable, > >using the N function and z as your break variable. > >Etc. I think this is exactly right, except why "compute some variable >which is a combination of all three values"? AGGREGATE is perfectly >happy with BREAKing on multiple variables. I'd suggest > >DATASET DECLARE ranked_data. >AGGREGATE > /OUTFILE='ranked_data' > /BREAK=z1 z2 z3 > /N=N. > >instead of > >COMPUTE Z=z1+z2*1000+z3*100000. >DATASET DECLARE ranked_data. >AGGREGATE > /OUTFILE='ranked_data' > /BREAK=z > /N=N. > > > >-- >No virus found in this incoming message. >Checked by AVG Anti-Virus. >Version: 7.1.394 / Virus Database: 268.10.1/390 - Release Date: >7/17/2006 |
Free forum by Nabble | Edit this page |