On Fri, 4 Aug 2006 15:41:33 -0700, GREENE Teresa <[hidden email]>
wrote: >Greetings all-- > >Is it possible to filter a dataset based on a file that contains a list >of identifiers? If I need to isolate 1000 subjects that have no common >variable from a dataset of over 800,000, what is the most expedient way >to do that? First, both file should have a common variable like "ID" and their variable names have to be the same, if they are string, their variable lengths have to be the same, too. No duplicated ID should exist in each file. 1. Open the bigger file, use Data > Sort cases to sort the ID in ascending order, save. 2. Open the smaller file that just contains one variable: ID of the 1000 people you want, sort that too and save. 3. With the smaller file opened, go to Data> Merge files> Add bariables 4. Select the bigger file, click Open 5. Highlight the ID variable in the "Excluded Variables" panel 6. Check the box "Match cases on key variables in sorted files" 7. Check the box "External file is keyed table" 8. Click the little [>] button next to the Key Variables panel and bring the ID variable over there 9. Click OK to submit the command, you can ignore the warning about the sorting of files, cause we did it in steps 1 and 2 And that should do the trick. |
Hi Kenneth--
Thank you for your response. Unfortunately, this does not work for this problem. I have two files that have multiple records for each ID. In one file there is an incident date. In the other file there is a date range indicating duration of enrollment in a program. I have to isolate the one record for each ID from one file that has the date range that is inclusive of the incident date for that ID in the other file. I have tried different ways of matching the files, but because of the multiple records, the match never works correctly for all records. Every time, I get records that are either mismatched, or didn't match when they should have. I have worked with systems where you can take a list of identifiers, create a text file and then indicate that file as the filter for selecting cases from a database. I am trying to figure out if there is a way to do this in SPSS. Thanks for trying!!! Teresa :) ------------------------------ Date: Fri, 18 Aug 2006 16:54:35 -0400 From: Ken Chui <[hidden email]> Subject: Re: Filter file possible? On Fri, 4 Aug 2006 15:41:33 -0700, GREENE Teresa <[hidden email]> wrote: >Greetings all-- > >Is it possible to filter a dataset based on a file that contains a list >of identifiers? If I need to isolate 1000 subjects that have no common >variable from a dataset of over 800,000, what is the most expedient way >to do that? First, both file should have a common variable like "ID" and their variable names have to be the same, if they are string, their variable lengths have to be the same, too. No duplicated ID should exist in each file. 1. Open the bigger file, use Data > Sort cases to sort the ID in ascending order, save. 2. Open the smaller file that just contains one variable: ID of the 1000 people you want, sort that too and save. 3. With the smaller file opened, go to Data> Merge files> Add bariables 4. Select the bigger file, click Open 5. Highlight the ID variable in the "Excluded Variables" panel 6. Check the box "Match cases on key variables in sorted files" 7. Check the box "External file is keyed table" 8. Click the little [>] button next to the Key Variables panel and bring the ID variable over there 9. Click OK to submit the command, you can ignore the warning about the sorting of files, cause we did it in steps 1 and 2 And that should do the trick. |
At 12:58 PM 8/21/2006, GREENE Teresa wrote:
>I have two files that have multiple records for each ID. In one file >there is an incident date. In the other file there is a date range >indicating duration of enrollment in a program. I have to isolate the >one record for each ID from one file that has the date range that is >inclusive of the incident date for that ID in the other file. OK. No tested code this time, but if this description isn't enough, and you have some test data, I'll try it. Don't merge the files; interleave them.(*) Assume . Your files are accessible by file handles INCIDENT and ENROLL, the former with one record per incident, the latter with one record per enrollment period. . ENROLL contains, among other variables, STDT_ID identifying the individual, and ENTER_DT and EXIT_DT, SPSS date variables giving the beginning and end of the enrollment period. No two enrollment periods overlap - that is, every EXIT_DT is earlier than the following ENTER_DT. . INCIDENT contains STDT_ID, and SPSS date variable INCID_DT, giving the incident date. . There are no variable names in common between the two files. (That can be relaxed, of course, if you're careful to keep them straight.) . Finally, you can write and read a temporary file to handle SCRATCH. Something like this: * I. Prepare the 'enrollment' file by adding variable . * KEY_DATE, which will be used for the interleave. . * Save the result. . * (KEY_DATE has the same value is ENTER_DT. It has . * to be a separate variable for the interleave, . * though.) . GET FILE=ENROLL. SORT CASES BY STDT_ID ENTER_DT /* If necessary */. NUMERIC KEY_DATE (DATE11). COMPUTE KEY_DATE = ENTER_DT. SAVE OUTFILE=SCRATCH /KEEP=KEY_DATE ALL. * II. Prepare the 'incident' file by adding variable . * KEY_DATE, for the interleave, in the active file. * (Here, KEY_DATE has the same value as INCID_DT, . * but, again, it needs to be a separate variable.) . GET FILE=INCIDENT. SORT CASES BY STDT_ID INCID_DT /* If necessary */. NUMERIC KEY_DATE (DATE11). COMPUTE KEY_DATE = ENTER_DT. * III. Now, interleave, attaching to each incident the . * data from the enrollment period WITH THE LATEST . * PRECEDING START DATE. . ADD FILES /FILE=SCRATCH /* Enrollments */ /FILE=* /IN=INCID /* Incidents */ /BY KEY_DATE /* Interleave */. LEAVE ENTER_DT EXIT_DT. LEAVE <any other variables you want from ENROLLMENT>. * IV. The following checks that the incident happened . * before the preceding EXIT_DT, and that the . * enrollment record belonged to this student. . * If either is not the case, the student was not . * enrolled at all when the incident occurred. . NUMERIC ENROLLED (F2). VAR LABEL ENROLLED 'Enrolled at time of incident?'. VAL LABEL ENROLLED 1 'Enrolled' 2 'Not enrolled'. DO IF MISSING(LAG(STDT_ID)). . COMPUTE ENROLLED = 2. ELSE IF STDT_ID NE LAG(STDT_ID). . COMPUTE ENROLLED = 2. ELSE IF INCID_DT GT EXIT_DT. . COMPUTE ENROLLED = 2. ELSE. . COMPUTE ENROLLED = 1. END IF. DO IF ENROLLED = 2. . COMPUTE ENTER_DT = $SYSMIS. . COMPUTE EXIT_DT = $SYSMIS. *..Etc., for all 'enrollment' variables . * you're keeping in the incident records . END IF. SELECT IF INCID EQ 1. * You may want to drop variables INCID and KEY_DATE . * from the final file. . .................................... (*) I wrote something like this a while ago, posted as "Re: Assigning prices by dates", Mon, 6 Mar 2006 01:49:29 -0500. That program looks more complicated than it is. A lot of the code is creating a list of months, analogous to your INCIDENT file. ; for my reference, an |
In reply to this post by GREENE Teresa
Shalom
Here is a simple example of interleaving dates as Richard Ristow also suggest . The program generate random dates so each time you will run it there will be deferent matches . title interleave dates . input program . loop subjectnum =1 to 20 . compute secday=60*60*24 . compute startenroll= date.dmy(1,1,2005)+trunc(unifrom(360)*secday). compute endenroll= startenroll+trunc(unifrom(60)*secday)+1. end case . end loop . end file . end input program . formats startenroll endenroll(edate10) . save outfile='tmp.sav' / drop=secday . execute . input program . loop i=1 to 40 . compute secday=60*60*24 . compute subjectnum=trunc(i/5)+1 . compute incidentdate= date.dmy(1,6,2004)+trunc(unifrom(560)*secday). end case . end loop . end file . end input program . formats incidentdate(edate10) . save outfile='tmp1.sav' / drop=secday i . execute . add files file='tmp.sav' / file='tmp1.sav' /in=inincid. if inincid eq 1 tmpdate=incidentdate . if inincid eq 0 tmpdate=startenroll . formats tmpdate(edate10) . sort cases by subjectnum tmpdate. numeric instartdate inenddate(edate10). leave instartdate inenddate. if inincid eq 0 instartdate=startenroll . if inincid eq 0 inenddate=endenroll . if (inincid eq 1 ) and ( incidentdate ge instartdate)and ( incidentdate le inenddate) match=1 . execute . Hillel Vardi Ben Gurion U Israel GREENE Teresa wrote: > Hi Kenneth-- > > Thank you for your response. Unfortunately, this does not work for this > problem. I have two files that have multiple records for each ID. In > one file there is an incident date. In the other file there is a date > range indicating duration of enrollment in a program. I have to isolate > the one record for each ID from one file that has the date range that is > inclusive of the incident date for that ID in the other file. I have > tried different ways of matching the files, but because of the multiple > records, the match never works correctly for all records. Every time, I > get records that are either mismatched, or didn't match when they should > have. > > I have worked with systems where you can take a list of identifiers, > create a text file and then indicate that file as the filter for > selecting cases from a database. I am trying to figure out if there is > a way to do this in SPSS. > > Thanks for trying!!! > > Teresa :) > > ------------------------------ > > Date: Fri, 18 Aug 2006 16:54:35 -0400 > From: Ken Chui <[hidden email]> > Subject: Re: Filter file possible? > > On Fri, 4 Aug 2006 15:41:33 -0700, GREENE Teresa > <[hidden email]> > wrote: > > >> Greetings all-- >> >> Is it possible to filter a dataset based on a file that contains a list >> of identifiers? If I need to isolate 1000 subjects that have no common >> variable from a dataset of over 800,000, what is the most expedient way >> to do that? >> > > First, both file should have a common variable like "ID" and their > variable > names have to be the same, if they are string, their variable lengths > have > to be the same, too. No duplicated ID should exist in each file. > > 1. Open the bigger file, use Data > Sort cases to sort the ID in > ascending > order, save. > 2. Open the smaller file that just contains one variable: ID of the 1000 > people you want, sort that too and save. > 3. With the smaller file opened, go to Data> Merge files> Add bariables > 4. Select the bigger file, click Open > 5. Highlight the ID variable in the "Excluded Variables" panel > 6. Check the box "Match cases on key variables in sorted files" > 7. Check the box "External file is keyed table" > 8. Click the little [>] button next to the Key Variables panel and bring > the > ID variable over there > 9. Click OK to submit the command, you can ignore the warning about the > sorting of files, cause we did it in steps 1 and 2 > > And that should do the trick. > > |
In reply to this post by Ken Chui
Teresa:
Another solution would be to use CASESTOVARS to restructure your file of incidents. This will give 1 row per ID, which you can join to the enrollment file: ** sample ENROLLMENT data-- sorted on id. DATA LIST FREE /id (f8.0) start_enroll end_enroll (2ADATE10). BEGIN DATA 1 12/12/2002 2/15/2003 1 6/1/2003 6/25/2003 1 12/12/1997 12/12/1998 2 12/20/2004 1/5/2005 2 4/1/1999 4/12/1999 2 4/1/2002 6/1/2002 END DATA. DATASET name enroll window = front. ** sample INCINDENT data-- sorted on id. DATA LIST FREE /id (f8.0) incident_date (adate10). BEGIN DATA 1 1/1/2002 1 6/12/2003 2 1/1/2005 2 4/4/1999 2 5/25/2002 END DATA. DATASET NAME incident WINDOW = front. ** "flatten" file to 1 row per id. SORT CASES BY id incident_date . CASESTOVARS /ID = id . DATASET ACTIVATE enroll. MATCH FILES FILE = enroll /TABLE = incident /BY id. EXECUTE. ** flag cases were incident_date.n is between enrollment start and end. DO REPEAT x = incident_date.1 TO incident_date.3 /y = match_1 TO match_3. COMPUTE y = start_enroll LE x AND end_enroll GE x. END REPEAT. EXECUTE. You will need to deal with incidents that have multiple matches, and incidents that have no matches. Cheers --jim Jim Marks Senior Market Analyst x 1616 -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: Monday, August 21, 2006 10:04 PM To: [hidden email] Subject: Re: Filter file possible? At 12:58 PM 8/21/2006, GREENE Teresa wrote: >I have two files that have multiple records for each ID. In one file >there is an incident date. In the other file there is a date range >indicating duration of enrollment in a program. I have to isolate the >one record for each ID from one file that has the date range that is >inclusive of the incident date for that ID in the other file. OK. No tested code this time, but if this description isn't enough, and you have some test data, I'll try it. Don't merge the files; interleave them.(*) Assume . Your files are accessible by file handles INCIDENT and ENROLL, the former with one record per incident, the latter with one record per enrollment period. . ENROLL contains, among other variables, STDT_ID identifying the individual, and ENTER_DT and EXIT_DT, SPSS date variables giving the beginning and end of the enrollment period. No two enrollment periods overlap - that is, every EXIT_DT is earlier than the following ENTER_DT. . INCIDENT contains STDT_ID, and SPSS date variable INCID_DT, giving the incident date. . There are no variable names in common between the two files. (That can be relaxed, of course, if you're careful to keep them straight.) . Finally, you can write and read a temporary file to handle SCRATCH. Something like this: * I. Prepare the 'enrollment' file by adding variable . * KEY_DATE, which will be used for the interleave. . * Save the result. . * (KEY_DATE has the same value is ENTER_DT. It has . * to be a separate variable for the interleave, . * though.) . GET FILE=ENROLL. SORT CASES BY STDT_ID ENTER_DT /* If necessary */. NUMERIC KEY_DATE (DATE11). COMPUTE KEY_DATE = ENTER_DT. SAVE OUTFILE=SCRATCH /KEEP=KEY_DATE ALL. * II. Prepare the 'incident' file by adding variable . * KEY_DATE, for the interleave, in the active file. * (Here, KEY_DATE has the same value as INCID_DT, . * but, again, it needs to be a separate variable.) . GET FILE=INCIDENT. SORT CASES BY STDT_ID INCID_DT /* If necessary */. NUMERIC KEY_DATE (DATE11). COMPUTE KEY_DATE = ENTER_DT. * III. Now, interleave, attaching to each incident the . * data from the enrollment period WITH THE LATEST . * PRECEDING START DATE. . ADD FILES /FILE=SCRATCH /* Enrollments */ /FILE=* /IN=INCID /* Incidents */ /BY KEY_DATE /* Interleave */. LEAVE ENTER_DT EXIT_DT. LEAVE <any other variables you want from ENROLLMENT>. * IV. The following checks that the incident happened . * before the preceding EXIT_DT, and that the . * enrollment record belonged to this student. . * If either is not the case, the student was not . * enrolled at all when the incident occurred. . NUMERIC ENROLLED (F2). VAR LABEL ENROLLED 'Enrolled at time of incident?'. VAL LABEL ENROLLED 1 'Enrolled' 2 'Not enrolled'. DO IF MISSING(LAG(STDT_ID)). . COMPUTE ENROLLED = 2. ELSE IF STDT_ID NE LAG(STDT_ID). . COMPUTE ENROLLED = 2. ELSE IF INCID_DT GT EXIT_DT. . COMPUTE ENROLLED = 2. ELSE. . COMPUTE ENROLLED = 1. END IF. DO IF ENROLLED = 2. . COMPUTE ENTER_DT = $SYSMIS. . COMPUTE EXIT_DT = $SYSMIS. *..Etc., for all 'enrollment' variables . * you're keeping in the incident records . END IF. SELECT IF INCID EQ 1. * You may want to drop variables INCID and KEY_DATE . * from the final file. . .................................... (*) I wrote something like this a while ago, posted as "Re: Assigning prices by dates", Mon, 6 Mar 2006 01:49:29 -0500. That program looks more complicated than it is. A lot of the code is creating a list of months, analogous to your INCIDENT file. ; for my reference, an |
At 08:42 AM 8/22/2006, Marks, Jim wrote:
>Another solution would be to use CASESTOVARS to restructure your file >of incidents. This will give 1 row per ID, which you can join to the >enrollment file: Exactly. That's the "long to wide to long" method. It's much the best way to do a true many-to-many match in SPSS. ("Much the best" unless I've missed something very drastically.) It's an important tool to know, and to use. If you do it, use CASESTOVARS on whichever file has the fewer cases per matching key. (Your matching key is the student: 'id' or 'STDT_ID', in posted examples.) If that file has many variables you want to include in the matched output, you may want to CASESTOVARS only the ones you need for matching; VARSTOCASES back; and MATCH FILES to attach the full set of variables from the file you made 'wide' with CASESTOVARS. For this problem, I still recommend 'interleave' logic, which I think is simpler. It works in this case because the possible matches are ordered by an index variable - the date. Specific comments: - The example doesn't include the final VARSTOCASES. The code >** flag cases were incident_date.n is between enrollment start and >end. >DO REPEAT x = incident_date.1 TO incident_date.3 > /y = match_1 TO match_3. >COMPUTE y = start_enroll LE x AND end_enroll GE x. >END REPEAT. correctly marks *enrollment periods* in which any incident occurred, and which of the incidents occurred in that period. If you want one record per incident, with the enrollment period in which it occurred, you need a VARSTOCASES, something like this (untested): VARSTOCASES MAKE incident_date FROM incident_date.1 TO incident_date.3 /MAKE match FROM match_1 TO match3. SELECT IF match EQ 1. * Remark: This loses incidents that do not fall . * into any enrollment period. Keeping them takes a . * little more complicated selection, probably using . * AGGREGATE. . - As an alternative to VARSTOCASES, make "incident_date.1 TO incident_date.3" a VECTOR, replace the DO REPEAT by a LOOP, and use XSAVE to write a record for each incident, with its matching enrollment period. - The 'dataset' statements like >DATASET name enroll window = front. were introduced in SPSS 14. In earlier releases, use SAVE OUTFILE instead of DATASET NAME, and GET FILE instead of DATASET ACTIVATE. - Finally (I suppose you were all waiting for this one), the EXECUTE statements aren't needed. Good luck to all! Onward, and maybe upward, Richard |
Shalom
> At 08:42 AM 8/22/2006, Richard Ristow wrote: > > Exactly. That's the "long to wide to long" method. It's much the best > way to do a true many-to-many match in SPSS. ("Much the best" unless > I've missed something very drastically.) It's an important tool to > know, and to use. I still believe that a long method is better in these situations and here are some reasoning for that . 1. In a wide setting you have to do more loops because all lines have the same (max) cases ( the INCIDENT cases in Teresa example) . 2. In a true many-to-many situation you will have to check all combination witch is loop I by loop J , (loop ENROLL group by loop INCIDENT cases) . in a long setting you will only have to do one pass of the data on a sorted file . 3. In general the command in a wide setting is more complex recode a1 to a7(1 2=3) in wide setting . recode a1(1 2=3) in a long setting . 4. most database today keep data in the long (normalize) setting . On the other hand I am aware that many users are not comfortable with long setting commands like leave , first , last , temporary variables and even aggregate . Hillel Vardi Ben Gurion U Israel |
Free forum by Nabble | Edit this page |