Re: Filter file possible?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Filter file possible?

Ken Chui
On Fri, 4 Aug 2006 15:41:33 -0700, GREENE Teresa <[hidden email]>
wrote:

>Greetings all--
>
>Is it possible to filter a dataset based on a file that contains a list
>of identifiers?  If I need to isolate 1000 subjects that have no common
>variable from a dataset of over 800,000, what is the most expedient way
>to do that?

First, both file should have a common variable like "ID" and their variable
names have to be the same, if they are string, their variable lengths have
to be the same, too.  No duplicated ID should exist in each file.

1. Open the bigger file, use Data > Sort cases to sort the ID in ascending
order, save.
2. Open the smaller file that just contains one variable: ID of the 1000
people you want, sort that too and save.
3. With the smaller file opened, go to Data> Merge files> Add bariables
4. Select the bigger file, click Open
5. Highlight the ID variable in the "Excluded Variables" panel
6. Check the box "Match cases on key variables in sorted files"
7. Check the box "External file is keyed table"
8. Click the little [>] button next to the Key Variables panel and bring the
ID variable over there
9. Click OK to submit the command, you can ignore the warning about the
sorting of files, cause we did it in steps 1 and 2

And that should do the trick.
Reply | Threaded
Open this post in threaded view
|

Re: Filter file possible?

GREENE Teresa
 Hi Kenneth--

Thank you for your response.  Unfortunately, this does not work for this
problem.  I have two files that have multiple records for each ID.  In
one file there is an incident date. In the other file there is a date
range indicating duration of enrollment in a program.  I have to isolate
the one record for each ID from one file that has the date range that is
inclusive of the incident date for that ID in the other file.  I have
tried different ways of matching the files, but because of the multiple
records, the match never works correctly for all records.  Every time, I
get records that are either mismatched, or didn't match when they should
have.

I have worked with systems where you can take a list of identifiers,
create a text file and then indicate that file as the filter for
selecting cases from a database.  I am trying to figure out if there is
a way to do this in SPSS.

Thanks for trying!!!

Teresa :)

------------------------------

Date:    Fri, 18 Aug 2006 16:54:35 -0400
From:    Ken Chui <[hidden email]>
Subject: Re: Filter file possible?

On Fri, 4 Aug 2006 15:41:33 -0700, GREENE Teresa
<[hidden email]>
wrote:

>Greetings all--
>
>Is it possible to filter a dataset based on a file that contains a list
>of identifiers?  If I need to isolate 1000 subjects that have no common
>variable from a dataset of over 800,000, what is the most expedient way
>to do that?

First, both file should have a common variable like "ID" and their
variable
names have to be the same, if they are string, their variable lengths
have
to be the same, too.  No duplicated ID should exist in each file.

1. Open the bigger file, use Data > Sort cases to sort the ID in
ascending
order, save.
2. Open the smaller file that just contains one variable: ID of the 1000
people you want, sort that too and save.
3. With the smaller file opened, go to Data> Merge files> Add bariables
4. Select the bigger file, click Open
5. Highlight the ID variable in the "Excluded Variables" panel
6. Check the box "Match cases on key variables in sorted files"
7. Check the box "External file is keyed table"
8. Click the little [>] button next to the Key Variables panel and bring
the
ID variable over there
9. Click OK to submit the command, you can ignore the warning about the
sorting of files, cause we did it in steps 1 and 2

And that should do the trick.
Reply | Threaded
Open this post in threaded view
|

Re: Filter file possible?

Richard Ristow
At 12:58 PM 8/21/2006, GREENE Teresa wrote:

>I have two files that have multiple records for each ID.  In one file
>there is an incident date. In the other file there is a date range
>indicating duration of enrollment in a program.  I have to isolate the
>one record for each ID from one file that has the date range that is
>inclusive of the incident date for that ID in the other file.

OK. No tested code this time, but if this description isn't enough, and
you have some test data, I'll try it.

Don't merge the files; interleave them.(*)

Assume
. Your files are accessible by file handles INCIDENT and ENROLL, the
former with one record per incident, the latter with one record per
enrollment period.
. ENROLL contains, among other variables, STDT_ID identifying the
individual, and ENTER_DT and EXIT_DT, SPSS date variables giving the
beginning and end of the enrollment period. No two enrollment periods
overlap - that is, every EXIT_DT is earlier than the following
ENTER_DT.
. INCIDENT contains STDT_ID, and SPSS date variable INCID_DT, giving
the incident date.
. There are no variable names in common between the two files. (That
can be relaxed, of course, if you're careful to keep them straight.)
. Finally, you can write and read a temporary file to handle SCRATCH.

Something like this:

*  I.    Prepare the 'enrollment' file by adding variable .
*        KEY_DATE, which will be used for the interleave. .
*        Save the result.                                 .
*        (KEY_DATE has the same value is ENTER_DT. It has .
*        to be a separate variable for the interleave,    .
*        though.)                                         .

GET FILE=ENROLL.
SORT CASES BY STDT_ID ENTER_DT /* If necessary */.

NUMERIC KEY_DATE (DATE11).
COMPUTE KEY_DATE = ENTER_DT.
SAVE OUTFILE=SCRATCH
       /KEEP=KEY_DATE ALL.


*  II.   Prepare the 'incident' file by adding variable   .
*        KEY_DATE, for the interleave, in the active file.
*        (Here, KEY_DATE has the same value as INCID_DT,  .
*        but, again, it needs to be a separate variable.) .

GET FILE=INCIDENT.
SORT CASES BY STDT_ID INCID_DT /* If necessary */.

NUMERIC KEY_DATE (DATE11).
COMPUTE KEY_DATE = ENTER_DT.


*  III.  Now, interleave, attaching to each incident the  .
*        data from the enrollment period WITH THE LATEST  .
*        PRECEDING START DATE.                            .

ADD FILES
   /FILE=SCRATCH      /* Enrollments */
   /FILE=* /IN=INCID  /* Incidents   */
   /BY KEY_DATE       /* Interleave  */.

LEAVE ENTER_DT EXIT_DT.
LEAVE <any other variables you want from ENROLLMENT>.

*  IV.   The following checks that the incident happened  .
*        before the preceding EXIT_DT, and that the       .
*        enrollment record belonged to this student.      .
*        If either is not the case, the student was not   .
*        enrolled at all when the incident occurred.      .

NUMERIC    ENROLLED (F2).
VAR LABEL  ENROLLED
            'Enrolled at time of incident?'.
VAL LABEL  ENROLLED  1 'Enrolled'
                      2 'Not enrolled'.
DO IF   MISSING(LAG(STDT_ID)).
.  COMPUTE ENROLLED = 2.
ELSE IF STDT_ID NE LAG(STDT_ID).
.  COMPUTE ENROLLED = 2.
ELSE IF INCID_DT GT EXIT_DT.
.  COMPUTE ENROLLED = 2.
ELSE.
.  COMPUTE ENROLLED = 1.
END IF.

DO IF ENROLLED = 2.
.  COMPUTE ENTER_DT = $SYSMIS.
.  COMPUTE EXIT_DT  = $SYSMIS.
*..Etc., for all 'enrollment' variables   .
*  you're keeping in the incident records .
END IF.

SELECT IF INCID EQ 1.

*  You may want to drop variables INCID and KEY_DATE .
*  from the final file.                              .

....................................
(*) I wrote something like this a while ago, posted as "Re: Assigning
prices by dates", Mon, 6 Mar 2006 01:49:29 -0500. That program looks
more complicated than it is. A lot of the code is creating a list of
months, analogous to your INCIDENT file.

; for my reference, an
Reply | Threaded
Open this post in threaded view
|

Re: Filter file possible? - interleaving dates

hillel vardi
In reply to this post by GREENE Teresa
Shalom

Here is a simple example of  interleaving dates as Richard Ristow also
suggest .
The program generate random dates so each time you will run it there
will be deferent matches  .

title     interleave  dates .
input program .
loop      subjectnum =1 to 20 .
compute    secday=60*60*24 .
compute    startenroll= date.dmy(1,1,2005)+trunc(unifrom(360)*secday).
compute    endenroll= startenroll+trunc(unifrom(60)*secday)+1.
end case .
end loop .
end file .
end input program .
formats   startenroll endenroll(edate10) .
save      outfile='tmp.sav' / drop=secday .
execute .

input program .
loop      i=1 to 40 .
compute    secday=60*60*24 .
compute    subjectnum=trunc(i/5)+1  .
compute    incidentdate= date.dmy(1,6,2004)+trunc(unifrom(560)*secday).
end case .
end loop .
end file .
end input program .
formats   incidentdate(edate10) .
save      outfile='tmp1.sav' / drop=secday i .
execute .

add files   file='tmp.sav' / file='tmp1.sav' /in=inincid.
if          inincid eq 1 tmpdate=incidentdate .
if          inincid eq 0 tmpdate=startenroll .
formats     tmpdate(edate10) .
sort cases  by subjectnum tmpdate.
numeric     instartdate inenddate(edate10).
leave        instartdate inenddate.
if          inincid eq 0 instartdate=startenroll .
if          inincid eq 0 inenddate=endenroll .
if          (inincid eq 1 ) and ( incidentdate ge instartdate)and (
incidentdate le inenddate) match=1 .
execute .


Hillel Vardi
Ben Gurion U
Israel


GREENE Teresa wrote:

>  Hi Kenneth--
>
> Thank you for your response.  Unfortunately, this does not work for this
> problem.  I have two files that have multiple records for each ID.  In
> one file there is an incident date. In the other file there is a date
> range indicating duration of enrollment in a program.  I have to isolate
> the one record for each ID from one file that has the date range that is
> inclusive of the incident date for that ID in the other file.  I have
> tried different ways of matching the files, but because of the multiple
> records, the match never works correctly for all records.  Every time, I
> get records that are either mismatched, or didn't match when they should
> have.
>
> I have worked with systems where you can take a list of identifiers,
> create a text file and then indicate that file as the filter for
> selecting cases from a database.  I am trying to figure out if there is
> a way to do this in SPSS.
>
> Thanks for trying!!!
>
> Teresa :)
>
> ------------------------------
>
> Date:    Fri, 18 Aug 2006 16:54:35 -0400
> From:    Ken Chui <[hidden email]>
> Subject: Re: Filter file possible?
>
> On Fri, 4 Aug 2006 15:41:33 -0700, GREENE Teresa
> <[hidden email]>
> wrote:
>
>
>> Greetings all--
>>
>> Is it possible to filter a dataset based on a file that contains a list
>> of identifiers?  If I need to isolate 1000 subjects that have no common
>> variable from a dataset of over 800,000, what is the most expedient way
>> to do that?
>>
>
> First, both file should have a common variable like "ID" and their
> variable
> names have to be the same, if they are string, their variable lengths
> have
> to be the same, too.  No duplicated ID should exist in each file.
>
> 1. Open the bigger file, use Data > Sort cases to sort the ID in
> ascending
> order, save.
> 2. Open the smaller file that just contains one variable: ID of the 1000
> people you want, sort that too and save.
> 3. With the smaller file opened, go to Data> Merge files> Add bariables
> 4. Select the bigger file, click Open
> 5. Highlight the ID variable in the "Excluded Variables" panel
> 6. Check the box "Match cases on key variables in sorted files"
> 7. Check the box "External file is keyed table"
> 8. Click the little [>] button next to the Key Variables panel and bring
> the
> ID variable over there
> 9. Click OK to submit the command, you can ignore the warning about the
> sorting of files, cause we did it in steps 1 and 2
>
> And that should do the trick.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Filter file possible?

Marks, Jim
In reply to this post by Ken Chui
Teresa:

Another solution would be to use CASESTOVARS to restructure your file of
incidents. This will give 1 row per ID, which you can join to the
enrollment file:


** sample ENROLLMENT data-- sorted on id.
DATA LIST FREE /id (f8.0) start_enroll end_enroll (2ADATE10).
BEGIN DATA
1 12/12/2002 2/15/2003 1 6/1/2003 6/25/2003 1 12/12/1997 12/12/1998
2 12/20/2004 1/5/2005  2 4/1/1999 4/12/1999  2 4/1/2002 6/1/2002 END
DATA.

DATASET name enroll window = front.

** sample INCINDENT data-- sorted on id.
DATA LIST FREE /id (f8.0) incident_date (adate10).
BEGIN DATA
1 1/1/2002 1 6/12/2003
2 1/1/2005 2 4/4/1999 2 5/25/2002
END DATA.

DATASET NAME incident WINDOW = front.

** "flatten" file to 1 row per id.
SORT CASES BY id incident_date .
CASESTOVARS
 /ID = id
.
DATASET ACTIVATE enroll.
MATCH FILES FILE = enroll /TABLE = incident
  /BY id.
EXECUTE.

** flag cases were incident_date.n is between enrollment start and end.
DO REPEAT x = incident_date.1 TO incident_date.3
  /y = match_1 TO match_3.

COMPUTE y = start_enroll LE x AND end_enroll GE x.
END REPEAT.
EXECUTE.

You will need to deal with incidents that have multiple matches, and
incidents that have no matches.

Cheers
--jim


Jim Marks
Senior Market Analyst
x 1616

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: Monday, August 21, 2006 10:04 PM
To: [hidden email]
Subject: Re: Filter file possible?

At 12:58 PM 8/21/2006, GREENE Teresa wrote:

>I have two files that have multiple records for each ID.  In one file
>there is an incident date. In the other file there is a date range
>indicating duration of enrollment in a program.  I have to isolate the
>one record for each ID from one file that has the date range that is
>inclusive of the incident date for that ID in the other file.

OK. No tested code this time, but if this description isn't enough, and
you have some test data, I'll try it.

Don't merge the files; interleave them.(*)

Assume
. Your files are accessible by file handles INCIDENT and ENROLL, the
former with one record per incident, the latter with one record per
enrollment period.
. ENROLL contains, among other variables, STDT_ID identifying the
individual, and ENTER_DT and EXIT_DT, SPSS date variables giving the
beginning and end of the enrollment period. No two enrollment periods
overlap - that is, every EXIT_DT is earlier than the following ENTER_DT.
. INCIDENT contains STDT_ID, and SPSS date variable INCID_DT, giving the
incident date.
. There are no variable names in common between the two files. (That can
be relaxed, of course, if you're careful to keep them straight.) .
Finally, you can write and read a temporary file to handle SCRATCH.

Something like this:

*  I.    Prepare the 'enrollment' file by adding variable .
*        KEY_DATE, which will be used for the interleave. .
*        Save the result.                                 .
*        (KEY_DATE has the same value is ENTER_DT. It has .
*        to be a separate variable for the interleave,    .
*        though.)                                         .

GET FILE=ENROLL.
SORT CASES BY STDT_ID ENTER_DT /* If necessary */.

NUMERIC KEY_DATE (DATE11).
COMPUTE KEY_DATE = ENTER_DT.
SAVE OUTFILE=SCRATCH
       /KEEP=KEY_DATE ALL.


*  II.   Prepare the 'incident' file by adding variable   .
*        KEY_DATE, for the interleave, in the active file.
*        (Here, KEY_DATE has the same value as INCID_DT,  .
*        but, again, it needs to be a separate variable.) .

GET FILE=INCIDENT.
SORT CASES BY STDT_ID INCID_DT /* If necessary */.

NUMERIC KEY_DATE (DATE11).
COMPUTE KEY_DATE = ENTER_DT.


*  III.  Now, interleave, attaching to each incident the  .
*        data from the enrollment period WITH THE LATEST  .
*        PRECEDING START DATE.                            .

ADD FILES
   /FILE=SCRATCH      /* Enrollments */
   /FILE=* /IN=INCID  /* Incidents   */
   /BY KEY_DATE       /* Interleave  */.

LEAVE ENTER_DT EXIT_DT.
LEAVE <any other variables you want from ENROLLMENT>.

*  IV.   The following checks that the incident happened  .
*        before the preceding EXIT_DT, and that the       .
*        enrollment record belonged to this student.      .
*        If either is not the case, the student was not   .
*        enrolled at all when the incident occurred.      .

NUMERIC    ENROLLED (F2).
VAR LABEL  ENROLLED
            'Enrolled at time of incident?'.
VAL LABEL  ENROLLED  1 'Enrolled'
                      2 'Not enrolled'.
DO IF   MISSING(LAG(STDT_ID)).
.  COMPUTE ENROLLED = 2.
ELSE IF STDT_ID NE LAG(STDT_ID).
.  COMPUTE ENROLLED = 2.
ELSE IF INCID_DT GT EXIT_DT.
.  COMPUTE ENROLLED = 2.
ELSE.
.  COMPUTE ENROLLED = 1.
END IF.

DO IF ENROLLED = 2.
.  COMPUTE ENTER_DT = $SYSMIS.
.  COMPUTE EXIT_DT  = $SYSMIS.
*..Etc., for all 'enrollment' variables   .
*  you're keeping in the incident records .
END IF.

SELECT IF INCID EQ 1.

*  You may want to drop variables INCID and KEY_DATE .
*  from the final file.                              .

....................................
(*) I wrote something like this a while ago, posted as "Re: Assigning
prices by dates", Mon, 6 Mar 2006 01:49:29 -0500. That program looks
more complicated than it is. A lot of the code is creating a list of
months, analogous to your INCIDENT file.

; for my reference, an
Reply | Threaded
Open this post in threaded view
|

Re: Filter file possible?

Richard Ristow
At 08:42 AM 8/22/2006, Marks, Jim wrote:

>Another solution would be to use CASESTOVARS to restructure your file
>of incidents. This will give 1 row per ID, which you can join to the
>enrollment file:

Exactly. That's the "long to wide to long" method. It's much the best
way to do a true many-to-many match in SPSS. ("Much the best" unless
I've missed something very drastically.) It's an important tool to
know, and to use.

If you do it, use CASESTOVARS on whichever file has the fewer cases per
matching key. (Your matching key is the student: 'id' or 'STDT_ID', in
posted examples.) If that file has many variables you want to include
in the matched output, you may want to CASESTOVARS only the ones you
need for matching; VARSTOCASES back; and MATCH FILES to attach the full
set of variables from the file you made 'wide' with CASESTOVARS.

For this problem, I still recommend 'interleave' logic, which I think
is simpler. It works in this case because the possible matches are
ordered by an index variable - the date.

Specific comments:
- The example doesn't include the final VARSTOCASES. The code

>** flag cases were incident_date.n is between enrollment start and
>end.
>DO REPEAT x = incident_date.1 TO incident_date.3
>   /y = match_1 TO match_3.
>COMPUTE y = start_enroll LE x AND end_enroll GE x.
>END REPEAT.

correctly marks *enrollment periods* in which any incident occurred,
and which of the incidents occurred in that period. If you want one
record per incident, with the enrollment period in which it occurred,
you need a VARSTOCASES, something like this (untested):

VARSTOCASES
    MAKE   incident_date
      FROM incident_date.1 TO incident_date.3
   /MAKE   match
      FROM match_1 TO match3.
SELECT IF match EQ 1.
* Remark: This loses incidents that do not fall     .
* into any enrollment period. Keeping them takes a  .
* little more complicated selection, probably using .
* AGGREGATE.                                        .

- As an alternative to VARSTOCASES, make "incident_date.1 TO
incident_date.3" a VECTOR, replace the DO REPEAT by a LOOP, and use
XSAVE to write a record for each incident, with its matching enrollment
period.

- The 'dataset' statements like

>DATASET name enroll window = front.

were introduced in SPSS 14. In earlier releases, use SAVE OUTFILE
instead of DATASET NAME, and GET FILE instead of DATASET ACTIVATE.

- Finally (I suppose you were all waiting for this one), the EXECUTE
statements aren't needed.

Good luck to all!

Onward, and maybe upward,
Richard
Reply | Threaded
Open this post in threaded view
|

Re: Filter file possible?

hillel vardi
Shalom
> At 08:42 AM 8/22/2006, Richard Ristow wrote:
>
> Exactly. That's the "long to wide to long" method. It's much the best
> way to do a true many-to-many match in SPSS. ("Much the best" unless
> I've missed something very drastically.) It's an important tool to
> know, and to use.
I still believe that a long method is better in these situations and
here are some reasoning for that .

 1. In a wide setting you have to do more loops because all lines have
the same (max) cases ( the INCIDENT cases in Teresa example) .
 2. In a true many-to-many situation you will have to check all
combination witch is  loop I by loop J  , (loop ENROLL group by loop
INCIDENT cases) .
      in a long setting you will only have to do one pass of the data on
a sorted file .
 3. In general  the  command in  a  wide setting  is more complex
     recode a1 to a7(1 2=3)            in wide setting .
     recode a1(1 2=3)                     in a long setting .
 4. most database today keep data in the long (normalize) setting .

On the other hand I am aware that many users are not comfortable  with
long setting commands like leave , first , last , temporary variables
and even aggregate .

Hillel Vardi
Ben Gurion U
Israel