working with multiple files without merge

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

working with multiple files without merge

Alina Sheyman-3
Here's what I'm trying to do. I have two SPSS files with multiple records of
each class name with start dates. I want to search the second file for the
class name and start date and if it matches to a class name-start date +/- 2
days in the first file to create a "match" variable. Does anyone know how I
can do this ?

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: working with multiple files without merge

Maguin, Eugene
Alina,

The short answer that this can not be done without a merge. You will have to
do a match files. Using syntax only, spss can not move back and forth
between two files in the way that you would need to be able to do. You might
be able to do this with python or script. I don't know. Others can comment
and offer examples, please.

This is not very easy to do even with matching. So let me make sure I
understand some things. First, your language is not so clear about which
file is to be matched to which. So, let file 1 be

Classname startdate
D9edmf     9/10/08
Wru39nv    9/18/08
Io3489     9/18/08
Admm394    9/15/08
Ppo098     8/31/08

Let file 2 by
D9edmf     9/12/08
Wru39nv    9/18/08
Io3489     9/13/08
Admm394    9/11/08

As I read your message you want to mark a new variable in file 2, call it
'match', as 'yes' if a record from file 2 matches a record from file 1
exactly for classname and the file 1 startdate is +/- 2 days of the file 2
startdate. Is this exactly true?

So in the above example file 2, when finished would look like:

Classname  startdate  match
D9edmf     9/12/08     yes
Wru39nv    9/18/08     yes
Io3489     9/13/08
Admm394    9/11/08

Is this exactly correct?

Lastly (for the moment), does either file 1 or file 2 have duplicate
combinations of classname and startdate? For example:

Admm394    9/15/08
Ppo098     8/31/08
Admm394    9/15/08

Has this been checked?

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: working with multiple files without merge

Richard Ristow
In reply to this post by Alina Sheyman-3
At 12:38 PM 10/1/2008, Alina Sheyman wrote:

>I have two SPSS files with multiple records of each class name with
>start dates. I want to search the second file for the class name and
>start date and if it matches to a class name-start date +/- 2 days
>in the first file to create a "match" variable.

I wonder if this is the same problem you're trying to solve in thread
'"do repeat" code'? That looks similar, except with 'wide'
organization -- all dates in one record -- as if you used CASESTOVARS
on this problem. Pardon my guessing.

Anyway, I think this is a solution to your problem, as posted in this thread.

The trick is to compare dates from File2 with the 4-day 'windows'
from File1 by >interleaving<, not merging, the files; and setting the
dates in File1 two days back, so they'll come before any File2 date
in their 'window'.

The data here avoids multiple matches by having all dates in File1
more than 4 days apart. Unless there is one file in which the 4-day
'windows' don't overlap, there can be multiple-to-multiple matches,
and the logic musts be more complicated.

*  .................   Show the test data      ..................... .
DATASET ACTIVATE File1    WINDOW=FRONT.
LIST.
List
|-----------------------------|---------------------------|
|Output Created               |04-OCT-2008 01:26:27       |
|-----------------------------|---------------------------|
[File1]

Class     StartDate Sequence1

Class_A  09/15/2008 a
Class_A  09/21/2008 b
Class_A  10/01/2008 c
Class_A  10/11/2008 d
Class_A  10/16/2008 e
Class_B  09/15/2008 A
Class_B  09/21/2008 B
Class_B  10/01/2008 C
Class_B  10/11/2008 D
Class_B  10/16/2008 E
Class_C  09/15/2008 a
Class_C  09/21/2008 b
Class_C  10/01/2008 c
Class_C  10/11/2008 d
Class_C  10/16/2008 e

Number of cases read:  15    Number of cases listed:  15


DATASET ACTIVATE File2    WINDOW=FRONT.
LIST.
List
|-----------------------------|---------------------------|
|Output Created               |04-OCT-2008 01:26:27       |
|-----------------------------|---------------------------|
[File2]

Class     StartDate Sequence2

Class_A  09/01/2008     01
Class_A  09/10/2008     02
Class_A  09/19/2008     03
Class_A  09/28/2008     04
Class_A  10/07/2008     05
Class_B  09/15/2008     01
Class_B  09/19/2008     02
Class_B  09/24/2008     03
Class_B  09/29/2008     04
Class_B  10/04/2008     05
Class_B  10/09/2008     06
Class_B  10/12/2008     07
Class_B  10/17/2008     08
Class_C  10/01/2008     01
Class_C  10/11/2008     02
Class_C  10/19/2008     03
Class_C  10/24/2008     04
Class_C  10/31/2008     05
Class_C  11/09/2008     06

Number of cases read:  19    Number of cases listed:  19


*  ................................................................. .
*  .................   Computations            ..................... .

*  I.    Copy of File1 records keeping only the key variables, ..... .
*        with real start date as 'F1Start', and 'StartDate'    ..... .
*        set back two days.                                    ..... .
*        (The latter is the 'cute' trick. When the files are   ..... .
*        interleaved, it makes File 1 come before File 2       ..... .
*        records that are as much as two days earlier.)        ..... .
*

*  ----- Load File1:                                          ..... .

ADD FILES
    /FILE=File1
    /KEEP=Class StartDate.

DATASET NAME     IntrLeav WINDOW=FRONT.
*  ----- 'Remember' the real starting date, and set variable   ..... .
*        "StartDate" back by two days                          ..... .
COMPUTE   F1StDate = StartDate.
FORMATS   F1StDate (ADATE10).
VAR LABEL F1StDate 'Matching start date, from File 1'.

COMPUTE   StartDate = StartDate - TIME.DAYS(2).

.  /**/ LIST  /*-*/.
[output suppressed]


*  II.   >Interleave< (not merge) with File 2.                 ..... .
*        The result has all data from File 2,                  ..... .
*        and the key variables from File 1.                    ..... .

ADD FILES
    /FILE=*     /IN=File1Rcd
    /FILE=File2
    /BY   Class StartDate
    /KEEP=Class StartDate ALL.

.  /**/ LIST  /*-*/.
[output suppressed]


*  III.  If a date in File 2 matches one from File 1 within    ..... .
*        two days, keep it as a matching date in the File 2    ..... .
*        record                                                ..... .

*  III.A From File 1 records, 'remember' the starting date     ..... .

DO IF   File1Rcd.
.  COMPUTE #LtstF1Dt = F1StDate.
.  FORMATS #LtstF1Dt  (ADATE10).


*  III.B For  File 2 records,                                  ..... .

ELSE  /* (Records not from File 1 are from File 2)             */.

*  ----- 'StartDate' becomes the File 2 start date             ..... .
.  COMPUTE F2StDate = StartDate.
.  VAR LAB F2StDate 'Starting date of this File 2 record'.
.  FORMATS F2StDAte (ADATE10).

*  ----- If the last File 1 date seen is not from the          ..... .
*        current class, 'forget' it.                           ..... .
.  IF    Class NE LAG(CLASS)
           #LtstF1Dt = $SYSMIS.

*  ----- If the last  File 1 date seen is within two days      ..... .
*        the date in  File 2, store it as a matching date.     ..... .
*        (Remember: "F1StDate" is initially missing in File 2  ..... .
*        records.)                                             ..... .

.  IF   NOT MISSING(#LtstF1Dt)
         AND ABS(#LtstF1Dt - F2StDate) LE TIME.DAYS(2)
           F1StDate = #LtstF1Dt.
END IF.


*  IV.   Add File 1 data to matching File 2 records            ..... .

*  ----- For non-matching File 2 records -- those without a    ..... .
*        start date from File 1 -- fill in File 1 start date   ..... .
*        with an artificial value that can't match anything,   ..... .
*        but is in the correct sort order.                     ..... .

DO IF   MISSING(F1StDate).
.  DO IF   Class EQ LAG(Class).
.     COMPUTE F1StDate = LAG(F1StDate)        + TIME.HMS(0,0,1).
.  ELSE.
.     COMPUTE F1StDate = DATE.MDY(01,01,1900) + TIME.HMS(0,0,1).
.  END IF.
END IF.

*  ----- Merge in the data from matching File 1                ..... .

MATCH FILES
    /TABLE=File1   /RENAME=(StartDate=F1StDate)
    /FILE =*
    /BY    Class    F1StDate
    /DROP =File1Rcd StartDate
    /KEEP =Class
           F1StDate F2StDate
           ALL.

*  ----- Remove the artifical 'File 1 start dates'             ..... .
IF XDATE.TIME(F1StDate) GT 0
           F1StDate = $SYSMIS.

*  ----- Drop the File 1 records                               ..... .
SELECT IF NOT MISSING(F2StDate) /* File 1 records have no File 2 date   */.

LIST.
List
|-----------------------------|---------------------------|
|Output Created               |04-OCT-2008 01:26:34       |
|-----------------------------|---------------------------|
[IntrLeav]

Class      F1StDate   F2StDate Sequence1 Sequence2

Class_A           . 09/01/2008               01
Class_A           . 09/10/2008               02
Class_A  09/21/2008 09/19/2008 b             03
Class_A           . 09/28/2008               04
Class_A           . 10/07/2008               05
Class_B  09/15/2008 09/15/2008 A             01
Class_B  09/21/2008 09/19/2008 B             02
Class_B           . 09/24/2008               03
Class_B  10/01/2008 09/29/2008 C             04
Class_B           . 10/04/2008               05
Class_B  10/11/2008 10/09/2008 D             06
Class_B  10/11/2008 10/12/2008 D             07
Class_B  10/16/2008 10/17/2008 E             08
Class_C  10/01/2008 10/01/2008 c             01
Class_C  10/11/2008 10/11/2008 d             02
Class_C           . 10/19/2008               03
Class_C           . 10/24/2008               04
Class_C           . 10/31/2008               05
Class_C           . 11/09/2008               06

Number of cases read:  19    Number of cases listed:  19

=============================
APPENDIX: Test data, and code
=============================
*  C:\Documents and Settings\Richard\My Documents                       .
*    \Technical\spssx-l\Z-2008d                                         .
*    \2008-10-10 Sheyman - working with multiple files without merge.SPS.


*  In response to posting                                            .
*  Date:    Wed, 1 Oct 2008 12:38:55 -0400                           .
*  From:    Alina Sheyman <[hidden email]>                       .
*  Subject: working with multiple files without merge                .
*  To:      [hidden email]                                 .

*  I have two SPSS files with multiple records of each class name    .
*  with start dates. I want to search the second file for the class  .
*  name and start date and if it matches to a class name-start date  .
*  +/- 2 days in the first file to create a "match" variable.        .


*  ................................................................. .
*  .................   Test data               ..................... .
SET RNG = MT       /* 'Mersenne twister' random number generator  */ .
SET MTINDEX = 0039 /*  Providence, RI telephone book              */ .

*  .....  File 1:  In this file,                               ..... .
*         All start dates are 15 Sept to 15 Oct                ..... .
*         They're spaced 5, 6, or 10 days apart                ..... .
*         They're the same for all three 'Classes'             ..... .

NEW FILE.
INPUT PROGRAM.
.  STRING   Class      (A8).
.  NUMERIC  StartDate  (ADATE10).
.  STRING   Sequence1  (A1).
.  LEAVE    Class,
             StartDate
             Sequence1.

.  COMPUTE  StartDate = DATE.MDY(09,15,2008).
.  LOOP #Day = 01 TO 30 /* Much more than needed */.
.     DO IF #Day GT 01.
.        COMPUTE #Select = TRUNC(RV.UNIFORM(1,4)).
.        RECODE  #Select
                  (1 =  5)
                  (2 =  6)
                  (3 = 10) INTO #Interval.
.        COMPUTE StartDate = StartDate + TIME.DAYS(#Interval).
.     END IF.
.     COMPUTE   Sequence1 = SUBSTR('abcdefghijklmnopqrstuvwxyz',#Day,1).
.     COMPUTE   Class = 'Class_A'.
.     END CASE.
.     COMPUTE   Class = 'Class_C'.
.     END CASE.
.     COMPUTE   Class = 'Class_B'.
.     COMPUTE   Sequence1 = UPCASE(Sequence1).
.     END CASE.
.  END LOOP IF StartDate GT DATE.MDY(10,15,2008).
END FILE.
END INPUT PROGRAM.
DATASET NAME File1.
SORT CASES BY Class StartDate.
.  /*-- LIST  /*-*/.

*  .....  File 2:  In this file,                               ..... .
*         Class A:  Dates in September,    spaced 08-10 days   ..... .
*         Class B:  Dates 15 Sept-15 Oct., spaced 01-05 days   ..... .
*         Class C:  Dates in October,      spaced 04-10 days   ..... .

NEW FILE.
INPUT PROGRAM.
.  STRING   Class      (A8).
.  NUMERIC  StartDate  (ADATE10)
             Sequence2  (N2).
.  LEAVE    Class,
             StartDate
             Sequence2.

*         Class A:  Dates in September,    spaced 08-10 days   ..... .
.  COMPUTE  Class     = 'Class_A'.
.  COMPUTE  StartDate = DATE.MDY(09,01,2008).
.  LOOP #Day = 01 TO 30 /* Much more than needed */.
.     DO IF #Day GT 01.
.        COMPUTE #Interval = TRUNC(RV.UNIFORM(08,11)).
.        COMPUTE StartDate = StartDate + TIME.DAYS(#Interval).
.     END IF.
.     COMPUTE Sequence2 = #Day.
.     END CASE.
.  END LOOP IF StartDate GT DATE.MDY(09,30,2008).

*         Class B:  Dates 15 Sept-15 Oct., spaced 01-05 days   ..... .
.  COMPUTE  Class     = 'Class_B'.
.  COMPUTE  StartDate = DATE.MDY(09,15,2008).
.  LOOP #Day = 01 TO 30 /* Many more than needed */.
.     DO IF #Day GT 01.
.        COMPUTE #Interval = TRUNC(RV.UNIFORM(01,06)).
.        COMPUTE StartDate = StartDate + TIME.DAYS(#Interval).
.     END IF.
.     COMPUTE Sequence2 = #Day.
.     END CASE.
.  END LOOP IF StartDate GT DATE.MDY(10,15,2008).

*         Class C:  Dates in October,      spaced 04-10 days   ..... .
.  COMPUTE  Class     = 'Class_C'.
.  COMPUTE  StartDate = DATE.MDY(10,01,2008).
.  LOOP #Day = 01 TO 30 /* Many more than needed */.
.     DO IF #Day GT 01.
.        COMPUTE #Interval = TRUNC(RV.UNIFORM(03,11)).
.        COMPUTE StartDate = StartDate + TIME.DAYS(#Interval).
.     END IF.
.     COMPUTE Sequence2 = #Day.
.     END CASE.
.  END LOOP IF StartDate GT DATE.MDY(10,31,2008).

END FILE.
END INPUT PROGRAM.
DATASET NAME File2.
.  /*-- LIST  /*-*/.

*  .................   Post after this point   ..................... .
*  ................................................................. .

*  .................   Show the test data      ..................... .
DATASET ACTIVATE File1    WINDOW=FRONT.
LIST.
DATASET ACTIVATE File2    WINDOW=FRONT.
LIST.

*  ................................................................. .
*  .................   Computations            ..................... .

*  I.    Copy of File1 records keeping only the key variables, ..... .
*        with real start date as 'F1Start', and 'StartDate'    ..... .
*        set back two days.                                    ..... .
*        (The latter is the 'cute' trick. When the files are   ..... .
*        interleaved, it makes File 1 come before File 2       ..... .
*        records that are as much as two days earlier.)        ..... .
*

*  ----- Load File1:                                          ..... .

ADD FILES
    /FILE=File1
    /KEEP=Class StartDate.

DATASET NAME     IntrLeav WINDOW=FRONT.


*  ----- 'Remember' the real starting date, and set variable   ..... .
*        "StartDate" back by two days                          ..... .
COMPUTE   F1StDate = StartDate.
FORMATS   F1StDate (ADATE10).
VAR LABEL F1StDate 'Matching start date, from File 1'.

COMPUTE   StartDate = StartDate - TIME.DAYS(2).

.  /**/ LIST  /*-*/.

*  II.   >Interleave< (not merge) with File 2.                 ..... .
*        The result has all data from File 2,                  ..... .
*        and the key variables from File 1.                    ..... .

ADD FILES
    /FILE=*     /IN=File1Rcd
    /FILE=File2
    /BY   Class StartDate
    /KEEP=Class StartDate ALL.

.  /**/ LIST  /*-*/.


*  III.  If a date in File 2 matches one from File 1 within    ..... .
*        two days, keep it as a matching date in the File 2    ..... .
*        record                                                ..... .

*  III.A From File 1 records, 'remember' the starting date     ..... .

DO IF   File1Rcd.
.  COMPUTE #LtstF1Dt = F1StDate.
.  FORMATS #LtstF1Dt  (ADATE10).


*  III.B For  File 2 records,                                  ..... .

ELSE  /* (Records not from File 1 are from File 2)             */.

*  ----- 'StartDate' becomes the File 2 start date             ..... .
.  COMPUTE F2StDate = StartDate.
.  VAR LAB F2StDate 'Starting date of this File 2 record'.
.  FORMATS F2StDAte (ADATE10).

*  ----- If the last File 1 date seen is not from the          ..... .
*        current class, 'forget' it.                           ..... .
.  IF    Class NE LAG(CLASS)
           #LtstF1Dt = $SYSMIS.

*  ----- If the last  File 1 date seen is within two days      ..... .
*        the date in  File 2, store it as a matching date.     ..... .
*        (Remember: "F1StDate" is initially missing in File 2  ..... .
*        records.)                                             ..... .

.  IF   NOT MISSING(#LtstF1Dt)
         AND ABS(#LtstF1Dt - F2StDate) LE TIME.DAYS(2)
           F1StDate = #LtstF1Dt.
END IF.


*  IV.   Add File 1 data to matching File 2 records            ..... .

*  ----- For non-matching File 2 records -- those without a    ..... .
*        start date from File 1 -- fill in File 1 start date   ..... .
*        with an artificial value that can't match anything,   ..... .
*        but is in the correct sort order.                     ..... .

DO IF   MISSING(F1StDate).
.  DO IF   Class EQ LAG(Class).
.     COMPUTE F1StDate = LAG(F1StDate)        + TIME.HMS(0,0,1).
.  ELSE.
.     COMPUTE F1StDate = DATE.MDY(01,01,1900) + TIME.HMS(0,0,1).
.  END IF.
END IF.

*  ----- Merge in the data from matching File 1                ..... .

MATCH FILES
    /TABLE=File1   /RENAME=(StartDate=F1StDate)
    /FILE =*
    /BY    Class    F1StDate
    /DROP =File1Rcd StartDate
    /KEEP =Class
           F1StDate F2StDate
           ALL.

*  ----- Remove the artifical 'File 1 start dates'             ..... .
IF XDATE.TIME(F1StDate) GT 0
           F1StDate = $SYSMIS.

*  ----- Drop the File 1 records                               ..... .
SELECT IF NOT MISSING(F2StDate) /* File 1 records have no File 2 date   */.

LIST.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD