Many-to-many merge in SPSS

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Many-to-many merge in SPSS

Richard Ristow
From time to time, the question has come up, how to do a many-to-many merge in SPSS: given two files with key variables (which I'll call the 'group key') and, potentially, several records per group-key value in each file, join every record in each file with every record having the same group-key value in the other file. Up to now, no satisfactory solution had been found (see Appendix II at the end of this posting).

Here, I give code and a test run for what I think is a complete solution. It's a version of the long-wide-long method (see Appendix II), using
. David Marso's insight that only record keys from the file being converted to 'wide' need be kept in the 'wide' file(*), and
. my own, that if the record keys are integers consecutive from 1, a set of keys is characterized by its largest value.

(*)  David Marso is <[hidden email]>. His code, for the special case of taking the Cartesian product of two files, is in SPSSX-L posting
     Date:    Thu, 19 Nov 2009 11:27:59 -0500              
     From:    Lakshmikanth <[hidden email]>
     Subject: Cartesian product in SPSS
     To:      [hidden email]
     A reformatted version, with test run, is in                 
     Date:    Tue, 24 Nov 2009 00:46:30 -0500
     From:    Richard Ristow <[hidden email]>
     Subject: Re: Cartesian product in SPSS
     To:      [hidden email]


Test and demonstration:

In the following, 'Unrolled' cannot be a dataset; it must be a disk file or a file handle to a disk file.

Test data - File 1:
[File1]
Category      School

Art           RISD
Art           Metropolitan Museum
Liberal Arts  Oberlin
Liberal Arts  Antioch
Liberal Arts  Swarthmore
Technical     MIT
Technical     ETH

Number of cases read:  7    Number of cases listed:  7

File 2:
[File2]
Category      Subject

Art           Painting
Art           Drawing
Art           Sculpture
Liberal Arts  Literature
Liberal Arts  Philosophy
Technical     Calculus
Technical     Physics
Technical     Chemistry
Technical     Engineering

Number of cases read:  9    Number of cases listed:  9

Demonstration run, with code, intermediate files, and output:


*  ...   I.       Identify File1 records, within categories, by      .
*                 integers sequentially from 1.                      .
*        (Because the category variable is a string, this is done in .
*        transformation code; CREATE and RANK wouldn't work.)        .

DATASET ACTIVATE File1     WINDOW=ASIS.
DATASET COPY     File1_IDd WINDOW=MINIMIZED.
DATASET ACTIVATE File1_IDd WINDOW=FRONT.
NUMERIC    ID1 (F3).
DO IF    $CASENUM EQ 1.
.  COMPUTE ID1 = 1.
ELSE IF  Category NE LAG(Category).
.  COMPUTE ID1 = 1.
ELSE.
.  COMPUTE ID1 = LAG(ID1) + 1.
END IF.


*  ...   II.      Restructure File1 from 'long' to 'wide', except    .
*               . Keep only the category, and identifying number ID1 .
*               . Represent the set of values of ID1 by the largest  .
*                 value. Since the values of ID1 are consecutive     .
*                 integers starting with 1, that's unambiguous.      .

DATASET DECLARE   File1_Wide.
DATASET ACTIVATE  File1_IDd  WINDOW=ASIS.
AGGREGATE OUTFILE=File1_Wide
   /BREAK=Category
   /All_IDs 'Representing all values of ID1' = MAX(ID1).


.  /**/  DATASET ACTIVATE File1_Wide WINDOW=FRONT               /*-*/.
*  -**-  This, believe it or not, is the 'wide' form of File1   -***-.
.  /**/  LIST                                                   /*-*/.

List
|-----------------------------|---------------------------|
|Output Created               |26-NOV-2009 14:13:19       |
|-----------------------------|---------------------------|
[File1_Wide]
 
Category      All_IDs

Art               2
Liberal Arts      3
Technical         2

Number of cases read:  3    Number of cases listed:  3

 
*  ...   III.     Merge 'wide' File1 with File2                      .
*                 This is a one-to-many merge.                       .

MATCH FILES
   /TABLE=File1_Wide
   /FILE =File2
   /BY Category.

.  /**/  LIST                                                   /*-*/.

List
|-----------------------------|---------------------------|
|Output Created               |26-NOV-2009 14:13:20       |
|-----------------------------|---------------------------|
Category      All_IDs Subject

Art               2   Painting
Art               2   Drawing
Art               2   Sculpture
Liberal Arts      3   Literature
Liberal Arts      3   Philosophy
Technical         2   Calculus
Technical         2   Physics
Technical         2   Chemistry
Technical         2   Engineering

Number of cases read:  9    Number of cases listed:  9

 
*  ...   IV.A.    Return to a 'long' structure, with all File1       .
*                 records joined with all File2 records.             .

*       The result is the complete many-to-many merged file, except  .
*       for (oh, yes) the *data* from File1.                         .

*       (This is where XSAVE will work, and VARSTOCASES won't.)      .

NUMERIC ID1 (F3).
LOOP    ID1 = 1 TO ALL_IDs.
.  XSAVE OUTFILE=Unrolled
   /KEEP=Category ID1 ALL.
END LOOP.
EXECUTE /* required */.

*  ...   IV.B.    ... and go to 'file 1 first' order.                .

GET FILE = Unrolled.
DATASET NAME     ManyToMany WINDOW=FRONT.
SORT CASES BY    Category ID1.


*  ...   V.       Re-merge with File1, adding the data, completing   .
*                 the job.                                           .

MATCH FILES
   /TABLE = File1_IDd
   /FILE  = *
   /BY      Category ID1
   /KEEP  = Category ID1 ALL.

LISt.

List
|-----------------------------|---------------------------|
|Output Created               |26-NOV-2009 14:13:22       |
|-----------------------------|---------------------------|
[ManyToMany]
C:\Documents and Settings\Richard\My Documents
  \Temporary\SPSS\
  2009-11-24 Ristow - Many to many merge in SPSS - UNROLL.SAV
 
Category      ID1 School                 All_IDs Subject

Art             1 RISD                       2   Painting
Art             1 RISD                       2   Drawing
Art             1 RISD                       2   Sculpture
Art             2 Metropolitan Museum        2   Painting
Art             2 Metropolitan Museum        2   Drawing
Art             2 Metropolitan Museum        2   Sculpture
Liberal Arts    1 Oberlin                    3   Literature
Liberal Arts    1 Oberlin                    3   Philosophy
Liberal Arts    2 Antioch                    3   Literature
Liberal Arts    2 Antioch                    3   Philosophy
Liberal Arts    3 Swarthmore                 3   Literature
Liberal Arts    3 Swarthmore                 3   Philosophy
Technical       1 MIT                        2   Calculus
Technical       1 MIT                        2   Physics
Technical       1 MIT                        2   Chemistry
Technical       1 MIT                        2   Engineering
Technical       2 ETH                        2   Calculus
Technical       2 ETH                        2   Physics
Technical       2 ETH                        2   Chemistry
Technical       2 ETH                        2   Engineering

Number of cases read:  20    Number of cases listed:  20
===============================
APPENDIX I:  Test data and code
===============================
*  C:\Documents and Settings\Richard\My Documents                    .
*    \Technical\spssx-l\Z-2009d\                                     .
*    2009-11-26  Ristow - Many-to-many merge in SPSS.SPS             .
 
*  To support a posting with title "Many-to-many merge in SPSS"      .


*  The question of doing many-to-many merges in SPSS, comes up from  .
*  time to time. No fully satisfactory solution has been posted.     .
*                                                                    .
*  The most promising method suggested has been "long-wide-long".    .
*  That is, all records in, say, file 1 with the same group          .
*  identifier are brought into a single record, typically by         .
*  CASESTOVARS. That changes the problem to a one-to-many merge,     .
*  using MATCH FILES with the wide version as a TABLE file. The      .
*  result, which is 'long' in file 2 and 'wide' in file 1, is then   .
*  unrolled to restore file 1, typically using VARSTOCASES. That     .
*  gives, for each file 2 record, a complete restored set of file 1  .
*  records, completing the match.                                    .
*                                                                    .
*  There are two distinct difficulties. First, if the file has many  .
*  variables, its wide version may have a very large number. And,    .
*  the VARSTOCASES is data-dependent: the largest number of cases    .
*  in any file 2 group, and the list of variables to be restored,    .
*  must be in the VARSTOCASES code. (A solution, of course, is to    .
*  have Python look at the data dictionary and generate the          .
*  VARSTOCASES -- but that's probably not trivial, as Python has to  .
*  infer which sets of variables were generated from CASESTOVARS.)   .
*                                                                    .
*  A variant is to include only the identifiers of the file 1        .
*  records in the 'wide' file 1, and re-attach the data variables    .
*  later. That avoids creating a file with a great many variables,   .
*  and does not require a VARSTOCASES naming the full list of        .
*  variables to be restored. The Cartesian-product code by David     .
*  Marso(*) works this way. However, it still  requires that the     .
*  number of records in one of the files, or rather the largest      .
*  number in any group in that file, be in the code, as the bound of .
*  the vector of identifiers.                                        .
*                                                                    .
*  (*)  David Marso's code is in SPSSX-L posting                     .
*       Date:    Thu, 19 Nov 2009 11:27:59 -0500                     .
*       From:    Lakshmikanth <[hidden email]>      .
*       Subject: Cartesian product in SPSS                           .
*       To:      [hidden email]                            .
*       A version reformatted by R. Ristow is in                     .
*       Date:    Tue, 24 Nov 2009 00:46:30 -0500                     .
*       From:    Richard Ristow <[hidden email]>            .
*       Subject: Re: Cartesian product in SPSS                       .
*       To:      [hidden email]                            .
*                                                                    .
*  However, if the keys within groups are successive positive        .
*  integers, the 'wide' form of the list of IDs can be a single      .
*  number: the largest key value in the group. ('Unrolling' this     .
*  requires XSAVE logic, instead of VARSTOCASES.)                    .
*                                                                    .
*  That's what's done in the code below, assigning integer record    .
*  IDs and applying 'log-wide-long' to File 1.                       .



*  ................................................................. .
*  .................   Scratch file            ..................... .

*  (Because it's written by XSAVE, it cannot be a dataset.)          .

FILE HANDLE Unrolled
 /NAME='C:\Documents and Settings\Richard\My Documents'              +
         '\Temporary\SPSS\'                                          +
         '2009-11-24 Ristow - Many to many merge in SPSS'            +
         ' - '                                                       +
         'UNROLL.SAV'.


*  .................   Test data               ..................... .


DATA LIST FIXED
   /Category    04-16 (A)
    School      17-38 (A).
BEGIN DATA
   Art          RISD
   Art          Metropolitan Museum
   Liberal Arts Oberlin
   Liberal Arts Antioch
   Liberal Arts Swarthmore
   Technical    MIT
   Technical    ETH
END DATA.
DATASET NAME     File1.
.  /*-- LIST /*-*/.

DATA LIST FIXED
   /Category    04-16 (A)
    Subject     17-28 (A).
BEGIN DATA
   Art          Painting
   Art          Drawing
   Art          Sculpture
   Liberal Arts Literature
   Liberal Arts Philosophy
   Technical    Calculus
   Technical    Physics
   Technical    Chemistry
   Technical    Engineering
END DATA.
DATASET NAME     File2.
.  /*-- LIST /*-*/.

*  .................   Post after this point   ..................... .
*  ................................................................. .

DATASET ACTIVATE File1     WINDOW=ASIS.
LIST.

DATASET ACTIVATE File2     WINDOW=ASIS.
LIST.

*  ...   I.       Identify File1 records, within categories, by      .
*                 integers sequentially from 1.                      .
*        (Because the category variable is a string, this is done in .
*        transformation code; CREATE and RANK wouldn't work.)        .

DATASET ACTIVATE File1     WINDOW=ASIS.
DATASET COPY     File1_IDd WINDOW=MINIMIZED.
DATASET ACTIVATE File1_IDd WINDOW=FRONT.


NUMERIC    ID1 (F3).
DO IF    $CASENUM EQ 1.
.  COMPUTE ID1 = 1.
ELSE IF  Category NE LAG(Category).
.  COMPUTE ID1 = 1.
ELSE.
.  COMPUTE ID1 = LAG(ID1) + 1.
END IF.


*  ...   II.      Restructure File1 from 'long' to 'wide', except    .
*               . Keep only the category, and identifying number ID1 .
*               . Represent the set of values of ID1 by the largest  .
*                 value. Since the values of ID1 are consecutive     .
*                 integers starting with 1, that's unambiguous.      .

DATASET DECLARE   File1_Wide.
DATASET ACTIVATE  File1_IDd  WINDOW=ASIS.

AGGREGATE OUTFILE=File1_Wide
   /BREAK=Category
   /All_IDs 'Representing all values of ID1' = MAX(ID1).

  
.  /**/  DATASET ACTIVATE File1_Wide WINDOW=FRONT               /*-*/.
*  -**-  This, believe it or not, is the 'wide' form of File1   -***-.
.  /**/  LIST                                                   /*-*/.

 
*  ...   III.     Merge 'wide' File1 with File2                      .
*                 This is a one-to-many merge.                       .

MATCH FILES
   /TABLE=File1_Wide
   /FILE =File2
   /BY Category.
  
.  /**/  LIST                                                   /*-*/.


*  ...   IV.A.    Return to a 'long' structure, with all File1       .
*                 records joined with all File2 records.             .

*       The result is the complete many-to-many merged file, except  .
*       for (oh, yes) the *data* from File1.                         .

*       (This is where XSAVE will work, and VARSTOCASES won't.)      .

NUMERIC ID1 (F3).
LOOP    ID1 = 1 TO ALL_IDs.
.  XSAVE OUTFILE=Unrolled
   /KEEP=Category ID1 ALL.
END LOOP.
EXECUTE /* required */.

*  ...   IV.B.    ... and go to 'file 1 first' order.                .

GET FILE = Unrolled.
DATASET NAME     ManyToMany WINDOW=FRONT.
SORT CASES BY    Category ID1.

*  ...   V.       Re-merge with File1, adding the data, completing   .
*                 the job.                                           .

MATCH FILES
   /TABLE = File1_IDd
   /FILE  = *
   /BY      Category ID1
   /KEEP  = Category ID1 ALL.
  
LISt.
================================
APPENDIX II: History and methods
================================

The best ideas have been some version of what I'll call the 'long-wide-long' method. In that method, in one of the files, the records for every key value (i'll cal are combined into a single record ('wide' form), using CASESTOVARS or an equivalent. Then there's only one record per key value in that file, and the merge can be done as one-to-many, using /TABLE in MATCH FILES. Finally, the data from the first file is re-converted ('unrolled') to 'long' form, using VARSTOCASES or some equivalent.

I don't think anybody's got that working satisfactorily. First, the 'wide' File 1 may have so many variables that SPSS can't handle it, or can't handle it efficiently. Second, and more crucial, both the list of variables in File 1, and the maximum number of records in File 1 for any key value, must be coded in the VARSTOCASES statement. One might write a Python program to analyze the data dictionary and generate the VARSTOCASES statement, but I think even that would be awkward.

David Marso of SPSS, Inc., wrote an interesting variant (see citations above). If the file to be made 'wide', say File 1, has a record-key variable that uniquely identifies records within each (group) key value, only the record-key values need be combined into a 'wide' record. After the one-to-many merge with File 2, and unrolling back to long form, the File 1 data can then be re-attached by a MATCH FILE by group-key and record-key, with File 1 as a TABLE.

David Marso's code still requires a vector to hold the record-key values from File 1; and the length of that vector, i.e. the maximum number of records in File 1 for any group-key value, must be given in the code.

However, if the record key consists of integers sequentially from 1 within each group, the set of keys may be represented by a single integer, namely the largest value. Then, the set of variables that need be named is not data-dependent at all, and general-purpose code is possible.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Many-to-many merge in SPSS

fuigb
Has anyone had luck with this code?  I understand what it is designed to do, but it does not run and too far short of being a real coder to debug it.  If anyone charitable has debugged this then please post.  

If this can actually be made to function the original author deserves a medal!
Reply | Threaded
Open this post in threaded view
|

Re: Many-to-many merge in SPSS

David Marso
Administrator
In reply to this post by Richard Ristow
These days I'd likely approach it with MATRIX (hint KRON).
Note BIG matrix jobs can be partitioned with SPLIT FILE in batch mode (in case XG of RAM doesn't suffice).
It's really late here.  I shall scrutinize 2mrow.
Thanks for the mention. All too often people omit that courtesy!
-----
Richard Ristow wrote
From time to time, the question has come up, how to do a many-to-many
merge in SPSS: given two files with key variables (which I'll call the
'group key') and, potentially, several records per group-key value in
each file, join every record in each file with every record having the
same group-key value in the other file. Up to now, no satisfactory
solution had been found (see Appendix II at the end of this
posting).
Here, I give code and a test run for what I think is a complete solution.
It's a version of the long-wide-long method (see Appendix II), using

. David Marso's insight that only record keys from the file being
converted to 'wide' need be kept in the 'wide' file(*), and
. my own, that if the record keys are integers consecutive from 1, a set
of keys is characterized by its largest value.
(*)  David Marso is <[hidden email]>. His code, for the
special case of taking the Cartesian product of two files, is in SPSSX-L
posting
     Date:    Thu, 19 Nov 2009
11:27:59
-0500              

     From:    Lakshmikanth
<[hidden email]>
     Subject: Cartesian product in SPSS
     To:     
[hidden email] 
     A reformatted version, with test run, is
in                 

     Date:    Tue, 24 Nov 2009
00:46:30 -0500
     From:    Richard Ristow
<[hidden email]>
     Subject: Re: Cartesian product in SPSS
     To:     
[hidden email] 

Test and demonstration:
In the following, 'Unrolled' cannot be a dataset; it must be a disk
file or a file handle to a disk file.
Test data - File 1:
[File1]
Category      School
Art           RISD
Art          
Metropolitan Museum
Liberal Arts  Oberlin
Liberal Arts  Antioch
Liberal Arts  Swarthmore
Technical     MIT
Technical     ETH
Number of cases read:  7    Number of cases
listed:  7
File 2:
[File2]
Category      Subject
Art          
Painting
Art          
Drawing
Art          
Sculpture
Liberal Arts  Literature
Liberal Arts  Philosophy
Technical     Calculus
Technical     Physics
Technical     Chemistry
Technical     Engineering
Number of cases read:  9    Number of cases
listed:  9
Demonstration run, with code, intermediate files, and
output:

*  ...  
I.       Identify File1 records, within
categories, by      .
*                
integers sequentially from
1.                     
.
*        (Because the category
variable is a string, this is done in .
*        transformation code; CREATE
and RANK wouldn't work.)       
.
DATASET ACTIVATE File1     WINDOW=ASIS.
DATASET COPY     File1_IDd WINDOW=MINIMIZED.
DATASET ACTIVATE File1_IDd WINDOW=FRONT.
NUMERIC    ID1 (F3).
DO IF    $CASENUM EQ 1.
.  COMPUTE ID1 = 1.
ELSE IF  Category NE LAG(Category).
.  COMPUTE ID1 = 1.
ELSE.
.  COMPUTE ID1 = LAG(ID1) + 1.
END IF.

*  ...   II.      Restructure
File1 from 'long' to 'wide', except    .
*              
. Keep only the category, and identifying number ID1 .
*              
. Represent the set of values of ID1 by the largest  .
*                
value. Since the values of ID1 are consecutive    
.
*                
integers starting with 1, that's
unambiguous.      .
DATASET DECLARE   File1_Wide.
DATASET ACTIVATE  File1_IDd  WINDOW=ASIS.
AGGREGATE OUTFILE=File1_Wide
   /BREAK=Category
   /All_IDs 'Representing all values of ID1' =
MAX(ID1).

.  /**/  DATASET ACTIVATE File1_Wide
WINDOW=FRONT              
/*-*/.
*  -**-  This, believe it or not, is the 'wide' form of
File1   -***-.
.  /**/ 
LIST                                                  
/*-*/.
List
|-----------------------------|---------------------------|
|Output
Created              
|26-NOV-2009 14:13:19       |
|-----------------------------|---------------------------|
[File1_Wide]
 
Category      All_IDs
Art              
2
Liberal Arts      3
Technical         2
Number of cases read:  3    Number of cases
listed:  3
 
*  ...   III.     Merge 'wide' File1
with
File2                     
.
*                
This is a one-to-many
merge.                      
.
MATCH FILES
   /TABLE=File1_Wide
   /FILE =File2
   /BY Category.
.  /**/ 
LIST                                                  
/*-*/.
List
|-----------------------------|---------------------------|
|Output
Created              
|26-NOV-2009 14:13:20       |
|-----------------------------|---------------------------|
Category      All_IDs Subject
Art              
2   Painting
Art              
2   Drawing
Art              
2   Sculpture
Liberal Arts      3   Literature
Liberal Arts      3   Philosophy
Technical         2  
Calculus
Technical         2  
Physics
Technical         2  
Chemistry
Technical         2  
Engineering
Number of cases read:  9    Number of cases
listed:  9
 
*  ...   IV.A.    Return to a 'long'
structure, with all File1       .
*                
records joined with all File2
records.            
.
*       The result is the complete
many-to-many merged file, except  .
*       for (oh, yes) the *data* from
File1.                        
.
*       (This is where XSAVE will work, and
VARSTOCASES won't.)      .
NUMERIC ID1 (F3).
LOOP    ID1 = 1 TO ALL_IDs.
.  XSAVE OUTFILE=Unrolled
   /KEEP=Category ID1 ALL.
END LOOP.
EXECUTE /* required */.
*  ...   IV.B.    ... and go to 'file 1
first'
order.               
.
GET FILE = Unrolled.
DATASET NAME     ManyToMany WINDOW=FRONT.
SORT CASES BY    Category ID1.

*  ...   V.       Re-merge
with File1, adding the data, completing   .
*                
the
job.                                          
.
MATCH FILES
   /TABLE = File1_IDd
   /FILE  = *
   /BY      Category ID1
   /KEEP  = Category ID1 ALL.
LISt.
List
|-----------------------------|---------------------------|
|Output
Created              
|26-NOV-2009 14:13:22       |
|-----------------------------|---------------------------|
[ManyToMany]
C:\Documents and Settings\Richard\My Documents
  \Temporary\SPSS\
  2009-11-24 Ristow - Many to many merge in SPSS - UNROLL.SAV
 
Category      ID1
School                
All_IDs Subject
Art            
1
RISD                      
2   Painting
Art            
1
RISD                      
2   Drawing
Art            
1
RISD                      
2   Sculpture
Art            
2 Metropolitan Museum       
2   Painting
Art            
2 Metropolitan Museum       
2   Drawing
Art            
2 Metropolitan Museum       
2   Sculpture
Liberal Arts    1
Oberlin                   
3   Literature
Liberal Arts    1
Oberlin                   
3   Philosophy
Liberal Arts    2
Antioch                   
3   Literature
Liberal Arts    2
Antioch                   
3   Philosophy
Liberal Arts    3
Swarthmore                
3   Literature
Liberal Arts    3
Swarthmore                
3   Philosophy
Technical       1
MIT                       
2   Calculus
Technical       1
MIT                       
2   Physics
Technical       1
MIT                       
2   Chemistry
Technical       1
MIT                       
2   Engineering
Technical       2
ETH                       
2   Calculus
Technical       2
ETH                       
2   Physics
Technical       2
ETH                       
2   Chemistry
Technical       2
ETH                       
2   Engineering
Number of cases read:  20    Number of cases
listed:  20
===============================
APPENDIX I:  Test data and code
===============================
*  C:\Documents and Settings\Richard\My
Documents                   
.
*   
\Technical\spssx-l\Z-2009d\                                    
.
*    2009-11-26  Ristow - Many-to-many merge in
SPSS.SPS            
.
 
*  To support a posting with title "Many-to-many merge in
SPSS"      .

*  The question of doing many-to-many merges in SPSS, comes up
from  .
*  time to time. No fully satisfactory solution has been
posted.     .
*                                                                   
.
*  The most promising method suggested has been
"long-wide-long".    .
*  That is, all records in, say, file 1 with the same
group          .
*  identifier are brought into a single record, typically
by         .
*  CASESTOVARS. That changes the problem to a one-to-many
merge,     .
*  using MATCH FILES with the wide version as a TABLE file.
The      .
*  result, which is 'long' in file 2 and 'wide' in file 1, is
then   .
*  unrolled to restore file 1, typically using VARSTOCASES.
That     .
*  gives, for each file 2 record, a complete restored set of file
1  .
*  records, completing the
match.                                   
.
*                                                                   
.
*  There are two distinct difficulties. First, if the file has
many  .
*  variables, its wide version may have a very large number.
And,    .
*  the VARSTOCASES is data-dependent: the largest number of
cases    .
*  in any file 2 group, and the list of variables to be
restored,    .
*  must be in the VARSTOCASES code. (A solution, of course, is
to    .
*  have Python look at the data dictionary and generate
the          .
*  VARSTOCASES -- but that's probably not trivial, as Python has
to  .
*  infer which sets of variables were generated from
CASESTOVARS.)   .
*                                                                   
.
*  A variant is to include only the identifiers of the file
1        .
*  records in the 'wide' file 1, and re-attach the data
variables    .
*  later. That avoids creating a file with a great many
variables,   .
*  and does not require a VARSTOCASES naming the full list
of        .
*  variables to be restored. The Cartesian-product code by
David     .
*  Marso(*) works this way. However, it still  requires that
the     .
*  number of records in one of the files, or rather the
largest      .
*  number in any group in that file, be in the code, as the bound of
.
*  the vector of
identifiers.                                       
.
*                                                                   
.
*  (*)  David Marso's code is in SPSSX-L
posting                    
.
*       Date:    Thu, 19 Nov
2009 11:27:59
-0500                    
.
*       From:   
Lakshmikanth
<[hidden email]>     
.
*       Subject: Cartesian product in
SPSS                          
.
*       To:     
[hidden email]                           
.
*       A version reformatted by R. Ristow
is
in                    
.
*       Date:    Tue, 24 Nov
2009 00:46:30
-0500                    
.
*       From:    Richard
Ristow
<[hidden email]>           
.
*       Subject: Re: Cartesian product in
SPSS                      
.
*       To:     
[hidden email]                           
.
*                                                                   
.
*  However, if the keys within groups are successive
positive        .
*  integers, the 'wide' form of the list of IDs can be a
single      .
*  number: the largest key value in the group. ('Unrolling'
this     .
*  requires XSAVE logic, instead of
VARSTOCASES.)                   
.
*                                                                   
.
*  That's what's done in the code below, assigning integer
record    .
*  IDs and applying 'log-wide-long' to File
1.                      
.

*  .................................................................
.
*  .................   Scratch
file           
..................... .
*  (Because it's written by XSAVE, it cannot be a
dataset.)         
.
FILE HANDLE Unrolled
 /NAME='C:\Documents and Settings\Richard\My
Documents'             
+
        
'\Temporary\SPSS\'                                         
+
         '2009-11-24 Ristow -
Many to many merge in
SPSS'           
+
         ' -
'                                                      
+
         'UNROLL.SAV'.

*  .................   Test
data              
..................... .

DATA LIST FIXED
   /Category    04-16 (A)
    School      17-38 (A).
BEGIN DATA
   Art         
RISD
   Art         
Metropolitan Museum
   Liberal Arts Oberlin
   Liberal Arts Antioch
   Liberal Arts Swarthmore
   Technical    MIT
   Technical    ETH
END DATA.
DATASET NAME     File1.
.  /*-- LIST /*-*/.
DATA LIST FIXED
   /Category    04-16 (A)
    Subject     17-28 (A).
BEGIN DATA
   Art         
Painting
   Art         
Drawing
   Art         
Sculpture
   Liberal Arts Literature
   Liberal Arts Philosophy
   Technical    Calculus
   Technical    Physics
   Technical    Chemistry
   Technical    Engineering
END DATA.
DATASET NAME     File2.
.  /*-- LIST /*-*/.
*  .................   Post after this point  
..................... .
*  .................................................................
.
DATASET ACTIVATE File1     WINDOW=ASIS.
LIST.
DATASET ACTIVATE File2     WINDOW=ASIS.
LIST.
*  ...   I.       Identify
File1 records, within categories, by      .
*                
integers sequentially from
1.                     
.
*        (Because the category
variable is a string, this is done in .
*        transformation code; CREATE
and RANK wouldn't work.)       
.
DATASET ACTIVATE File1     WINDOW=ASIS.
DATASET COPY     File1_IDd WINDOW=MINIMIZED.
DATASET ACTIVATE File1_IDd WINDOW=FRONT.

NUMERIC    ID1 (F3).
DO IF    $CASENUM EQ 1.
.  COMPUTE ID1 = 1.
ELSE IF  Category NE LAG(Category).
.  COMPUTE ID1 = 1.
ELSE.
.  COMPUTE ID1 = LAG(ID1) + 1.
END IF.

*  ...   II.      Restructure
File1 from 'long' to 'wide', except    .
*              
. Keep only the category, and identifying number ID1 .
*              
. Represent the set of values of ID1 by the largest  .
*                
value. Since the values of ID1 are consecutive    
.
*                
integers starting with 1, that's
unambiguous.      .
DATASET DECLARE   File1_Wide.
DATASET ACTIVATE  File1_IDd  WINDOW=ASIS.
AGGREGATE OUTFILE=File1_Wide
   /BREAK=Category
   /All_IDs 'Representing all values of ID1' =
MAX(ID1).
  
.  /**/  DATASET ACTIVATE File1_Wide
WINDOW=FRONT              
/*-*/.
*  -**-  This, believe it or not, is the 'wide' form of
File1   -***-.
.  /**/ 
LIST                                                  
/*-*/.
 
*  ...   III.     Merge 'wide' File1
with
File2                     
.
*                
This is a one-to-many
merge.                      
.
MATCH FILES
   /TABLE=File1_Wide
   /FILE =File2
   /BY Category.
  
.  /**/ 
LIST                                                  
/*-*/.

*  ...   IV.A.    Return to a 'long'
structure, with all File1       .
*                
records joined with all File2
records.            
.
*       The result is the complete
many-to-many merged file, except  .
*       for (oh, yes) the *data* from
File1.                        
.
*       (This is where XSAVE will work, and
VARSTOCASES won't.)      .
NUMERIC ID1 (F3).
LOOP    ID1 = 1 TO ALL_IDs.
.  XSAVE OUTFILE=Unrolled
   /KEEP=Category ID1 ALL.
END LOOP.
EXECUTE /* required */.
*  ...   IV.B.    ... and go to 'file 1
first'
order.               
.
GET FILE = Unrolled.
DATASET NAME     ManyToMany WINDOW=FRONT.
SORT CASES BY    Category ID1.
*  ...   V.       Re-merge
with File1, adding the data, completing   .
*                
the
job.                                          
.
MATCH FILES
   /TABLE = File1_IDd
   /FILE  = *
   /BY      Category ID1
   /KEEP  = Category ID1 ALL.
  
LISt.
================================
APPENDIX II: History and methods
================================
The best ideas have been some version of what I'll call the
'long-wide-long' method. In that method, in one of the files, the records
for every key value (i'll cal are combined into a single record ('wide'
form), using CASESTOVARS or an equivalent. Then there's only one record
per key value in that file, and the merge can be done as one-to-many,
using /TABLE in MATCH FILES. Finally, the data from the first file is
re-converted ('unrolled') to 'long' form, using VARSTOCASES or some
equivalent.
I don't think anybody's got that working satisfactorily. First, the
'wide' File 1 may have so many variables that SPSS can't handle it, or
can't handle it efficiently. Second, and more crucial, both the list of
variables in File 1, and the maximum number of records in File 1 for any
key value, must be coded in the VARSTOCASES statement. One might write a
Python program to analyze the data dictionary and generate the
VARSTOCASES statement, but I think even that would be awkward.
David Marso of SPSS, Inc., wrote an interesting variant (see citations
above). If the file to be made 'wide', say File 1, has a record-key
variable that uniquely identifies records within each (group) key value,
only the record-key values need be combined into a 'wide' record. After
the one-to-many merge with File 2, and unrolling back to long form, the
File 1 data can then be re-attached by a MATCH FILE by group-key and
record-key, with File 1 as a TABLE.
David Marso's code still requires a vector to hold the record-key values
from File 1; and the length of that vector, i.e. the maximum number of
records in File 1 for any group-key value, must be given in the
code.
However, if the record key consists of integers sequentially from 1
within each group, the set of keys may be represented by a single
integer, namely the largest value. Then, the set of variables that need
be named is not data-dependent at all, and general-purpose code is
possible.



=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Many-to-many merge in SPSS

fuigb
After a night's sleep I realized what I needed to tweak and so SUCCESS!  You guys are geniuses.  For years and at various jobs I've fought the good fight for SPSS against the true-believers of SAS: "SPSS cannot accomplish what is easy in SAS."  You're helping me to roll back the naysayers!  Again, thank you.  
Reply | Threaded
Open this post in threaded view
|

Re: Many-to-many merge in SPSS

David Marso
Administrator
In reply to this post by David Marso
Here is a solution to the many to many merge using 4 lines of MATRIX code ;-)
Note that this only works as is with numeric data!!!
To work with strings, best is to merge the ids and numeric using the following (save strings and ids to separate files or datasets, then do sort and table matches to insert the string data).
Note the Cartesian dataset is sorted by ID02 ID01, so MATCH stringtable2 BY ID02,
SORT result by ID01 then MATCH stringtable1 BT ID01.
I only ran this against two 1000 row tables -> 1,000,000 resulting cases.
Larger data files will likely work fine given sufficient RAM and WORKSPACE settings.
For HUGE files?  Contact me for a con$ultation ;-)

*Data simulation here *.
DATASET DECLARE file1.
MATRIX.
SAVE TRUNC(UNIFORM(1000,10)*10)/ OUTFILE file1/ VARIABLES f1x01 TO f1x10.
END MATRIX.
DATASET ACTIVATE file1.
COMPUTE ID01=$CASENUM.
SAVE OUTFILE 'C:\TEMP\file1.sav'.

DATASET DECLARE file2.
MATRIX.
SAVE TRUNC(UNIFORM(1000,10)*10)/ OUTFILE file2/ VARIABLES f2x01 TO f2x10.
END MATRIX.
DATASET ACTIVATE file2.
COMPUTE ID02=$CASENUM.
SAVE OUTFILE 'C:\TEMP\file2.sav'.
/*DATA Simulation  DONE */.

/*Actual cartesian match */.
DATASET DECLARE cartesian.
MATRIX .
GET data1 / FILE 'C:\TEMP\file1.sav' / VARIABLES ID01 f1x01 TO f1x10.
GET data2 / FILE 'C:\TEMP\file2.sav' / VARIABLES ID02 f2x01 TO f2x10.
COMPUTE d1d2={KRONEKER(MAKE(NROW(data2),1,1),data1),
              KRONEKER(data2,MAKE(NROW(data1),1,1))}.
SAVE d1d2
  / OUTFILE cartesian
  / VARIABLES ID01  f1x01 TO f1x10 ID02 f2x01 TO f2x10 .
END MATRIX.



David Marso wrote
These days I'd likely approach it with MATRIX (hint KRON).
Note BIG matrix jobs can be partitioned with SPLIT FILE in batch mode (in case XG of RAM doesn't suffice).
It's really late here.  I shall scrutinize 2mrow.
Thanks for the mention. All too often people omit that courtesy!
-----
Richard Ristow wrote
From time to time, the question has come up, how to do a many-to-many
merge in SPSS: given two files with key variables (which I'll call the
'group key') and, potentially, several records per group-key value in
each file, join every record in each file with every record having the
same group-key value in the other file. Up to now, no satisfactory
solution had been found (see Appendix II at the end of this
posting).
Here, I give code and a test run for what I think is a complete solution.
It's a version of the long-wide-long method (see Appendix II), using

. David Marso's insight that only record keys from the file being
converted to 'wide' need be kept in the 'wide' file(*), and
. my own, that if the record keys are integers consecutive from 1, a set
of keys is characterized by its largest value.
(*)  David Marso is <[hidden email]>. His code, for the
special case of taking the Cartesian product of two files, is in SPSSX-L
posting
     Date:    Thu, 19 Nov 2009
11:27:59
-0500              

     From:    Lakshmikanth
<[hidden email]>
     Subject: Cartesian product in SPSS
     To:     
[hidden email] 
     A reformatted version, with test run, is
in                 

     Date:    Tue, 24 Nov 2009
00:46:30 -0500
     From:    Richard Ristow
<[hidden email]>
     Subject: Re: Cartesian product in SPSS
     To:     
[hidden email] 

Test and demonstration:
In the following, 'Unrolled' cannot be a dataset; it must be a disk
file or a file handle to a disk file.
Test data - File 1:
[File1]
Category      School
Art           RISD
Art          
Metropolitan Museum
Liberal Arts  Oberlin
Liberal Arts  Antioch
Liberal Arts  Swarthmore
Technical     MIT
Technical     ETH
Number of cases read:  7    Number of cases
listed:  7
File 2:
[File2]
Category      Subject
Art          
Painting
Art          
Drawing
Art          
Sculpture
Liberal Arts  Literature
Liberal Arts  Philosophy
Technical     Calculus
Technical     Physics
Technical     Chemistry
Technical     Engineering
Number of cases read:  9    Number of cases
listed:  9
Demonstration run, with code, intermediate files, and
output:

*  ...  
I.       Identify File1 records, within
categories, by      .
*                
integers sequentially from
1.                     
.
*        (Because the category
variable is a string, this is done in .
*        transformation code; CREATE
and RANK wouldn't work.)       
.
DATASET ACTIVATE File1     WINDOW=ASIS.
DATASET COPY     File1_IDd WINDOW=MINIMIZED.
DATASET ACTIVATE File1_IDd WINDOW=FRONT.
NUMERIC    ID1 (F3).
DO IF    $CASENUM EQ 1.
.  COMPUTE ID1 = 1.
ELSE IF  Category NE LAG(Category).
.  COMPUTE ID1 = 1.
ELSE.
.  COMPUTE ID1 = LAG(ID1) + 1.
END IF.

*  ...   II.      Restructure
File1 from 'long' to 'wide', except    .
*              
. Keep only the category, and identifying number ID1 .
*              
. Represent the set of values of ID1 by the largest  .
*                
value. Since the values of ID1 are consecutive    
.
*                
integers starting with 1, that's
unambiguous.      .
DATASET DECLARE   File1_Wide.
DATASET ACTIVATE  File1_IDd  WINDOW=ASIS.
AGGREGATE OUTFILE=File1_Wide
   /BREAK=Category
   /All_IDs 'Representing all values of ID1' =
MAX(ID1).

.  /**/  DATASET ACTIVATE File1_Wide
WINDOW=FRONT              
/*-*/.
*  -**-  This, believe it or not, is the 'wide' form of
File1   -***-.
.  /**/ 
LIST                                                  
/*-*/.
List
|-----------------------------|---------------------------|
|Output
Created              
|26-NOV-2009 14:13:19       |
|-----------------------------|---------------------------|
[File1_Wide]
 
Category      All_IDs
Art              
2
Liberal Arts      3
Technical         2
Number of cases read:  3    Number of cases
listed:  3
 
*  ...   III.     Merge 'wide' File1
with
File2                     
.
*                
This is a one-to-many
merge.                      
.
MATCH FILES
   /TABLE=File1_Wide
   /FILE =File2
   /BY Category.
.  /**/ 
LIST                                                  
/*-*/.
List
|-----------------------------|---------------------------|
|Output
Created              
|26-NOV-2009 14:13:20       |
|-----------------------------|---------------------------|
Category      All_IDs Subject
Art              
2   Painting
Art              
2   Drawing
Art              
2   Sculpture
Liberal Arts      3   Literature
Liberal Arts      3   Philosophy
Technical         2  
Calculus
Technical         2  
Physics
Technical         2  
Chemistry
Technical         2  
Engineering
Number of cases read:  9    Number of cases
listed:  9
 
*  ...   IV.A.    Return to a 'long'
structure, with all File1       .
*                
records joined with all File2
records.            
.
*       The result is the complete
many-to-many merged file, except  .
*       for (oh, yes) the *data* from
File1.                        
.
*       (This is where XSAVE will work, and
VARSTOCASES won't.)      .
NUMERIC ID1 (F3).
LOOP    ID1 = 1 TO ALL_IDs.
.  XSAVE OUTFILE=Unrolled
   /KEEP=Category ID1 ALL.
END LOOP.
EXECUTE /* required */.
*  ...   IV.B.    ... and go to 'file 1
first'
order.               
.
GET FILE = Unrolled.
DATASET NAME     ManyToMany WINDOW=FRONT.
SORT CASES BY    Category ID1.

*  ...   V.       Re-merge
with File1, adding the data, completing   .
*                
the
job.                                          
.
MATCH FILES
   /TABLE = File1_IDd
   /FILE  = *
   /BY      Category ID1
   /KEEP  = Category ID1 ALL.
LISt.
List
|-----------------------------|---------------------------|
|Output
Created              
|26-NOV-2009 14:13:22       |
|-----------------------------|---------------------------|
[ManyToMany]
C:\Documents and Settings\Richard\My Documents
  \Temporary\SPSS\
  2009-11-24 Ristow - Many to many merge in SPSS - UNROLL.SAV
 
Category      ID1
School                
All_IDs Subject
Art            
1
RISD                      
2   Painting
Art            
1
RISD                      
2   Drawing
Art            
1
RISD                      
2   Sculpture
Art            
2 Metropolitan Museum       
2   Painting
Art            
2 Metropolitan Museum       
2   Drawing
Art            
2 Metropolitan Museum       
2   Sculpture
Liberal Arts    1
Oberlin                   
3   Literature
Liberal Arts    1
Oberlin                   
3   Philosophy
Liberal Arts    2
Antioch                   
3   Literature
Liberal Arts    2
Antioch                   
3   Philosophy
Liberal Arts    3
Swarthmore                
3   Literature
Liberal Arts    3
Swarthmore                
3   Philosophy
Technical       1
MIT                       
2   Calculus
Technical       1
MIT                       
2   Physics
Technical       1
MIT                       
2   Chemistry
Technical       1
MIT                       
2   Engineering
Technical       2
ETH                       
2   Calculus
Technical       2
ETH                       
2   Physics
Technical       2
ETH                       
2   Chemistry
Technical       2
ETH                       
2   Engineering
Number of cases read:  20    Number of cases
listed:  20
===============================
APPENDIX I:  Test data and code
===============================
*  C:\Documents and Settings\Richard\My
Documents                   
.
*   
\Technical\spssx-l\Z-2009d\                                    
.
*    2009-11-26  Ristow - Many-to-many merge in
SPSS.SPS            
.
 
*  To support a posting with title "Many-to-many merge in
SPSS"      .

*  The question of doing many-to-many merges in SPSS, comes up
from  .
*  time to time. No fully satisfactory solution has been
posted.     .
*                                                                   
.
*  The most promising method suggested has been
"long-wide-long".    .
*  That is, all records in, say, file 1 with the same
group          .
*  identifier are brought into a single record, typically
by         .
*  CASESTOVARS. That changes the problem to a one-to-many
merge,     .
*  using MATCH FILES with the wide version as a TABLE file.
The      .
*  result, which is 'long' in file 2 and 'wide' in file 1, is
then   .
*  unrolled to restore file 1, typically using VARSTOCASES.
That     .
*  gives, for each file 2 record, a complete restored set of file
1  .
*  records, completing the
match.                                   
.
*                                                                   
.
*  There are two distinct difficulties. First, if the file has
many  .
*  variables, its wide version may have a very large number.
And,    .
*  the VARSTOCASES is data-dependent: the largest number of
cases    .
*  in any file 2 group, and the list of variables to be
restored,    .
*  must be in the VARSTOCASES code. (A solution, of course, is
to    .
*  have Python look at the data dictionary and generate
the          .
*  VARSTOCASES -- but that's probably not trivial, as Python has
to  .
*  infer which sets of variables were generated from
CASESTOVARS.)   .
*                                                                   
.
*  A variant is to include only the identifiers of the file
1        .
*  records in the 'wide' file 1, and re-attach the data
variables    .
*  later. That avoids creating a file with a great many
variables,   .
*  and does not require a VARSTOCASES naming the full list
of        .
*  variables to be restored. The Cartesian-product code by
David     .
*  Marso(*) works this way. However, it still  requires that
the     .
*  number of records in one of the files, or rather the
largest      .
*  number in any group in that file, be in the code, as the bound of
.
*  the vector of
identifiers.                                       
.
*                                                                   
.
*  (*)  David Marso's code is in SPSSX-L
posting                    
.
*       Date:    Thu, 19 Nov
2009 11:27:59
-0500                    
.
*       From:   
Lakshmikanth
<[hidden email]>     
.
*       Subject: Cartesian product in
SPSS                          
.
*       To:     
[hidden email]                           
.
*       A version reformatted by R. Ristow
is
in                    
.
*       Date:    Tue, 24 Nov
2009 00:46:30
-0500                    
.
*       From:    Richard
Ristow
<[hidden email]>           
.
*       Subject: Re: Cartesian product in
SPSS                      
.
*       To:     
[hidden email]                           
.
*                                                                   
.
*  However, if the keys within groups are successive
positive        .
*  integers, the 'wide' form of the list of IDs can be a
single      .
*  number: the largest key value in the group. ('Unrolling'
this     .
*  requires XSAVE logic, instead of
VARSTOCASES.)                   
.
*                                                                   
.
*  That's what's done in the code below, assigning integer
record    .
*  IDs and applying 'log-wide-long' to File
1.                      
.

*  .................................................................
.
*  .................   Scratch
file           
..................... .
*  (Because it's written by XSAVE, it cannot be a
dataset.)         
.
FILE HANDLE Unrolled
 /NAME='C:\Documents and Settings\Richard\My
Documents'             
+
        
'\Temporary\SPSS\'                                         
+
         '2009-11-24 Ristow -
Many to many merge in
SPSS'           
+
         ' -
'                                                      
+
         'UNROLL.SAV'.

*  .................   Test
data              
..................... .

DATA LIST FIXED
   /Category    04-16 (A)
    School      17-38 (A).
BEGIN DATA
   Art         
RISD
   Art         
Metropolitan Museum
   Liberal Arts Oberlin
   Liberal Arts Antioch
   Liberal Arts Swarthmore
   Technical    MIT
   Technical    ETH
END DATA.
DATASET NAME     File1.
.  /*-- LIST /*-*/.
DATA LIST FIXED
   /Category    04-16 (A)
    Subject     17-28 (A).
BEGIN DATA
   Art         
Painting
   Art         
Drawing
   Art         
Sculpture
   Liberal Arts Literature
   Liberal Arts Philosophy
   Technical    Calculus
   Technical    Physics
   Technical    Chemistry
   Technical    Engineering
END DATA.
DATASET NAME     File2.
.  /*-- LIST /*-*/.
*  .................   Post after this point  
..................... .
*  .................................................................
.
DATASET ACTIVATE File1     WINDOW=ASIS.
LIST.
DATASET ACTIVATE File2     WINDOW=ASIS.
LIST.
*  ...   I.       Identify
File1 records, within categories, by      .
*                
integers sequentially from
1.                     
.
*        (Because the category
variable is a string, this is done in .
*        transformation code; CREATE
and RANK wouldn't work.)       
.
DATASET ACTIVATE File1     WINDOW=ASIS.
DATASET COPY     File1_IDd WINDOW=MINIMIZED.
DATASET ACTIVATE File1_IDd WINDOW=FRONT.

NUMERIC    ID1 (F3).
DO IF    $CASENUM EQ 1.
.  COMPUTE ID1 = 1.
ELSE IF  Category NE LAG(Category).
.  COMPUTE ID1 = 1.
ELSE.
.  COMPUTE ID1 = LAG(ID1) + 1.
END IF.

*  ...   II.      Restructure
File1 from 'long' to 'wide', except    .
*              
. Keep only the category, and identifying number ID1 .
*              
. Represent the set of values of ID1 by the largest  .
*                
value. Since the values of ID1 are consecutive    
.
*                
integers starting with 1, that's
unambiguous.      .
DATASET DECLARE   File1_Wide.
DATASET ACTIVATE  File1_IDd  WINDOW=ASIS.
AGGREGATE OUTFILE=File1_Wide
   /BREAK=Category
   /All_IDs 'Representing all values of ID1' =
MAX(ID1).
  
.  /**/  DATASET ACTIVATE File1_Wide
WINDOW=FRONT              
/*-*/.
*  -**-  This, believe it or not, is the 'wide' form of
File1   -***-.
.  /**/ 
LIST                                                  
/*-*/.
 
*  ...   III.     Merge 'wide' File1
with
File2                     
.
*                
This is a one-to-many
merge.                      
.
MATCH FILES
   /TABLE=File1_Wide
   /FILE =File2
   /BY Category.
  
.  /**/ 
LIST                                                  
/*-*/.

*  ...   IV.A.    Return to a 'long'
structure, with all File1       .
*                
records joined with all File2
records.            
.
*       The result is the complete
many-to-many merged file, except  .
*       for (oh, yes) the *data* from
File1.                        
.
*       (This is where XSAVE will work, and
VARSTOCASES won't.)      .
NUMERIC ID1 (F3).
LOOP    ID1 = 1 TO ALL_IDs.
.  XSAVE OUTFILE=Unrolled
   /KEEP=Category ID1 ALL.
END LOOP.
EXECUTE /* required */.
*  ...   IV.B.    ... and go to 'file 1
first'
order.               
.
GET FILE = Unrolled.
DATASET NAME     ManyToMany WINDOW=FRONT.
SORT CASES BY    Category ID1.
*  ...   V.       Re-merge
with File1, adding the data, completing   .
*                
the
job.                                          
.
MATCH FILES
   /TABLE = File1_IDd
   /FILE  = *
   /BY      Category ID1
   /KEEP  = Category ID1 ALL.
  
LISt.
================================
APPENDIX II: History and methods
================================
The best ideas have been some version of what I'll call the
'long-wide-long' method. In that method, in one of the files, the records
for every key value (i'll cal are combined into a single record ('wide'
form), using CASESTOVARS or an equivalent. Then there's only one record
per key value in that file, and the merge can be done as one-to-many,
using /TABLE in MATCH FILES. Finally, the data from the first file is
re-converted ('unrolled') to 'long' form, using VARSTOCASES or some
equivalent.
I don't think anybody's got that working satisfactorily. First, the
'wide' File 1 may have so many variables that SPSS can't handle it, or
can't handle it efficiently. Second, and more crucial, both the list of
variables in File 1, and the maximum number of records in File 1 for any
key value, must be coded in the VARSTOCASES statement. One might write a
Python program to analyze the data dictionary and generate the
VARSTOCASES statement, but I think even that would be awkward.
David Marso of SPSS, Inc., wrote an interesting variant (see citations
above). If the file to be made 'wide', say File 1, has a record-key
variable that uniquely identifies records within each (group) key value,
only the record-key values need be combined into a 'wide' record. After
the one-to-many merge with File 2, and unrolling back to long form, the
File 1 data can then be re-attached by a MATCH FILE by group-key and
record-key, with File 1 as a TABLE.
David Marso's code still requires a vector to hold the record-key values
from File 1; and the length of that vector, i.e. the maximum number of
records in File 1 for any group-key value, must be given in the
code.
However, if the record key consists of integers sequentially from 1
within each group, the set of keys may be represented by a single
integer, namely the largest value. Then, the set of variables that need
be named is not data-dependent at all, and general-purpose code is
possible.



=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Many-to-many merge in SPSS

David Marso
Administrator
From Richard's appendix:
"David Marso's code still requires a vector to hold the record-key values
from File 1; and the length of that vector, i.e. the maximum number of
records in File 1 for any group-key value, must be given in the
code.
However, if the record key consists of integers sequentially from 1
within each group, the set of keys may be represented by a single
integer, namely the largest value. Then, the set of variables that need
be named is not data-dependent at all, and general-purpose code is
possible. "

SWEET that the matrix solution requires nothing but the two data files ;-)
I later tested the code on 2 10K row files using only IDs for the KRON op and later did the 'fact' dat merges using 2 TABLE merges with an intermediate SORT. Code on other PC so you'll need to just imagine it until later. Hmm those 2 10K files end up creating a 100M row dataset! Do the math.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Many-to-many merge in SPSS

Albert-Jan Roskam
Never tried this (and no SPSS here), but this might also be a way:

DEFINE !connect_string ()
'DSN=MS Access Database;DBQ=c:\temp\temp.mdb;DriverId=25;FIL=MS Access;MaxBufferSize=2048;PageTimeout=5;'
!ENDDEFINE.
SAVE TRANSLATE /TYPE=ODBC /CONNECT=!connect_string /TABLE=tbl1 /REPLACE.
SAVE TRANSLATE /TYPE=ODBC /CONNECT=!connect_string /TABLE=tbl2 /REPLACE
  /SQL = "SELECT * FROM tbl1, tbl2 INTO tblCartesian".
GET DATA...
 
Regards,

Albert-Jan



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a

fresh water system, and public health, what have the Romans ever done for us?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



From: David Marso <[hidden email]>
To: [hidden email]
Sent: Saturday, February 1, 2014 9:08 AM
Subject: Re: [SPSSX-L] Many-to-many merge in SPSS

From Richard's appendix:
"David Marso's code still requires a vector to hold the record-key values
from File 1; and the length of that vector, i.e. the maximum number of
records in File 1 for any group-key value, must be given in the
code.
However, if the record key consists of integers sequentially from 1
within each group, the set of keys may be represented by a single
integer, namely the largest value. Then, the set of variables that need
be named is not data-dependent at all, and general-purpose code is
possible. "

SWEET that the matrix solution requires nothing but the two data files ;-)
I later tested the code on 2 10K row files using only IDs for the KRON op
and later did the 'fact' dat merges using 2 TABLE merges with an
intermediate SORT. Code on other PC so you'll need to just imagine it until
later. Hmm those 2 10K files end up creating a 100M row dataset! Do the
math.



-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Many-to-many-merge-in-SPSS-tp1081978p5724262.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: Many-to-many merge in SPSS

Richard Ristow
In reply to this post by fuigb
There've been a couple of solutions posted -- see David Marso's
MATRIX logic(1), and Albert-Jan Roskam's suggestion of using an
external SQL database(2).

I think the long-wide-long approach will still prove useful. Among
other things, it can be made to work very much like MATCH FILES. And
the output retains labels, formats, and missing values of all
variables; I don't think MATRIX keeps those, and external SQL certainly won't.

(1) For MATRIX logic, see
Date:    Fri, 31 Jan 2014 10:06:28 -0800
From:    David Marso <[hidden email]>
Subject: Re: Many-to-many merge in SPSS
To:      [hidden email]

(2) For SQL logic, see
Date:    Sat, 1 Feb 2014 11:40:31 -0800
From:    Albert-Jan Roskam <[hidden email]>
Subject: Re: Many-to-many merge in SPSS
To:      [hidden email]

Here is the long-wide-long method cleaned up and wrapped in a macro.
I'm testing it, before posting, but I'd be grateful for anybody else
trying it as well.

Call:

!M2MnyX Left   =<left input>/
         Right  =<right input>/
         BYvar  =<variables>/
         WrkFile=<file>.

(The 'X' suffix on the name indicates that this version uses XSAVE logic.)

Parameters (all are required) are,

Left            Name of an existing SPSS dataset;
                or, name or file handle of an SPSS .SAV file
Right           Name of a second SPSS dataset or .SAV file, as above

BYvar           Names of one or more variables occurring in both files.
                Requirements are as for BY variables in MATCH FILES,
                except that it need not be a unique key in either input.
                Every record in each input will be joined to every
                record having the same BYvar value in the other.
WrkFile         Name or file handle of a disk file (NOT a dataset) for
                working storage.

Other requirements:

- Inputs 'Left' and 'Right', and variable(s) 'BYvar' must be compatible
        as for MATCH FILES:
   .    The BYvar variable or variables must occur in both inputs, and
        be of the same type and length in both.
   .    Both inputs must be in ascending order on the BYvar variable(s).
        (However, group key values need not be unique record identifiers
        in either input.)
   .    If any other variable occurs in both, it must have the same type and
        length in both, AND ITS VALUE FROM THE RIGHT INPUT WILL BE LOST.

- SPSS must be able to write to the work file.

Result:

+ The left and right inputs will be unaltered.

+ A new dataset 'Joined' will be created, and will be the active
dataset. It will have one record for every pair of records from the
'left' and 'right' inputs that match on the group key. Order of
variables will be,

.       The 'BYvar' variable(s)
.       Variable 'Left#'. (This variable numbers the records
        in the left input, sequentially within each BYvar value.)
.       The non-key variables from the left input
.       The non-key variables from the right input

+ The records will be in ascending order by the BYvar variables.
Within a BYvar value, they will be in the same order as in the
left-hand input. Records from the same left-hand input record will be
in the same order as in their right-hand input records.

+ No value of the group key will occur in this file unless it occurs
in both inputs. (In database terminology, it is an 'inner join'.)

+ The macro will print certain messages, including echoing its parameters.

Side effects:
.       The working file will be left on disk. It may be erased if desired.
        A pre-an existing file by that name will be lost.
.       Datasets '@LftIDd' and '@LftWide' will be created; they may be erased.
        Pre-existing datasets named '@LftIDd', '@LftWide',  or 'Joined' will
.       be lost.
.       Pre-existing variables named 'Left#', '@NewGrp', and '@MxLkey' will
        be lost.
Glitch:
.       The dataset 'Joined' will be tagged as being an image of the working
        file; on closing SPSS, you'll be asked if you want to save it back
        to the working-file name and location.


(Later, code to erase the working file and scratch datasets may be
added, and the ECHO messages dropped.)

APPENDICES FOLLOW:
I.      Definition of macro !M2MnyX, with instructions in comments
II.     Output from demonstration run
III.    Test data and code, for demonstration run
================================================
Appendix I: Definition of macro !M2MnyX
================================================
*  ****************************************************************  .
*  Macro for many-to-many merge of two datasets or .SAV files,       .
*  using XSAVE logic.                                                .
*  Takes four arguments, all required, all terminated by '/':        .
*  Left     Name of one of the datasets, or saved file, to be joined .
*  Right    Name of the other  dataset,  or saved file, to be joined .
*  BYvar    Variable or variables on which  the two are to be joined..
*           Requirements are as for BY variables in MATCH FILES:     .
*           all variables must be present and of the same type in    .
*           both inputs; and both must be sorted in ascending order  .
*           on the BYvar variables.                                  .
*  WrkFile  Name or handle of disk file in which to 'unroll' the     .
*           keys in Left, joined to the data in Right.               .
*  ****************************************************************  .

DEFINE !M2MnyX(Left    = !CHAREND('/')
               /Right   = !CHAREND('/')
               /BYvar   = !CHAREND('/')
               /WrkFile = !CHAREND('/'))

.  ECHO ' '.
.  ECHO               'Arguments to  <!>M2MnyX1:'.
.  ECHO !QUOTE(!CONCAT('"Left"  input:   ',!Left)).
.  ECHO !QUOTE(!CONCAT('"Right" input:   ',!Right)).
.  ECHO !QUOTE(!CONCAT('Key variable(s): ',!BYvar)).
.  ECHO !QUOTE(!CONCAT('Working file:    ',!WrkFile)).
.  ECHO 'After the macro finishes execution, datasets'.
.  ECHO '"@LftIDd" and "@LftWide" may be erased.'.


*  ...   I.       Add 'Left#', record number within <!>BYvar groups, .
*                 to Left-hand file.                                 .

ADD FILES
     /FILE =!Left
     /BY    !BYvar
     /FIRST= @NewGrp.

NUMERIC    Left#   (F3).

DO IF      @NewGrp.
.  COMPUTE Left# = 1.
ELSE.
.  COMPUTE Left# = LAG(Left#) + 1.
END IF.

DATASET NAME @LftIDd    WINDOW=FRONT.


*  ...   II.      Restructure Left input from 'long' to 'wide',      .
*        /*     . Keep only the group key (!BYvar) and record        .
*                 sequence number within group (Left#).              .
*               . Represent the set of values of Left# by the        .
*                 largest value. Since the values of Left# are       .
*                 consecutive integers starting with 1, the          .
*                 maximum value characterizes the set.               .

DATASET ACTIVATE  @LftIDd  WINDOW=ASIS.
DATASET DECLARE   @LftWide.

AGGREGATE OUTFILE=@LftWide
    /BREAK = !BYvar
    /@MxLkey 'Largest value of Left# for this key' = MAX(Left#).


*  ...   III.     Merge 'wide' Left input with Right input           .
*                 This is a one-to-many merge.                       .

MATCH FILES
    /TABLE=@LftWide
    /FILE =!Right
    /BY    !BYvar.

*  ...   IV.A      Return to a 'long' structure, with all Left       .
*                  records joined with all Right records.            .

*       The result is the complete many-to-many merged file, except  .
*       for the substantive data from the Left input.                .

NUMERIC Left# (F3).
LOOP    Left# = 1 TO @MxLkey.
.  XSAVE OUTFILE=!WrkFile
    /DROP=@MxLKey
    /KEEP=!BYvar Left# ALL.
END LOOP.
EXECUTE /* required */.

*  ...   IV.B.    ... and go to 'Left file first' order.             .

GET  FILE=!WrkFile.
SORT CASES BY !BYvar Left#.


*  ...   V.       Re-merge with !Left, adding the data, completing    .
*                 the join.                                           .

MATCH FILES
    /TABLE= @LftIDd
    /FILE = *
    /BY     !BYvar Left#
    /DROP = @NewGrp
    /KEEP = !BYvar Left# ALL.

DATASET NAME Joined WINDOW=FRONT.

!ENDDEFINE.
================================================
Appendix II.    Output from demonstration run
================================================
DATASET ACTIVATE Schools  WINDOW=FRONT.
LIST.
List
|-----------------------------|---------------------------|
|Output Created               |03-FEB-2014 21:32:26       |
|-----------------------------|---------------------------|
  [Schools]
Category      School

Art           RI School of Design
Art           Metropolitan Museum
Culinary      Johnson & Wales
Liberal Arts  Oberlin
Liberal Arts  Antioch
Liberal Arts  Swarthmore
Technical     MIT
Technical     Cal Tech
Technical     ETH

Number of cases read:  9    Number of cases listed:  9

DATASET ACTIVATE Subjects WINDOW=FRONT.
LIST.
List
|-----------------------------|---------------------------|
|Output Created               |03-FEB-2014 21:32:26       |
|-----------------------------|---------------------------|
  [Subjects]
Category      Subject

Art           Painting
Art           Drawing
Art           Sculpture
Liberal Arts  Literature
Liberal Arts  Philosophy
Technical     Calculus
Technical     Physics
Technical     Chemistry
Technical     Engineering

Number of cases read:  9    Number of cases listed:  9


!M2MnyX  Left    = Schools/
          Right   = Subjects/
          BYvar   = Category/
          WrkFile = Unroll.

Arguments to  <!>M2MnyX1:
"Left"  input:   Schools
"Right" input:   Subjects
Key variable(s): Category
Working file:    Unroll
After the macro finishes execution, datasets
"@LftIDd" and "@LftWide" may be erased.

LIST.
List
|-----------------------------|---------------------------|
|Output Created               |03-FEB-2014 21:32:27       |
|-----------------------------|---------------------------|
  [Joined] C:\Documents and Settings\Richard\My Documents
             \Temporary\SPSS\
             2014-xx-xx Ristow - Many-to-many merge in SPSS - Scratch1.SAV

Category      Left# School                 Subject

Art              1  RI School of Design    Painting
Art              1  RI School of Design    Drawing
Art              1  RI School of Design    Sculpture
Art              2  Metropolitan Museum    Painting
Art              2  Metropolitan Museum    Drawing
Art              2  Metropolitan Museum    Sculpture
Liberal Arts     1  Oberlin                Literature
Liberal Arts     1  Oberlin                Philosophy
Liberal Arts     2  Antioch                Literature
Liberal Arts     2  Antioch                Philosophy
Liberal Arts     3  Swarthmore             Literature
Liberal Arts     3  Swarthmore             Philosophy
Technical        1  MIT                    Calculus
Technical        1  MIT                    Physics
Technical        1  MIT                    Chemistry
Technical        1  MIT                    Engineering
Technical        2  Cal Tech               Calculus
Technical        2  Cal Tech               Physics
Technical        2  Cal Tech               Chemistry
Technical        2  Cal Tech               Engineering
Technical        3  ETH                    Calculus
Technical        3  ETH                    Physics
Technical        3  ETH                    Chemistry
Technical        3  ETH                    Engineering

Number of cases read:  24    Number of cases listed:  24
================================================
Appendix III.   Test data and code, for demonstration run
          III.A Test data and driver code
================================================
*  C:\Documents and Settings\Richard\My Documents                    .
*    \Technical\spssx-l\Z-2014\                                      .
*    2014-02-03 Ristow - Many-to-many merge in SPSS.SPS              .


*  Under name 2014-xx-xx Ristow-Many-to-many merge in SPSS.SPS:      .
*  31 Jan 2014 - Create, from file                                   .
*        C:\Documents and Settings\Richard\My Documents              .
*          \Technical\spssx-l\Z-2009d\                               .
*          2009-11-24 Ristow - Many to many merge in SPSS.SPS        .
*     which was prepared for, and included in, posting               .
*        Date:    Thu, 26 Nov 2009 14:34:54 -0500                    .
*        From:    Richard Ristow <[hidden email]>           .
*        Subject: Many-to-many merge in SPSS                         .
*        To:      [hidden email]                           .
*     Projected clean-up ('+' = done, '.' = to do):                  .
*     +  Name input datasets '@Left' and '@Right', instead of        .
*        'File1' and 'File2'                                         .
*     +  Use ADD FILES when creating record sequence numbers for     .
*        '@Left'; this allows group keys with several variables.     .
*     +  Don't add sequence numbers for file @'Right'; they aren't   .
*        needed                                                      .
*     +  Wrap in a macro, initially un-parameterized.                .
*   1 Feb 2014 - Complete the renaming of files and variables,       .
*      and create first version of the macro, with no parameters.    .
*      Remove it to file  M2MnyX0.SPS, put the test data here and    .
*      set up for test.                                              .
*   Under name 2014-xx-xx Ristow - Many-to-many merge in SPSS.SPS:   .
*      Save under this name, with spaces around the hyphen following .
*      the author's name.  Looks better, matches earlier practice    .
*      Test with macro <!>M2MnyX0, taking no arguments.              .
*    3 Feb 2014 - Drop creation of datasets @Left and @Right, which  .
*      were for testing open code or the macro without parameters.   .
*      Set up to work with production macro <!>M2MnyX.               .
*   Under name 2014-02-03 Ristow - Many-to-many merge in SPSS.SPS:   .

*  .................   Directory holding code  ..................... .

FILE HANDLE CodePlace
      NAME='C:\Documents and Settings\Richard\My Documents' +
             '\Technical\spssx-l\Z-2014'.

* .................    Working file, required for XSAVE logic  ..... .

FILE HANDLE Unroll
  /NAME='C:\Documents and Settings\Richard\My Documents'              +
          '\Temporary\SPSS\'                                          +
          '2014-xx-xx Ristow - Many-to-many merge in SPSS'            +
          ' - '                                                       +
          'Scratch1.SAV'.

*  .................   Test data               ..................... .


DATA LIST FIXED
    /Category    04-16 (A)
     School      17-38 (A).
BEGIN DATA
    Art          RI School of Design
    Art          Metropolitan Museum
    Culinary     Johnson & Wales
    Liberal Arts Oberlin
    Liberal Arts Antioch
    Liberal Arts Swarthmore
    Technical    MIT
    Technical    Cal Tech
    Technical    ETH
END DATA.
DATASET NAME     Schools.
.  /*-- LIST /*-*/.

DATA LIST FIXED
    /Category    04-16 (A)
     Subject     17-28 (A).
BEGIN DATA
    Art          Painting
    Art          Drawing
    Art          Sculpture
    Liberal Arts Literature
    Liberal Arts Philosophy
    Technical    Calculus
    Technical    Physics
    Technical    Chemistry
    Technical    Engineering
END DATA.
DATASET NAME    Subjects.
.  /*-- LIST /*-*/.

*  Load definition of <!>M2MnyX:                                    .
INSERT FILE='CodePlace\M2MnyX1.SPS'.

DATASET ACTIVATE Schools  WINDOW=FRONT.
LIST.

DATASET ACTIVATE Subjects WINDOW=FRONT.
LIST.
!M2MnyX  Left    = Schools/
          Right   = Subjects/
          BYvar   = Category/
          WrkFile = Unroll.

LIST.
================================================
          III.B Macro definition file
================================================
*  C:\Documents and Settings\Richard\My Documents  .
*    \Technical\spssx-l\Z-2014\                    .
*    M2MnyX1.SPS                                   .


*  For insertion in 2014-xx-xx Ristow-Many-to-many merge in SPSS.SPS .
*  or successor (2014-02-03 Ristow-Many-to-many merge in SPSS.SPS)   .

*  This is a production form of the macro with XSAVE logic.          .

*  ****************************************************************  .
*  Macro for many-to-many merge of two datasets or .SAV files,       .
*  using XSAVE logic.                                                .
*  Takes four arguments, all required, all terminated by '/':        .
*  Left     Name of one of the datasets, or saved file, to be joined .
*  Right    Name of the other  dataset,  or saved file, to be joined .
*  BYvar    Variable or variables on which  the two are to be joined..
*           Requirements are as for BY variables in MATCH FILES:     .
*           all variables must be present and of the same type in    .
*           both inputs; and both must be sorted in ascending order  .
*           on the BYvar variables.                                  .
*  WrkFile  Name or handle of disk file in which to 'unroll' the     .
*           keys in Left, joined to the data in Right.               .
*  ****************************************************************  .

DEFINE !M2MnyX(Left    = !CHAREND('/')
               /Right   = !CHAREND('/')
               /BYvar   = !CHAREND('/')
               /WrkFile = !CHAREND('/'))

.  ECHO ' '.
.  ECHO               'Arguments to  <!>M2MnyX1:'.
.  ECHO !QUOTE(!CONCAT('"Left"  input:   ',!Left)).
.  ECHO !QUOTE(!CONCAT('"Right" input:   ',!Right)).
.  ECHO !QUOTE(!CONCAT('Key variable(s): ',!BYvar)).
.  ECHO !QUOTE(!CONCAT('Working file:    ',!WrkFile)).
.  ECHO 'After the macro finishes execution, datasets'.
.  ECHO '"@LftIDd" and "@LftWide" may be erased.'.


*  ...   I.       Add 'Left#', record number within <!>BYvar groups, .
*                 to Left-hand file.                                 .

ADD FILES
     /FILE =!Left
     /BY    !BYvar
     /FIRST= @NewGrp.

NUMERIC    Left#   (F3).

DO IF      @NewGrp.
.  COMPUTE Left# = 1.
ELSE.
.  COMPUTE Left# = LAG(Left#) + 1.
END IF.

DATASET NAME @LftIDd    WINDOW=FRONT.


*  ...   II.      Restructure Left input from 'long' to 'wide',      .
*        /*     . Keep only the group key (!BYvar) and record        .
*                 sequence number within group (Left#).              .
*               . Represent the set of values of Left# by the        .
*                 largest value. Since the values of Left# are       .
*                 consecutive integers starting with 1, the          .
*                 maximum value characterizes the set.               .

DATASET ACTIVATE  @LftIDd  WINDOW=ASIS.
DATASET DECLARE   @LftWide.

AGGREGATE OUTFILE=@LftWide
    /BREAK = !BYvar
    /@MxLkey 'Largest value of Left# for this key' = MAX(Left#).


*  ...   III.     Merge 'wide' Left input with Right input           .
*                 This is a one-to-many merge.                       .

MATCH FILES
    /TABLE=@LftWide
    /FILE =!Right
    /BY    !BYvar.

*  ...   IV.A      Return to a 'long' structure, with all Left       .
*                  records joined with all Right records.            .

*       The result is the complete many-to-many merged file, except  .
*       for the substantive data from the Left input.                .

NUMERIC Left# (F3).
LOOP    Left# = 1 TO @MxLkey.
.  XSAVE OUTFILE=!WrkFile
    /DROP=@MxLKey
    /KEEP=!BYvar Left# ALL.
END LOOP.
EXECUTE /* required */.

*  ...   IV.B.    ... and go to 'Left file first' order.             .

GET  FILE=!WrkFile.
SORT CASES BY !BYvar Left#.


*  ...   V.       Re-merge with !Left, adding the data, completing    .
*                 the join.                                           .

MATCH FILES
    /TABLE= @LftIDd
    /FILE = *
    /BY     !BYvar Left#
    /DROP = @NewGrp
    /KEEP = !BYvar Left# ALL.

DATASET NAME Joined WINDOW=FRONT.

!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Many-to-many merge in SPSS

Andy W
Thanks for posting this Richard, although in my opinion saving the metadata is not a big deal - APPLY DICTIONARY could rectify that pretty easily.
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/