SPSSX Discussion

Many-to-many merge in SPSS

Classic

List

Threaded

9 messages Options

Richard Ristow

Many-to-many merge in SPSS

From time to time, the question has come up, how to do a many-to-many merge in SPSS: given two files with key variables (which I'll call the 'group key') and, potentially, several records per group-key value in each file, join every record in each file with every record having the same group-key value in the other file. Up to now, no satisfactory solution had been found (see Appendix II at the end of this posting).

Here, I give code and a test run for what I think is a complete solution. It's a version of the long-wide-long method (see Appendix II), using
. David Marso's insight that only record keys from the file being converted to 'wide' need be kept in the 'wide' file(*), and
. my own, that if the record keys are integers consecutive from 1, a set of keys is characterized by its largest value.

(*) David Marso is <[hidden email]>. His code, for the special case of taking the Cartesian product of two files, is in SPSSX-L posting
     Date:    Thu, 19 Nov 2009 11:27:59 -0500                    From:    Lakshmikanth <[hidden email]>      Subject: Cartesian product in SPSS      To:      [hidden email]     A reformatted version, with test run, is in
     Date:    Tue, 24 Nov 2009 00:46:30 -0500      From:    Richard Ristow <[hidden email]>      Subject: Re: Cartesian product in SPSS      To:      [hidden email] Test and demonstration: In the following, 'Unrolled' cannot be a dataset; it must be a disk file or a file handle to a disk file. Test data - File 1: [File1] Category      School Art           RISD Art           Metropolitan Museum Liberal Arts Oberlin Liberal Arts Antioch Liberal Arts Swarthmore Technical     MIT Technical     ETH Number of cases read: 7    Number of cases listed: 7File 2:
[File2] Category      Subject Art           Painting Art           Drawing Art           Sculpture Liberal Arts Literature Liberal Arts Philosophy Technical     Calculus Technical     Physics Technical     Chemistry Technical     Engineering Number of cases read: 9    Number of cases listed: 9Demonstration run, with code, intermediate files, and output:

* ...   I.       Identify File1 records, within categories, by      . *                 integers sequentially from 1.                      . *        (Because the category variable is a string, this is done in . *        transformation code; CREATE and RANK wouldn't work.)        . DATASET ACTIVATE File1     WINDOW=ASIS. DATASET COPY     File1_IDd WINDOW=MINIMIZED. DATASET ACTIVATE File1_IDd WINDOW=FRONT. NUMERIC    ID1 (F3). DO IF    $CASENUM EQ 1. . COMPUTE ID1 = 1. ELSE IF Category NE LAG(Category). . COMPUTE ID1 = 1. ELSE. . COMPUTE ID1 = LAG(ID1) + 1. END IF. * ...   II.      Restructure File1 from 'long' to 'wide', except    . *               . Keep only the category, and identifying number ID1 . *               . Represent the set of values of ID1 by the largest . *                 value. Since the values of ID1 are consecutive     . *                 integers starting with 1, that's unambiguous.      . DATASET DECLARE   File1_Wide. DATASET ACTIVATE File1_IDd WINDOW=ASIS. AGGREGATE OUTFILE=File1_Wide    /BREAK=Category    /All_IDs 'Representing all values of ID1' = MAX(ID1). . /**/ DATASET ACTIVATE File1_Wide WINDOW=FRONT               /*-*/. * -**- This, believe it or not, is the 'wide' form of File1   -***-. . /**/ LIST                                                   /*-*/. List |-----------------------------|---------------------------| |Output Created               |26-NOV-2009 14:13:19       | |-----------------------------|---------------------------| [File1_Wide] Category      All_IDs Art               2 Liberal Arts      3 Technical         2 Number of cases read: 3    Number of cases listed: 3 * ...   III.     Merge 'wide' File1 with File2                      . *                 This is a one-to-many merge.                       . MATCH FILES    /TABLE=File1_Wide    /FILE =File2    /BY Category. . /**/ LIST                                                   /*-*/. List |-----------------------------|---------------------------| |Output Created               |26-NOV-2009 14:13:20       | |-----------------------------|---------------------------| Category      All_IDs Subject Art               2   Painting Art               2   Drawing Art               2   Sculpture Liberal Arts      3   Literature Liberal Arts      3   Philosophy Technical         2   Calculus Technical         2   Physics Technical         2   Chemistry Technical         2   Engineering Number of cases read: 9    Number of cases listed: 9 * ...   IV.A.    Return to a 'long' structure, with all File1       . *                 records joined with all File2 records.             . *       The result is the complete many-to-many merged file, except . *       for (oh, yes) the *data* from File1.                         . *       (This is where XSAVE will work, and VARSTOCASES won't.)      . NUMERIC ID1 (F3). LOOP    ID1 = 1 TO ALL_IDs. . XSAVE OUTFILE=Unrolled    /KEEP=Category ID1 ALL. END LOOP. EXECUTE /* required */. * ...   IV.B.    ... and go to 'file 1 first' order.                . GET FILE = Unrolled. DATASET NAME     ManyToMany WINDOW=FRONT. SORT CASES BY    Category ID1. * ...   V.       Re-merge with File1, adding the data, completing   . *                 the job.                                           . MATCH FILES    /TABLE = File1_IDd    /FILE = *    /BY      Category ID1    /KEEP = Category ID1 ALL. LISt. List |-----------------------------|---------------------------| |Output Created               |26-NOV-2009 14:13:22       | |-----------------------------|---------------------------| [ManyToMany] C:\Documents and Settings\Richard\My Documents \Temporary\SPSS\ 2009-11-24 Ristow - Many to many merge in SPSS - UNROLL.SAV Category      ID1 School                 All_IDs Subject Art             1 RISD                       2   Painting Art             1 RISD                       2   Drawing Art             1 RISD                       2   Sculpture Art             2 Metropolitan Museum        2   Painting Art             2 Metropolitan Museum        2   Drawing Art             2 Metropolitan Museum        2   Sculpture Liberal Arts    1 Oberlin                    3   Literature Liberal Arts    1 Oberlin                    3   Philosophy Liberal Arts    2 Antioch                    3   Literature Liberal Arts    2 Antioch                    3   Philosophy Liberal Arts    3 Swarthmore                 3   Literature Liberal Arts    3 Swarthmore                 3   Philosophy Technical       1 MIT                        2   Calculus Technical       1 MIT                        2   Physics Technical       1 MIT                        2   Chemistry Technical       1 MIT                        2   Engineering Technical       2 ETH                        2   Calculus Technical       2 ETH                        2   Physics Technical       2 ETH                        2   Chemistry Technical       2 ETH                        2   Engineering Number of cases read: 20    Number of cases listed: 20 =============================== APPENDIX I: Test data and code =============================== * C:\Documents and Settings\Richard\My Documents                    . *    \Technical\spssx-l\Z-2009d\                                     . *    2009-11-26 Ristow - Many-to-many merge in SPSS.SPS             . * To support a posting with title "Many-to-many merge in SPSS"      . * The question of doing many-to-many merges in SPSS, comes up from . * time to time. No fully satisfactory solution has been posted.     . *                                                                    . * The most promising method suggested has been "long-wide-long".    . * That is, all records in, say, file 1 with the same group          . * identifier are brought into a single record, typically by         . * CASESTOVARS. That changes the problem to a one-to-many merge,     . * using MATCH FILES with the wide version as a TABLE file. The      . * result, which is 'long' in file 2 and 'wide' in file 1, is then   . * unrolled to restore file 1, typically using VARSTOCASES. That     . * gives, for each file 2 record, a complete restored set of file 1 . * records, completing the match.                                    . *                                                                    . * There are two distinct difficulties. First, if the file has many . * variables, its wide version may have a very large number. And,    . * the VARSTOCASES is data-dependent: the largest number of cases    . * in any file 2 group, and the list of variables to be restored,    . * must be in the VARSTOCASES code. (A solution, of course, is to    . * have Python look at the data dictionary and generate the          . * VARSTOCASES -- but that's probably not trivial, as Python has to . * infer which sets of variables were generated from CASESTOVARS.)   . *                                                                    . * A variant is to include only the identifiers of the file 1        . * records in the 'wide' file 1, and re-attach the data variables    . * later. That avoids creating a file with a great many variables,   . * and does not require a VARSTOCASES naming the full list of        . * variables to be restored. The Cartesian-product code by David     . * Marso(*) works this way. However, it still requires that the     . * number of records in one of the files, or rather the largest      . * number in any group in that file, be in the code, as the bound of . * the vector of identifiers.                                        . *                                                                    . * (*) David Marso's code is in SPSSX-L posting                     . *       Date:    Thu, 19 Nov 2009 11:27:59 -0500                     . *       From:    Lakshmikanth <[hidden email]>      . *       Subject: Cartesian product in SPSS                           . *       To:      [hidden email]                            . *       A version reformatted by R. Ristow is in                     . *       Date:    Tue, 24 Nov 2009 00:46:30 -0500                     . *       From:    Richard Ristow <[hidden email]>            . *       Subject: Re: Cartesian product in SPSS                       . *       To:      [hidden email]                            . *                                                                    . * However, if the keys within groups are successive positive        . * integers, the 'wide' form of the list of IDs can be a single      . * number: the largest key value in the group. ('Unrolling' this     . * requires XSAVE logic, instead of VARSTOCASES.)                    . *                                                                    . * That's what's done in the code below, assigning integer record    . * IDs and applying 'log-wide-long' to File 1.                       . * ................................................................. . * .................   Scratch file            ..................... . * (Because it's written by XSAVE, it cannot be a dataset.)          . FILE HANDLE Unrolled /NAME='C:\Documents and Settings\Richard\My Documents'              +          '\Temporary\SPSS\'                                          +          '2009-11-24 Ristow - Many to many merge in SPSS'            +          ' - '                                                       +          'UNROLL.SAV'. * .................   Test data               ..................... . DATA LIST FIXED    /Category    04-16 (A)     School      17-38 (A). BEGIN DATA    Art          RISD    Art          Metropolitan Museum    Liberal Arts Oberlin    Liberal Arts Antioch    Liberal Arts Swarthmore    Technical    MIT    Technical    ETH END DATA. DATASET NAME     File1. . /*-- LIST /*-*/. DATA LIST FIXED    /Category    04-16 (A)     Subject     17-28 (A). BEGIN DATA    Art          Painting    Art          Drawing    Art          Sculpture    Liberal Arts Literature    Liberal Arts Philosophy    Technical    Calculus    Technical    Physics    Technical    Chemistry    Technical    Engineering END DATA. DATASET NAME     File2. . /*-- LIST /*-*/. * .................   Post after this point   ..................... . * ................................................................. . DATASET ACTIVATE File1     WINDOW=ASIS. LIST. DATASET ACTIVATE File2     WINDOW=ASIS. LIST. * ...   I.       Identify File1 records, within categories, by      . *                 integers sequentially from 1.                      . *        (Because the category variable is a string, this is done in . *        transformation code; CREATE and RANK wouldn't work.)        . DATASET ACTIVATE File1     WINDOW=ASIS. DATASET COPY     File1_IDd WINDOW=MINIMIZED. DATASET ACTIVATE File1_IDd WINDOW=FRONT. NUMERIC    ID1 (F3). DO IF    $CASENUM EQ 1. . COMPUTE ID1 = 1. ELSE IF Category NE LAG(Category). . COMPUTE ID1 = 1. ELSE. . COMPUTE ID1 = LAG(ID1) + 1. END IF. * ...   II.      Restructure File1 from 'long' to 'wide', except    . *               . Keep only the category, and identifying number ID1 . *               . Represent the set of values of ID1 by the largest . *                 value. Since the values of ID1 are consecutive     . *                 integers starting with 1, that's unambiguous.      . DATASET DECLARE   File1_Wide. DATASET ACTIVATE File1_IDd WINDOW=ASIS. AGGREGATE OUTFILE=File1_Wide    /BREAK=Category    /All_IDs 'Representing all values of ID1' = MAX(ID1).    . /**/ DATASET ACTIVATE File1_Wide WINDOW=FRONT               /*-*/. * -**- This, believe it or not, is the 'wide' form of File1   -***-. . /**/ LIST                                                   /*-*/. * ...   III.     Merge 'wide' File1 with File2                      . *                 This is a one-to-many merge.                       . MATCH FILES    /TABLE=File1_Wide    /FILE =File2    /BY Category.    . /**/ LIST                                                   /*-*/. * ...   IV.A.    Return to a 'long' structure, with all File1       . *                 records joined with all File2 records.             . *       The result is the complete many-to-many merged file, except . *       for (oh, yes) the *data* from File1.                         . *       (This is where XSAVE will work, and VARSTOCASES won't.)      . NUMERIC ID1 (F3). LOOP    ID1 = 1 TO ALL_IDs. . XSAVE OUTFILE=Unrolled    /KEEP=Category ID1 ALL. END LOOP. EXECUTE /* required */. * ...   IV.B.    ... and go to 'file 1 first' order.                . GET FILE = Unrolled. DATASET NAME     ManyToMany WINDOW=FRONT. SORT CASES BY    Category ID1. * ...   V.       Re-merge with File1, adding the data, completing   . *                 the job.                                           . MATCH FILES    /TABLE = File1_IDd    /FILE = *    /BY      Category ID1    /KEEP = Category ID1 ALL.    LISt. ================================ APPENDIX II: History and methods ================================The best ideas have been some version of what I'll call the 'long-wide-long' method. In that method, in one of the files, the records for every key value (i'll cal are combined into a single record ('wide' form), using CASESTOVARS or an equivalent. Then there's only one record per key value in that file, and the merge can be done as one-to-many, using /TABLE in MATCH FILES. Finally, the data from the first file is re-converted ('unrolled') to 'long' form, using VARSTOCASES or some equivalent.

I don't think anybody's got that working satisfactorily. First, the 'wide' File 1 may have so many variables that SPSS can't handle it, or can't handle it efficiently. Second, and more crucial, both the list of variables in File 1, and the maximum number of records in File 1 for any key value, must be coded in the VARSTOCASES statement. One might write a Python program to analyze the data dictionary and generate the VARSTOCASES statement, but I think even that would be awkward.

David Marso of SPSS, Inc., wrote an interesting variant (see citations above). If the file to be made 'wide', say File 1, has a record-key variable that uniquely identifies records within each (group) key value, only the record-key values need be combined into a 'wide' record. After the one-to-many merge with File 2, and unrolling back to long form, the File 1 data can then be re-attached by a MATCH FILE by group-key and record-key, with File 1 as a TABLE.

David Marso's code still requires a vector to hold the record-key values from File 1; and the length of that vector, i.e. the maximum number of records in File 1 for any group-key value, must be given in the code.

However, if the record key consists of integers sequentially from 1 within each group, the set of keys may be represented by a single integer, namely the largest value. Then, the set of variables that need be named is not data-dependent at all, and general-purpose code is possible.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

fuigb

Re: Many-to-many merge in SPSS

Has anyone had luck with this code? I understand what it is designed to do, but it does not run and too far short of being a real coder to debug it. If anyone charitable has debugged this then please post.

If this can actually be made to function the original author deserves a medal!

David Marso

Re: Many-to-many merge in SPSS

Administrator

In reply to this post by Richard Ristow

These days I'd likely approach it with MATRIX (hint KRON).
Note BIG matrix jobs can be partitioned with SPLIT FILE in batch mode (in case XG of RAM doesn't suffice).
It's really late here. I shall scrutinize 2mrow.
Thanks for the mention. All too often people omit that courtesy!
-----

Richard Ristow wrote

From time to time, the question has come up, how to do a many-to-many
merge in SPSS: given two files with key variables (which I'll call the
'group key') and, potentially, several records per group-key value in
each file, join every record in each file with every record having the
same group-key value in the other file. Up to now, no satisfactory
solution had been found (see Appendix II at the end of this
posting).
Here, I give code and a test run for what I think is a complete solution.
It's a version of the long-wide-long method (see Appendix II), using

. David Marso's insight that only record keys from the file being
converted to 'wide' need be kept in the 'wide' file(*), and
. my own, that if the record keys are integers consecutive from 1, a set
of keys is characterized by its largest value.
(*) David Marso is <[hidden email]>. His code, for the
special case of taking the Cartesian product of two files, is in SPSSX-L
posting
     Date:    Thu, 19 Nov 2009
11:27:59
-0500

     From:    Lakshmikanth
<[hidden email]>
     Subject: Cartesian product in SPSS
     To:
[hidden email]
     A reformatted version, with test run, is
in

     Date:    Tue, 24 Nov 2009
00:46:30 -0500
     From:    Richard Ristow
<[hidden email]>
     Subject: Re: Cartesian product in SPSS
     To:
[hidden email]

Test and demonstration:
In the following, 'Unrolled' cannot be a dataset; it must be a disk
file or a file handle to a disk file.
Test data - File 1:
[File1]
Category      School
Art           RISD
Art
Metropolitan Museum
Liberal Arts Oberlin
Liberal Arts Antioch
Liberal Arts Swarthmore
Technical     MIT
Technical     ETH
Number of cases read: 7    Number of cases
listed: 7
File 2:
[File2]
Category      Subject
Art
Painting
Art
Drawing
Art
Sculpture
Liberal Arts Literature
Liberal Arts Philosophy
Technical     Calculus
Technical     Physics
Technical     Chemistry
Technical     Engineering
Number of cases read: 9    Number of cases
listed: 9
Demonstration run, with code, intermediate files, and
output:

* ...
I.       Identify File1 records, within
categories, by      .
*
integers sequentially from
1.
.
*        (Because the category
variable is a string, this is done in .
*        transformation code; CREATE
and RANK wouldn't work.)
.
DATASET ACTIVATE File1     WINDOW=ASIS.
DATASET COPY     File1_IDd WINDOW=MINIMIZED.
DATASET ACTIVATE File1_IDd WINDOW=FRONT.
NUMERIC    ID1 (F3).
DO IF    $CASENUM EQ 1.
. COMPUTE ID1 = 1.
ELSE IF Category NE LAG(Category).
. COMPUTE ID1 = 1.
ELSE.
. COMPUTE ID1 = LAG(ID1) + 1.
END IF.

* ...   II.      Restructure
File1 from 'long' to 'wide', except    .
*
. Keep only the category, and identifying number ID1 .
*
. Represent the set of values of ID1 by the largest .
*
value. Since the values of ID1 are consecutive
.
*
integers starting with 1, that's
unambiguous.      .
DATASET DECLARE   File1_Wide.
DATASET ACTIVATE File1_IDd WINDOW=ASIS.
AGGREGATE OUTFILE=File1_Wide
   /BREAK=Category
   /All_IDs 'Representing all values of ID1' =
MAX(ID1).

. /**/ DATASET ACTIVATE File1_Wide
WINDOW=FRONT
/*-*/.
* -**- This, believe it or not, is the 'wide' form of
File1   -***-.
. /**/
LIST
/*-*/.
List
|-----------------------------|---------------------------|
|Output
Created
|26-NOV-2009 14:13:19       |
|-----------------------------|---------------------------|
[File1_Wide]

Category      All_IDs
Art
2
Liberal Arts      3
Technical         2
Number of cases read: 3    Number of cases
listed: 3

* ...   III.     Merge 'wide' File1
with
File2
.
*
This is a one-to-many
merge.
.
MATCH FILES
   /TABLE=File1_Wide
   /FILE =File2
   /BY Category.
. /**/
LIST
/*-*/.
List
|-----------------------------|---------------------------|
|Output
Created
|26-NOV-2009 14:13:20       |
|-----------------------------|---------------------------|
Category      All_IDs Subject
Art
2   Painting
Art
2   Drawing
Art
2   Sculpture
Liberal Arts      3   Literature
Liberal Arts      3   Philosophy
Technical         2
Calculus
Technical         2
Physics
Technical         2
Chemistry
Technical         2
Engineering
Number of cases read: 9    Number of cases
listed: 9

* ...   IV.A.    Return to a 'long'
structure, with all File1       .
*
records joined with all File2
records.
.
*       The result is the complete
many-to-many merged file, except .
*       for (oh, yes) the *data* from
File1.
.
*       (This is where XSAVE will work, and
VARSTOCASES won't.)      .
NUMERIC ID1 (F3).
LOOP    ID1 = 1 TO ALL_IDs.
. XSAVE OUTFILE=Unrolled
   /KEEP=Category ID1 ALL.
END LOOP.
EXECUTE /* required */.
* ...   IV.B.    ... and go to 'file 1
first'
order.
.
GET FILE = Unrolled.
DATASET NAME     ManyToMany WINDOW=FRONT.
SORT CASES BY    Category ID1.

* ...   V.       Re-merge
with File1, adding the data, completing   .
*
the
job.
.
MATCH FILES
   /TABLE = File1_IDd
   /FILE = *
   /BY      Category ID1
   /KEEP = Category ID1 ALL.
LISt.
List
|-----------------------------|---------------------------|
|Output
Created
|26-NOV-2009 14:13:22       |
|-----------------------------|---------------------------|
[ManyToMany]
C:\Documents and Settings\Richard\My Documents
\Temporary\SPSS\
2009-11-24 Ristow - Many to many merge in SPSS - UNROLL.SAV

Category      ID1
School
All_IDs Subject
Art
1
RISD
2   Painting
Art
1
RISD
2   Drawing
Art
1
RISD
2   Sculpture
Art
2 Metropolitan Museum
2   Painting
Art
2 Metropolitan Museum
2   Drawing
Art
2 Metropolitan Museum
2   Sculpture
Liberal Arts    1
Oberlin
3   Literature
Liberal Arts    1
Oberlin
3   Philosophy
Liberal Arts    2
Antioch
3   Literature
Liberal Arts    2
Antioch
3   Philosophy
Liberal Arts    3
Swarthmore
3   Literature
Liberal Arts    3
Swarthmore
3   Philosophy
Technical       1
MIT
2   Calculus
Technical       1
MIT
2   Physics
Technical       1
MIT
2   Chemistry
Technical       1
MIT
2   Engineering
Technical       2
ETH
2   Calculus
Technical       2
ETH
2   Physics
Technical       2
ETH
2   Chemistry
Technical       2
ETH
2   Engineering
Number of cases read: 20    Number of cases
listed: 20
===============================
APPENDIX I: Test data and code
===============================
* C:\Documents and Settings\Richard\My
Documents
.
*
\Technical\spssx-l\Z-2009d\
.
*    2009-11-26 Ristow - Many-to-many merge in
SPSS.SPS
.

* To support a posting with title "Many-to-many merge in
SPSS"      .

* The question of doing many-to-many merges in SPSS, comes up
from .
* time to time. No fully satisfactory solution has been
posted.     .
*
.
* The most promising method suggested has been
"long-wide-long".    .
* That is, all records in, say, file 1 with the same
group          .
* identifier are brought into a single record, typically
by         .
* CASESTOVARS. That changes the problem to a one-to-many
merge,     .
* using MATCH FILES with the wide version as a TABLE file.
The      .
* result, which is 'long' in file 2 and 'wide' in file 1, is
then   .
* unrolled to restore file 1, typically using VARSTOCASES.
That     .
* gives, for each file 2 record, a complete restored set of file
1 .
* records, completing the
match.
.
*
.
* There are two distinct difficulties. First, if the file has
many .
* variables, its wide version may have a very large number.
And,    .
* the VARSTOCASES is data-dependent: the largest number of
cases    .
* in any file 2 group, and the list of variables to be
restored,    .
* must be in the VARSTOCASES code. (A solution, of course, is
to    .
* have Python look at the data dictionary and generate
the          .
* VARSTOCASES -- but that's probably not trivial, as Python has
to .
* infer which sets of variables were generated from
CASESTOVARS.)   .
*
.
* A variant is to include only the identifiers of the file
1        .
* records in the 'wide' file 1, and re-attach the data
variables    .
* later. That avoids creating a file with a great many
variables,   .
* and does not require a VARSTOCASES naming the full list
of        .
* variables to be restored. The Cartesian-product code by
David     .
* Marso(*) works this way. However, it still requires that
the     .
* number of records in one of the files, or rather the
largest      .
* number in any group in that file, be in the code, as the bound of
.
* the vector of
identifiers.
.
*
.
* (*) David Marso's code is in SPSSX-L
posting
.
*       Date:    Thu, 19 Nov
2009 11:27:59
-0500
.
*       From:
Lakshmikanth
<[hidden email]>
.
*       Subject: Cartesian product in
SPSS
.
*       To:
[hidden email]
.
*       A version reformatted by R. Ristow
is
in
.
*       Date:    Tue, 24 Nov
2009 00:46:30
-0500
.
*       From:    Richard
Ristow
<[hidden email]>
.
*       Subject: Re: Cartesian product in
SPSS
.
*       To:
[hidden email]
.
*
.
* However, if the keys within groups are successive
positive        .
* integers, the 'wide' form of the list of IDs can be a
single      .
* number: the largest key value in the group. ('Unrolling'
this     .
* requires XSAVE logic, instead of
VARSTOCASES.)
.
*
.
* That's what's done in the code below, assigning integer
record    .
* IDs and applying 'log-wide-long' to File
1.
.

* .................................................................
.
* .................   Scratch
file
..................... .
* (Because it's written by XSAVE, it cannot be a
dataset.)
.
FILE HANDLE Unrolled
/NAME='C:\Documents and Settings\Richard\My
Documents'
+

'\Temporary\SPSS\'
+
         '2009-11-24 Ristow -
Many to many merge in
SPSS'
+
         ' -
'
+
         'UNROLL.SAV'.

* .................   Test
data
..................... .

DATA LIST FIXED
   /Category    04-16 (A)
    School      17-38 (A).
BEGIN DATA
   Art
RISD
   Art
Metropolitan Museum
   Liberal Arts Oberlin
   Liberal Arts Antioch
   Liberal Arts Swarthmore
   Technical    MIT
   Technical    ETH
END DATA.
DATASET NAME     File1.
. /*-- LIST /*-*/.
DATA LIST FIXED
   /Category    04-16 (A)
    Subject     17-28 (A).
BEGIN DATA
   Art
Painting
   Art
Drawing
   Art
Sculpture
   Liberal Arts Literature
   Liberal Arts Philosophy
   Technical    Calculus
   Technical    Physics
   Technical    Chemistry
   Technical    Engineering
END DATA.
DATASET NAME     File2.
. /*-- LIST /*-*/.
* .................   Post after this point
..................... .
* .................................................................
.
DATASET ACTIVATE File1     WINDOW=ASIS.
LIST.
DATASET ACTIVATE File2     WINDOW=ASIS.
LIST.
* ...   I.       Identify
File1 records, within categories, by      .
*
integers sequentially from
1.
.
*        (Because the category
variable is a string, this is done in .
*        transformation code; CREATE
and RANK wouldn't work.)
.
DATASET ACTIVATE File1     WINDOW=ASIS.
DATASET COPY     File1_IDd WINDOW=MINIMIZED.
DATASET ACTIVATE File1_IDd WINDOW=FRONT.

NUMERIC    ID1 (F3).
DO IF    $CASENUM EQ 1.
. COMPUTE ID1 = 1.
ELSE IF Category NE LAG(Category).
. COMPUTE ID1 = 1.
ELSE.
. COMPUTE ID1 = LAG(ID1) + 1.
END IF.

* ...   II.      Restructure
File1 from 'long' to 'wide', except    .
*
. Keep only the category, and identifying number ID1 .
*
. Represent the set of values of ID1 by the largest .
*
value. Since the values of ID1 are consecutive
.
*
integers starting with 1, that's
unambiguous.      .
DATASET DECLARE   File1_Wide.
DATASET ACTIVATE File1_IDd WINDOW=ASIS.
AGGREGATE OUTFILE=File1_Wide
   /BREAK=Category
   /All_IDs 'Representing all values of ID1' =
MAX(ID1).

. /**/ DATASET ACTIVATE File1_Wide
WINDOW=FRONT
/*-*/.
* -**- This, believe it or not, is the 'wide' form of
File1   -***-.
. /**/
LIST
/*-*/.

* ...   III.     Merge 'wide' File1
with
File2
.
*
This is a one-to-many
merge.
.
MATCH FILES
   /TABLE=File1_Wide
   /FILE =File2
   /BY Category.

. /**/
LIST
/*-*/.

* ...   IV.A.    Return to a 'long'
structure, with all File1       .
*
records joined with all File2
records.
.
*       The result is the complete
many-to-many merged file, except .
*       for (oh, yes) the *data* from
File1.
.
*       (This is where XSAVE will work, and
VARSTOCASES won't.)      .
NUMERIC ID1 (F3).
LOOP    ID1 = 1 TO ALL_IDs.
. XSAVE OUTFILE=Unrolled
   /KEEP=Category ID1 ALL.
END LOOP.
EXECUTE /* required */.
* ...   IV.B.    ... and go to 'file 1
first'
order.
.
GET FILE = Unrolled.
DATASET NAME     ManyToMany WINDOW=FRONT.
SORT CASES BY    Category ID1.
* ...   V.       Re-merge
with File1, adding the data, completing   .
*
the
job.
.
MATCH FILES
   /TABLE = File1_IDd
   /FILE = *
   /BY      Category ID1
   /KEEP = Category ID1 ALL.

LISt.
================================
APPENDIX II: History and methods
================================
The best ideas have been some version of what I'll call the
'long-wide-long' method. In that method, in one of the files, the records
for every key value (i'll cal are combined into a single record ('wide'
form), using CASESTOVARS or an equivalent. Then there's only one record
per key value in that file, and the merge can be done as one-to-many,
using /TABLE in MATCH FILES. Finally, the data from the first file is
re-converted ('unrolled') to 'long' form, using VARSTOCASES or some
equivalent.
I don't think anybody's got that working satisfactorily. First, the
'wide' File 1 may have so many variables that SPSS can't handle it, or
can't handle it efficiently. Second, and more crucial, both the list of
variables in File 1, and the maximum number of records in File 1 for any
key value, must be coded in the VARSTOCASES statement. One might write a
Python program to analyze the data dictionary and generate the
VARSTOCASES statement, but I think even that would be awkward.
David Marso of SPSS, Inc., wrote an interesting variant (see citations
above). If the file to be made 'wide', say File 1, has a record-key
variable that uniquely identifies records within each (group) key value,
only the record-key values need be combined into a 'wide' record. After
the one-to-many merge with File 2, and unrolling back to long form, the
File 1 data can then be re-attached by a MATCH FILE by group-key and
record-key, with File 1 as a TABLE.
David Marso's code still requires a vector to hold the record-key values
from File 1; and the length of that vector, i.e. the maximum number of
records in File 1 for any group-key value, must be given in the
code.
However, if the record key consists of integers sequentially from 1
within each group, the set of keys may be represented by a single
integer, namely the largest value. Then, the set of variables that need
be named is not data-dependent at all, and general-purpose code is
possible.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

fuigb

Re: Many-to-many merge in SPSS

After a night's sleep I realized what I needed to tweak and so SUCCESS! You guys are geniuses. For years and at various jobs I've fought the good fight for SPSS against the true-believers of SAS: "SPSS cannot accomplish what is easy in SAS." You're helping me to roll back the naysayers! Again, thank you.

David Marso

Re: Many-to-many merge in SPSS

Administrator

In reply to this post by David Marso

Here is a solution to the many to many merge using 4 lines of MATRIX code ;-)
Note that this only works as is with numeric data!!!
To work with strings, best is to merge the ids and numeric using the following (save strings and ids to separate files or datasets, then do sort and table matches to insert the string data).
Note the Cartesian dataset is sorted by ID02 ID01, so MATCH stringtable2 BY ID02,
SORT result by ID01 then MATCH stringtable1 BT ID01.
I only ran this against two 1000 row tables -> 1,000,000 resulting cases.
Larger data files will likely work fine given sufficient RAM and WORKSPACE settings.
For HUGE files? Contact me for a con$ultation ;-)

*Data simulation here *.
DATASET DECLARE file1.
MATRIX.
SAVE TRUNC(UNIFORM(1000,10)*10)/ OUTFILE file1/ VARIABLES f1x01 TO f1x10.
END MATRIX.
DATASET ACTIVATE file1.
COMPUTE ID01=$CASENUM.
SAVE OUTFILE 'C:\TEMP\file1.sav'.

DATASET DECLARE file2.
MATRIX.
SAVE TRUNC(UNIFORM(1000,10)*10)/ OUTFILE file2/ VARIABLES f2x01 TO f2x10.
END MATRIX.
DATASET ACTIVATE file2.
COMPUTE ID02=$CASENUM.
SAVE OUTFILE 'C:\TEMP\file2.sav'.
/*DATA Simulation DONE */.

/*Actual cartesian match */.
DATASET DECLARE cartesian.
MATRIX .
GET data1 / FILE 'C:\TEMP\file1.sav' / VARIABLES ID01 f1x01 TO f1x10.
GET data2 / FILE 'C:\TEMP\file2.sav' / VARIABLES ID02 f2x01 TO f2x10.
COMPUTE d1d2={KRONEKER(MAKE(NROW(data2),1,1),data1),
KRONEKER(data2,MAKE(NROW(data1),1,1))}.
SAVE d1d2
/ OUTFILE cartesian
/ VARIABLES ID01 f1x01 TO f1x10 ID02 f2x01 TO f2x10 .
END MATRIX.

David Marso wrote

These days I'd likely approach it with MATRIX (hint KRON).
Note BIG matrix jobs can be partitioned with SPLIT FILE in batch mode (in case XG of RAM doesn't suffice).
It's really late here. I shall scrutinize 2mrow.
Thanks for the mention. All too often people omit that courtesy!
-----

Richard Ristow wrote

From time to time, the question has come up, how to do a many-to-many
merge in SPSS: given two files with key variables (which I'll call the
'group key') and, potentially, several records per group-key value in
each file, join every record in each file with every record having the
same group-key value in the other file. Up to now, no satisfactory
solution had been found (see Appendix II at the end of this
posting).
Here, I give code and a test run for what I think is a complete solution.
It's a version of the long-wide-long method (see Appendix II), using

. David Marso's insight that only record keys from the file being
converted to 'wide' need be kept in the 'wide' file(*), and
. my own, that if the record keys are integers consecutive from 1, a set
of keys is characterized by its largest value.
(*) David Marso is <[hidden email]>. His code, for the
special case of taking the Cartesian product of two files, is in SPSSX-L
posting
     Date:    Thu, 19 Nov 2009
11:27:59
-0500

     From:    Lakshmikanth
<[hidden email]>
     Subject: Cartesian product in SPSS
     To:
[hidden email]
     A reformatted version, with test run, is
in

     Date:    Tue, 24 Nov 2009
00:46:30 -0500
     From:    Richard Ristow
<[hidden email]>
     Subject: Re: Cartesian product in SPSS
     To:
[hidden email]

Test and demonstration:
In the following, 'Unrolled' cannot be a dataset; it must be a disk
file or a file handle to a disk file.
Test data - File 1:
[File1]
Category      School
Art           RISD
Art
Metropolitan Museum
Liberal Arts Oberlin
Liberal Arts Antioch
Liberal Arts Swarthmore
Technical     MIT
Technical     ETH
Number of cases read: 7    Number of cases
listed: 7
File 2:
[File2]
Category      Subject
Art
Painting
Art
Drawing
Art
Sculpture
Liberal Arts Literature
Liberal Arts Philosophy
Technical     Calculus
Technical     Physics
Technical     Chemistry
Technical     Engineering
Number of cases read: 9    Number of cases
listed: 9
Demonstration run, with code, intermediate files, and
output:

* ...
I.       Identify File1 records, within
categories, by      .
*
integers sequentially from
1.
.
*        (Because the category
variable is a string, this is done in .
*        transformation code; CREATE
and RANK wouldn't work.)
.
DATASET ACTIVATE File1     WINDOW=ASIS.
DATASET COPY     File1_IDd WINDOW=MINIMIZED.
DATASET ACTIVATE File1_IDd WINDOW=FRONT.
NUMERIC    ID1 (F3).
DO IF    $CASENUM EQ 1.
. COMPUTE ID1 = 1.
ELSE IF Category NE LAG(Category).
. COMPUTE ID1 = 1.
ELSE.
. COMPUTE ID1 = LAG(ID1) + 1.
END IF.

* ...   II.      Restructure
File1 from 'long' to 'wide', except    .
*
. Keep only the category, and identifying number ID1 .
*
. Represent the set of values of ID1 by the largest .
*
value. Since the values of ID1 are consecutive
.
*
integers starting with 1, that's
unambiguous.      .
DATASET DECLARE   File1_Wide.
DATASET ACTIVATE File1_IDd WINDOW=ASIS.
AGGREGATE OUTFILE=File1_Wide
   /BREAK=Category
   /All_IDs 'Representing all values of ID1' =
MAX(ID1).

. /**/ DATASET ACTIVATE File1_Wide
WINDOW=FRONT
/*-*/.
* -**- This, believe it or not, is the 'wide' form of
File1   -***-.
. /**/
LIST
/*-*/.
List
|-----------------------------|---------------------------|
|Output
Created
|26-NOV-2009 14:13:19       |
|-----------------------------|---------------------------|
[File1_Wide]

Category      All_IDs
Art
2
Liberal Arts      3
Technical         2
Number of cases read: 3    Number of cases
listed: 3

* ...   III.     Merge 'wide' File1
with
File2
.
*
This is a one-to-many
merge.
.
MATCH FILES
   /TABLE=File1_Wide
   /FILE =File2
   /BY Category.
. /**/
LIST
/*-*/.
List
|-----------------------------|---------------------------|
|Output
Created
|26-NOV-2009 14:13:20       |
|-----------------------------|---------------------------|
Category      All_IDs Subject
Art
2   Painting
Art
2   Drawing
Art
2   Sculpture
Liberal Arts      3   Literature
Liberal Arts      3   Philosophy
Technical         2
Calculus
Technical         2
Physics
Technical         2
Chemistry
Technical         2
Engineering
Number of cases read: 9    Number of cases
listed: 9

* ...   IV.A.    Return to a 'long'
structure, with all File1       .
*
records joined with all File2
records.
.
*       The result is the complete
many-to-many merged file, except .
*       for (oh, yes) the *data* from
File1.
.
*       (This is where XSAVE will work, and
VARSTOCASES won't.)      .
NUMERIC ID1 (F3).
LOOP    ID1 = 1 TO ALL_IDs.
. XSAVE OUTFILE=Unrolled
   /KEEP=Category ID1 ALL.
END LOOP.
EXECUTE /* required */.
* ...   IV.B.    ... and go to 'file 1
first'
order.
.
GET FILE = Unrolled.
DATASET NAME     ManyToMany WINDOW=FRONT.
SORT CASES BY    Category ID1.

* ...   V.       Re-merge
with File1, adding the data, completing   .
*
the
job.
.
MATCH FILES
   /TABLE = File1_IDd
   /FILE = *
   /BY      Category ID1
   /KEEP = Category ID1 ALL.
LISt.
List
|-----------------------------|---------------------------|
|Output
Created
|26-NOV-2009 14:13:22       |
|-----------------------------|---------------------------|
[ManyToMany]
C:\Documents and Settings\Richard\My Documents
\Temporary\SPSS\
2009-11-24 Ristow - Many to many merge in SPSS - UNROLL.SAV

Category      ID1
School
All_IDs Subject
Art
1
RISD
2   Painting
Art
1
RISD
2   Drawing
Art
1
RISD
2   Sculpture
Art
2 Metropolitan Museum
2   Painting
Art
2 Metropolitan Museum
2   Drawing
Art
2 Metropolitan Museum
2   Sculpture
Liberal Arts    1
Oberlin
3   Literature
Liberal Arts    1
Oberlin
3   Philosophy
Liberal Arts    2
Antioch
3   Literature
Liberal Arts    2
Antioch
3   Philosophy
Liberal Arts    3
Swarthmore
3   Literature
Liberal Arts    3
Swarthmore
3   Philosophy
Technical       1
MIT
2   Calculus
Technical       1
MIT
2   Physics
Technical       1
MIT
2   Chemistry
Technical       1
MIT
2   Engineering
Technical       2
ETH
2   Calculus
Technical       2
ETH
2   Physics
Technical       2
ETH
2   Chemistry
Technical       2
ETH
2   Engineering
Number of cases read: 20    Number of cases
listed: 20
===============================
APPENDIX I: Test data and code
===============================
* C:\Documents and Settings\Richard\My
Documents
.
*
\Technical\spssx-l\Z-2009d\
.
*    2009-11-26 Ristow - Many-to-many merge in
SPSS.SPS
.

* To support a posting with title "Many-to-many merge in
SPSS"      .

* The question of doing many-to-many merges in SPSS, comes up
from .
* time to time. No fully satisfactory solution has been
posted.     .
*
.
* The most promising method suggested has been
"long-wide-long".    .
* That is, all records in, say, file 1 with the same
group          .
* identifier are brought into a single record, typically
by         .
* CASESTOVARS. That changes the problem to a one-to-many
merge,     .
* using MATCH FILES with the wide version as a TABLE file.
The      .
* result, which is 'long' in file 2 and 'wide' in file 1, is
then   .
* unrolled to restore file 1, typically using VARSTOCASES.
That     .
* gives, for each file 2 record, a complete restored set of file
1 .
* records, completing the
match.
.
*
.
* There are two distinct difficulties. First, if the file has
many .
* variables, its wide version may have a very large number.
And,    .
* the VARSTOCASES is data-dependent: the largest number of
cases    .
* in any file 2 group, and the list of variables to be
restored,    .
* must be in the VARSTOCASES code. (A solution, of course, is
to    .
* have Python look at the data dictionary and generate
the          .
* VARSTOCASES -- but that's probably not trivial, as Python has
to .
* infer which sets of variables were generated from
CASESTOVARS.)   .
*
.
* A variant is to include only the identifiers of the file
1        .
* records in the 'wide' file 1, and re-attach the data
variables    .
* later. That avoids creating a file with a great many
variables,   .
* and does not require a VARSTOCASES naming the full list
of        .
* variables to be restored. The Cartesian-product code by
David     .
* Marso(*) works this way. However, it still requires that
the     .
* number of records in one of the files, or rather the
largest      .
* number in any group in that file, be in the code, as the bound of
.
* the vector of
identifiers.
.
*
.
* (*) David Marso's code is in SPSSX-L
posting
.
*       Date:    Thu, 19 Nov
2009 11:27:59
-0500
.
*       From:
Lakshmikanth
<[hidden email]>
.
*       Subject: Cartesian product in
SPSS
.
*       To:
[hidden email]
.
*       A version reformatted by R. Ristow
is
in
.
*       Date:    Tue, 24 Nov
2009 00:46:30
-0500
.
*       From:    Richard
Ristow
<[hidden email]>
.
*       Subject: Re: Cartesian product in
SPSS
.
*       To:
[hidden email]
.
*
.
* However, if the keys within groups are successive
positive        .
* integers, the 'wide' form of the list of IDs can be a
single      .
* number: the largest key value in the group. ('Unrolling'
this     .
* requires XSAVE logic, instead of
VARSTOCASES.)
.
*
.
* That's what's done in the code below, assigning integer
record    .
* IDs and applying 'log-wide-long' to File
1.
.

* .................................................................
.
* .................   Scratch
file
..................... .
* (Because it's written by XSAVE, it cannot be a
dataset.)
.
FILE HANDLE Unrolled
/NAME='C:\Documents and Settings\Richard\My
Documents'
+

'\Temporary\SPSS\'
+
         '2009-11-24 Ristow -
Many to many merge in
SPSS'
+
         ' -
'
+
         'UNROLL.SAV'.

* .................   Test
data
..................... .

DATA LIST FIXED
   /Category    04-16 (A)
    School      17-38 (A).
BEGIN DATA
   Art
RISD
   Art
Metropolitan Museum
   Liberal Arts Oberlin
   Liberal Arts Antioch
   Liberal Arts Swarthmore
   Technical    MIT
   Technical    ETH
END DATA.
DATASET NAME     File1.
. /*-- LIST /*-*/.
DATA LIST FIXED
   /Category    04-16 (A)
    Subject     17-28 (A).
BEGIN DATA
   Art
Painting
   Art
Drawing
   Art
Sculpture
   Liberal Arts Literature
   Liberal Arts Philosophy
   Technical    Calculus
   Technical    Physics
   Technical    Chemistry
   Technical    Engineering
END DATA.
DATASET NAME     File2.
. /*-- LIST /*-*/.
* .................   Post after this point
..................... .
* .................................................................
.
DATASET ACTIVATE File1     WINDOW=ASIS.
LIST.
DATASET ACTIVATE File2     WINDOW=ASIS.
LIST.
* ...   I.       Identify
File1 records, within categories, by      .
*
integers sequentially from
1.
.
*        (Because the category
variable is a string, this is done in .
*        transformation code; CREATE
and RANK wouldn't work.)
.
DATASET ACTIVATE File1     WINDOW=ASIS.
DATASET COPY     File1_IDd WINDOW=MINIMIZED.
DATASET ACTIVATE File1_IDd WINDOW=FRONT.

NUMERIC    ID1 (F3).
DO IF    $CASENUM EQ 1.
. COMPUTE ID1 = 1.
ELSE IF Category NE LAG(Category).
. COMPUTE ID1 = 1.
ELSE.
. COMPUTE ID1 = LAG(ID1) + 1.
END IF.

* ...   II.      Restructure
File1 from 'long' to 'wide', except    .
*
. Keep only the category, and identifying number ID1 .
*
. Represent the set of values of ID1 by the largest .
*
value. Since the values of ID1 are consecutive
.
*
integers starting with 1, that's
unambiguous.      .
DATASET DECLARE   File1_Wide.
DATASET ACTIVATE File1_IDd WINDOW=ASIS.
AGGREGATE OUTFILE=File1_Wide
   /BREAK=Category
   /All_IDs 'Representing all values of ID1' =
MAX(ID1).

. /**/ DATASET ACTIVATE File1_Wide
WINDOW=FRONT
/*-*/.
* -**- This, believe it or not, is the 'wide' form of
File1   -***-.
. /**/
LIST
/*-*/.

* ...   III.     Merge 'wide' File1
with
File2
.
*
This is a one-to-many
merge.
.
MATCH FILES
   /TABLE=File1_Wide
   /FILE =File2
   /BY Category.

. /**/
LIST
/*-*/.

* ...   IV.A.    Return to a 'long'
structure, with all File1       .
*
records joined with all File2
records.
.
*       The result is the complete
many-to-many merged file, except .
*       for (oh, yes) the *data* from
File1.
.
*       (This is where XSAVE will work, and
VARSTOCASES won't.)      .
NUMERIC ID1 (F3).
LOOP    ID1 = 1 TO ALL_IDs.
. XSAVE OUTFILE=Unrolled
   /KEEP=Category ID1 ALL.
END LOOP.
EXECUTE /* required */.
* ...   IV.B.    ... and go to 'file 1
first'
order.
.
GET FILE = Unrolled.
DATASET NAME     ManyToMany WINDOW=FRONT.
SORT CASES BY    Category ID1.
* ...   V.       Re-merge
with File1, adding the data, completing   .
*
the
job.
.
MATCH FILES
   /TABLE = File1_IDd
   /FILE = *
   /BY      Category ID1
   /KEEP = Category ID1 ALL.

LISt.
================================
APPENDIX II: History and methods
================================
The best ideas have been some version of what I'll call the
'long-wide-long' method. In that method, in one of the files, the records
for every key value (i'll cal are combined into a single record ('wide'
form), using CASESTOVARS or an equivalent. Then there's only one record
per key value in that file, and the merge can be done as one-to-many,
using /TABLE in MATCH FILES. Finally, the data from the first file is
re-converted ('unrolled') to 'long' form, using VARSTOCASES or some
equivalent.
I don't think anybody's got that working satisfactorily. First, the
'wide' File 1 may have so many variables that SPSS can't handle it, or
can't handle it efficiently. Second, and more crucial, both the list of
variables in File 1, and the maximum number of records in File 1 for any
key value, must be coded in the VARSTOCASES statement. One might write a
Python program to analyze the data dictionary and generate the
VARSTOCASES statement, but I think even that would be awkward.
David Marso of SPSS, Inc., wrote an interesting variant (see citations
above). If the file to be made 'wide', say File 1, has a record-key
variable that uniquely identifies records within each (group) key value,
only the record-key values need be combined into a 'wide' record. After
the one-to-many merge with File 2, and unrolling back to long form, the
File 1 data can then be re-attached by a MATCH FILE by group-key and
record-key, with File 1 as a TABLE.
David Marso's code still requires a vector to hold the record-key values
from File 1; and the length of that vector, i.e. the maximum number of
records in File 1 for any group-key value, must be given in the
code.
However, if the record key consists of integers sequentially from 1
within each group, the set of keys may be represented by a single
integer, namely the largest value. Then, the set of variables that need
be named is not data-dependent at all, and general-purpose code is
possible.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Many-to-many merge in SPSS

Administrator

From Richard's appendix:
"David Marso's code still requires a vector to hold the record-key values
from File 1; and the length of that vector, i.e. the maximum number of
records in File 1 for any group-key value, must be given in the
code.
However, if the record key consists of integers sequentially from 1
within each group, the set of keys may be represented by a single
integer, namely the largest value. Then, the set of variables that need
be named is not data-dependent at all, and general-purpose code is
possible. "

SWEET that the matrix solution requires nothing but the two data files ;-)
I later tested the code on 2 10K row files using only IDs for the KRON op and later did the 'fact' dat merges using 2 TABLE merges with an intermediate SORT. Code on other PC so you'll need to just imagine it until later. Hmm those 2 10K files end up creating a 100M row dataset! Do the math.

Albert-Jan Roskam

Re: Many-to-many merge in SPSS

Never tried this (and no SPSS here), but this might also be a way:

DEFINE !connect_string ()
'DSN=MS Access Database;DBQ=c:\temp\temp.mdb;DriverId=25;FIL=MS Access;MaxBufferSize=2048;PageTimeout=5;'
!ENDDEFINE.
SAVE TRANSLATE /TYPE=ODBC /CONNECT=!connect_string /TABLE=tbl1 /REPLACE.
SAVE TRANSLATE /TYPE=ODBC /CONNECT=!connect_string /TABLE=tbl2 /REPLACE
/SQL = "SELECT * FROM tbl1, tbl2 INTO tblCartesian".

GET DATA...

Regards,

Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a

fresh water system, and public health, what have the Romans ever done for us?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

From: David Marso <[hidden email]>
To: [hidden email]
Sent: Saturday, February 1, 2014 9:08 AM
Subject: Re: [SPSSX-L] Many-to-many merge in SPSS

From Richard's appendix:
"David Marso's code still requires a vector to hold the record-key values
from File 1; and the length of that vector, i.e. the maximum number of
records in File 1 for any group-key value, must be given in the
code.
However, if the record key consists of integers sequentially from 1
within each group, the set of keys may be represented by a single
integer, namely the largest value. Then, the set of variables that need
be named is not data-dependent at all, and general-purpose code is
possible. "

SWEET that the matrix solution requires nothing but the two data files ;-)
I later tested the code on 2 10K row files using only IDs for the KRON op
and later did the 'fact' dat merges using 2 TABLE merges with an
intermediate SORT. Code on other PC so you'll need to just imagine it until
later. Hmm those 2 10K files end up creating a 100M row dataset! Do the
math.

-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Many-to-many-merge-in-SPSS-tp1081978p5724262.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: Many-to-many merge in SPSS

In reply to this post by fuigb

There've been a couple of solutions posted -- see David Marso's
MATRIX logic(1), and Albert-Jan Roskam's suggestion of using an
external SQL database(2).

I think the long-wide-long approach will still prove useful. Among
other things, it can be made to work very much like MATCH FILES. And
the output retains labels, formats, and missing values of all
variables; I don't think MATRIX keeps those, and external SQL certainly won't.

(1) For MATRIX logic, see
Date: Fri, 31 Jan 2014 10:06:28 -0800
From: David Marso <[hidden email]>
Subject: Re: Many-to-many merge in SPSS
To: [hidden email]

(2) For SQL logic, see
Date: Sat, 1 Feb 2014 11:40:31 -0800
From: Albert-Jan Roskam <[hidden email]>
Subject: Re: Many-to-many merge in SPSS
To: [hidden email]

Here is the long-wide-long method cleaned up and wrapped in a macro.
I'm testing it, before posting, but I'd be grateful for anybody else
trying it as well.

Call:

!M2MnyX Left =<left input>/
Right =<right input>/
BYvar =<variables>/
WrkFile=<file>.

(The 'X' suffix on the name indicates that this version uses XSAVE logic.)

Parameters (all are required) are,

Left Name of an existing SPSS dataset;
or, name or file handle of an SPSS .SAV file
Right Name of a second SPSS dataset or .SAV file, as above

BYvar Names of one or more variables occurring in both files.
Requirements are as for BY variables in MATCH FILES,
except that it need not be a unique key in either input.
Every record in each input will be joined to every
record having the same BYvar value in the other.
WrkFile Name or file handle of a disk file (NOT a dataset) for
working storage.

Other requirements:

- Inputs 'Left' and 'Right', and variable(s) 'BYvar' must be compatible
as for MATCH FILES:
. The BYvar variable or variables must occur in both inputs, and
be of the same type and length in both.
. Both inputs must be in ascending order on the BYvar variable(s).
(However, group key values need not be unique record identifiers
in either input.)
. If any other variable occurs in both, it must have the same type and
length in both, AND ITS VALUE FROM THE RIGHT INPUT WILL BE LOST.

- SPSS must be able to write to the work file.

Result:

+ The left and right inputs will be unaltered.

+ A new dataset 'Joined' will be created, and will be the active
dataset. It will have one record for every pair of records from the
'left' and 'right' inputs that match on the group key. Order of
variables will be,

. The 'BYvar' variable(s)
. Variable 'Left#'. (This variable numbers the records
in the left input, sequentially within each BYvar value.)
. The non-key variables from the left input
. The non-key variables from the right input

+ The records will be in ascending order by the BYvar variables.
Within a BYvar value, they will be in the same order as in the
left-hand input. Records from the same left-hand input record will be
in the same order as in their right-hand input records.

+ No value of the group key will occur in this file unless it occurs
in both inputs. (In database terminology, it is an 'inner join'.)

+ The macro will print certain messages, including echoing its parameters.

Side effects:
. The working file will be left on disk. It may be erased if desired.
A pre-an existing file by that name will be lost.
. Datasets '@LftIDd' and '@LftWide' will be created; they may be erased.
Pre-existing datasets named '@LftIDd', '@LftWide', or 'Joined' will
. be lost.
. Pre-existing variables named 'Left#', '@NewGrp', and '@MxLkey' will
be lost.
Glitch:
. The dataset 'Joined' will be tagged as being an image of the working
file; on closing SPSS, you'll be asked if you want to save it back
to the working-file name and location.

(Later, code to erase the working file and scratch datasets may be
added, and the ECHO messages dropped.)

APPENDICES FOLLOW:
I. Definition of macro !M2MnyX, with instructions in comments
II. Output from demonstration run
III. Test data and code, for demonstration run
================================================
Appendix I: Definition of macro !M2MnyX
================================================
* **************************************************************** .
* Macro for many-to-many merge of two datasets or .SAV files, .
* using XSAVE logic. .
* Takes four arguments, all required, all terminated by '/': .
* Left Name of one of the datasets, or saved file, to be joined .
* Right Name of the other dataset, or saved file, to be joined .
* BYvar Variable or variables on which the two are to be joined..
* Requirements are as for BY variables in MATCH FILES: .
* all variables must be present and of the same type in .
* both inputs; and both must be sorted in ascending order .
* on the BYvar variables. .
* WrkFile Name or handle of disk file in which to 'unroll' the .
* keys in Left, joined to the data in Right. .
* **************************************************************** .

DEFINE !M2MnyX(Left = !CHAREND('/')
/Right = !CHAREND('/')
/BYvar = !CHAREND('/')
/WrkFile = !CHAREND('/'))

. ECHO ' '.
. ECHO 'Arguments to <!>M2MnyX1:'.
. ECHO !QUOTE(!CONCAT('"Left" input: ',!Left)).
. ECHO !QUOTE(!CONCAT('"Right" input: ',!Right)).
. ECHO !QUOTE(!CONCAT('Key variable(s): ',!BYvar)).
. ECHO !QUOTE(!CONCAT('Working file: ',!WrkFile)).
. ECHO 'After the macro finishes execution, datasets'.
. ECHO '"@LftIDd" and "@LftWide" may be erased.'.

* ... I. Add 'Left#', record number within <!>BYvar groups, .
* to Left-hand file. .

ADD FILES
/FILE =!Left
/BY !BYvar
/FIRST= @NewGrp.

NUMERIC Left# (F3).

DO IF @NewGrp.
. COMPUTE Left# = 1.
ELSE.
. COMPUTE Left# = LAG(Left#) + 1.
END IF.

DATASET NAME @LftIDd WINDOW=FRONT.

* ... II. Restructure Left input from 'long' to 'wide', .
* /* . Keep only the group key (!BYvar) and record .
* sequence number within group (Left#). .
* . Represent the set of values of Left# by the .
* largest value. Since the values of Left# are .
* consecutive integers starting with 1, the .
* maximum value characterizes the set. .

DATASET ACTIVATE @LftIDd WINDOW=ASIS.
DATASET DECLARE @LftWide.

AGGREGATE OUTFILE=@LftWide
/BREAK = !BYvar
/@MxLkey 'Largest value of Left# for this key' = MAX(Left#).

* ... III. Merge 'wide' Left input with Right input .
* This is a one-to-many merge. .

MATCH FILES
/TABLE=@LftWide
/FILE =!Right
/BY !BYvar.

* ... IV.A Return to a 'long' structure, with all Left .
* records joined with all Right records. .

* The result is the complete many-to-many merged file, except .
* for the substantive data from the Left input. .

NUMERIC Left# (F3).
LOOP Left# = 1 TO @MxLkey.
. XSAVE OUTFILE=!WrkFile
/DROP=@MxLKey
/KEEP=!BYvar Left# ALL.
END LOOP.
EXECUTE /* required */.

* ... IV.B. ... and go to 'Left file first' order. .

GET FILE=!WrkFile.
SORT CASES BY !BYvar Left#.

* ... V. Re-merge with !Left, adding the data, completing .
* the join. .

MATCH FILES
/TABLE= @LftIDd
/FILE = *
/BY !BYvar Left#
/DROP = @NewGrp
/KEEP = !BYvar Left# ALL.

DATASET NAME Joined WINDOW=FRONT.

!ENDDEFINE.
================================================
Appendix II. Output from demonstration run
================================================
DATASET ACTIVATE Schools WINDOW=FRONT.
LIST.
List
|-----------------------------|---------------------------|
|Output Created |03-FEB-2014 21:32:26 |
|-----------------------------|---------------------------|
[Schools]
Category School

Art RI School of Design
Art Metropolitan Museum
Culinary Johnson & Wales
Liberal Arts Oberlin
Liberal Arts Antioch
Liberal Arts Swarthmore
Technical MIT
Technical Cal Tech
Technical ETH

Number of cases read: 9 Number of cases listed: 9

DATASET ACTIVATE Subjects WINDOW=FRONT.
LIST.
List
|-----------------------------|---------------------------|
|Output Created |03-FEB-2014 21:32:26 |
|-----------------------------|---------------------------|
[Subjects]
Category Subject

Art Painting
Art Drawing
Art Sculpture
Liberal Arts Literature
Liberal Arts Philosophy
Technical Calculus
Technical Physics
Technical Chemistry
Technical Engineering

Number of cases read: 9 Number of cases listed: 9

!M2MnyX Left = Schools/
Right = Subjects/
BYvar = Category/
WrkFile = Unroll.

Arguments to <!>M2MnyX1:
"Left" input: Schools
"Right" input: Subjects
Key variable(s): Category
Working file: Unroll
After the macro finishes execution, datasets
"@LftIDd" and "@LftWide" may be erased.

LIST.
List
|-----------------------------|---------------------------|
|Output Created |03-FEB-2014 21:32:27 |
|-----------------------------|---------------------------|
[Joined] C:\Documents and Settings\Richard\My Documents
\Temporary\SPSS\
2014-xx-xx Ristow - Many-to-many merge in SPSS - Scratch1.SAV

Category Left# School Subject

Art 1 RI School of Design Painting
Art 1 RI School of Design Drawing
Art 1 RI School of Design Sculpture
Art 2 Metropolitan Museum Painting
Art 2 Metropolitan Museum Drawing
Art 2 Metropolitan Museum Sculpture
Liberal Arts 1 Oberlin Literature
Liberal Arts 1 Oberlin Philosophy
Liberal Arts 2 Antioch Literature
Liberal Arts 2 Antioch Philosophy
Liberal Arts 3 Swarthmore Literature
Liberal Arts 3 Swarthmore Philosophy
Technical 1 MIT Calculus
Technical 1 MIT Physics
Technical 1 MIT Chemistry
Technical 1 MIT Engineering
Technical 2 Cal Tech Calculus
Technical 2 Cal Tech Physics
Technical 2 Cal Tech Chemistry
Technical 2 Cal Tech Engineering
Technical 3 ETH Calculus
Technical 3 ETH Physics
Technical 3 ETH Chemistry
Technical 3 ETH Engineering

Number of cases read: 24 Number of cases listed: 24
================================================
Appendix III. Test data and code, for demonstration run
III.A Test data and driver code
================================================
* C:\Documents and Settings\Richard\My Documents .
* \Technical\spssx-l\Z-2014\ .
* 2014-02-03 Ristow - Many-to-many merge in SPSS.SPS .

* Under name 2014-xx-xx Ristow-Many-to-many merge in SPSS.SPS: .
* 31 Jan 2014 - Create, from file .
* C:\Documents and Settings\Richard\My Documents .
* \Technical\spssx-l\Z-2009d\ .
* 2009-11-24 Ristow - Many to many merge in SPSS.SPS .
* which was prepared for, and included in, posting .
* Date: Thu, 26 Nov 2009 14:34:54 -0500 .
* From: Richard Ristow <[hidden email]> .
* Subject: Many-to-many merge in SPSS .
* To: [hidden email] .
* Projected clean-up ('+' = done, '.' = to do): .
* + Name input datasets '@Left' and '@Right', instead of .
* 'File1' and 'File2' .
* + Use ADD FILES when creating record sequence numbers for .
* '@Left'; this allows group keys with several variables. .
* + Don't add sequence numbers for file @'Right'; they aren't .
* needed .
* + Wrap in a macro, initially un-parameterized. .
* 1 Feb 2014 - Complete the renaming of files and variables, .
* and create first version of the macro, with no parameters. .
* Remove it to file M2MnyX0.SPS, put the test data here and .
* set up for test. .
* Under name 2014-xx-xx Ristow - Many-to-many merge in SPSS.SPS: .
* Save under this name, with spaces around the hyphen following .
* the author's name. Looks better, matches earlier practice .
* Test with macro <!>M2MnyX0, taking no arguments. .
* 3 Feb 2014 - Drop creation of datasets @Left and @Right, which .
* were for testing open code or the macro without parameters. .
* Set up to work with production macro <!>M2MnyX. .
* Under name 2014-02-03 Ristow - Many-to-many merge in SPSS.SPS: .

* ................. Directory holding code ..................... .

FILE HANDLE CodePlace
NAME='C:\Documents and Settings\Richard\My Documents' +
'\Technical\spssx-l\Z-2014'.

* ................. Working file, required for XSAVE logic ..... .

FILE HANDLE Unroll
/NAME='C:\Documents and Settings\Richard\My Documents' +
'\Temporary\SPSS\' +
'2014-xx-xx Ristow - Many-to-many merge in SPSS' +
' - ' +
'Scratch1.SAV'.

* ................. Test data ..................... .

DATA LIST FIXED
/Category 04-16 (A)
School 17-38 (A).
BEGIN DATA
Art RI School of Design
Art Metropolitan Museum
Culinary Johnson & Wales
Liberal Arts Oberlin
Liberal Arts Antioch
Liberal Arts Swarthmore
Technical MIT
Technical Cal Tech
Technical ETH
END DATA.
DATASET NAME Schools.
. /*-- LIST /*-*/.

DATA LIST FIXED
/Category 04-16 (A)
Subject 17-28 (A).
BEGIN DATA
Art Painting
Art Drawing
Art Sculpture
Liberal Arts Literature
Liberal Arts Philosophy
Technical Calculus
Technical Physics
Technical Chemistry
Technical Engineering
END DATA.
DATASET NAME Subjects.
. /*-- LIST /*-*/.

* Load definition of <!>M2MnyX: .
INSERT FILE='CodePlace\M2MnyX1.SPS'.

DATASET ACTIVATE Schools WINDOW=FRONT.
LIST.

DATASET ACTIVATE Subjects WINDOW=FRONT.
LIST.
!M2MnyX Left = Schools/
Right = Subjects/
BYvar = Category/
WrkFile = Unroll.

LIST.
================================================
III.B Macro definition file
================================================
* C:\Documents and Settings\Richard\My Documents .
* \Technical\spssx-l\Z-2014\ .
* M2MnyX1.SPS .

* For insertion in 2014-xx-xx Ristow-Many-to-many merge in SPSS.SPS .
* or successor (2014-02-03 Ristow-Many-to-many merge in SPSS.SPS) .

* This is a production form of the macro with XSAVE logic. .

* **************************************************************** .
* Macro for many-to-many merge of two datasets or .SAV files, .
* using XSAVE logic. .
* Takes four arguments, all required, all terminated by '/': .
* Left Name of one of the datasets, or saved file, to be joined .
* Right Name of the other dataset, or saved file, to be joined .
* BYvar Variable or variables on which the two are to be joined..
* Requirements are as for BY variables in MATCH FILES: .
* all variables must be present and of the same type in .
* both inputs; and both must be sorted in ascending order .
* on the BYvar variables. .
* WrkFile Name or handle of disk file in which to 'unroll' the .
* keys in Left, joined to the data in Right. .
* **************************************************************** .

DEFINE !M2MnyX(Left = !CHAREND('/')
/Right = !CHAREND('/')
/BYvar = !CHAREND('/')
/WrkFile = !CHAREND('/'))

. ECHO ' '.
. ECHO 'Arguments to <!>M2MnyX1:'.
. ECHO !QUOTE(!CONCAT('"Left" input: ',!Left)).
. ECHO !QUOTE(!CONCAT('"Right" input: ',!Right)).
. ECHO !QUOTE(!CONCAT('Key variable(s): ',!BYvar)).
. ECHO !QUOTE(!CONCAT('Working file: ',!WrkFile)).
. ECHO 'After the macro finishes execution, datasets'.
. ECHO '"@LftIDd" and "@LftWide" may be erased.'.

* ... I. Add 'Left#', record number within <!>BYvar groups, .
* to Left-hand file. .

ADD FILES
/FILE =!Left
/BY !BYvar
/FIRST= @NewGrp.

NUMERIC Left# (F3).

DO IF @NewGrp.
. COMPUTE Left# = 1.
ELSE.
. COMPUTE Left# = LAG(Left#) + 1.
END IF.

DATASET NAME @LftIDd WINDOW=FRONT.

* ... II. Restructure Left input from 'long' to 'wide', .
* /* . Keep only the group key (!BYvar) and record .
* sequence number within group (Left#). .
* . Represent the set of values of Left# by the .
* largest value. Since the values of Left# are .
* consecutive integers starting with 1, the .
* maximum value characterizes the set. .

DATASET ACTIVATE @LftIDd WINDOW=ASIS.
DATASET DECLARE @LftWide.

AGGREGATE OUTFILE=@LftWide
/BREAK = !BYvar
/@MxLkey 'Largest value of Left# for this key' = MAX(Left#).

* ... III. Merge 'wide' Left input with Right input .
* This is a one-to-many merge. .

MATCH FILES
/TABLE=@LftWide
/FILE =!Right
/BY !BYvar.

* ... IV.A Return to a 'long' structure, with all Left .
* records joined with all Right records. .

* The result is the complete many-to-many merged file, except .
* for the substantive data from the Left input. .

NUMERIC Left# (F3).
LOOP Left# = 1 TO @MxLkey.
. XSAVE OUTFILE=!WrkFile
/DROP=@MxLKey
/KEEP=!BYvar Left# ALL.
END LOOP.
EXECUTE /* required */.

* ... IV.B. ... and go to 'Left file first' order. .

GET FILE=!WrkFile.
SORT CASES BY !BYvar Left#.

* ... V. Re-merge with !Left, adding the data, completing .
* the join. .

MATCH FILES
/TABLE= @LftIDd
/FILE = *
/BY !BYvar Left#
/DROP = @NewGrp
/KEEP = !BYvar Left# ALL.

DATASET NAME Joined WINDOW=FRONT.

!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Andy W

Re: Many-to-many merge in SPSS

Thanks for posting this Richard, although in my opinion saving the metadata is not a big deal - APPLY DICTIONARY could rectify that pretty easily.

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/