From time to time, the question has come up, how to do a many-to-many
merge in SPSS: given two files with key variables (which I'll call the
'group key') and, potentially, several records per group-key value in
each file, join every record in each file with every record having the
same group-key value in the other file. Up to now, no satisfactory
solution had been found (see Appendix II at the end of this
posting).
Here, I give code and a test run for what I think is a complete solution. It's a version of the long-wide-long method (see Appendix II), using . David Marso's insight that only record keys from the file being converted to 'wide' need be kept in the 'wide' file(*), and . my own, that if the record keys are integers consecutive from 1, a set of keys is characterized by its largest value. (*) David Marso is <[hidden email]>. His code, for the special case of taking the Cartesian product of two files, is in SPSSX-L posting Date: Thu, 19 Nov 2009 11:27:59 -0500 From: Lakshmikanth <[hidden email]> Subject: Cartesian product in SPSS To: [hidden email] A reformatted version, with test run, is in Date: Tue, 24 Nov 2009 00:46:30 -0500 From: Richard Ristow <[hidden email]> Subject: Re: Cartesian product in SPSS To: [hidden email] Test and demonstration: In the following, 'Unrolled' cannot be a dataset; it must be a disk file or a file handle to a disk file. Test data - File 1: [File1] Category School Art RISD Art Metropolitan Museum Liberal Arts Oberlin Liberal Arts Antioch Liberal Arts Swarthmore Technical MIT Technical ETH Number of cases read: 7 Number of cases listed: 7 File 2: [File2] Category Subject Art Painting Art Drawing Art Sculpture Liberal Arts Literature Liberal Arts Philosophy Technical Calculus Technical Physics Technical Chemistry Technical Engineering Number of cases read: 9 Number of cases listed: 9 Demonstration run, with code, intermediate files, and output: * ... I. Identify File1 records, within categories, by . * integers sequentially from 1. . * (Because the category variable is a string, this is done in . * transformation code; CREATE and RANK wouldn't work.) . DATASET ACTIVATE File1 WINDOW=ASIS. DATASET COPY File1_IDd WINDOW=MINIMIZED. DATASET ACTIVATE File1_IDd WINDOW=FRONT. NUMERIC ID1 (F3). DO IF $CASENUM EQ 1. . COMPUTE ID1 = 1. ELSE IF Category NE LAG(Category). . COMPUTE ID1 = 1. ELSE. . COMPUTE ID1 = LAG(ID1) + 1. END IF. * ... II. Restructure File1 from 'long' to 'wide', except . * . Keep only the category, and identifying number ID1 . * . Represent the set of values of ID1 by the largest . * value. Since the values of ID1 are consecutive . * integers starting with 1, that's unambiguous. . DATASET DECLARE File1_Wide. DATASET ACTIVATE File1_IDd WINDOW=ASIS. AGGREGATE OUTFILE=File1_Wide /BREAK=Category /All_IDs 'Representing all values of ID1' = MAX(ID1). . /**/ DATASET ACTIVATE File1_Wide WINDOW=FRONT /*-*/. * -**- This, believe it or not, is the 'wide' form of File1 -***-. . /**/ LIST /*-*/. List |-----------------------------|---------------------------| |Output Created |26-NOV-2009 14:13:19 | |-----------------------------|---------------------------| [File1_Wide] Category All_IDs Art 2 Liberal Arts 3 Technical 2 Number of cases read: 3 Number of cases listed: 3 * ... III. Merge 'wide' File1 with File2 . * This is a one-to-many merge. . MATCH FILES /TABLE=File1_Wide /FILE =File2 /BY Category. . /**/ LIST /*-*/. List |-----------------------------|---------------------------| |Output Created |26-NOV-2009 14:13:20 | |-----------------------------|---------------------------| Category All_IDs Subject Art 2 Painting Art 2 Drawing Art 2 Sculpture Liberal Arts 3 Literature Liberal Arts 3 Philosophy Technical 2 Calculus Technical 2 Physics Technical 2 Chemistry Technical 2 Engineering Number of cases read: 9 Number of cases listed: 9 * ... IV.A. Return to a 'long' structure, with all File1 . * records joined with all File2 records. . * The result is the complete many-to-many merged file, except . * for (oh, yes) the *data* from File1. . * (This is where XSAVE will work, and VARSTOCASES won't.) . NUMERIC ID1 (F3). LOOP ID1 = 1 TO ALL_IDs. . XSAVE OUTFILE=Unrolled /KEEP=Category ID1 ALL. END LOOP. EXECUTE /* required */. * ... IV.B. ... and go to 'file 1 first' order. . GET FILE = Unrolled. DATASET NAME ManyToMany WINDOW=FRONT. SORT CASES BY Category ID1. * ... V. Re-merge with File1, adding the data, completing . * the job. . MATCH FILES /TABLE = File1_IDd /FILE = * /BY Category ID1 /KEEP = Category ID1 ALL. LISt. List |-----------------------------|---------------------------| |Output Created |26-NOV-2009 14:13:22 | |-----------------------------|---------------------------| [ManyToMany] C:\Documents and Settings\Richard\My Documents \Temporary\SPSS\ 2009-11-24 Ristow - Many to many merge in SPSS - UNROLL.SAV Category ID1 School All_IDs Subject Art 1 RISD 2 Painting Art 1 RISD 2 Drawing Art 1 RISD 2 Sculpture Art 2 Metropolitan Museum 2 Painting Art 2 Metropolitan Museum 2 Drawing Art 2 Metropolitan Museum 2 Sculpture Liberal Arts 1 Oberlin 3 Literature Liberal Arts 1 Oberlin 3 Philosophy Liberal Arts 2 Antioch 3 Literature Liberal Arts 2 Antioch 3 Philosophy Liberal Arts 3 Swarthmore 3 Literature Liberal Arts 3 Swarthmore 3 Philosophy Technical 1 MIT 2 Calculus Technical 1 MIT 2 Physics Technical 1 MIT 2 Chemistry Technical 1 MIT 2 Engineering Technical 2 ETH 2 Calculus Technical 2 ETH 2 Physics Technical 2 ETH 2 Chemistry Technical 2 ETH 2 Engineering Number of cases read: 20 Number of cases listed: 20 =============================== APPENDIX I: Test data and code =============================== * C:\Documents and Settings\Richard\My Documents . * \Technical\spssx-l\Z-2009d\ . * 2009-11-26 Ristow - Many-to-many merge in SPSS.SPS . * To support a posting with title "Many-to-many merge in SPSS" . * The question of doing many-to-many merges in SPSS, comes up from . * time to time. No fully satisfactory solution has been posted. . * . * The most promising method suggested has been "long-wide-long". . * That is, all records in, say, file 1 with the same group . * identifier are brought into a single record, typically by . * CASESTOVARS. That changes the problem to a one-to-many merge, . * using MATCH FILES with the wide version as a TABLE file. The . * result, which is 'long' in file 2 and 'wide' in file 1, is then . * unrolled to restore file 1, typically using VARSTOCASES. That . * gives, for each file 2 record, a complete restored set of file 1 . * records, completing the match. . * . * There are two distinct difficulties. First, if the file has many . * variables, its wide version may have a very large number. And, . * the VARSTOCASES is data-dependent: the largest number of cases . * in any file 2 group, and the list of variables to be restored, . * must be in the VARSTOCASES code. (A solution, of course, is to . * have Python look at the data dictionary and generate the . * VARSTOCASES -- but that's probably not trivial, as Python has to . * infer which sets of variables were generated from CASESTOVARS.) . * . * A variant is to include only the identifiers of the file 1 . * records in the 'wide' file 1, and re-attach the data variables . * later. That avoids creating a file with a great many variables, . * and does not require a VARSTOCASES naming the full list of . * variables to be restored. The Cartesian-product code by David . * Marso(*) works this way. However, it still requires that the . * number of records in one of the files, or rather the largest . * number in any group in that file, be in the code, as the bound of . * the vector of identifiers. . * . * (*) David Marso's code is in SPSSX-L posting . * Date: Thu, 19 Nov 2009 11:27:59 -0500 . * From: Lakshmikanth <[hidden email]> . * Subject: Cartesian product in SPSS . * To: [hidden email] . * A version reformatted by R. Ristow is in . * Date: Tue, 24 Nov 2009 00:46:30 -0500 . * From: Richard Ristow <[hidden email]> . * Subject: Re: Cartesian product in SPSS . * To: [hidden email] . * . * However, if the keys within groups are successive positive . * integers, the 'wide' form of the list of IDs can be a single . * number: the largest key value in the group. ('Unrolling' this . * requires XSAVE logic, instead of VARSTOCASES.) . * . * That's what's done in the code below, assigning integer record . * IDs and applying 'log-wide-long' to File 1. . * ................................................................. . * ................. Scratch file ..................... . * (Because it's written by XSAVE, it cannot be a dataset.) . FILE HANDLE Unrolled /NAME='C:\Documents and Settings\Richard\My Documents' + '\Temporary\SPSS\' + '2009-11-24 Ristow - Many to many merge in SPSS' + ' - ' + 'UNROLL.SAV'. * ................. Test data ..................... . DATA LIST FIXED /Category 04-16 (A) School 17-38 (A). BEGIN DATA Art RISD Art Metropolitan Museum Liberal Arts Oberlin Liberal Arts Antioch Liberal Arts Swarthmore Technical MIT Technical ETH END DATA. DATASET NAME File1. . /*-- LIST /*-*/. DATA LIST FIXED /Category 04-16 (A) Subject 17-28 (A). BEGIN DATA Art Painting Art Drawing Art Sculpture Liberal Arts Literature Liberal Arts Philosophy Technical Calculus Technical Physics Technical Chemistry Technical Engineering END DATA. DATASET NAME File2. . /*-- LIST /*-*/. * ................. Post after this point ..................... . * ................................................................. . DATASET ACTIVATE File1 WINDOW=ASIS. LIST. DATASET ACTIVATE File2 WINDOW=ASIS. LIST. * ... I. Identify File1 records, within categories, by . * integers sequentially from 1. . * (Because the category variable is a string, this is done in . * transformation code; CREATE and RANK wouldn't work.) . DATASET ACTIVATE File1 WINDOW=ASIS. DATASET COPY File1_IDd WINDOW=MINIMIZED. DATASET ACTIVATE File1_IDd WINDOW=FRONT. NUMERIC ID1 (F3). DO IF $CASENUM EQ 1. . COMPUTE ID1 = 1. ELSE IF Category NE LAG(Category). . COMPUTE ID1 = 1. ELSE. . COMPUTE ID1 = LAG(ID1) + 1. END IF. * ... II. Restructure File1 from 'long' to 'wide', except . * . Keep only the category, and identifying number ID1 . * . Represent the set of values of ID1 by the largest . * value. Since the values of ID1 are consecutive . * integers starting with 1, that's unambiguous. . DATASET DECLARE File1_Wide. DATASET ACTIVATE File1_IDd WINDOW=ASIS. AGGREGATE OUTFILE=File1_Wide /BREAK=Category /All_IDs 'Representing all values of ID1' = MAX(ID1). . /**/ DATASET ACTIVATE File1_Wide WINDOW=FRONT /*-*/. * -**- This, believe it or not, is the 'wide' form of File1 -***-. . /**/ LIST /*-*/. * ... III. Merge 'wide' File1 with File2 . * This is a one-to-many merge. . MATCH FILES /TABLE=File1_Wide /FILE =File2 /BY Category. . /**/ LIST /*-*/. * ... IV.A. Return to a 'long' structure, with all File1 . * records joined with all File2 records. . * The result is the complete many-to-many merged file, except . * for (oh, yes) the *data* from File1. . * (This is where XSAVE will work, and VARSTOCASES won't.) . NUMERIC ID1 (F3). LOOP ID1 = 1 TO ALL_IDs. . XSAVE OUTFILE=Unrolled /KEEP=Category ID1 ALL. END LOOP. EXECUTE /* required */. * ... IV.B. ... and go to 'file 1 first' order. . GET FILE = Unrolled. DATASET NAME ManyToMany WINDOW=FRONT. SORT CASES BY Category ID1. * ... V. Re-merge with File1, adding the data, completing . * the job. . MATCH FILES /TABLE = File1_IDd /FILE = * /BY Category ID1 /KEEP = Category ID1 ALL. LISt. ================================ APPENDIX II: History and methods ================================ The best ideas have been some version of what I'll call the 'long-wide-long' method. In that method, in one of the files, the records for every key value (i'll cal are combined into a single record ('wide' form), using CASESTOVARS or an equivalent. Then there's only one record per key value in that file, and the merge can be done as one-to-many, using /TABLE in MATCH FILES. Finally, the data from the first file is re-converted ('unrolled') to 'long' form, using VARSTOCASES or some equivalent. I don't think anybody's got that working satisfactorily. First, the 'wide' File 1 may have so many variables that SPSS can't handle it, or can't handle it efficiently. Second, and more crucial, both the list of variables in File 1, and the maximum number of records in File 1 for any key value, must be coded in the VARSTOCASES statement. One might write a Python program to analyze the data dictionary and generate the VARSTOCASES statement, but I think even that would be awkward. David Marso of SPSS, Inc., wrote an interesting variant (see citations above). If the file to be made 'wide', say File 1, has a record-key variable that uniquely identifies records within each (group) key value, only the record-key values need be combined into a 'wide' record. After the one-to-many merge with File 2, and unrolling back to long form, the File 1 data can then be re-attached by a MATCH FILE by group-key and record-key, with File 1 as a TABLE. David Marso's code still requires a vector to hold the record-key values from File 1; and the length of that vector, i.e. the maximum number of records in File 1 for any group-key value, must be given in the code. However, if the record key consists of integers sequentially from 1 within each group, the set of keys may be represented by a single integer, namely the largest value. Then, the set of variables that need be named is not data-dependent at all, and general-purpose code is possible. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Has anyone had luck with this code? I understand what it is designed to do, but it does not run and too far short of being a real coder to debug it. If anyone charitable has debugged this then please post.
If this can actually be made to function the original author deserves a medal! |
Administrator
|
In reply to this post by Richard Ristow
These days I'd likely approach it with MATRIX (hint KRON).
Note BIG matrix jobs can be partitioned with SPLIT FILE in batch mode (in case XG of RAM doesn't suffice). It's really late here. I shall scrutinize 2mrow. Thanks for the mention. All too often people omit that courtesy! -----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
After a night's sleep I realized what I needed to tweak and so SUCCESS! You guys are geniuses. For years and at various jobs I've fought the good fight for SPSS against the true-believers of SAS: "SPSS cannot accomplish what is easy in SAS." You're helping me to roll back the naysayers! Again, thank you.
|
Administrator
|
In reply to this post by David Marso
Here is a solution to the many to many merge using 4 lines of MATRIX code ;-)
Note that this only works as is with numeric data!!! To work with strings, best is to merge the ids and numeric using the following (save strings and ids to separate files or datasets, then do sort and table matches to insert the string data). Note the Cartesian dataset is sorted by ID02 ID01, so MATCH stringtable2 BY ID02, SORT result by ID01 then MATCH stringtable1 BT ID01. I only ran this against two 1000 row tables -> 1,000,000 resulting cases. Larger data files will likely work fine given sufficient RAM and WORKSPACE settings. For HUGE files? Contact me for a con$ultation ;-) *Data simulation here *. DATASET DECLARE file1. MATRIX. SAVE TRUNC(UNIFORM(1000,10)*10)/ OUTFILE file1/ VARIABLES f1x01 TO f1x10. END MATRIX. DATASET ACTIVATE file1. COMPUTE ID01=$CASENUM. SAVE OUTFILE 'C:\TEMP\file1.sav'. DATASET DECLARE file2. MATRIX. SAVE TRUNC(UNIFORM(1000,10)*10)/ OUTFILE file2/ VARIABLES f2x01 TO f2x10. END MATRIX. DATASET ACTIVATE file2. COMPUTE ID02=$CASENUM. SAVE OUTFILE 'C:\TEMP\file2.sav'. /*DATA Simulation DONE */. /*Actual cartesian match */. DATASET DECLARE cartesian. MATRIX . GET data1 / FILE 'C:\TEMP\file1.sav' / VARIABLES ID01 f1x01 TO f1x10. GET data2 / FILE 'C:\TEMP\file2.sav' / VARIABLES ID02 f2x01 TO f2x10. COMPUTE d1d2={KRONEKER(MAKE(NROW(data2),1,1),data1), KRONEKER(data2,MAKE(NROW(data1),1,1))}. SAVE d1d2 / OUTFILE cartesian / VARIABLES ID01 f1x01 TO f1x10 ID02 f2x01 TO f2x10 . END MATRIX.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Administrator
|
From Richard's appendix:
"David Marso's code still requires a vector to hold the record-key values from File 1; and the length of that vector, i.e. the maximum number of records in File 1 for any group-key value, must be given in the code. However, if the record key consists of integers sequentially from 1 within each group, the set of keys may be represented by a single integer, namely the largest value. Then, the set of variables that need be named is not data-dependent at all, and general-purpose code is possible. " SWEET that the matrix solution requires nothing but the two data files ;-) I later tested the code on 2 10K row files using only IDs for the KRON op and later did the 'fact' dat merges using 2 TABLE merges with an intermediate SORT. Code on other PC so you'll need to just imagine it until later. Hmm those 2 10K files end up creating a 100M row dataset! Do the math.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Never tried this (and no SPSS here), but this might also be a way:
DEFINE !connect_string () 'DSN=MS Access Database;DBQ=c:\temp\temp.mdb;DriverId=25;FIL=MS Access;MaxBufferSize=2048;PageTimeout=5;' !ENDDEFINE. SAVE TRANSLATE /TYPE=ODBC /CONNECT=!connect_string /TABLE=tbl1 /REPLACE. SAVE TRANSLATE /TYPE=ODBC /CONNECT=!connect_string /TABLE=tbl2 /REPLACE /SQL = "SELECT * FROM tbl1, tbl2 INTO tblCartesian". GET DATA... Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
In reply to this post by fuigb
There've been a couple of solutions posted -- see David Marso's
MATRIX logic(1), and Albert-Jan Roskam's suggestion of using an external SQL database(2). I think the long-wide-long approach will still prove useful. Among other things, it can be made to work very much like MATCH FILES. And the output retains labels, formats, and missing values of all variables; I don't think MATRIX keeps those, and external SQL certainly won't. (1) For MATRIX logic, see Date: Fri, 31 Jan 2014 10:06:28 -0800 From: David Marso <[hidden email]> Subject: Re: Many-to-many merge in SPSS To: [hidden email] (2) For SQL logic, see Date: Sat, 1 Feb 2014 11:40:31 -0800 From: Albert-Jan Roskam <[hidden email]> Subject: Re: Many-to-many merge in SPSS To: [hidden email] Here is the long-wide-long method cleaned up and wrapped in a macro. I'm testing it, before posting, but I'd be grateful for anybody else trying it as well. Call: !M2MnyX Left =<left input>/ Right =<right input>/ BYvar =<variables>/ WrkFile=<file>. (The 'X' suffix on the name indicates that this version uses XSAVE logic.) Parameters (all are required) are, Left Name of an existing SPSS dataset; or, name or file handle of an SPSS .SAV file Right Name of a second SPSS dataset or .SAV file, as above BYvar Names of one or more variables occurring in both files. Requirements are as for BY variables in MATCH FILES, except that it need not be a unique key in either input. Every record in each input will be joined to every record having the same BYvar value in the other. WrkFile Name or file handle of a disk file (NOT a dataset) for working storage. Other requirements: - Inputs 'Left' and 'Right', and variable(s) 'BYvar' must be compatible as for MATCH FILES: . The BYvar variable or variables must occur in both inputs, and be of the same type and length in both. . Both inputs must be in ascending order on the BYvar variable(s). (However, group key values need not be unique record identifiers in either input.) . If any other variable occurs in both, it must have the same type and length in both, AND ITS VALUE FROM THE RIGHT INPUT WILL BE LOST. - SPSS must be able to write to the work file. Result: + The left and right inputs will be unaltered. + A new dataset 'Joined' will be created, and will be the active dataset. It will have one record for every pair of records from the 'left' and 'right' inputs that match on the group key. Order of variables will be, . The 'BYvar' variable(s) . Variable 'Left#'. (This variable numbers the records in the left input, sequentially within each BYvar value.) . The non-key variables from the left input . The non-key variables from the right input + The records will be in ascending order by the BYvar variables. Within a BYvar value, they will be in the same order as in the left-hand input. Records from the same left-hand input record will be in the same order as in their right-hand input records. + No value of the group key will occur in this file unless it occurs in both inputs. (In database terminology, it is an 'inner join'.) + The macro will print certain messages, including echoing its parameters. Side effects: . The working file will be left on disk. It may be erased if desired. A pre-an existing file by that name will be lost. . Datasets '@LftIDd' and '@LftWide' will be created; they may be erased. Pre-existing datasets named '@LftIDd', '@LftWide', or 'Joined' will . be lost. . Pre-existing variables named 'Left#', '@NewGrp', and '@MxLkey' will be lost. Glitch: . The dataset 'Joined' will be tagged as being an image of the working file; on closing SPSS, you'll be asked if you want to save it back to the working-file name and location. (Later, code to erase the working file and scratch datasets may be added, and the ECHO messages dropped.) APPENDICES FOLLOW: I. Definition of macro !M2MnyX, with instructions in comments II. Output from demonstration run III. Test data and code, for demonstration run ================================================ Appendix I: Definition of macro !M2MnyX ================================================ * **************************************************************** . * Macro for many-to-many merge of two datasets or .SAV files, . * using XSAVE logic. . * Takes four arguments, all required, all terminated by '/': . * Left Name of one of the datasets, or saved file, to be joined . * Right Name of the other dataset, or saved file, to be joined . * BYvar Variable or variables on which the two are to be joined.. * Requirements are as for BY variables in MATCH FILES: . * all variables must be present and of the same type in . * both inputs; and both must be sorted in ascending order . * on the BYvar variables. . * WrkFile Name or handle of disk file in which to 'unroll' the . * keys in Left, joined to the data in Right. . * **************************************************************** . DEFINE !M2MnyX(Left = !CHAREND('/') /Right = !CHAREND('/') /BYvar = !CHAREND('/') /WrkFile = !CHAREND('/')) . ECHO ' '. . ECHO 'Arguments to <!>M2MnyX1:'. . ECHO !QUOTE(!CONCAT('"Left" input: ',!Left)). . ECHO !QUOTE(!CONCAT('"Right" input: ',!Right)). . ECHO !QUOTE(!CONCAT('Key variable(s): ',!BYvar)). . ECHO !QUOTE(!CONCAT('Working file: ',!WrkFile)). . ECHO 'After the macro finishes execution, datasets'. . ECHO '"@LftIDd" and "@LftWide" may be erased.'. * ... I. Add 'Left#', record number within <!>BYvar groups, . * to Left-hand file. . ADD FILES /FILE =!Left /BY !BYvar /FIRST= @NewGrp. NUMERIC Left# (F3). DO IF @NewGrp. . COMPUTE Left# = 1. ELSE. . COMPUTE Left# = LAG(Left#) + 1. END IF. DATASET NAME @LftIDd WINDOW=FRONT. * ... II. Restructure Left input from 'long' to 'wide', . * /* . Keep only the group key (!BYvar) and record . * sequence number within group (Left#). . * . Represent the set of values of Left# by the . * largest value. Since the values of Left# are . * consecutive integers starting with 1, the . * maximum value characterizes the set. . DATASET ACTIVATE @LftIDd WINDOW=ASIS. DATASET DECLARE @LftWide. AGGREGATE OUTFILE=@LftWide /BREAK = !BYvar /@MxLkey 'Largest value of Left# for this key' = MAX(Left#). * ... III. Merge 'wide' Left input with Right input . * This is a one-to-many merge. . MATCH FILES /TABLE=@LftWide /FILE =!Right /BY !BYvar. * ... IV.A Return to a 'long' structure, with all Left . * records joined with all Right records. . * The result is the complete many-to-many merged file, except . * for the substantive data from the Left input. . NUMERIC Left# (F3). LOOP Left# = 1 TO @MxLkey. . XSAVE OUTFILE=!WrkFile /DROP=@MxLKey /KEEP=!BYvar Left# ALL. END LOOP. EXECUTE /* required */. * ... IV.B. ... and go to 'Left file first' order. . GET FILE=!WrkFile. SORT CASES BY !BYvar Left#. * ... V. Re-merge with !Left, adding the data, completing . * the join. . MATCH FILES /TABLE= @LftIDd /FILE = * /BY !BYvar Left# /DROP = @NewGrp /KEEP = !BYvar Left# ALL. DATASET NAME Joined WINDOW=FRONT. !ENDDEFINE. ================================================ Appendix II. Output from demonstration run ================================================ DATASET ACTIVATE Schools WINDOW=FRONT. LIST. List |-----------------------------|---------------------------| |Output Created |03-FEB-2014 21:32:26 | |-----------------------------|---------------------------| [Schools] Category School Art RI School of Design Art Metropolitan Museum Culinary Johnson & Wales Liberal Arts Oberlin Liberal Arts Antioch Liberal Arts Swarthmore Technical MIT Technical Cal Tech Technical ETH Number of cases read: 9 Number of cases listed: 9 DATASET ACTIVATE Subjects WINDOW=FRONT. LIST. List |-----------------------------|---------------------------| |Output Created |03-FEB-2014 21:32:26 | |-----------------------------|---------------------------| [Subjects] Category Subject Art Painting Art Drawing Art Sculpture Liberal Arts Literature Liberal Arts Philosophy Technical Calculus Technical Physics Technical Chemistry Technical Engineering Number of cases read: 9 Number of cases listed: 9 !M2MnyX Left = Schools/ Right = Subjects/ BYvar = Category/ WrkFile = Unroll. Arguments to <!>M2MnyX1: "Left" input: Schools "Right" input: Subjects Key variable(s): Category Working file: Unroll After the macro finishes execution, datasets "@LftIDd" and "@LftWide" may be erased. LIST. List |-----------------------------|---------------------------| |Output Created |03-FEB-2014 21:32:27 | |-----------------------------|---------------------------| [Joined] C:\Documents and Settings\Richard\My Documents \Temporary\SPSS\ 2014-xx-xx Ristow - Many-to-many merge in SPSS - Scratch1.SAV Category Left# School Subject Art 1 RI School of Design Painting Art 1 RI School of Design Drawing Art 1 RI School of Design Sculpture Art 2 Metropolitan Museum Painting Art 2 Metropolitan Museum Drawing Art 2 Metropolitan Museum Sculpture Liberal Arts 1 Oberlin Literature Liberal Arts 1 Oberlin Philosophy Liberal Arts 2 Antioch Literature Liberal Arts 2 Antioch Philosophy Liberal Arts 3 Swarthmore Literature Liberal Arts 3 Swarthmore Philosophy Technical 1 MIT Calculus Technical 1 MIT Physics Technical 1 MIT Chemistry Technical 1 MIT Engineering Technical 2 Cal Tech Calculus Technical 2 Cal Tech Physics Technical 2 Cal Tech Chemistry Technical 2 Cal Tech Engineering Technical 3 ETH Calculus Technical 3 ETH Physics Technical 3 ETH Chemistry Technical 3 ETH Engineering Number of cases read: 24 Number of cases listed: 24 ================================================ Appendix III. Test data and code, for demonstration run III.A Test data and driver code ================================================ * C:\Documents and Settings\Richard\My Documents . * \Technical\spssx-l\Z-2014\ . * 2014-02-03 Ristow - Many-to-many merge in SPSS.SPS . * Under name 2014-xx-xx Ristow-Many-to-many merge in SPSS.SPS: . * 31 Jan 2014 - Create, from file . * C:\Documents and Settings\Richard\My Documents . * \Technical\spssx-l\Z-2009d\ . * 2009-11-24 Ristow - Many to many merge in SPSS.SPS . * which was prepared for, and included in, posting . * Date: Thu, 26 Nov 2009 14:34:54 -0500 . * From: Richard Ristow <[hidden email]> . * Subject: Many-to-many merge in SPSS . * To: [hidden email] . * Projected clean-up ('+' = done, '.' = to do): . * + Name input datasets '@Left' and '@Right', instead of . * 'File1' and 'File2' . * + Use ADD FILES when creating record sequence numbers for . * '@Left'; this allows group keys with several variables. . * + Don't add sequence numbers for file @'Right'; they aren't . * needed . * + Wrap in a macro, initially un-parameterized. . * 1 Feb 2014 - Complete the renaming of files and variables, . * and create first version of the macro, with no parameters. . * Remove it to file M2MnyX0.SPS, put the test data here and . * set up for test. . * Under name 2014-xx-xx Ristow - Many-to-many merge in SPSS.SPS: . * Save under this name, with spaces around the hyphen following . * the author's name. Looks better, matches earlier practice . * Test with macro <!>M2MnyX0, taking no arguments. . * 3 Feb 2014 - Drop creation of datasets @Left and @Right, which . * were for testing open code or the macro without parameters. . * Set up to work with production macro <!>M2MnyX. . * Under name 2014-02-03 Ristow - Many-to-many merge in SPSS.SPS: . * ................. Directory holding code ..................... . FILE HANDLE CodePlace NAME='C:\Documents and Settings\Richard\My Documents' + '\Technical\spssx-l\Z-2014'. * ................. Working file, required for XSAVE logic ..... . FILE HANDLE Unroll /NAME='C:\Documents and Settings\Richard\My Documents' + '\Temporary\SPSS\' + '2014-xx-xx Ristow - Many-to-many merge in SPSS' + ' - ' + 'Scratch1.SAV'. * ................. Test data ..................... . DATA LIST FIXED /Category 04-16 (A) School 17-38 (A). BEGIN DATA Art RI School of Design Art Metropolitan Museum Culinary Johnson & Wales Liberal Arts Oberlin Liberal Arts Antioch Liberal Arts Swarthmore Technical MIT Technical Cal Tech Technical ETH END DATA. DATASET NAME Schools. . /*-- LIST /*-*/. DATA LIST FIXED /Category 04-16 (A) Subject 17-28 (A). BEGIN DATA Art Painting Art Drawing Art Sculpture Liberal Arts Literature Liberal Arts Philosophy Technical Calculus Technical Physics Technical Chemistry Technical Engineering END DATA. DATASET NAME Subjects. . /*-- LIST /*-*/. * Load definition of <!>M2MnyX: . INSERT FILE='CodePlace\M2MnyX1.SPS'. DATASET ACTIVATE Schools WINDOW=FRONT. LIST. DATASET ACTIVATE Subjects WINDOW=FRONT. LIST. !M2MnyX Left = Schools/ Right = Subjects/ BYvar = Category/ WrkFile = Unroll. LIST. ================================================ III.B Macro definition file ================================================ * C:\Documents and Settings\Richard\My Documents . * \Technical\spssx-l\Z-2014\ . * M2MnyX1.SPS . * For insertion in 2014-xx-xx Ristow-Many-to-many merge in SPSS.SPS . * or successor (2014-02-03 Ristow-Many-to-many merge in SPSS.SPS) . * This is a production form of the macro with XSAVE logic. . * **************************************************************** . * Macro for many-to-many merge of two datasets or .SAV files, . * using XSAVE logic. . * Takes four arguments, all required, all terminated by '/': . * Left Name of one of the datasets, or saved file, to be joined . * Right Name of the other dataset, or saved file, to be joined . * BYvar Variable or variables on which the two are to be joined.. * Requirements are as for BY variables in MATCH FILES: . * all variables must be present and of the same type in . * both inputs; and both must be sorted in ascending order . * on the BYvar variables. . * WrkFile Name or handle of disk file in which to 'unroll' the . * keys in Left, joined to the data in Right. . * **************************************************************** . DEFINE !M2MnyX(Left = !CHAREND('/') /Right = !CHAREND('/') /BYvar = !CHAREND('/') /WrkFile = !CHAREND('/')) . ECHO ' '. . ECHO 'Arguments to <!>M2MnyX1:'. . ECHO !QUOTE(!CONCAT('"Left" input: ',!Left)). . ECHO !QUOTE(!CONCAT('"Right" input: ',!Right)). . ECHO !QUOTE(!CONCAT('Key variable(s): ',!BYvar)). . ECHO !QUOTE(!CONCAT('Working file: ',!WrkFile)). . ECHO 'After the macro finishes execution, datasets'. . ECHO '"@LftIDd" and "@LftWide" may be erased.'. * ... I. Add 'Left#', record number within <!>BYvar groups, . * to Left-hand file. . ADD FILES /FILE =!Left /BY !BYvar /FIRST= @NewGrp. NUMERIC Left# (F3). DO IF @NewGrp. . COMPUTE Left# = 1. ELSE. . COMPUTE Left# = LAG(Left#) + 1. END IF. DATASET NAME @LftIDd WINDOW=FRONT. * ... II. Restructure Left input from 'long' to 'wide', . * /* . Keep only the group key (!BYvar) and record . * sequence number within group (Left#). . * . Represent the set of values of Left# by the . * largest value. Since the values of Left# are . * consecutive integers starting with 1, the . * maximum value characterizes the set. . DATASET ACTIVATE @LftIDd WINDOW=ASIS. DATASET DECLARE @LftWide. AGGREGATE OUTFILE=@LftWide /BREAK = !BYvar /@MxLkey 'Largest value of Left# for this key' = MAX(Left#). * ... III. Merge 'wide' Left input with Right input . * This is a one-to-many merge. . MATCH FILES /TABLE=@LftWide /FILE =!Right /BY !BYvar. * ... IV.A Return to a 'long' structure, with all Left . * records joined with all Right records. . * The result is the complete many-to-many merged file, except . * for the substantive data from the Left input. . NUMERIC Left# (F3). LOOP Left# = 1 TO @MxLkey. . XSAVE OUTFILE=!WrkFile /DROP=@MxLKey /KEEP=!BYvar Left# ALL. END LOOP. EXECUTE /* required */. * ... IV.B. ... and go to 'Left file first' order. . GET FILE=!WrkFile. SORT CASES BY !BYvar Left#. * ... V. Re-merge with !Left, adding the data, completing . * the join. . MATCH FILES /TABLE= @LftIDd /FILE = * /BY !BYvar Left# /DROP = @NewGrp /KEEP = !BYvar Left# ALL. DATASET NAME Joined WINDOW=FRONT. !ENDDEFINE. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Thanks for posting this Richard, although in my opinion saving the metadata is not a big deal - APPLY DICTIONARY could rectify that pretty easily.
|
Free forum by Nabble | Edit this page |