Since I just wrote to Peter Spangler that I'd posted a macro that can
do Cartesian product, I thought I'd better post the latest revision, though it isn't fully tested. This version incorporates David Marso's code using MATRIX to 'unroll' the records containing pairs of maximum sequence values from the two inputs, to the set of all pairs of sequence values thereby represented. It will, therefore, handle a much larger Cartesian product than the earlier version I posted, though I don't give it much chance of handling the 1.5 giga-record file Peter Spangler seems to be needing. * **************************************************************** . * Macro for many-to-many merge of two datasets or .SAV files. . * Richard Ristow, March 2014 . * Contact: [hidden email] . * . * The technique of representing a set of consecutive records in a . * file by the largest value of a sequence number was introducted to . * the author by David Marso. David also wrote code using MATRIX to . * 'unroll' a record containing a pair of such numbers into the set . * of all pairs of sequence numbers represented, which has been . * adapted here. . * ================================================================= . * . * All arguments are terminated by '/'. . * . * Required arguments: . * LeftIn Name of one of the datasets, or saved files, to be joined. * (referred to as the "Left input") . * RightIn Name of the other dataset, or saved file, to be joined. * (referred to as the "Right input") . * . * Optional arguments for the join operation: . * BYvar Variable or variables on which the inputs are to be . * joined; if not specified, the result will be a Cartesian . * product, matching all pairs of records from the inputs. . * Requirements are as for BY variables in MATCH FILES: . * all variables must be present and of the same type in . * both inputs; and both inputs must be in ascending order . * on the BYvar variable(s). . * Join Type of join (relevant only if BYvar is specified): . * Outer (default), LeftOuter, RightOuter, or Inner; . * Outer Output includes contents of all records . * from both inputs . * Left Outer Output includes contents of all records . * or LeftOuter from the Left input, but only records from . * or LOuter the Right input that match some left-input . * record . * Right Outer Output includes contents of all records . * or RightOuter from the Right input, but only records from . * or ROuter the Left input that match some right-input . * record . * Inner Output contains only records from each . * input that match some record in the other . * Left# Name of variable to hold sequence numbers of Left input . * records, within values of BYvar. Name should not occur . * in either input file. (default: Left#) . * Right# Name of variable to hold sequence numbers of Right input . * records, within values of BYvar. Name should not occur . * in either input file. (default: Right#) . * . * Optional arguments for the inputs: . * LRename 'RENAME' to be applied to the Left input . * LDrop 'DROP' to be applied to the Left input . * LKeep 'KEEP' to be applied to the Left input (default: ALL) . * RRename 'RENAME' to be applied to the Right input . * RDrop 'DROP' to be applied to the Right input . * RKeep 'KEEP' to be applied to the Right input (default: ALL) . * . * Optional argument for the output: . * DSname Name of the output, joined dataset. The joined datasets . * will always be active, but will be unnamed unless . * DSname is specified. . * ................................................................ . * Reserved names: . * . * By default, variables Left# and Right# are created, and retained . * in the output. (These variables may be given other names by . * specifying macro arguments Left#= and Right#=) . * . * Variables @Mny2Mny_NoBreak, @Mny2Mny_NewGrp . * @Mny2Mny_LMaxRec @Mny2Mny_RMaxRec . * @Mny2Mny_Grp# @Mny2Mny_GrpRecs . * are created, used, and discarded. . * . * Datasets @Mny2Mny_Left#d @Mny2Mny_Right#d . * @Mny2Mny_Left## @Mny2Mny_GrpSmry . * are created, used, and discarded. . * **************************************************************** . DEFINE !Mny2Mny(LeftIn = !CHAREND('/') /RightIn = !CHAREND('/') /BYvar = !CHAREND('/') /LRename = !CHAREND('/') /LDrop = !CHAREND('/') /LKeep = !CHAREND('/') !DEFAULT(ALL) /RRename = !CHAREND('/') /RDrop = !CHAREND('/') /RKeep = !CHAREND('/') !DEFAULT(ALL) /Join = !CHAREND('/') !DEFAULT(Outer) /Left# = !CHAREND('/') !DEFAULT(Left#) /Right# = !CHAREND('/') !DEFAULT(Right#) /DSname = !CHAREND('/')) ECHO " ". ECHO !QUOTE(!CONCAT('Macro !','Mny2Mny:')). * ....... Section I: Input parsing and error checking ....... . !LET !Error = No * ....... I.A: Required inputs ....... . !IF ( !LeftIn !EQ !NULL !OR !RightIn !EQ !NULL) !THEN !LET !Error = Yes ECHO "Arguments LeftIn and RightIn (the two inputs)". ECHO " are required.". !IF (!LeftIn !NE !NULL) !THEN ECHO !QUOTE(!CONCAT(" (Left input specified: ", !LeftIn,')')). !IFEND !IF (!RightIn !NE !NULL) !THEN ECHO !QUOTE(!CONCAT(" (Right input specified: ", !RightIn,')')). !IFEND !IFEND !IF ( !RightIn !EQ '*') !THEN !LET !Error = Yes ECHO "The Right input may not be '*'". !IFEND * ....... I.B: One-token inputs ....... . !IF (!TAIL(!Left#) !NE !NULL) !THEN !LET !Error = Yes ECHO "'Left#', if specified, must be a single token". ECHO !QUOTE(!CONCAT('(Entered: ','Left#,')')). !IFEND !IF (!TAIL(!Right) !NE !NULL) !THEN !LET !Error = Yes ECHO "'Right#', if specified, must be a single token". ECHO !QUOTE(!CONCAT('(Entered: ','Right#,')')). !IFEND !IF (!TAIL(!DSname) !NE !NULL) !THEN !LET !Error = Yes ECHO "'DSname', if specified, must be a single token". ECHO !QUOTE(!CONCAT('(Entered: ',!DSname,')')). !IFEND * ....... I.C: Join type ....... . !IF (!UPCASE(!Join) !EQ 'OUTER') !THEN !LET !JType = 'OUTER' !LET !JMsg = 'Outer join' !IFEND !IF (!UPCASE(!Join) !EQ 'LEFT OUTER') !THEN !LET !JType = 'LEFTOUT' !LET !JMsg = 'Left outer join' !IFEND !IF (!UPCASE(!Join) !EQ 'LEFTOUTER') !THEN !LET !JType = 'LEFTOUT' !LET !JMsg = 'Left outer join' !IFEND !IF (!UPCASE(!Join) !EQ 'LOUTER') !THEN !LET !JType = 'LEFTOUT' !LET !JMsg = 'Left outer join' !IFEND !IF (!UPCASE(!Join) !EQ 'RIGHT OUTER') !THEN !LET !JType = 'RIGHTOUT' !LET !JMsg = 'Right outer join' !IFEND !IF (!UPCASE(!Join) !EQ 'RIGHTOUTER') !THEN !LET !JType = 'RIGHTOUT' !LET !JMsg = 'Right outer join' !IFEND !IF (!UPCASE(!Join) !EQ 'ROUTER') !THEN !LET !JType = 'RIGHTOUT' !LET !JMsg = 'Right outer join' !IFEND !IF (!UPCASE(!Join) !EQ 'INNER') !THEN !LET !JType = 'INNER' !LET !JMsg = 'Inner join' !IFEND !IF (!BYvar !EQ !NULL) !THEN !LET !JType = 'CARTPROD' !LET !JMsg = 'Cartesian product' !IFEND !IF (!JType !EQ !NULL) !THEN !LET !Error = Yes ECHO "Invalid join type:". ECHO !QUOTE(!CONCAT(' Join=',!Join)). !IFEND * ....... Section II: Informational messages ........ . !IF (!Error !EQ No) !THEN ECHO !QUOTE(!CONCAT(!JMsg,' of')). ECHO !QUOTE(!CONCAT(' ',!LeftIn)). !IF ( !LRename !NE !NULL) !THEN ECHO !QUOTE(!CONCAT(' /RENAME=',!LRename)) !IFEND !IF ( !LDrop !NE !NULL) !THEN ECHO !QUOTE(!CONCAT(' /DROP =',!LDrop)) !IFEND !IF (!UPCASE(!LKeep) !NE 'ALL') !THEN ECHO !QUOTE(!CONCAT(' /KEEP =',!BYVar,' ',!LKeep)) !IFEND ECHO !QUOTE(!CONCAT('to ',!RightIn)). !IF ( !RRename !NE !NULL) !THEN ECHO !QUOTE(!CONCAT(' /RENAME=',!RRename)) !IFEND !IF ( !RDrop !NE !NULL) !THEN ECHO !QUOTE(!CONCAT(' /DROP =',!RDrop)) !IFEND !IF (!UPCASE(!RKeep) !NE 'ALL') !THEN ECHO !QUOTE(!CONCAT(' /KEEP =',!BYVar,' ',!RKeep)) !IFEND !IF (!BYvar !NE !NULL) !THEN ECHO !QUOTE(!CONCAT('by ',!BYvar)). !IFEND !IF (!DSname !NE !NULL) !THEN ECHO !QUOTE(!CONCAT('Output to dataset ',!DSname)). !IFEND !IFEND * ....... Section III: Processing ....... . * ....... III.A Sequence-numbered copies of the inputs ....... . * ....... III.A.1 Left input -> dataset @Mny2Mny_Left#d ....... . !IF (!Error !EQ No) !THEN !IF (!BYvar !EQ !NULL) !THEN * ... Left input, when there are no BYvar vars ... . ADD FILES /FILE=!Leftin !IF (!LRename !NE !NULL) !THEN /RENAME = !LRename !IFEND !IF (!LDrop !NE !NULL) !THEN /DROP = !LDrop !IFEND /KEEP = !BYvar !LKeep. COMPUTE @Mny2Mny_NewGrp = ($CASENUM EQ 1). COMPUTE !Left# = $CASENUM. !ELSE * ... Left input, when there are BYvar vars ... . ADD FILES /FILE=!Leftin !IF (!LRename !NE !NULL) !THEN /RENAME = !LRename !IFEND !IF (!LDrop !NE !NULL) !THEN /DROP = !LDrop !IFEND /KEEP = !BYvar !LKeep /BY !BYvar /FIRST = @Mny2Mny_NewGrp. DO IF @Mny2Mny_NewGrp. . COMPUTE !Left# = 1. ELSE. . COMPUTE !Left# = LAG(!Left#) + 1. END IF. !IFEND * ... Left input, code common to both cases ... . COMPUTE @Mny2Mny_NoBreak = 1. FORMATS @Mny2Mny_NoBreak @Mny2Mny_NewGrp (F2). FORMATS !Left# (F4). VARIABLE WIDTH !Left# (6). DATASET NAME @Mny2Mny_Left#d WINDOW=ASIS. * ... End of processing Left input . /*-- LIST /*-*/. * ....... III.A.2 Right input -> dataset @Mny2Mny_Right#d ....... . !IF (!BYvar !EQ !NULL) !THEN * ... Right input, when there are no BYvar vars ... . ADD FILES /FILE=!RightIn !IF (!RRename !NE !NULL) !THEN /RENAME = !RRename !IFEND !IF (!RDrop !NE !NULL) !THEN /DROP = !RDrop !IFEND /KEEP = !BYvar !RKeep. COMPUTE @Mny2Mny_NewGrp = ($CASENUM EQ 1). COMPUTE !Right# = $CASENUM. !ELSE * ... Right input, when there are BYvar vars ... . ADD FILES /FILE=!RightIn !IF (!RRename !NE !NULL) !THEN /RENAME = !RRename !IFEND !IF (!RDrop !NE !NULL) !THEN /DROP = !RDrop !IFEND /KEEP = !BYvar !RKeep /BY !BYvar /FIRST = @Mny2Mny_NewGrp. DO IF @Mny2Mny_NewGrp. . COMPUTE !Right# = 1. ELSE. . COMPUTE !Right# = LAG(!Right#) + 1. END IF. !IFEND * ... Right input, code common to both cases ... . COMPUTE @Mny2Mny_NoBreak = 1. FORMATS @Mny2Mny_NoBreak @Mny2Mny_NewGrp (F2). FORMATS !Right# (F4). VARIABLE WIDTH !Right# (6). DATASET NAME @Mny2Mny_Right#d WINDOW=ASIS. * ... End of processing Right input . /*-- LIST /*-*/. * ....... III.B Build 'spine' of all keys for all records ....... . * ....... III.B.1 Combine record counts from both inputs ....... . DATASET ACTIVATE @Mny2Mny_Left#d WINDOW=ASIS. DATASET DECLARE @Mny2Mny_Left##. AGGREGATE OUTFILE=@Mny2Mny_Left## /BREAK = @Mny2Mny_NoBreak !BYvar /@Mny2Mny_LMaxRec = MAX(!Left#). DATASET ACTIVATE @Mny2Mny_Right#d WINDOW=ASIS. AGGREGATE OUTFILE=* /BREAK = @Mny2Mny_NoBreak !BYvar /@Mny2Mny_RMaxRec = MAX(!Right#). MATCH FILES /FILE = @Mny2Mny_Left## /FILE = * /BY @Mny2Mny_NoBreak !BYvar. RECODE @Mny2Mny_LMaxRec @Mny2Mny_RMaxRec (MISSING=0). DATASET NAME @Mny2Mny_GrpSmry. /*-- LIST /*-*/. * ....... III.B.2 Select, based on join type ....... . !IF (!Jtype !EQ 'INNER') !THEN SELECT IF (@Mny2Mny_LMaxRec GT 0 AND @Mny2Mny_RMaxRec GT 0). !IFEND !IF (!Jtype !EQ 'LEFTOUT') !THEN SELECT IF (@Mny2Mny_RMaxRec GT 0). !IFEND !IF (!Jtype !EQ 'RIGHTOUT') !THEN SELECT IF (@Mny2Mny_LMaxRec GT 0). !IFEND * ..................................... . * For OUTER join and Cartesian product, . * no selection is necessary. . * ..................................... . /*-- LIST /*-*/. * ....... III.B.3 Expand the counts to get list of ........ . * pairs of record numbers from the ........ . * groups, by group number ........ . COMPUTE @Mny2Mny_Grp# = $CASENUM. COMPUTE @Mny2Mny_GrpRecs = MAX(@Mny2Mny_LMaxRec,1) *MAX(@Mny2Mny_RMaxRec,1). * ....... MATRIX code by David Marso, adapted here. ........ . * ... OMS /SELECT TABLES /IF COMMANDS = ["Matrix"] SUBTYPES = ["Notes"] /DESTINATION VIEWER = NO /TAG = "ZapMATRIX". MATRIX. + GET GrpSmry /FILE=* /VARIABLES=@Mny2Mny_Grp# @Mny2Mny_LMaxRec @Mny2Mny_RMaxRec @Mny2Mny_GrpRecs /NAMES=GSnames. + LOOP #=1 TO NROW(GrpSmry). + COMPUTE LeftNos=T({1:MMAX({GrpSmry(#,2),1})}). + DO IF (GrpSmry(#,2) EQ 0). + COMPUTE LeftNos = 0*LeftNos. + END IF. + COMPUTE RightNos=T({1:MMAX({GrpSmry(#,3),1})}). + DO IF (GrpSmry(#,3) EQ 0). + COMPUTE RightNos = 0*RightNos. + END IF. + SAVE ({MAKE(GrpSmry(#,4),1,GrpSmry(#,1)), KRONEKER(MAKE(NROW(RightNos),1,1),LeftNos), KRONEKER(RightNos,MAKE(NROW(LeftNos),1,1)) }) / OUTFILE * / VARIABLES @Mny2Mny_Grp# !Left# !Right#. + END LOOP. END MATRIX. FORMATS @Mny2Mny_Grp# !Left# !Right# (F4). VARIABLE WIDTH !Left# !Right# (6). /*-- LIST /*-*/. * ....... III.B.4 Combine with summary file by group key ........ . * to get the 'spine': list of all pairs ........ . * of record numbers by group key. ........ . MATCH FILES /TABLE=@Mny2Mny_GrpSmry /FILE =* /BY @Mny2Mny_Grp#. /*-- DATASET COPY Spine /*-*/. * ....... III.C Add the substantive data ....... . MATCH FILES /FILE =* /TABLE=@Mny2Mny_Right#d /BY !BYvar !Right# /DROP @Mny2Mny_Grp# @Mny2Mny_GrpRecs @Mny2Mny_LMaxRec @Mny2Mny_RMaxRec @Mny2Mny_NoBreak @Mny2Mny_NewGrp. SORT CASES BY !BYvar !Left# !Right#. MATCH FILES /TABLE=@Mny2Mny_Left#d /FILE =* /BY !BYvar !Left# /DROP @Mny2Mny_NoBreak @Mny2Mny_NewGrp /KEEP !BYvar !Left# !Right# ALL. EXECUTE /* so input datasets can be dropped */. !IF (!DSname !NE !NULL) !THEN DATASET NAME !DSname WINDOW=FRONT. !IFEND DATASET CLOSE @Mny2Mny_Left#d. DATASET CLOSE @Mny2Mny_Right#d. DATASET CLOSE @Mny2Mny_Left##. DATASET CLOSE @Mny2Mny_GrpSmry. !IFEND !ENDDEFINE. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
IMO posting code here on Nabble is not a great solution. Ignoring whether Nabble will be around forever or not. I would suggest posting a link to the code somewhere, and there you can just update the code so people following the same link will always get the most up to date version.
There are alot of options for posting the code online. I frequently go the easiest route and just share a link to my dropbox account, but I've used google code previous. The SPSS developerworks site you can upload code chunks. Although most (all?) examples are SPD and SPE files, you can upload anything. See here I just uploaded a chart template to my account. |
There are traditional code tools both from
IBM and independent authors posted on the SPSS Community site (www.ibm.com/developerworks/spssdevcentral)
mostly in the Statistical Tools collection (https://www.ibm.com/developerworks/community/files/app?lang=en#/collection/bbe88aaf-f3cd-466a-83fb-592d48eecb1c).
Sorry for the long url, but you can easily navigate to the collection
from the first one.
We welcome user contributions. See the Sharing topic on the first page of the site for details on how to do this. Contributors retain full ownership and control of their submissions subject to the developerWorks Terms of Use. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Andy W <[hidden email]> To: [hidden email], Date: 03/20/2014 06:05 AM Subject: Re: [SPSSX-L] Macro !Mny2Mny (joins and Cartesian products), v.2 Sent by: "SPSSX(r) Discussion" <[hidden email]> IMO posting code here on Nabble is not a great solution. Ignoring whether Nabble will be around forever or not. I would suggest posting a link to the code somewhere, and there you can just update the code so people following the same link will always get the most up to date version. There are alot of options for posting the code online. I frequently go the easiest route and just share a link to my dropbox account, but I've used google code previous. The SPSS developerworks site you can upload code chunks. Although most (all?) examples are SPD and SPE files, you can upload anything. See here <https://www.ibm.com/developerworks/community/files/app?lang=en#/file/8937d380-f86e-4790-b871-dd14ce6e50d7> I just uploaded a chart template to my account. ----- Andy W [hidden email] http://andrewpwheeler.wordpress.com/ -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Macro-Mny2Mny-joins-and-Cartesian-products-v-2-tp5724939p5724943.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by Richard Ristow
Rather than create a Cartesian product of 1800 x 800000, consider that
1. the 800,000 may not be unique values so substantial reduction here via aggregation. 2. Of the reduced set consider the number of unique words in the entire set as well as the 1800. 3. If one treats a word as the unit of analysis then one needs only consider cases in which one or more words found in the 1800 match with those found in the reduced (aggregated) 800000. 4. Since the MATRIX code builds the KP for a given row (word) of the input data and saves it within the loop, one will never have to hold the entire possibly huge KP of 800000 x 1800. Leaving it to P. Spangler to ponder and write necessary code. <SNIP macro code>.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Thank you all for this valuable conversation and suggestions. I am beginning to see there are better ways of conceptualizing the analysis and much to learn!
Sent from my iPhone > On Mar 20, 2014, at 6:38 AM, David Marso <[hidden email]> wrote: > > Rather than create a Cartesian product of 1800 x 800000, consider that > 1. the 800,000 may not be unique values so substantial reduction here via > aggregation. > 2. Of the reduced set consider the number of unique words in the entire set > as well as the 1800. > 3. If one treats a word as the unit of analysis then one needs only consider > cases in which one or more words found in the 1800 match with those found in > the reduced (aggregated) 800000. > 4. Since the MATRIX code builds the KP for a given row (word) of the input > data and saves it within the loop, one will never have to hold the entire > possibly huge KP of 800000 x 1800. > Leaving it to P. Spangler to ponder and write necessary code. > > > Richard Ristow wrote >> Since I just wrote to Peter Spangler that I'd posted a macro that can >> do Cartesian product, I thought I'd better post the latest revision, >> though it isn't fully tested. >> >> This version incorporates David Marso's code using MATRIX to 'unroll' >> the records containing pairs of maximum sequence values from the two >> inputs, to the set of all pairs of sequence values thereby >> represented. It will, therefore, handle a much larger Cartesian >> product than the earlier version I posted, > * >> though I don't give it >> much chance of handling the 1.5 giga-record file Peter Spangler seems >> to be needing. > * > <SNIP macro code>. > > > > > ----- > Please reply to the list and not to my personal email. > Those desiring my consulting or training services please feel free to email me. > --- > "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." > Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" > -- > View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Macro-Mny2Mny-joins-and-Cartesian-products-v-2-tp5724939p5724946.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD -- This email may contain confidential information for the sole use of the intended recipient(s). If you are not an intended recipient, please notify the sender and delete all copies immediately. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |