Macro !Mny2Mny, for many-to-many merge and Cartesian product

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Macro !Mny2Mny, for many-to-many merge and Cartesian product

Richard Ristow
A following posting will give the definition of macro !Mny2Mny, an
improved general-use macro for many-to-many merges ('joins', in
database terms) and Cartesian products. Use:

!Mny2Mny LeftIn={'file'/}
                 {dataset/}
                 {*/}
         [LRename=(old varnames=new varnames).../]
         [LDrop=varlist/]
         [LKeep={ALL** }/]
          RightIn={'file'/}
                  {dataset/}
         [RRename=(old varnames=new varnames).../]
         [RDrop=varlist/]
         [RKeep={ALL** }/]
         [BYvar=varlist/]
         [Join= {Outer**}/]
         [MaxGrp={50**}/]

!Mny2Mny combines variables from two SPSS-format files or datasets,
called respectively the 'left input' (keyword 'LeftIn') and the
'right input' (keyword 'RightIn'). The left input, but not the right
input, may be specified as '*'. As for MATCH FILES, you may specify
RENAME (keywords LRename, RRename), DROP (keywords LDROP, RDrop), and
KEEP (keywords LKeep, RKeep) for either or both inputs. (Unlike MATCH
FILES, !Mny2Mny takes separate DROP and KEEP lists for each input.)

In general, !Mny2Mny matches each record from each file with several
records from the other. If BYvar is not specified, every record from
each file matches every record from the other; i.e., the result is a
Cartesian product of the files. If BYvar is specified, records from
each file match those records from the other with the same value of
the BYvar variables. As for MATCH FILES, if BYvar is specified, all
variables named must have the same name, data type, and (if strings)
length in both inputs. (If necessary, use LRename or RRename to make
variable names match). When BYvar is used, both inputs must be sorted
in ascending order of the key variables named on BYvar.

The combined file will be the active dataset, and will be unnamed.
Order of variables will be
. The variables named on BYvar, if any
. New variables 'Left#' and 'Right#'.
. Remaining variables from the left input
. Remaining variables from the right input
Variables 'Left#' and 'Right#' identify which records from the inputs
were the sources for the output record. They are sequence numbers,
respectively, for the left and right inputs, within the given value
of the BYvar variables. (If BYvar is not specified, 'Left#' and
'Right#' are sequence numbers for all records in the inputs.)

The combined file will by in ascending order of the key variables
named on BYvar; within a value of the BYvar variables, in ascending
order of Left#, then Right#.

If BYvar is specified, you may use subcommand JOIN= to control
matching. It may be one of,
. Inner:  Records from either input that do not match any records in
the other, are excluded
. Outer:  All records from both inputs are included, whether the
match any record in the othe or not. (DEFAULT)
. Left Outer: All records from the left input, but only matching
records from the right input, are included
. Right outer: All records from the right input, but only matching
records from the left input, are included.
When an output record has information from only one input, the
sequence number (Left# or Right#) from the input that is present will
be as usual; the sequence number for the absent input will be 0.

Option MaxGrp (default: 50) gives the largest number of records that
can be handled from either input, for one value of the BYvar
variables. (If BYvar is not specified, MaxGrp must be at least as
large as the number of records in the larger input.)

Restrictions and side effects:
. Variables Left#, Right#; and @Mny2Mny_NoBreak, @Mny2Mny_NewGrp,
@Mny2Mny_LMaxRec, and @Mny2Mny_RMaxRec, should not occur in either
input. (However, ones that do, e.g. Left# and Right# because the
input is the result of a prior run of !Mny2Mny, may be renamed using
LRename or RRename; or dropped, using LDrop or RDrop.)
. Datasets @Mny2Mny_Left#d, @Mny2Mny_Right#d, and @Mny2Mny_Left##
should not exist, and will be lost if they do.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Macro !Mny2Mny, for many-to-many merge and Cartesian product

Richard Ristow
Here's the definition. It's in a late stage of testing, rather than
finished; I sent the documentation and discussion to the list earlier
than I intended, so I don't want to hold up the code.

*  ****************************************************************  .
*  Macro for many-to-many merge of two datasets or .SAV files,       .
*  by Richard Ristow, March 2014                                     .
*  Contact: [hidden email]                                  .
*                                                                    .
*  Required arguments:                                               .
*  LeftIn   Name of one of the datasets, or saved files, to be joined.
*           (referred to as the "Left input")                        .
*  RightIn  Name of the other  dataset,  or saved file,  to be joined.
*           (referred to as the "Right input")                       .
*                                                                    .
*  Optional arguments for the inputs:                                .
*  LRename  'RENAME' to be applied to the Left  input                .
*  LDrop    'DROP'   to be applied to the Left  input                .
*  LKeep    'KEEP'   to be applied to the Left  input (default: ALL) .
*  RRename  'RENAME' to be applied to the Right input                .
*  RDrop    'DROP'   to be applied to the Right input                .
*  RKeep    'KEEP'   to be applied to the Right input (default: ALL) .
*                                                                    .
*  Optional arguments for the join operation:                        .
*  Join     Outer (default), LeftOuter, RightOuter, or Inner:        .
*           Outer        Output includes contents of all records     .
*                        from both inputs                            .
*           Left Outer    Output includes contents of all records     .
*        or LeftOuter    from the Left input, but only records from  .
*        or LOuter       the Right input that match some left-input  .
*                        record                                      .
*           Right Outer  Output includes contents of all records     .
*        or RightOuter   from the Right input, but only records from .
*        or ROuter       the Left input that match some right-input  .
*                        record                                      .
*           Inner        Output contains only records from each      .
*                        input that match some record in the other   .
*  BYvar    Variable or variables on which  the two are to be joined;.
*           if not specified, a Cartesian product will be performed. .
*           Requirements are as for BY variables in MATCH FILES:     .
*           all variables must be present and of the same type in    .
*           both inputs; and both inputs must be in ascending order  .
*           on the BYvar variable(s).                                .
*                                                                    .
*  MaxGrp   Largest number of records allowed for any value of the   .
*           BYvar variable(s) in either input; for Cartesian         .
*           product, larget number of records in either intput.      .
*           Default: 50.                                             .
*  ................................................................  .
*  Reserved names:                                                   .
*  Variables Left# and Right# are created, and retained in the output.
*  Variables @Mny2Mny_NoBreak @Mny2Mny_NewGrp                        .
*        and @Mny2Mny_LMaxRec @Mny2Mny_RMaxRec                       .
*        are used, and discarded.                                    .
*  Datasets  @Mny2Mny_Left#d  @Mny2Mny_Right#d                       .
*        and  @Mny2Mny_Left##                                        .
*        are created, used, and discarded.
*  ****************************************************************  .

DEFINE !Mny2Mny(LeftIn  = !CHAREND('/')
                /RightIn = !CHAREND('/')
                /BYvar   = !CHAREND('/')
                /LRename = !CHAREND('/')
                /LDrop   = !CHAREND('/')
                /LKeep   = !CHAREND('/') !DEFAULT(ALL)
                /RRename = !CHAREND('/')
                /RDrop   = !CHAREND('/')
                /RKeep   = !CHAREND('/') !DEFAULT(ALL)
                /Join    = !CHAREND('/') !DEFAULT(Outer)
                /MaxGrp  = !TOKENS(1)    !DEFAULT(50))

    ECHO " ".
    ECHO !QUOTE(!CONCAT('Macro !','Mny2Mny:')).

*  .......     Section I:   Input parsing and error checking   ....... .
    !LET !Error = No

    !IF (   !LeftIn  !EQ !NULL
        !OR !RightIn !EQ !NULL) !THEN
        !LET !Error = Yes
        ECHO "Arguments LeftIn and RightIn (the two inputs)".
        ECHO "      are required.".
    !IFEND

    !IF (   !RightIn !EQ '*') !THEN
        !LET !Error = Yes
        ECHO "The Right input may not be '*'".
    !IFEND

    !IF (!UPCASE(!Join) !EQ 'OUTER') !THEN
       !LET !JType = 'OUTER'
       !LET !JMsg  = 'Outer join'
    !IFEND
    !IF (!UPCASE(!Join) !EQ 'LEFT OUTER') !THEN
       !LET !JType = 'LEFTOUT'
       !LET !JMsg  = 'Left outer join'
    !IFEND
    !IF (!UPCASE(!Join) !EQ 'LEFTOUTER') !THEN
       !LET !JType = 'LEFTOUT'
       !LET !JMsg  = 'Left outer join'
    !IFEND
    !IF (!UPCASE(!Join) !EQ 'LOUTER') !THEN
       !LET !JType = 'LEFTOUT'
       !LET !JMsg  = 'Left outer join'
    !IFEND

    !IF (!UPCASE(!Join) !EQ 'RIGHT OUTER') !THEN
       !LET !JType = 'RIGHTOUT'
       !LET !JMsg  = 'Right outer join'
    !IFEND
    !IF (!UPCASE(!Join) !EQ 'RIGHTOUTER') !THEN
       !LET !JType = 'RIGHTOUT'
       !LET !JMsg  = 'Right outer join'
    !IFEND
    !IF (!UPCASE(!Join) !EQ 'ROUTER') !THEN
       !LET !JType = 'RIGHTOUT'
       !LET !JMsg  = 'Right outer join'
    !IFEND

    !IF (!UPCASE(!Join) !EQ 'INNER') !THEN
       !LET !JType = 'INNER'
       !LET !JMsg  = 'Inner join'
    !IFEND

    !IF (!BYvar         !EQ !NULL)   !THEN
       !LET !JType = 'CARTPROD'
       !LET !JMsg   = 'Cartesian product'
    !IFEND

    !IF (!JType !EQ !NULL) !THEN
        !LET !Error = Yes
        ECHO "Invalid join type:".
        ECHO !QUOTE(!CONCAT('    Join=',!Join)).
    !IFEND


*  .......     Section II:  Informational messages            ........ .
    !IF (!Error !EQ No) !THEN
        ECHO !QUOTE(!CONCAT(!JMsg,' of')).
        ECHO !QUOTE(!CONCAT('   ',!LeftIn)).
        !IF (        !LRename !NE !NULL) !THEN
           ECHO !QUOTE(!CONCAT('      /RENAME=',!LRename))
        !IFEND
        !IF (        !LDrop   !NE !NULL) !THEN
           ECHO !QUOTE(!CONCAT('      /DROP  =',!LDrop))
        !IFEND
        !IF (!UPCASE(!LKeep)  !NE 'ALL') !THEN
           ECHO !QUOTE(!CONCAT('      /KEEP  =',!BYVar,' ',!LKeep))
        !IFEND

        ECHO !QUOTE(!CONCAT('to ',!RightIn)).
        !IF (        !RRename !NE !NULL) !THEN
           ECHO !QUOTE(!CONCAT('      /RENAME=',!RRename))
        !IFEND
        !IF (        !RDrop   !NE !NULL) !THEN
           ECHO !QUOTE(!CONCAT('      /DROP  =',!RDrop))
        !IFEND
        !IF (!UPCASE(!RKeep)  !NE 'ALL') !THEN
           ECHO !QUOTE(!CONCAT('      /KEEP  =',!BYVar,' ',!RKeep))
        !IFEND
        !IF (!BYvar  !NE !NULL)          !THEN
           ECHO !QUOTE(!CONCAT('by ',!BYvar)).
        !IFEND
      !IFEND

*  .......     Section III: Processing                          ....... .
*  .......     III.A  Sequence-numbered copies of the inputs    ....... .
    !IF (!Error !EQ No) !THEN
        ADD FILES
           /FILE=!Leftin.
        NUMERIC @Mny2Mny_NoBreak (F2).
        COMPUTE @Mny2Mny_NoBreak = 1.
        ADD FILES
           /FILE=*
        !IF (!LRename !NE !NULL) !THEN
           /RENAME = !LRename
        !IFEND
           /BY      @Mny2Mny_NoBreak !BYvar
           /FIRST = @Mny2Mny_NewGrp
        !IF (!LDrop   !NE !NULL) !THEN
           /DROP   = !LDrop
        !IFEND
           /KEEP   = @Mny2Mny_NoBreak !BYvar !LKeep.
        DO IF    @Mny2Mny_NewGrp.
        .  COMPUTE Left# = 1.
        ELSE.
        .  COMPUTE Left# = LAG(Left#) + 1.
        END IF.
        FORMATS    Left# (F4).
        DATASET NAME      @Mny2Mny_Left#d WINDOW=ASIS.
        /*-- LIST /*-*/.

        DATASET DECLARE   @Mny2Mny_Left##.
        AGGREGATE OUTFILE=@Mny2Mny_Left##
           /BREAK = @Mny2Mny_NoBreak !BYvar
           /@Mny2Mny_LMaxRec = MAX(Left#).


        ADD FILES
           /FILE=!RightIn.
        NUMERIC @Mny2Mny_NoBreak(F2).
        COMPUTE @Mny2Mny_NoBreak = 1.
        ADD FILES
           /FILE=*
        !IF (!LRename !NE !NULL) !THEN
           /RENAME = !RRename
        !IFEND
           /BY @Mny2Mny_NoBreak !BYvar
           /FIRST = @Mny2Mny_NewGrp
        !IF (!RDrop   !NE !NULL) !THEN
           /DROP   = !RDrop
        !IFEND
           /KEEP   = @Mny2Mny_NoBreak !BYvar !RKeep.
        DO IF    @Mny2Mny_NewGrp.
        .  COMPUTE Right# = 1.
        ELSE.
        .  COMPUTE Right# = LAG(Right#) + 1.
        END IF.
        FORMATS    Right# (F4).
        DATASET NAME      @Mny2Mny_Right#d WINDOW=ASIS.
        /*-- LIST /*-*/.

*  .......     III.B  Build 'spine' of all keys for all records ....... .
*  .......     III.B.1  Combine record counts from the inputs   ....... .
        AGGREGATE OUTFILE=*
           /BREAK   = @Mny2Mny_NoBreak !BYvar
           /@Mny2Mny_RMaxRec = MAX(Right#).
        /*-- LIST /*-*/.

        MATCH FILES
           /FILE    = @Mny2Mny_Left##
           /FILE    = *
           /BY @Mny2Mny_NoBreak !BYvar.

        RECODE @Mny2Mny_LMaxRec
               @Mny2Mny_RMaxRec (MISSING=0).

*  .......     III.B.2  Select, based on join type              ....... .
       !IF  (!Jtype !EQ 'INNER') !THEN
        SELECT IF (@Mny2Mny_LMaxRec GT 0
               AND @Mny2Mny_RMaxRec GT 0).
       !IFEND

       !IF  (!Jtype !EQ 'LEFTOUT') !THEN
        SELECT IF (@Mny2Mny_RMaxRec GT 0).
       !IFEND

       !IF  (!Jtype !EQ 'RIGHTOUT') !THEN
        SELECT IF (@Mny2Mny_LMaxRec GT 0).
       !IFEND

        * ..................................... .
        * For OUTER join and Cartesian product, .
        * no selection is necessary.            .
        * ..................................... .

*  .......     III.B.3  Error message, if record count exceeds  ....... .
*                       the allowed maximum                     ....... .

        IF $CASENUM EQ 1 #Overflow = 0.

        DO IF    @Mny2Mny_LMaxRec GT !MaxGrp.
        .  DO IF #Overflow EQ 0.
        .     PRINT / 'Left  input: ' @Mny2Mny_LMaxRec
                !IF  (!Jtype !EQ 'CARTPROD') !THEN
                      ' records '
                !ELSE
                      ' records for key ', !BYvar
                !IFEND
                      !QUOTE(!CONCAT(' (max is ',!MaxGrp,')')).
        .     COMPUTE #Overflow = 1.
        .  ELSE.
        .     PRINT / 'Left  input: ' @Mny2Mny_LMaxRec
                !IF  (!Jtype !EQ 'CARTPROD') !THEN
                      ' records '.
                !ELSE
                      ' records for key ', !BYvar.
                !IFEND
        .  END IF.
        END IF.

        DO IF    @Mny2Mny_RMaxRec GT !MaxGrp.
        .  DO IF #Overflow EQ 0.
        .     PRINT / 'Right input: ' @Mny2Mny_RMaxRec
                !IF  (!Jtype !EQ 'CARTPROD') !THEN
                      ' records '
                !ELSE
                      ' records for key ', !BYvar
                !IFEND
                      !QUOTE(!CONCAT(' (max is ',!MaxGrp,')')).
        .     COMPUTE #Overflow = 1.
        .  ELSE.
        .     PRINT / 'Right input: ' @Mny2Mny_RMaxRec
                !IF  (!Jtype !EQ 'CARTPROD') !THEN
                      ' records '.
                !ELSE
                      ' records for key ', !BYvar.
                !IFEND
        .  END IF.
        END IF.

*  .......     III.B.4  'Unroll' to build file of all record    ....... .
*                       keys                                    ....... .
        * 'Unroll' the Right record numbers.
        VECTOR  Rec#(!MaxGrp,F4).

        DO IF   @Mny2Mny_RMaxRec EQ 0.
        .     COMPUTE Rec#(1)     = 0.
        ELSE.
        .  LOOP #Idx = 1 TO MIN(@Mny2Mny_RMaxRec,!MaxGrp).
        .     COMPUTE Rec#(#Idx) = #Idx.
        .  END LOOP.
        END IF.
        EXECUTE  /* apparently necessary */.

        !LET !Sfx = !LENGTH (!BLANKS (!MaxGrp))

        OMS /SELECT ALL
            /IF COMMANDS = ["Variables to Cases"]
            /DESTINATION   VIEWER = NO
            /TAG = "@MNY2MNY_V2C_ZAP".

        VARSTOCASES
           /MAKE Right# FROM Rec#1 TO !CONCAT('Rec#',!Sfx).

        * 'Unroll' the Left  record numbers.
        VECTOR  Rec#(!MaxGrp,F4).

        DO IF   @Mny2Mny_LMaxRec EQ 0.
        .     COMPUTE Rec#(1)     = 0.
        ELSE.
        .  LOOP #Idx = 1 TO MIN(@Mny2Mny_LMaxRec,!MaxGrp).
        .     COMPUTE Rec#(#Idx) = #Idx.
        .  END LOOP.
        END IF.
        EXECUTE  /* apparently necessary */.

        VARSTOCASES
           /MAKE Left#  FROM Rec#1 TO !CONCAT('Rec#',!Sfx).

        OMSEND TAG=["@MNY2MNY_V2C_ZAP"].

*  .......     III.C  Add the substantive data                  ....... .
        MATCH FILES
           /FILE=*
           /TABLE=@Mny2Mny_Right#d
           /BY !BYvar Right#
           /DROP @Mny2Mny_NoBreak @Mny2Mny_NewGrp
                 @Mny2Mny_LMaxRec @Mny2Mny_RMaxRec.

        SORT CASES BY !BYvar Left# Right#.

        MATCH FILES
           /TABLE=@Mny2Mny_Left#d
           /FILE=*
           /BY   !BYvar Left#
           /DROP  @Mny2Mny_NoBreak @Mny2Mny_NewGrp
           /KEEP !BYvar Left# Right# ALL.

        EXECUTE /* so input datasets can be dropped */.

        DATASET CLOSE @Mny2Mny_Left#d.
        DATASET CLOSE @Mny2Mny_Right#d.
        DATASET CLOSE @Mny2Mny_Left##.
    !IFEND
!ENDDEFINE.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD