Many-too-many demo

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Many-too-many demo

Richard Ristow
I'm preparing a new version of macro !Mny2Mny. Like its predecessor,
it works by first building a file in with one record for each value
of the key on which the files are to be joined, relying on the
insight I got from David Marco: If the records in each file are
sequence-numbered within each key value, the set of all sequence
numbers can be represented by the largest value.

This summary file is then 'unrolled' into what I've called the
'spine': it has all the records that will be in the final file, but
represented only by their key values and the sequence numbers of the
two input records that will be combined there. (By definition, those
two records share the same value of the key.) That done, the
many-to-many merge can be completed by two MATCH FILES where the two
inputs are TABLE files. This does require that copies of both inputs
be made with sequence numbers included; that imposes some overhead,
but generally the size of the output is considerably greater than
that of the two inputs combined, so the burden should be a small part
of the total time required. (I'm judging that this process is almost
entirely input-output bound.)

The version I previously posted used a pair of VARSTOCASES operations
to unroll the summary file into the 'spine'. The new version
incorporates David Marso's technique of unrolling the records in
MATRIX code. That has the immense advantage that the code need not
contain an explicit upper limit for the number of records for a
single key value, as was necessary with the VARSTOCASES logic.

I've stuck to the goal of having !Mny2Mny accept any key, single- or
multi-variabled, of any datatype, that would be accepted by MATCH
FILES. Such keys cannot readily be passed through MATRIX, if at all.
The method works because of David's insight that the key values can
be sequence-numbered, and the summary file passed in that form to MATRIX.

The result is code that is remarkably little changed from the
VARSTOCASES version. The key-value sequence numbers exist only in
scratch dataset @Mny2Mny_GrpSmry, in the MATRIX code that reads it,
and in the unnamed dataset which MATRIX creates and which is then
merged with @Mny2Mny_GrpSmry to form the 'spine'.

The MATRIX code here differs from what I received from David in two
respects. First, I've unloaded some of the table-building to
transformation code; the matrix which I call Grp#Rcds, and which is C
in the version I received from David, is read into MATRIX in its
completed form, rather than being built there. Second, I develop the
vectors of record sequence numbers in two separate MATRIX variables,
LeftNos and RightNos, to permit passing record-number values of 0 to
indicate non-matches.

It is always my goal to write readable code; and, in the case of
!Mny2Mny, to write robust and versatile code. I've found those goals
conflicting. So, here is a demonstration of the workings of the new
!Mny2Mny as I'm drafting it, but in open code rather than as part of
a macro. I think that without interspersed macro variables and macro
logic it is considerably more readable. Of course some features are
omitted, notably the logic to calculate file joins if there are
matching record keys, and Cartesian products if there are not. (The
variable @Mny2Mny_NoBreak in the example code has no function there,
but is part of the provision for Cartesian products.)

Here's the example. Test data is the "Schools" and "Subjects" files
I've used in other examples; they're in the Appendix to this posting.
And this is an outer join, with unmatched records from both inputs
included in the output. (And it starts with "Section III", because
earlier sections of the macro for input checking and messages, are
omitted here.) Test run in SPSS 14 under Windows XP.

*  .......     Section III: Processing                          ....... .
*  .......     III.A    Sequence-numbered copies of the input   ....... .
*  .......     III.A.1  Left  input -> dataset @Mny2Mny_Left#d  ....... .

ADD FILES
    /FILE=Schools
    /BY      Category
    /KEEP  = Category ALL
    /FIRST = @Mny2Mny_NewGrp.

NUMERIC @Mny2Mny_NoBreak (F2).
COMPUTE @Mny2Mny_NoBreak = 1.

DO IF    @Mny2Mny_NewGrp.
.  COMPUTE Left# = 1.
ELSE.
.  COMPUTE Left# = LAG(Left#) + 1.
END IF.

FORMATS    @Mny2Mny_NewGrp (F2).
FORMATS    Left#          (F4).
DATASET NAME      @Mny2Mny_Left#d WINDOW=ASIS.
*  ... End of processing Left  input .
/*-- LIST /*-*/.

DATASET DECLARE   @Mny2Mny_Left##.
AGGREGATE OUTFILE=@Mny2Mny_Left##
    /BREAK = @Mny2Mny_NoBreak Category
    /@Mny2Mny_LMaxRec = MAX(Left#).

*  .......     III.A.2  Right input -> dataset @Mny2Mny_Right#d ....... .

ADD FILES
    /FILE=Subjects
    /KEEP   = Category ALL
    /BY Category
    /FIRST = @Mny2Mny_NewGrp.

NUMERIC @Mny2Mny_NoBreak(F2).
COMPUTE @Mny2Mny_NoBreak = 1.
DO IF    @Mny2Mny_NewGrp.
.  COMPUTE Right# = 1.
ELSE.
.  COMPUTE Right# = LAG(Right#) + 1.
END IF.

FORMATS    @Mny2Mny_NewGrp (F2).
FORMATS    Right#         (F4).
DATASET NAME      @Mny2Mny_Right#d WINDOW=ASIS.
/*-- LIST /*-*/.

*  .......     III.B  Build 'spine' of all keys for all records ....... .
*  .......     III.B.1  Combine record counts from both inputs  ....... .

AGGREGATE OUTFILE=*
    /BREAK   = @Mny2Mny_NoBreak Category
    /@Mny2Mny_RMaxRec = MAX(Right#).

/*-- LIST /*-*/.

MATCH FILES
    /FILE    = @Mny2Mny_Left##
    /FILE    = *
    /BY @Mny2Mny_NoBreak Category.

DATASET NAME    @Mny2Mny_GrpSmry.
RECODE @Mny2Mny_LMaxRec
        @Mny2Mny_RMaxRec (MISSING=0).

*  .......     III.B.2  Select, based on join type              ....... .
* e.g. !IF  (!Jtype !EQ 'INNER') !THEN
*      SELECT IF (@Mny2Mny_LMaxRec GT 0
*             AND @Mny2Mny_RMaxRec GT 0).
*     !IFEND

/*--  LIST /*-*/.

*  .......     III.B.3  Expand the counts to get list of        ........ .
*                       pairs of record numbers from the        ........ .
*                       groups, by group number                 ........ .

COMPUTE @Mny2Mny_Grp#    = $CASENUM.
COMPUTE @Mny2Mny_GrpRecs = MAX(@Mny2Mny_LMaxRec*@Mny2Mny_RMaxRec,
                                @Mny2Mny_LMaxRec,
                                @Mny2Mny_RMaxRec).

*  .......     MATRIX code by David Marso, adapted here.        ........ .
*  ... OMS  /SELECT TABLES
      /IF COMMANDS = ["Matrix"]
          SUBTYPES = ["Notes"]
      /DESTINATION   VIEWER = NO
      /TAG = "ZapMATRIX".

MATRIX.
+  GET Grp#Rcds
    /FILE=*
    /VARIABLES=@Mny2Mny_Grp#,
               @Mny2Mny_LMaxRec @Mny2Mny_RMaxRec,
               @Mny2Mny_GrpRecs
    /NAMES=CVnames.

+  LOOP #=1 TO NROW(Grp#Rcds).

+     COMPUTE LeftNos=T({1:MMAX({Grp#Rcds(#,2),1})}).
+     DO IF (Grp#Rcds(#,2) EQ 0).
+        COMPUTE LeftNos = 0*LeftNos.
+     END IF.

+     COMPUTE RightNos=T({1:MMAX({Grp#Rcds(#,3),1})}).
+     DO IF (Grp#Rcds(#,3) EQ 0).
+        COMPUTE RightNos = 0*RightNos.
+     END IF.

+     SAVE ({MAKE(Grp#Rcds(#,4),1,Grp#Rcds(#,1)),
              KRONEKER(MAKE(NROW(RightNos),1,1),LeftNos),
              KRONEKER(RightNos,MAKE(NROW(LeftNos),1,1)) })
        / OUTFILE *
        / VARIABLES @Mny2Mny_Grp# Left# Right#.

+  END LOOP.
END MATRIX.


Matrix
Notes [suppressed]
------ END MATRIX -----

FORMATS @Mny2Mny_Grp# Left# Right# (F4).
VARIABLE WIDTH        Left# Right# (6).
/*--  LIST /*-*/.

*  .......     III.B.4  Combine, with summary file by group key ........ .
*                       to get the 'spine': list of all pairs   ........ .
*                       of record numbers by group key.         ........ .

MATCH FILES
    /TABLE=@Mny2Mny_GrpSmry
    /FILE =*
    /BY @Mny2Mny_Grp#.

/*-- DATASET COPY Spine      /*-*/.

*  .......     III.C  Add the substantive data                  ....... .
MATCH FILES
    /FILE=*
    /TABLE=@Mny2Mny_Right#d
    /BY Category Right#
    /DROP @Mny2Mny_NoBreak
          @Mny2Mny_NewGrp  @Mny2Mny_Grp#   @Mny2Mny_GrpRecs
          @Mny2Mny_LMaxRec @Mny2Mny_RMaxRec.

SORT CASES BY Category Left# Right#.
MATCH FILES
    /TABLE=@Mny2Mny_Left#d
    /FILE=*
    /BY   Category Left#
    /DROP  @Mny2Mny_NoBreak @Mny2Mny_NewGrp
    /KEEP Category Left# Right# ALL.

EXECUTE /* so input datasets can be dropped */.

DATASET CLOSE @Mny2Mny_Left#d.
DATASET CLOSE @Mny2Mny_Right#d.
DATASET CLOSE @Mny2Mny_Left##.
DATASET CLOSE @Mny2Mny_GrpSmry.
LIST.

List

Notes
|-----------------------------|---------------------------|
|Output Created               |14-MAR-2014 20:13:44       |
|-----------------------------|---------------------------|
Category      Left# Right# School                 Subject

Art               1     1  RI School of Design    Painting
Art               1     2  RI School of Design    Drawing
Art               1     3  RI School of Design    Sculpture
Art               2     1  Metropolitan Museum    Painting
Art               2     2  Metropolitan Museum    Drawing
Art               2     3  Metropolitan Museum    Sculpture
Culinary          1     0  Johnson & Wales
Liberal Arts      1     1  Oberlin                Literature
Liberal Arts      1     2  Oberlin                Philosophy
Liberal Arts      2     1  Antioch                Literature
Liberal Arts      2     2  Antioch                Philosophy
Liberal Arts      3     1  Swarthmore             Literature
Liberal Arts      3     2  Swarthmore             Philosophy
Technical         1     1  MIT                    Calculus
Technical         1     2  MIT                    Physics
Technical         1     3  MIT                    Chemistry
Technical         1     4  MIT                    Engineering
Technical         2     1  Cal Tech               Calculus
Technical         2     2  Cal Tech               Physics
Technical         2     3  Cal Tech               Chemistry
Technical         2     4  Cal Tech               Engineering
Technical         3     1  ETH                    Calculus
Technical         3     2  ETH                    Physics
Technical         3     3  ETH                    Chemistry
Technical         3     4  ETH                    Engineering

Number of cases read:  25    Number of cases listed:  25
======================================
APPENDIX: Test data, and code repeated
======================================
*  C:\Documents and Settings\Richard\My Documents  .
*    \Technical\spssx-l\Z-2014\                    .
*    2014-03-15 Ristow-Mny2Mny demo.SPS            .


DATA LIST FIXED
    /Category    04-16 (A)
     School      17-38 (A).
BEGIN DATA
    Art          RI School of Design
    Art          Metropolitan Museum
    Culinary     Johnson & Wales
    Liberal Arts Oberlin
    Liberal Arts Antioch
    Liberal Arts Swarthmore
    Technical    MIT
    Technical    Cal Tech
    Technical    ETH
END DATA.
SORT CASES BY Category.
DATASET NAME     Schools.
.  /*-- LIST /*-*/.

DATA LIST FIXED
    /Category    04-16 (A)
     Subject     17-28 (A).
BEGIN DATA
    Art          Painting
    Art          Drawing
    Art          Sculpture
    Liberal Arts Literature
    Liberal Arts Philosophy
    Technical    Calculus
    Technical    Physics
    Technical    Chemistry
    Technical    Engineering
END DATA.
SORT CASES BY Category.
DATASET NAME    Subjects.
.  /*-- LIST /*-*/.
*  ...................................................... .


*  .......     Section III: Processing                          ....... .
*  .......     III.A    Sequence-numbered copies of the input   ....... .
*  .......     III.A.1  Left  input -> dataset @Mny2Mny_Left#d  ....... .

ADD FILES
    /FILE=Schools
    /BY      Category
    /KEEP  = Category ALL
    /FIRST = @Mny2Mny_NewGrp.

NUMERIC @Mny2Mny_NoBreak (F2).
COMPUTE @Mny2Mny_NoBreak = 1.

DO IF    @Mny2Mny_NewGrp.
.  COMPUTE Left# = 1.
ELSE.
.  COMPUTE Left# = LAG(Left#) + 1.
END IF.

FORMATS    @Mny2Mny_NewGrp (F2).
FORMATS    Left#          (F4).
DATASET NAME      @Mny2Mny_Left#d WINDOW=ASIS.
*  ... End of processing Left  input .
/*-- LIST /*-*/.

DATASET DECLARE   @Mny2Mny_Left##.
AGGREGATE OUTFILE=@Mny2Mny_Left##
    /BREAK = @Mny2Mny_NoBreak Category
    /@Mny2Mny_LMaxRec = MAX(Left#).

*  .......     III.A.2  Right input -> dataset @Mny2Mny_Right#d ....... .

ADD FILES
    /FILE=Subjects
    /KEEP   = Category ALL
    /BY Category
    /FIRST = @Mny2Mny_NewGrp.

NUMERIC @Mny2Mny_NoBreak(F2).
COMPUTE @Mny2Mny_NoBreak = 1.
DO IF    @Mny2Mny_NewGrp.
.  COMPUTE Right# = 1.
ELSE.
.  COMPUTE Right# = LAG(Right#) + 1.
END IF.

FORMATS    @Mny2Mny_NewGrp (F2).
FORMATS    Right#         (F4).
DATASET NAME      @Mny2Mny_Right#d WINDOW=ASIS.
/*-- LIST /*-*/.

*  .......     III.B  Build 'spine' of all keys for all records ....... .
*  .......     III.B.1  Combine record counts from both inputs  ....... .

AGGREGATE OUTFILE=*
    /BREAK   = @Mny2Mny_NoBreak Category
    /@Mny2Mny_RMaxRec = MAX(Right#).

/*-- LIST /*-*/.

MATCH FILES
    /FILE    = @Mny2Mny_Left##
    /FILE    = *
    /BY @Mny2Mny_NoBreak Category.

DATASET NAME    @Mny2Mny_GrpSmry.

RECODE @Mny2Mny_LMaxRec
        @Mny2Mny_RMaxRec (MISSING=0).

*  .......     III.B.2  Select, based on join type              ....... .
* e.g. !IF  (!Jtype !EQ 'INNER') !THEN
*      SELECT IF (@Mny2Mny_LMaxRec GT 0
*             AND @Mny2Mny_RMaxRec GT 0).
*     !IFEND

/*--  LIST /*-*/.

*  .......     III.B.3  Expand the counts to get list of        ........ .
*                       pairs of record numbers from the        ........ .
*                       groups, by group number                 ........ .

COMPUTE @Mny2Mny_Grp#    = $CASENUM.
COMPUTE @Mny2Mny_GrpRecs = MAX(@Mny2Mny_LMaxRec*@Mny2Mny_RMaxRec,
                                @Mny2Mny_LMaxRec,
                                @Mny2Mny_RMaxRec).

*  .......     MATRIX code by David Marso, adapted here.        ........ .
*  ... OMS  /SELECT TABLES
      /IF COMMANDS = ["Matrix"]
          SUBTYPES = ["Notes"]
      /DESTINATION   VIEWER = NO
      /TAG = "ZapMATRIX".

MATRIX.
+  GET Grp#Rcds
    /FILE=*
    /VARIABLES=@Mny2Mny_Grp#,
               @Mny2Mny_LMaxRec @Mny2Mny_RMaxRec,
               @Mny2Mny_GrpRecs
    /NAMES=CVnames.

+  LOOP #=1 TO NROW(Grp#Rcds).

+     COMPUTE LeftNos=T({1:MMAX({Grp#Rcds(#,2),1})}).
+     DO IF (Grp#Rcds(#,2) EQ 0).
+        COMPUTE LeftNos = 0*LeftNos.
+     END IF.

+     COMPUTE RightNos=T({1:MMAX({Grp#Rcds(#,3),1})}).
+     DO IF (Grp#Rcds(#,3) EQ 0).
+        COMPUTE RightNos = 0*RightNos.
+     END IF.

+     SAVE ({MAKE(Grp#Rcds(#,4),1,Grp#Rcds(#,1)),
              KRONEKER(MAKE(NROW(RightNos),1,1),LeftNos),
              KRONEKER(RightNos,MAKE(NROW(LeftNos),1,1)) })
        / OUTFILE *
        / VARIABLES @Mny2Mny_Grp# Left# Right#.

+  END LOOP.
END MATRIX.

FORMATS @Mny2Mny_Grp# Left# Right# (F4).
VARIABLE WIDTH        Left# Right# (6).
/*--  LIST /*-*/.

*  .......     III.B.4  Combine, with summary file by group key ........ .
*                       to get the 'spine': list of all pairs   ........ .
*                       of record numbers by group key.         ........ .

MATCH FILES
    /TABLE=@Mny2Mny_GrpSmry
    /FILE =*
    /BY @Mny2Mny_Grp#.

/*-- DATASET COPY Spine      /*-*/.

*  .......     III.C  Add the substantive data                  ....... .
MATCH FILES
    /FILE=*
    /TABLE=@Mny2Mny_Right#d
    /BY Category Right#
    /DROP @Mny2Mny_NoBreak
          @Mny2Mny_NewGrp  @Mny2Mny_Grp#   @Mny2Mny_GrpRecs
          @Mny2Mny_LMaxRec @Mny2Mny_RMaxRec.

SORT CASES BY Category Left# Right#.

MATCH FILES
    /TABLE=@Mny2Mny_Left#d
    /FILE=*
    /BY   Category Left#
    /DROP  @Mny2Mny_NoBreak @Mny2Mny_NewGrp
    /KEEP Category Left# Right# ALL.

EXECUTE /* so input datasets can be dropped */.

DATASET CLOSE @Mny2Mny_Left#d.
DATASET CLOSE @Mny2Mny_Right#d.
DATASET CLOSE @Mny2Mny_Left##.
DATASET CLOSE @Mny2Mny_GrpSmry.

LIST.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Many-too-many demo

Richard Ristow
Pun not intended, though perhaps pertinent!

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Many-too-many demo

bdates
Sorry, but the clinical psychologist in me thinks that Freud would have a ball with this one...your Freudian slip is showing.

B
________________________________________
From: SPSSX(r) Discussion [[hidden email]] on behalf of Richard Ristow [[hidden email]]
Sent: Friday, March 14, 2014 8:59 PM
To: [hidden email]
Subject: Re: Many-too-many demo

Pun not intended, though perhaps pertinent!

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD