"Match files" with duplicates in table

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

"Match files" with duplicates in table

Catherine Kubitschek
I want to something like a MATCH FILES FILE=x /TABLE=y /BY z . Except the
"y" file is not unduplicated on "z".

This is much easier to explain with an example:

/* First file - id & name, unduplicated (pre-sorted) .
data list /id 1 name 3-10 (a) .
begin data
1 Ann
2 Bill
3 Chris
4 David
5 Ethel
6 Frank
end data .
save outfile=UndupComplete .

/* Second file - id & major, duplicated but subset of id (pre-sorted) .
data list /id 1 Major 3-6 (a) .
begin data
2 ENGL
3 HIST
3 THEO
5 ENGL
end data .
save outfile=DupPartial .

The result I want is a dataset that looks like:

id name     Major

  1 Ann
  2 Bill     ENGL
  3 Chris    HIST
  3 Chris    THEO
  4 David
  5 Ethel    ENGL
  6 Frank

Is there a good way to do this?  (I'm using SPSS 14).
Thanks.

Catherine
Reply | Threaded
Open this post in threaded view
|

Re: "Match files" with duplicates in table

Hector Maletta
         Catherine,
         no problem as far as the ID is not duplicated.
         MATCH FILES /FILE= UNDUPCOMPLETE/FILE= DupPartial/by ID.

         This produces exactly what you want.

         Hector

         -----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Catherine Kubitschek
Sent: 27 September 2007 22:34
To: [hidden email]
Subject: "Match files" with duplicates in table

         I want to something like a MATCH FILES FILE=x /TABLE=y /BY z .
Except the
         "y" file is not unduplicated on "z".

         This is much easier to explain with an example:

         /* First file - id & name, unduplicated (pre-sorted) .
         data list /id 1 name 3-10 (a) .
         begin data
         1 Ann
         2 Bill
         3 Chris
         4 David
         5 Ethel
         6 Frank
         end data .
         save outfile=UndupComplete .

         /* Second file - id & major, duplicated but subset of id
(pre-sorted) .
         data list /id 1 Major 3-6 (a) .
         begin data
         2 ENGL
         3 HIST
         3 THEO
         5 ENGL
         end data .
         save outfile=DupPartial .

         The result I want is a dataset that looks like:

         id name     Major

           1 Ann
           2 Bill     ENGL
           3 Chris    HIST
           3 Chris    THEO
           4 David
           5 Ethel    ENGL
           6 Frank

         Is there a good way to do this?  (I'm using SPSS 14).
         Thanks.

         Catherine
Reply | Threaded
Open this post in threaded view
|

Re: "Match files" with duplicates in table

Catherine Kubitschek
I'm sorry.  I put a double negative into my message.  My problem is that
the id IS duplicated in what would be the table. -Catherine

At 9/27/2007 10:01 PM, Hector Maletta wrote:

>          Catherine,
>          no problem as far as the ID is not duplicated.
>          MATCH FILES /FILE= UNDUPCOMPLETE/FILE= DupPartial/by ID.
>
>          This produces exactly what you want.
>
>          Hector
>
>          -----Original Message-----
>From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
>Catherine Kubitschek
>Sent: 27 September 2007 22:34
>To: [hidden email]
>Subject: "Match files" with duplicates in table
>
>          I want to something like a MATCH FILES FILE=x /TABLE=y /BY z .
>Except the
>          "y" file is not unduplicated on "z".
>
>          This is much easier to explain with an example:
>
>          /* First file - id & name, unduplicated (pre-sorted) .
>          data list /id 1 name 3-10 (a) .
>          begin data
>          1 Ann
>          2 Bill
>          3 Chris
>          4 David
>          5 Ethel
>          6 Frank
>          end data .
>          save outfile=UndupComplete .
>
>          /* Second file - id & major, duplicated but subset of id
>(pre-sorted) .
>          data list /id 1 Major 3-6 (a) .
>          begin data
>          2 ENGL
>          3 HIST
>          3 THEO
>          5 ENGL
>          end data .
>          save outfile=DupPartial .
>
>          The result I want is a dataset that looks like:
>
>          id name     Major
>
>            1 Ann
>            2 Bill     ENGL
>            3 Chris    HIST
>            3 Chris    THEO
>            4 David
>            5 Ethel    ENGL
>            6 Frank
>
>          Is there a good way to do this?  (I'm using SPSS 14).
>          Thanks.
>
>          Catherine
Reply | Threaded
Open this post in threaded view
|

Re: "Match files" with duplicates in table

Richard Ristow
In reply to this post by Catherine Kubitschek
Bless you for really good test data! And exactly the right kind of
subject line: Stating the nature of the problem, succinctly, with no
text except that statement.

At 09:34 PM 9/27/2007, Catherine Kubitschek wrote:

>I want to something like a MATCH FILES FILE=x /TABLE=y /BY z . Except
>the
>"y" file is not unduplicated on "z". The result I want is a dataset
>that looks like:
>
>id name     Major
>
>  1 Ann
>  2 Bill     ENGL
>  3 Chris    HIST
>  3 Chris    THEO
>  4 David
>  5 Ethel    ENGL
>  6 Frank

One the way to do this is with MATCH FILES, ADD FILES, and SELECT. (I
tried a couple of variant solutions, but I don't like them.) Following
is SPSS 14 draft output. I'm using datasets instead of your scratch
.SAV files.
|-----------------------------|---------------------------|
|Output Created               |28-SEP-2007 12:37:31       |
|-----------------------------|---------------------------|
[UndupComplete]
id name

  1 Ann
  2 Bill
  3 Chris
  4 David
  5 Ethel
  6 Frank

Number of cases read:  6    Number of cases listed:  6

|-----------------------------|---------------------------|
|Output Created               |28-SEP-2007 12:37:31       |
|-----------------------------|---------------------------|
[DupPartial]

id Major

  2 ENGL
  3 HIST
  3 THEO
  5 ENGL

Number of cases read:  4    Number of cases listed:  4


*  ...  I.   Double-match. May not give desired results if "MAJOR"   .
*  ...  is ever blank in 'DupPartial'                                .

MATCH FILES
    /TABLE=UndupComplete
    /FILE =DupPartial
    /BY    ID.

DATASET NAME     I_DoubleMatch WINDOW=FRONT.
.  /*--  LIST /*-*/.

ADD  FILES
    /FILE = *
    /FILE = UndupComplete
    /BY     ID.

.  /*--  LIST /*-*/.

SELECT IF MAJOR NE ' '
        OR MISSING(LAG(ID))
        OR ID   NE LAG(ID) .

LIST.

List
|-----------------------------|---------------------------|
|Output Created               |28-SEP-2007 12:37:33       |
|-----------------------------|---------------------------|
[I_DoubleMatch]

id name     Major

  1 Ann
  2 Bill     ENGL
  3 Chris    HIST
  3 Chris    THEO
  4 David
  5 Ethel    ENGL
  6 Frank

Number of cases read:  7    Number of cases listed:  7
===================
APPENDIX: Test data
===================
*  ................................................................. .
*  .................   Test data, from posting ..................... .
*  (Excellent. I didn't have to touch it, except replacing SAVE by   .
*  DATASET commands.)                                                .

/* First file - id & name, unduplicated (pre-sorted) .
data list /id 1 name 3-10 (a) .
begin data
1 Ann
2 Bill
3 Chris
4 David
5 Ethel
6 Frank
end data .
* ... save outfile=UndupComplete .
DATASET NAME       UndupComplete WINDOW=FRONT.


/* Second file - id & major, duplicated but subset of id (pre-sorted) .
data list /id 1 Major 3-6 (a) .
begin data
2 ENGL
3 HIST
3 THEO
5 ENGL
end data .
* ... save outfile=DupPartial .
DATASET NAME       DupPartial      WINDOW=FRONT.

DATASET ACTIVATE   UndupComplete   WINDOW=FRONT.
LIST.

DATASET ACTIVATE   DupPartial      WINDOW=FRONT.
LIST.
Reply | Threaded
Open this post in threaded view
|

Re: "Match files" with duplicates in table

Hal 9000
In reply to this post by Catherine Kubitschek
Catherine, Hector's solution produces a result of:

     ID        NAME      MAJOR
-----------------------------------------------
       1        Ann
       2        Bill            Engl
       3        Chris           Hist
       3                        Theo
       4        David
       5        Ethel           Engl
       6        Frank

...even though you get error message:

File #2
     KEY:        3

>Warning # 5132
>Duplicate key in a file.  The BY variables do not uniquely identify each
>case on the indicated file.  Please check the results carefully.

If I were you, I'd do it this way, sort by id (a) and name (d) and
compute name as equal to lag(name) if id = lag(id).

Maybe another approach would be to flatten out the file with
duplicates using casestovars:

casestovars
        /id = id.

Then do the match file and you get:

      ID      NAME      MAJOR.1   MAJOR.2
-----------------------------------------------------------------
       1        Ann
       2        Bill            Engl
       3        Chris           Hist            Theo
       4        David
       5        Ethel           Engl
       6        Frank

Hope that helped!
-Gary
Reply | Threaded
Open this post in threaded view
|

Re: "Match files" with duplicates in table

Hector Maletta
         I have pursued privately my dialog with Catherine and the final
result is thus:

         MATCH FILES
                     /TABLE = UNDUPCOMPLETE
                     /FILE = DupPartial/by ID.
         MATCH FILES /FILE = * /FILE= UNDUPCOMPLETE /BY ID.

         SAVE OUTFILE =NEWFILE.

         The first MATCH assigns data from the list of students to each
student-course record in the second file. The second MATCH will have the
following effect: each student-course already included in the active file
(resulting from the first step) will be included; each student not included
in the active file (omitted in the results of the first step because not
registered in any course) will be included too, but with a blank course.

          There might be a more elegant solution, but it is late here and I
cannot think of anything better right now.



         Hector



         -----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hal
9000
Sent: 28 September 2007 14:30
To: [hidden email]
Subject: Re: "Match files" with duplicates in table

         Catherine, Hector's solution produces a result of:

              ID        NAME      MAJOR
         -----------------------------------------------
                1        Ann
                2        Bill            Engl
                3        Chris           Hist
                3                        Theo
                4        David
                5        Ethel           Engl
                6        Frank

         ...even though you get error message:

         File #2
              KEY:        3

         >Warning # 5132
         >Duplicate key in a file.  The BY variables do not uniquely
identify each
         >case on the indicated file.  Please check the results carefully.

         If I were you, I'd do it this way, sort by id (a) and name (d) and
         compute name as equal to lag(name) if id = lag(id).

         Maybe another approach would be to flatten out the file with
         duplicates using casestovars:

         casestovars
                 /id = id.

         Then do the match file and you get:

               ID      NAME      MAJOR.1   MAJOR.2
         -----------------------------------------------------------------
                1        Ann
                2        Bill            Engl
                3        Chris           Hist            Theo
                4        David
                5        Ethel           Engl
                6        Frank

         Hope that helped!
         -Gary
Reply | Threaded
Open this post in threaded view
|

Re: "Match files" with duplicates in table

Hal 9000
Cool!

On 9/28/07, Hector Maletta <[hidden email]> wrote:

>         I have pursued privately my dialog with Catherine and the final
> result is thus:
>
>         MATCH FILES
>                     /TABLE = UNDUPCOMPLETE
>                     /FILE = DupPartial/by ID.
>         MATCH FILES /FILE = * /FILE= UNDUPCOMPLETE /BY ID.
>
>         SAVE OUTFILE =NEWFILE.
>
>         The first MATCH assigns data from the list of students to each
> student-course record in the second file. The second MATCH will have the
> following effect: each student-course already included in the active file
> (resulting from the first step) will be included; each student not included
> in the active file (omitted in the results of the first step because not
> registered in any course) will be included too, but with a blank course.
>
>          There might be a more elegant solution, but it is late here and I
> cannot think of anything better right now.
>
>
>
>         Hector
>
>
>
>         -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hal
> 9000
> Sent: 28 September 2007 14:30
> To: [hidden email]
> Subject: Re: "Match files" with duplicates in table
>
>         Catherine, Hector's solution produces a result of:
>
>              ID        NAME      MAJOR
>         -----------------------------------------------
>                1        Ann
>                2        Bill            Engl
>                3        Chris           Hist
>                3                        Theo
>                4        David
>                5        Ethel           Engl
>                6        Frank
>
>         ...even though you get error message:
>
>         File #2
>              KEY:        3
>
>         >Warning # 5132
>         >Duplicate key in a file.  The BY variables do not uniquely
> identify each
>         >case on the indicated file.  Please check the results carefully.
>
>         If I were you, I'd do it this way, sort by id (a) and name (d) and
>         compute name as equal to lag(name) if id = lag(id).
>
>         Maybe another approach would be to flatten out the file with
>         duplicates using casestovars:
>
>         casestovars
>                 /id = id.
>
>         Then do the match file and you get:
>
>               ID      NAME      MAJOR.1   MAJOR.2
>         -----------------------------------------------------------------
>                1        Ann
>                2        Bill            Engl
>                3        Chris           Hist            Theo
>                4        David
>                5        Ethel           Engl
>                6        Frank
>
>         Hope that helped!
>         -Gary
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: "Match files" with duplicates in table

Albert-Jan Roskam
In reply to this post by Hal 9000
Yep, provided that there are not too many duplicates I
usually also use:
sort cases by id.
casestovars /id=id.
match files...
Many duplicates + many records sometimes make spss
crash when you attempt to run a casestovars command.

Cheers!!
Albert-Jan

--- Hal 9000 <[hidden email]> wrote:

> Catherine, Hector's solution produces a result of:
>
>      ID        NAME      MAJOR
> -----------------------------------------------
>        1        Ann
>        2        Bill            Engl
>        3        Chris           Hist
>        3                        Theo
>        4        David
>        5        Ethel           Engl
>        6        Frank
>
> ...even though you get error message:
>
> File #2
>      KEY:        3
>
> >Warning # 5132
> >Duplicate key in a file.  The BY variables do not
> uniquely identify each
> >case on the indicated file.  Please check the
> results carefully.
>
> If I were you, I'd do it this way, sort by id (a)
> and name (d) and
> compute name as equal to lag(name) if id = lag(id).
>
> Maybe another approach would be to flatten out the
> file with
> duplicates using casestovars:
>
> casestovars
>         /id = id.
>
> Then do the match file and you get:
>
>       ID      NAME      MAJOR.1   MAJOR.2
>
-----------------------------------------------------------------

>        1        Ann
>        2        Bill            Engl
>        3        Chris           Hist            Theo
>        4        David
>        5        Ethel           Engl
>        6        Frank
>
> Hope that helped!
> -Gary
>


Cheers!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Did you know that 87.166253% of all statistics claim a precision of results that is not justified by the method employed? [HELMUT RICHTER]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



____________________________________________________________________________________
Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=list&sid=396545469
Reply | Threaded
Open this post in threaded view
|

Re: "Match files" with duplicates in table

Richard Ristow
In reply to this post by Hector Maletta
At 02:30 PM 9/28/2007, Hector Maletta wrote:

>My final result is thus:

SPSS 14 draft output, test data as before:

MATCH FILES
            /TABLE = UNDUPCOMPLETE
            /FILE = DupPartial/by ID.
MATCH FILES /FILE = * /FILE= UNDUPCOMPLETE /BY ID.

LIST.

List
|-----------------------------|---------------------------|
|Output Created               |30-SEP-2007 17:47:58       |
|-----------------------------|---------------------------|
id name     Major

  1 Ann
  2 Bill     ENGL
File #1
      KEY:        3

 >Warning # 5132
 >Duplicate key in a file.  The BY variables do
not uniquely identify each
 >case on the indicated file.  Please check the results carefully.

  3 Chris    HIST
  3 Chris    THEO
  4 David
  5 Ethel    ENGL
  6 Frank

Number of cases read:  7    Number of cases listed:  7

(Gary Moser, <[hidden email]>, tested
earlier, and reported the same error message that's here.)

Hector wrote,

>There might be a more elegant solution, but it
>is late here and I cannot think of anything better right now.

THERE'S a question of aesthetics. I expected,
myself, that a neat double MATCH FILES would do
it. I rejected it without writing the code
because of the problem of non-unique keys, as
reported in the warning message above. But
Hector's code gives the right result, and I think will do so reliably.

I don't like to put code in production that gives
warning messages. The last thing you want is to
leave users either blasè or confused about
whether to pay attention to warning messages. But
Hector's code does work, and it's simpler and
clearer than the MATCH FILES/ADD FILES logic I posted.

Now we can be like one of those TV shows where
viewers vote on which singer, or something, is
the best. Or maybe, simply, de gustibus non disputandum est.
Reply | Threaded
Open this post in threaded view
|

Re: "Match files" with duplicates in table

ViAnn Beadle
That "warning" message is more of a note and usually can be ignored for a
FILE file. It is however, a fatal error for a TABLE file.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: Sunday, September 30, 2007 4:40 PM
To: [hidden email]
Subject: Re: "Match files" with duplicates in table

At 02:30 PM 9/28/2007, Hector Maletta wrote:

>My final result is thus:

SPSS 14 draft output, test data as before:

MATCH FILES
            /TABLE = UNDUPCOMPLETE
            /FILE = DupPartial/by ID.
MATCH FILES /FILE = * /FILE= UNDUPCOMPLETE /BY ID.

LIST.

List
|-----------------------------|---------------------------|
|Output Created               |30-SEP-2007 17:47:58       |
|-----------------------------|---------------------------|
id name     Major

  1 Ann
  2 Bill     ENGL
File #1
      KEY:        3

 >Warning # 5132
 >Duplicate key in a file.  The BY variables do
not uniquely identify each
 >case on the indicated file.  Please check the results carefully.

  3 Chris    HIST
  3 Chris    THEO
  4 David
  5 Ethel    ENGL
  6 Frank

Number of cases read:  7    Number of cases listed:  7

(Gary Moser, <[hidden email]>, tested
earlier, and reported the same error message that's here.)

Hector wrote,

>There might be a more elegant solution, but it
>is late here and I cannot think of anything better right now.

THERE'S a question of aesthetics. I expected,
myself, that a neat double MATCH FILES would do
it. I rejected it without writing the code
because of the problem of non-unique keys, as
reported in the warning message above. But
Hector's code gives the right result, and I think will do so reliably.

I don't like to put code in production that gives
warning messages. The last thing you want is to
leave users either blasè or confused about
whether to pay attention to warning messages. But
Hector's code does work, and it's simpler and
clearer than the MATCH FILES/ADD FILES logic I posted.

Now we can be like one of those TV shows where
viewers vote on which singer, or something, is
the best. Or maybe, simply, de gustibus non disputandum est.
Reply | Threaded
Open this post in threaded view
|

Re: "Match files" with duplicates in table

Hector Maletta
         The problem with Catherine's situation is that she wants first to
assign student's data to all course registrations, and then she wants to add
student data for all students not registered in any course.
         The first problem is well solved with a TABLE subcommand, using the
general student list as a TABLE and the course registration list as a file.
Of course, the latter may contain duplicate ID records since the same
students may have registered in more than one course.
         For the second problem, I responded in haste and late at night,
sorry. Before matching or adding, one needs to exclude from the student list
all students already included before, i.e. exclude all students registered
for courses, and then ADD to the course file the remaining students from the
student list. This complicates things a bit.
         The complete process may look like this:
         *Assign student data to course registration records.
         MATCH FILES/TABLE='UNDUPCOMPLETE.SAV'/FILE='DupPartial.SAV'/by ID.
         SAVE OUTFILE 'COURSEFILE.SAV'.
         *Flag records with course registration.
         COMPUTE REGCOURSE=1.
         *Aggregate registration records by student.
         AGGREGATE OUTFILE=*/PRESORTED/BREAK ID
         /REGCOURSE=MAX(REGCOURSE).
         *Match records of student registered in courses, with student list.
         MATCH FILES /FILE 'UNDUPCOMPLETE.SAV'/FILE=*/BY ID.
         *Exclude from list all student registered in courses.
         SELECT IF SYSMIS(REGCOURSE) OR REGCOURSE=0.
         *Add registered and non-registered students in a single list.
         ADD FILES /FILE 'COURSEFILE.SAV'/FILE *.
         SAVE OUTFILE 'FINAL.SAV'.

         This is untested. Hope it works.

         Hector

         -----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
ViAnn Beadle
Sent: 01 October 2007 10:08
To: [hidden email]
Subject: Re: "Match files" with duplicates in table

         That "warning" message is more of a note and usually can be ignored
for a
         FILE file. It is however, a fatal error for a TABLE file.

         -----Original Message-----
         From: SPSSX(r) Discussion [mailto:[hidden email]] On
Behalf Of
         Richard Ristow
         Sent: Sunday, September 30, 2007 4:40 PM
         To: [hidden email]
         Subject: Re: "Match files" with duplicates in table

         At 02:30 PM 9/28/2007, Hector Maletta wrote:

         >My final result is thus:

         SPSS 14 draft output, test data as before:

         MATCH FILES
                     /TABLE = UNDUPCOMPLETE
                     /FILE = DupPartial/by ID.
         MATCH FILES /FILE = * /FILE= UNDUPCOMPLETE /BY ID.

         LIST.

         List
         |-----------------------------|---------------------------|
         |Output Created               |30-SEP-2007 17:47:58       |
         |-----------------------------|---------------------------|
         id name     Major

           1 Ann
           2 Bill     ENGL
         File #1
               KEY:        3

          >Warning # 5132
          >Duplicate key in a file.  The BY variables do
         not uniquely identify each
          >case on the indicated file.  Please check the results carefully.

           3 Chris    HIST
           3 Chris    THEO
           4 David
           5 Ethel    ENGL
           6 Frank

         Number of cases read:  7    Number of cases listed:  7

         (Gary Moser, <[hidden email]>, tested
         earlier, and reported the same error message that's here.)

         Hector wrote,

         >There might be a more elegant solution, but it
         >is late here and I cannot think of anything better right now.

         THERE'S a question of aesthetics. I expected,
         myself, that a neat double MATCH FILES would do
         it. I rejected it without writing the code
         because of the problem of non-unique keys, as
         reported in the warning message above. But
         Hector's code gives the right result, and I think will do so
reliably.

         I don't like to put code in production that gives
         warning messages. The last thing you want is to
         leave users either blasè or confused about
         whether to pay attention to warning messages. But
         Hector's code does work, and it's simpler and
         clearer than the MATCH FILES/ADD FILES logic I posted.

         Now we can be like one of those TV shows where
         viewers vote on which singer, or something, is
         the best. Or maybe, simply, de gustibus non disputandum est.