SPSSX Discussion

"Match files" with duplicates in table

Classic

List

Threaded

11 messages Options

Catherine Kubitschek

"Match files" with duplicates in table

I want to something like a MATCH FILES FILE=x /TABLE=y /BY z . Except the
"y" file is not unduplicated on "z".

This is much easier to explain with an example:

/* First file - id & name, unduplicated (pre-sorted) .
data list /id 1 name 3-10 (a) .
begin data
1 Ann
2 Bill
3 Chris
4 David
5 Ethel
6 Frank
end data .
save outfile=UndupComplete .

/* Second file - id & major, duplicated but subset of id (pre-sorted) .
data list /id 1 Major 3-6 (a) .
begin data
2 ENGL
3 HIST
3 THEO
5 ENGL
end data .
save outfile=DupPartial .

The result I want is a dataset that looks like:

id name Major

1 Ann
2 Bill ENGL
3 Chris HIST
3 Chris THEO
4 David
5 Ethel ENGL
6 Frank

Is there a good way to do this? (I'm using SPSS 14).
Thanks.

Catherine

Hector Maletta

Re: "Match files" with duplicates in table

Catherine,
no problem as far as the ID is not duplicated.
MATCH FILES /FILE= UNDUPCOMPLETE/FILE= DupPartial/by ID.

This produces exactly what you want.

Hector

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Catherine Kubitschek
Sent: 27 September 2007 22:34
To: [hidden email]
Subject: "Match files" with duplicates in table

I want to something like a MATCH FILES FILE=x /TABLE=y /BY z .
Except the
"y" file is not unduplicated on "z".

This is much easier to explain with an example:

/* First file - id & name, unduplicated (pre-sorted) .
data list /id 1 name 3-10 (a) .
begin data
1 Ann
2 Bill
3 Chris
4 David
5 Ethel
6 Frank
end data .
save outfile=UndupComplete .

/* Second file - id & major, duplicated but subset of id
(pre-sorted) .
data list /id 1 Major 3-6 (a) .
begin data
2 ENGL
3 HIST
3 THEO
5 ENGL
end data .
save outfile=DupPartial .

The result I want is a dataset that looks like:

id name Major

1 Ann
2 Bill ENGL
3 Chris HIST
3 Chris THEO
4 David
5 Ethel ENGL
6 Frank

Is there a good way to do this? (I'm using SPSS 14).
Thanks.

Catherine

Catherine Kubitschek

Re: "Match files" with duplicates in table

I'm sorry. I put a double negative into my message. My problem is that
the id IS duplicated in what would be the table. -Catherine

At 9/27/2007 10:01 PM, Hector Maletta wrote:

> Catherine,
> no problem as far as the ID is not duplicated.
> MATCH FILES /FILE= UNDUPCOMPLETE/FILE= DupPartial/by ID.
>
> This produces exactly what you want.
>
> Hector
>
> -----Original Message-----
>From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
>Catherine Kubitschek
>Sent: 27 September 2007 22:34
>To: [hidden email]
>Subject: "Match files" with duplicates in table
>
> I want to something like a MATCH FILES FILE=x /TABLE=y /BY z .
>Except the
> "y" file is not unduplicated on "z".
>
> This is much easier to explain with an example:
>
> /* First file - id & name, unduplicated (pre-sorted) .
> data list /id 1 name 3-10 (a) .
> begin data
> 1 Ann
> 2 Bill
> 3 Chris
> 4 David
> 5 Ethel
> 6 Frank
> end data .
> save outfile=UndupComplete .
>
> /* Second file - id & major, duplicated but subset of id
>(pre-sorted) .
> data list /id 1 Major 3-6 (a) .
> begin data
> 2 ENGL
> 3 HIST
> 3 THEO
> 5 ENGL
> end data .
> save outfile=DupPartial .
>
> The result I want is a dataset that looks like:
>
> id name Major
>
> 1 Ann
> 2 Bill ENGL
> 3 Chris HIST
> 3 Chris THEO
> 4 David
> 5 Ethel ENGL
> 6 Frank
>
> Is there a good way to do this? (I'm using SPSS 14).
> Thanks.
>
> Catherine

Richard Ristow

Re: "Match files" with duplicates in table

In reply to this post by Catherine Kubitschek

Bless you for really good test data! And exactly the right kind of
subject line: Stating the nature of the problem, succinctly, with no
text except that statement.

At 09:34 PM 9/27/2007, Catherine Kubitschek wrote:

>I want to something like a MATCH FILES FILE=x /TABLE=y /BY z . Except
>the
>"y" file is not unduplicated on "z". The result I want is a dataset
>that looks like:
>
>id name Major
>
> 1 Ann
> 2 Bill ENGL
> 3 Chris HIST
> 3 Chris THEO
> 4 David
> 5 Ethel ENGL
> 6 Frank

One the way to do this is with MATCH FILES, ADD FILES, and SELECT. (I
tried a couple of variant solutions, but I don't like them.) Following
is SPSS 14 draft output. I'm using datasets instead of your scratch
.SAV files.
|-----------------------------|---------------------------|
|Output Created |28-SEP-2007 12:37:31 |
|-----------------------------|---------------------------|
[UndupComplete]
id name

1 Ann
2 Bill
3 Chris
4 David
5 Ethel
6 Frank

Number of cases read: 6 Number of cases listed: 6

|-----------------------------|---------------------------|
|Output Created |28-SEP-2007 12:37:31 |
|-----------------------------|---------------------------|
[DupPartial]

id Major

2 ENGL
3 HIST
3 THEO
5 ENGL

Number of cases read: 4 Number of cases listed: 4

* ... I. Double-match. May not give desired results if "MAJOR" .
* ... is ever blank in 'DupPartial' .

MATCH FILES
/TABLE=UndupComplete
/FILE =DupPartial
/BY ID.

DATASET NAME I_DoubleMatch WINDOW=FRONT.
. /*-- LIST /*-*/.

ADD FILES
/FILE = *
/FILE = UndupComplete
/BY ID.

. /*-- LIST /*-*/.

SELECT IF MAJOR NE ' '
OR MISSING(LAG(ID))
OR ID NE LAG(ID) .

LIST.

List
|-----------------------------|---------------------------|
|Output Created |28-SEP-2007 12:37:33 |
|-----------------------------|---------------------------|
[I_DoubleMatch]

id name Major

1 Ann
2 Bill ENGL
3 Chris HIST
3 Chris THEO
4 David
5 Ethel ENGL
6 Frank

Number of cases read: 7 Number of cases listed: 7
===================
APPENDIX: Test data
===================
* ................................................................. .
* ................. Test data, from posting ..................... .
* (Excellent. I didn't have to touch it, except replacing SAVE by .
* DATASET commands.) .

/* First file - id & name, unduplicated (pre-sorted) .
data list /id 1 name 3-10 (a) .
begin data
1 Ann
2 Bill
3 Chris
4 David
5 Ethel
6 Frank
end data .
* ... save outfile=UndupComplete .
DATASET NAME UndupComplete WINDOW=FRONT.

/* Second file - id & major, duplicated but subset of id (pre-sorted) .
data list /id 1 Major 3-6 (a) .
begin data
2 ENGL
3 HIST
3 THEO
5 ENGL
end data .
* ... save outfile=DupPartial .
DATASET NAME DupPartial WINDOW=FRONT.

DATASET ACTIVATE UndupComplete WINDOW=FRONT.
LIST.

DATASET ACTIVATE DupPartial WINDOW=FRONT.
LIST.

Hal 9000

Re: "Match files" with duplicates in table

In reply to this post by Catherine Kubitschek

Catherine, Hector's solution produces a result of:

ID NAME MAJOR
-----------------------------------------------
1 Ann
2 Bill Engl
3 Chris Hist
3 Theo
4 David
5 Ethel Engl
6 Frank

...even though you get error message:

File #2
KEY: 3

>Warning # 5132
>Duplicate key in a file. The BY variables do not uniquely identify each
>case on the indicated file. Please check the results carefully.

If I were you, I'd do it this way, sort by id (a) and name (d) and
compute name as equal to lag(name) if id = lag(id).

Maybe another approach would be to flatten out the file with
duplicates using casestovars:

casestovars
/id = id.

Then do the match file and you get:

ID NAME MAJOR.1 MAJOR.2
-----------------------------------------------------------------
1 Ann
2 Bill Engl
3 Chris Hist Theo
4 David
5 Ethel Engl
6 Frank

Hope that helped!
-Gary

Hector Maletta

Re: "Match files" with duplicates in table

I have pursued privately my dialog with Catherine and the final
result is thus:

MATCH FILES
/TABLE = UNDUPCOMPLETE
/FILE = DupPartial/by ID.
MATCH FILES /FILE = * /FILE= UNDUPCOMPLETE /BY ID.

SAVE OUTFILE =NEWFILE.

The first MATCH assigns data from the list of students to each
student-course record in the second file. The second MATCH will have the
following effect: each student-course already included in the active file
(resulting from the first step) will be included; each student not included
in the active file (omitted in the results of the first step because not
registered in any course) will be included too, but with a blank course.

There might be a more elegant solution, but it is late here and I
cannot think of anything better right now.

Hector

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hal
9000
Sent: 28 September 2007 14:30
To: [hidden email]
Subject: Re: "Match files" with duplicates in table

Catherine, Hector's solution produces a result of:

ID NAME MAJOR
-----------------------------------------------
1 Ann
2 Bill Engl
3 Chris Hist
3 Theo
4 David
5 Ethel Engl
6 Frank

...even though you get error message:

File #2
KEY: 3

>Warning # 5132
>Duplicate key in a file. The BY variables do not uniquely
identify each
>case on the indicated file. Please check the results carefully.

If I were you, I'd do it this way, sort by id (a) and name (d) and
compute name as equal to lag(name) if id = lag(id).

Maybe another approach would be to flatten out the file with
duplicates using casestovars:

casestovars
/id = id.

Then do the match file and you get:

ID NAME MAJOR.1 MAJOR.2
-----------------------------------------------------------------
1 Ann
2 Bill Engl
3 Chris Hist Theo
4 David
5 Ethel Engl
6 Frank

Hope that helped!
-Gary

Hal 9000

Re: "Match files" with duplicates in table

Cool!

On 9/28/07, Hector Maletta <[hidden email]> wrote:

> I have pursued privately my dialog with Catherine and the final
> result is thus:
>
> MATCH FILES
> /TABLE = UNDUPCOMPLETE
> /FILE = DupPartial/by ID.
> MATCH FILES /FILE = * /FILE= UNDUPCOMPLETE /BY ID.
>
> SAVE OUTFILE =NEWFILE.
>
> The first MATCH assigns data from the list of students to each
> student-course record in the second file. The second MATCH will have the
> following effect: each student-course already included in the active file
> (resulting from the first step) will be included; each student not included
> in the active file (omitted in the results of the first step because not
> registered in any course) will be included too, but with a blank course.
>
> There might be a more elegant solution, but it is late here and I
> cannot think of anything better right now.
>
>
>
> Hector
>
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hal
> 9000
> Sent: 28 September 2007 14:30
> To: [hidden email]
> Subject: Re: "Match files" with duplicates in table
>
> Catherine, Hector's solution produces a result of:
>
> ID NAME MAJOR
> -----------------------------------------------
> 1 Ann
> 2 Bill Engl
> 3 Chris Hist
> 3 Theo
> 4 David
> 5 Ethel Engl
> 6 Frank
>
> ...even though you get error message:
>
> File #2
> KEY: 3
>
> >Warning # 5132
> >Duplicate key in a file. The BY variables do not uniquely
> identify each
> >case on the indicated file. Please check the results carefully.
>
> If I were you, I'd do it this way, sort by id (a) and name (d) and
> compute name as equal to lag(name) if id = lag(id).
>
> Maybe another approach would be to flatten out the file with
> duplicates using casestovars:
>
> casestovars
> /id = id.
>
> Then do the match file and you get:
>
> ID NAME MAJOR.1 MAJOR.2
> -----------------------------------------------------------------
> 1 Ann
> 2 Bill Engl
> 3 Chris Hist Theo
> 4 David
> 5 Ethel Engl
> 6 Frank
>
> Hope that helped!
> -Gary
>
>
>

Albert-Jan Roskam

Re: "Match files" with duplicates in table

In reply to this post by Hal 9000

Yep, provided that there are not too many duplicates I
usually also use:
sort cases by id.
casestovars /id=id.
match files...
Many duplicates + many records sometimes make spss
crash when you attempt to run a casestovars command.

Cheers!!
Albert-Jan

--- Hal 9000 <[hidden email]> wrote:

> Catherine, Hector's solution produces a result of:
>
> ID NAME MAJOR
> -----------------------------------------------
> 1 Ann
> 2 Bill Engl
> 3 Chris Hist
> 3 Theo
> 4 David
> 5 Ethel Engl
> 6 Frank
>
> ...even though you get error message:
>
> File #2
> KEY: 3
>
> >Warning # 5132
> >Duplicate key in a file. The BY variables do not
> uniquely identify each
> >case on the indicated file. Please check the
> results carefully.
>
> If I were you, I'd do it this way, sort by id (a)
> and name (d) and
> compute name as equal to lag(name) if id = lag(id).
>
> Maybe another approach would be to flatten out the
> file with
> duplicates using casestovars:
>
> casestovars
> /id = id.
>
> Then do the match file and you get:
>
> ID NAME MAJOR.1 MAJOR.2
>

-----------------------------------------------------------------

> 1 Ann
> 2 Bill Engl
> 3 Chris Hist Theo
> 4 David
> 5 Ethel Engl
> 6 Frank
>
> Hope that helped!
> -Gary
>

Cheers!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Did you know that 87.166253% of all statistics claim a precision of results that is not justified by the method employed? [HELMUT RICHTER]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

____________________________________________________________________________________
Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=list&sid=396545469

Richard Ristow

Re: "Match files" with duplicates in table

In reply to this post by Hector Maletta

At 02:30 PM 9/28/2007, Hector Maletta wrote:

>My final result is thus:

SPSS 14 draft output, test data as before:

MATCH FILES
/TABLE = UNDUPCOMPLETE
/FILE = DupPartial/by ID.
MATCH FILES /FILE = * /FILE= UNDUPCOMPLETE /BY ID.

LIST.

List
|-----------------------------|---------------------------|
|Output Created |30-SEP-2007 17:47:58 |
|-----------------------------|---------------------------|
id name Major

1 Ann
2 Bill ENGL
File #1
KEY: 3

>Warning # 5132
>Duplicate key in a file. The BY variables do
not uniquely identify each
>case on the indicated file. Please check the results carefully.

3 Chris HIST
3 Chris THEO
4 David
5 Ethel ENGL
6 Frank

Number of cases read: 7 Number of cases listed: 7

(Gary Moser, <[hidden email]>, tested
earlier, and reported the same error message that's here.)

Hector wrote,

>There might be a more elegant solution, but it
>is late here and I cannot think of anything better right now.

THERE'S a question of aesthetics. I expected,
myself, that a neat double MATCH FILES would do
it. I rejected it without writing the code
because of the problem of non-unique keys, as
reported in the warning message above. But
Hector's code gives the right result, and I think will do so reliably.

I don't like to put code in production that gives
warning messages. The last thing you want is to
leave users either blasè or confused about
whether to pay attention to warning messages. But
Hector's code does work, and it's simpler and
clearer than the MATCH FILES/ADD FILES logic I posted.

Now we can be like one of those TV shows where
viewers vote on which singer, or something, is
the best. Or maybe, simply, de gustibus non disputandum est.

ViAnn Beadle

Re: "Match files" with duplicates in table

That "warning" message is more of a note and usually can be ignored for a
FILE file. It is however, a fatal error for a TABLE file.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: Sunday, September 30, 2007 4:40 PM
To: [hidden email]
Subject: Re: "Match files" with duplicates in table

At 02:30 PM 9/28/2007, Hector Maletta wrote:

>My final result is thus:

SPSS 14 draft output, test data as before:

MATCH FILES
/TABLE = UNDUPCOMPLETE
/FILE = DupPartial/by ID.
MATCH FILES /FILE = * /FILE= UNDUPCOMPLETE /BY ID.

LIST.

List
|-----------------------------|---------------------------|
|Output Created |30-SEP-2007 17:47:58 |
|-----------------------------|---------------------------|
id name Major

1 Ann
2 Bill ENGL
File #1
KEY: 3

>Warning # 5132
>Duplicate key in a file. The BY variables do
not uniquely identify each
>case on the indicated file. Please check the results carefully.

3 Chris HIST
3 Chris THEO
4 David
5 Ethel ENGL
6 Frank

Number of cases read: 7 Number of cases listed: 7

(Gary Moser, <[hidden email]>, tested
earlier, and reported the same error message that's here.)

Hector wrote,

>There might be a more elegant solution, but it
>is late here and I cannot think of anything better right now.

THERE'S a question of aesthetics. I expected,
myself, that a neat double MATCH FILES would do
it. I rejected it without writing the code
because of the problem of non-unique keys, as
reported in the warning message above. But
Hector's code gives the right result, and I think will do so reliably.

I don't like to put code in production that gives
warning messages. The last thing you want is to
leave users either blasè or confused about
whether to pay attention to warning messages. But
Hector's code does work, and it's simpler and
clearer than the MATCH FILES/ADD FILES logic I posted.

Now we can be like one of those TV shows where
viewers vote on which singer, or something, is
the best. Or maybe, simply, de gustibus non disputandum est.

Hector Maletta

Re: "Match files" with duplicates in table

The problem with Catherine's situation is that she wants first to
assign student's data to all course registrations, and then she wants to add
student data for all students not registered in any course.
The first problem is well solved with a TABLE subcommand, using the
general student list as a TABLE and the course registration list as a file.
Of course, the latter may contain duplicate ID records since the same
students may have registered in more than one course.
For the second problem, I responded in haste and late at night,
sorry. Before matching or adding, one needs to exclude from the student list
all students already included before, i.e. exclude all students registered
for courses, and then ADD to the course file the remaining students from the
student list. This complicates things a bit.
The complete process may look like this:
*Assign student data to course registration records.
MATCH FILES/TABLE='UNDUPCOMPLETE.SAV'/FILE='DupPartial.SAV'/by ID.
SAVE OUTFILE 'COURSEFILE.SAV'.
*Flag records with course registration.
COMPUTE REGCOURSE=1.
*Aggregate registration records by student.
AGGREGATE OUTFILE=*/PRESORTED/BREAK ID
/REGCOURSE=MAX(REGCOURSE).
*Match records of student registered in courses, with student list.
MATCH FILES /FILE 'UNDUPCOMPLETE.SAV'/FILE=*/BY ID.
*Exclude from list all student registered in courses.
SELECT IF SYSMIS(REGCOURSE) OR REGCOURSE=0.
*Add registered and non-registered students in a single list.
ADD FILES /FILE 'COURSEFILE.SAV'/FILE *.
SAVE OUTFILE 'FINAL.SAV'.

This is untested. Hope it works.

Hector

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
ViAnn Beadle
Sent: 01 October 2007 10:08
To: [hidden email]
Subject: Re: "Match files" with duplicates in table

That "warning" message is more of a note and usually can be ignored
for a
FILE file. It is however, a fatal error for a TABLE file.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On
Behalf Of
Richard Ristow
Sent: Sunday, September 30, 2007 4:40 PM
To: [hidden email]
Subject: Re: "Match files" with duplicates in table

At 02:30 PM 9/28/2007, Hector Maletta wrote:

>My final result is thus:

SPSS 14 draft output, test data as before:

MATCH FILES
/TABLE = UNDUPCOMPLETE
/FILE = DupPartial/by ID.
MATCH FILES /FILE = * /FILE= UNDUPCOMPLETE /BY ID.

LIST.

List
|-----------------------------|---------------------------|
|Output Created |30-SEP-2007 17:47:58 |
|-----------------------------|---------------------------|
id name Major

1 Ann
2 Bill ENGL
File #1
KEY: 3

>Warning # 5132
>Duplicate key in a file. The BY variables do
not uniquely identify each
>case on the indicated file. Please check the results carefully.

3 Chris HIST
3 Chris THEO
4 David
5 Ethel ENGL
6 Frank

Number of cases read: 7 Number of cases listed: 7

(Gary Moser, <[hidden email]>, tested
earlier, and reported the same error message that's here.)

Hector wrote,

>There might be a more elegant solution, but it
>is late here and I cannot think of anything better right now.

THERE'S a question of aesthetics. I expected,
myself, that a neat double MATCH FILES would do
it. I rejected it without writing the code
because of the problem of non-unique keys, as
reported in the warning message above. But
Hector's code gives the right result, and I think will do so
reliably.

I don't like to put code in production that gives
warning messages. The last thing you want is to
leave users either blasè or confused about
whether to pay attention to warning messages. But
Hector's code does work, and it's simpler and
clearer than the MATCH FILES/ADD FILES logic I posted.

Now we can be like one of those TV shows where
viewers vote on which singer, or something, is
the best. Or maybe, simply, de gustibus non disputandum est.