SPSSX Discussion

How to reconcile two files in SPSS 19?

Classic

List

Threaded

9 messages Options

Danielle Boyce

How to reconcile two files in SPSS 19?

Hello friends,

I have two files:

File 1 has patient ID, test date, test result 1, test result 2, test result 3, etc. (IDs and all test results are numeric) for some 2,000 records. Not all rows are unique patients, as each patient can have more than one test date.

File 2 has identical variable names to File 1 for about 900 records.

Both files contribute some unique records and some duplicate records. File 1 contains data from 2003-present; File 2 has data from 2007-present. The overlapping records are from 2007-present.

Some of the duplicate records overlap, but contribute different information for the same test. In other words, File 1 has patient #1234, test date 1/1/2008 with test result 1 as 343 and test results 2 and 3 are blank (because data collectors didn't realize that they were supposed to enter test results 2 and 3). File 2 has patient #1234, test date 1/1/2008 with test result 1 as blank, test result 2 as 565 and test result 3 as 678.

There is no predictability as to which fields are blank or completed for these duplicate cases. Sometimes, both files contain all of the completed fields for the records that are duplicated, and all I would have to do is delete one of the duplicate rows.

I want one new file which represents both the unique cases that each file contributes plus a single row for each duplicate record. For the example above, I would have one row: patient #1234, test date 1/1/2008 with test result 1 as 343, test result 2 as 565 and test result 3 as 678. Is this even possible?

I have tried flagging duplicates by patient ID and date, but it doesn't merge the rows together.

Thanks,

Danielle

John F Hall

Re: How to reconcile two files in SPSS 19?

Danielle

Are your variable names the same in each file? If so, you might be able to use MATCH FILES.

Have a look at the syntax reference guide pp 1141 – 1149

Help > Command Syntax Reference

You need to make sure both files are sorted in ID order. With both datasets open:

File > New > Syntax:

SORT CASES BY ID . [Run this on each data set]

MATCH FILES FILE <dataset1> /FILE <dataset2> / BY ID .

This will pick up variables from the first file specified and ignore them in the second, but I’m not sure what it will do with duplicate cases. It will ignore them, but it looks as if you want to keep part of the duplicate records.

Here’s an extract from the manual:

BY Subcommand

BY specifies one or more identification, or key, variables that determine which cases are to be

combined. When BY is specified, cases from one file are matched only with cases from other files

that have the same values for the key variables. BY is required unless all input files are to be

matched sequentially according to the order of cases.

􀂄 BY must follow the FILE and TABLE subcommands and any associated RENAME and IN

subcommands.

􀂄 BY specifies the names of one or more key variables. The key variables must exist in all input

files. The key variables can be numeric or long or short strings.

􀂄 All input files must be sorted in ascending order of the key variables. If necessary, use SORT

CASES before MATCH FILES.

􀂄 Missing values for key variables are handled like any other values.

􀂄 Unmatched cases are assigned system-missing values (for numeric variables) or blanks (for

string variables) for variables from files that do not contain a match.

Duplicate Cases

Duplicate cases are those with the same values for the key variables named on the BY subcommand.

􀂄 Duplicate cases are permitted in any input files except table files.

􀂄 When there is no table file, the first duplicate case in each file is matched with the first

matching case (if any) from the other files; the second duplicate case is matched with a second

matching duplicate, if any; and so on. (In effect, a parallel match is performed within groups

of duplicate cases.) Unmatched cases are assigned system-missing values (for numeric

variables) or blanks (for string variables) for variables from files that do not contain a match.

􀂄 The program displays a warning if it encounters duplicate keys in one or more of the files

being matched..

Hope this helps

John Hall

Email: [hidden email]

Website: www.surveyresearch.weebly.com

Skype: surveyresearcher1

Phone: (+33) (0) 2.33.45.91.47

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Danielle Boyce
Sent: 29 January 2012 04:26
To: [hidden email]
Subject: How to reconcile two files in SPSS 19?

Hello friends,

I have two files:

File 2 has identical variable names to File 1 for about 900 records.

I have tried flagging duplicates by patient ID and date, but it doesn't merge the rows together.

Thanks,

Danielle

Rich Ulrich

Re: How to reconcile two files in SPSS 19?

In reply to this post by Danielle Boyce

It looks to me like you want to
ADD FILES BY ID Date
and then use AGGREGATE, taking the function FIRST( ) for each
variable. I'm pretty sure that this ignores the Missings.

If there are no duplicated fields with different actual values,
LAST( ) would work similarly, with the obvious variation in choice,
when there are real values in both records.

Or, I think you could use UPDATE with the same effect as using
LAST( ) when using aggregation... if UPDATE does allow for
entirely new records, which I think it does.

--
Rich Ulrich

Date: Sat, 28 Jan 2012 19:26:21 -0800
From: [hidden email]
Subject: How to reconcile two files in SPSS 19?
To: [hidden email]

Hello friends,

I have two files:

File 2 has identical variable names to File 1 for about 900 records.

I have tried flagging duplicates by patient ID and date, but it doesn't merge the rows together.

Thanks,

Danielle

Deepanshu Bhalla

Re: How to reconcile two files in SPSS 19?

In reply to this post by John F Hall

CONTENTS DELETED

The author has deleted this message.

Danielle Boyce

Re: How to reconcile two files in SPSS 19?

In reply to this post by Rich Ulrich

Thanks so much to everyone! I will try these approaches today!

Danielle

From: Rich Ulrich <[hidden email]>

Sender: "SPSSX(r) Discussion" <[hidden email]>

Date: Sun, 29 Jan 2012 02:44:34 -0500

To: <[hidden email]>

ReplyTo: Rich Ulrich <[hidden email]>

Subject: Re: How to reconcile two files in SPSS 19?

Date: Sat, 28 Jan 2012 19:26:21 -0800
From: [hidden email]
Subject: How to reconcile two files in SPSS 19?
To: [hidden email]

Hello friends,

I have two files:

File 2 has identical variable names to File 1 for about 900 records.

I have tried flagging duplicates by patient ID and date, but it doesn't merge the rows together.

Thanks,

Danielle

David Marso

Re: How to reconcile two files in SPSS 19?

Administrator

In reply to this post by Danielle Boyce

"I want one new file which represents both the unique cases that each file contributes plus a single row for each duplicate record. For the example above, I would have one row: patient #1234, test date 1/1/2008 with test result 1 as 343, test result 2 as 565 and test result 3 as
678. Is this even possible?"
See below.

I am confused by the following "plus a single row for each duplicate record. "
WHY would you wish to retain these? Simply adds much cause for later confusion!

data list / id date test1 to test3 (f4,x,adate,3(x,f3)) .
begin data
1234 01/01/2008 343
4567 02/01/2009 123 456
end data.
SORT CASES BY id date.
save outfile 'file1.sav'.
list.
data list / id date test1 to test3 (f4,x,adate,3(x,f3)) .
begin data
1234 01/01/2008 565 678
5678 03/04/2001 345 345 222
end data.
SORT CASES BY id date.
save outfile 'file2.sav'.
list.
UPDATE FILE 'file1.sav' / FILE 'file2.sav' /BY ID date.
list.

ID DATE TEST1 TEST2 TEST3

1234 01/01/2008 343 565 678
4567 02/01/2009 123 . 456
5678 03/04/2001 345 345 222

Number of cases read: 3 Number of cases listed: 3

Danielle Boyce wrote

Hello friends,

I have two files:

File 1 has patient ID, test date, test result 1, test result 2, test result 3, etc. (IDs and all test results are numeric) for some 2,000 records. Not all rows are unique patients, as each patient can have more than one test date.

File 2 has identical variable names to File 1 for about 900 records.

Both files contribute some unique records and some duplicate records. File 1 contains data from 2003-present; File 2 has data from 2007-present. The overlapping records are from 2007-present.

Some of the duplicate records overlap, but contribute different information for the same test. In other words, File 1 has patient #1234, test date 1/1/2008 with test result 1 as 343 and test results 2 and 3 are blank (because data collectors didn't realize that they were supposed to enter test results 2 and 3). File 2 has patient #1234, test date 1/1/2008 with test result 1 as blank, test result 2 as 565 and test result 3 as 678.

There is no predictability as to which fields are blank or completed for these duplicate cases. Sometimes, both files contain all of the completed fields for the records that are duplicated, and all I would have to do is delete one of the duplicate rows.

I want one new file which represents both the unique cases that each file contributes plus a single row for each duplicate record. For the example above, I would have one row: patient #1234, test date 1/1/2008 with test result 1 as 343, test result 2 as 565 and test result 3 as
678. Is this even possible?

I have tried flagging duplicates by patient ID and date, but it doesn't merge the rows together.

Thanks,
Danielle

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

David Marso

Re: How to reconcile two files in SPSS 19?

Administrator

In reply to this post by Danielle Boyce

This advice is simply wrong! The correct approach is to use UPDATE!!!
-----------------------------------------------------------

On Sun, 29 Jan 2012 16:33:41 +0530, deepanshu bhalla
<[hidden email]> wrote:

>Hi Danielle,

I agree with John F Hall approach to handle your requirement .IÂ want to add
one more point in case of duplicates that delete all the variables that you
want to be updated before matching and sorting . For example : if you want
records to be updated from file 1 to file 2 so delete all the variables
except ID in file 2.

To find duplicates use Aggregate function with N_Break and then apply select
if command to remove duplicates.

Thanks and Regards
Deepanshu BhallaÂ

________________________________
From: John F Hall <[hidden email]>
To: [hidden email]
Sent: Sunday, 29 January 2012 12:33 PM
Subject: Re: How to reconcile two files in SPSS 19?

Danielle
Â
Are your variable names the same in each file?Â If so, you might be able to
use MATCH FILES.Â
Have a look at the syntax reference guide pp 1141 â 1149
Â
Help > Command Syntax Reference
Â
You need to make sure both files are sorted in ID order.Â With both datasets
open:
Â
File > New > Syntax:
Â
SORT CASES BY ID .Â Â Â Â Â Â Â Â Â [Run this on each data set]
Â
MATCH FILES FILE <dataset1> /FILE <dataset2> / BY ID .
Â
This will pick up variables from the first file specified and ignore them in
the second, but Iâm not sure what it will do with duplicate cases.Â It will
ignore them, but it looks as if you want to keep part of the duplicate records.
Â
Hereâs an extract from the manual:
Â
BY Subcommand
BY specifies one or more identification, or key, variables that determine
which cases are to be
combined. When BY is specified, cases from one file are matched only with
cases from other files
that have the same values for the key variables. BY is required unless all
input files are to be
matched sequentially according to the order of cases.
ô BY must follow the FILE and TABLE subcommands and any associated RENAME
and IN
subcommands.
ô BY specifies the names of one or more key variables. The key variables
must exist in all input
files. The key variables can be numeric or long or short strings.
ô All input files must be sorted in ascending order of the key variables.
If necessary, use SORT
CASES before MATCH FILES.
ô Missing values for key variables are handled like any other values.
ô Unmatched cases are assigned system-missing values (for numeric
variables) or blanks (for
string variables) for variables from files that do not contain a match.
Duplicate Cases
Duplicate cases are those with the same values for the key variables named
on the BY subcommand.
ô Duplicate cases are permitted in any input files except table files.
ô When there is no table file, the first duplicate case in each file is
matched with the first
matching case (if any) from the other files; the second duplicate case is
matched with a second
matching duplicate, if any; and so on. (In effect, a parallel match is
performed within groups
of duplicate cases.) Unmatched cases are assigned system-missing values (for
numeric
variables) or blanks (for string variables) for variables from files that do
not contain a match.
ô The program displays a warning if it encounters duplicate keys in one or
more of the files
being matched..
Â
Hope this helps
Â
John Hall
Â
Email: Â Â Â [hidden email]
Website: www.surveyresearch.weebly.com
Skype: Â Â surveyresearcher1
Phone:Â Â Â (+33) (0) 2.33.45.91.47
Â
Â
Â
From:SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Danielle Boyce
Sent: 29 January 2012 04:26
To: [hidden email]
Subject: How to reconcile two files in SPSS 19?
Â
Hello friends,
Â
I have two files:
Â
File 1 has patient ID, test date, test result 1, test result 2, test result
3, etc. (IDs and all test results are numeric) for some 2,000 records.Â Not
all rows are unique patients, as each patient can have more than one test date.
Â
File 2 has identical variable names to File 1 for about 900 records.Â
Â
Both files contribute some unique records and some duplicate records.Â File
1 contains data from 2003-present; File 2 has data from 2007-present.Â The
overlapping records are from 2007-present.Â

Some of the duplicate records overlap, but contribute different information
for the same test.Â In other words, File 1 has patient #1234, test date
1/1/2008 with test result 1 as 343 and test results 2 and 3 are blank
(because data collectors didn't realize that they were supposed to enter
test results 2 and 3). File 2 has patient #1234, test date 1/1/2008 with
test result 1 as blank, test result 2 as 565 and test result 3 as 678.Â

There is no predictability as to which fields are blank or completed for
these duplicate cases.Â Sometimes, both files contain all of the completed
fields for the records that are duplicated, and all I would have to do is
delete one of the duplicate rows.

I want one new file which represents both the unique cases that each file
contributes plus a single row for each duplicate record.Â For the example
above, I would have one row:Â patient #1234, test date 1/1/2008 with test
result 1 as 343, test result 2 as 565 and test result 3 as 678.Â Is this
even possible?

I have tried flagging duplicates by patient ID and date, but it doesn't
merge the rows together.
Â
Thanks,
Danielle

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: How to reconcile two files in SPSS 19?

Administrator

In reply to this post by Rich Ulrich

"Or, I think you could use UPDATE with the same effect as using
LAST( ) when using aggregation... if UPDATE does allow for
entirely new records, which I think it does."
It *DOES* indeed!
---

Rich Ulrich-2 wrote

It looks to me like you want to
ADD FILES BY ID Date
and then use AGGREGATE, taking the function FIRST( ) for each
variable. I'm pretty sure that this ignores the Missings.

If there are no duplicated fields with different actual values,
LAST( ) would work similarly, with the obvious variation in choice,
when there are real values in both records.

Or, I think you could use UPDATE with the same effect as using
LAST( ) when using aggregation... if UPDATE does allow for
entirely new records, which I think it does.

--
Rich Ulrich

Date: Sat, 28 Jan 2012 19:26:21 -0800
From: [hidden email]
Subject: How to reconcile two files in SPSS 19?
To: [hidden email]

Hello friends,
I have two files:
File 1 has patient ID, test date, test result 1, test result 2, test result 3, etc. (IDs and all test results are numeric) for some 2,000 records. Not all rows are unique patients, as each patient can have more than one test date.
File 2 has identical variable names to File 1 for about 900 records.

Both files contribute some unique records and some duplicate
records. File 1 contains data from 2003-present; File 2 has data from 2007-present. The overlapping records are from 2007-present.

Some of the duplicate records overlap, but contribute different information for the same test. In other words, File 1 has patient #1234, test date 1/1/2008 with test result 1 as 343 and test results 2 and 3 are blank (because data collectors didn't realize that they were supposed to enter test results 2 and 3). File 2 has patient #1234, test date 1/1/2008 with test result 1 as blank, test result 2 as 565 and test result 3 as 678.

There is no predictability as to which fields are blank or completed for these duplicate cases. Sometimes, both files contain all of the completed fields for the records that are duplicated, and all I would have to do is delete one of the duplicate rows.

I want one new file which represents both the unique cases that each file contributes
plus a single row for each duplicate record. For the example above, I would have one row: patient #1234, test date 1/1/2008 with test result 1 as 343, test result 2 as 565 and test result 3 as
678. Is this even possible?
I have tried flagging duplicates by patient ID and date, but it doesn't merge the rows together.

Thanks,Danielle

Melissa Ives

Re: How to reconcile two files in SPSS 19?

In reply to this post by David Marso

I agree that UPDATE seems best.

However, the statement "Some of the duplicate records overlap, but contribute different information for the same test." raises concerns for me.
One key caveat with UPDATE is that if there should be different responses for the same test result in file 1 and file 2, only the value in the file listed last (i.e. the transaction file, file2 in the example below) will be kept.

I believe what Danielle means by "plus a single row for each duplicate record" is that rows with a record in both file 1 and file 2 (duplicate records) would result in only one record (single row) in the combined file.

Below is from the syntax help for UPDATE. I added the text in square brackets [].

• Cases [in the first listed file (master file)] are updated [with data from the second listed file (transaction file)] when they are matched on the BY variable(s). If the master and transaction files contain common variables for matched cases, the values for those variables are taken from the transaction file, provided that the values are not missing or blanks. Missing or blank values in the transaction files are not used to update values in the master file.

• When UPDATE encounters duplicate keys within a transaction file, it applies each transaction sequentially to that case to produce one case per key value in the resulting file. If more than one transaction file is specified, the value for a variable comes from the last transaction file with a nonmissing value for that variable. [If there are two records for the same ID and date in the transaction file, the first record is used to update the master file, then the second record (overwriting the first if there are values in the same variable.]

• Variables that are in the transaction files but not in the master file are ADDED [caps added] to the master file. Cases that do not contain those variables are assigned the system-missing value (for numerics) or blanks (for strings).

• Cases that are in the transaction files but not in the master file are ADDED [caps added] to the master file and are interleaved according to their values for the key variables.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso
Sent: Sunday, January 29, 2012 11:11 AM
To: [hidden email]
Subject: Re: [SPSSX-L] How to reconcile two files in SPSS 19?

"I want one new file which represents both the unique cases that each file contributes plus a single row for each duplicate record. For the example above, I would have one row: patient #1234, test date 1/1/2008 with test result 1 as 343, test result 2 as 565 and test result 3 as 678. Is this even possible?"
See below.

I am confused by the following "plus a single row for each duplicate record.
"
WHY would you wish to retain these? Simply adds much cause for later confusion!

data list / id date test1 to test3 (f4,x,adate,3(x,f3)) .
begin data
1234 01/01/2008 343
4567 02/01/2009 123 456
end data.
SORT CASES BY id date.
save outfile 'file1.sav'.
list.
data list / id date test1 to test3 (f4,x,adate,3(x,f3)) .
begin data
1234 01/01/2008 565 678
5678 03/04/2001 345 345 222
end data.
SORT CASES BY id date.
save outfile 'file2.sav'.
list.
UPDATE FILE 'file1.sav' / FILE 'file2.sav' /BY ID date.
list.

ID DATE TEST1 TEST2 TEST3

1234 01/01/2008 343 565 678
4567 02/01/2009 123 . 456
5678 03/04/2001 345 345 222

Number of cases read: 3 Number of cases listed: 3

Danielle Boyce wrote

>
> Hello friends,
>
>
> I have two files:
>
> File 1 has patient ID, test date, test result 1, test result 2, test
> result 3, etc. (IDs and all test results are numeric) for some 2,000
> records. Not all rows are unique patients, as each patient can have
> more than one test date.
>
> File 2 has identical variable names to File 1 for about 900 records.
>
>
> Both files contribute some unique records and some duplicate records.
> File 1 contains data from 2003-present; File 2 has data from
> 2007-present. The overlapping records are from 2007-present.
>
> Some of the duplicate records overlap, but contribute different
> information for the same test. In other words, File 1 has patient
> #1234, test date 1/1/2008 with test result 1 as 343 and test results 2
> and 3 are blank (because data collectors didn't realize that they were
> supposed to enter test results 2 and 3). File 2 has patient #1234,
> test date 1/1/2008 with test result 1 as blank, test result 2 as 565
> and test result 3 as 678.
>
> There is no predictability as to which fields are blank or completed
> for these duplicate cases. Sometimes, both files contain all of the
> completed fields for the records that are duplicated, and all I would
> have to do is delete one of the duplicate rows.
>
> I want one new file which represents both the unique cases that each
> file contributes plus a single row for each duplicate record. For
> the example above, I would have one row: patient #1234, test date
> 1/1/2008 with test result 1 as 343, test result 2 as 565 and test
> result 3 as 678. Is this even possible?
>
> I have tried flagging duplicates by patient ID and date, but it
> doesn't merge the rows together.
>
>
> Thanks,
> Danielle
>

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/How-to-reconcile-two-files-in-SPSS-19-tp5438950p5439667.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD