Hello friends, I have two files: File 1 has patient ID, test date, test result 1, test result 2, test result 3, etc. (IDs and all test results are numeric) for some 2,000 records. Not all rows are unique patients, as each patient can have more than one test date. File 2 has identical variable names to File 1 for about 900 records. Both files contribute some unique records and some duplicate
records. File 1 contains data from 2003-present; File 2 has data from 2007-present. The overlapping records are from 2007-present. Some of the duplicate records overlap, but contribute different information for the same test. In other words, File 1 has patient #1234, test date 1/1/2008 with test result 1 as 343 and test results 2 and 3 are blank (because data collectors didn't realize that they were supposed to enter test results 2 and 3). File 2 has patient #1234, test date 1/1/2008 with test result 1 as blank, test result 2 as 565 and test result 3 as 678. There is no predictability as to which fields are blank or completed for these duplicate cases. Sometimes, both files contain all of the completed fields for the records that are duplicated, and all I would have to do is delete one of the duplicate rows. I want one new file which represents both the unique cases that each file contributes plus a single row for each duplicate record. For the example above, I would have one row: patient #1234, test date 1/1/2008 with test result 1 as 343, test result 2 as 565 and test result 3 as 678. Is this even possible? I have tried flagging duplicates by patient ID and date, but it doesn't merge the rows together. Thanks, Danielle |
Danielle Are your variable names the same in each file? If so, you might be able to use MATCH FILES. Have a look at the syntax reference guide pp 1141 – 1149 Help > Command Syntax Reference You need to make sure both files are sorted in ID order. With both datasets open: File > New > Syntax: SORT CASES BY ID . [Run this on each data set] MATCH FILES FILE <dataset1> /FILE <dataset2> / BY ID . This will pick up variables from the first file specified and ignore them in the second, but I’m not sure what it will do with duplicate cases. It will ignore them, but it looks as if you want to keep part of the duplicate records. Here’s an extract from the manual: BY Subcommand BY specifies one or more identification, or key, variables that determine which cases are to be combined. When BY is specified, cases from one file are matched only with cases from other files that have the same values for the key variables. BY is required unless all input files are to be matched sequentially according to the order of cases. BY must follow the FILE and TABLE subcommands and any associated RENAME and IN subcommands. BY specifies the names of one or more key variables. The key variables must exist in all input files. The key variables can be numeric or long or short strings. All input files must be sorted in ascending order of the key variables. If necessary, use SORT CASES before MATCH FILES. Missing values for key variables are handled like any other values. Unmatched cases are assigned system-missing values (for numeric variables) or blanks (for string variables) for variables from files that do not contain a match. Duplicate Cases Duplicate cases are those with the same values for the key variables named on the BY subcommand. Duplicate cases are permitted in any input files except table files. When there is no table file, the first duplicate case in each file is matched with the first matching case (if any) from the other files; the second duplicate case is matched with a second matching duplicate, if any; and so on. (In effect, a parallel match is performed within groups of duplicate cases.) Unmatched cases are assigned system-missing values (for numeric variables) or blanks (for string variables) for variables from files that do not contain a match. The program displays a warning if it encounters duplicate keys in one or more of the files being matched.. Hope this helps John Hall Email: [hidden email] Website: www.surveyresearch.weebly.com Skype: surveyresearcher1 Phone: (+33) (0) 2.33.45.91.47 From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Danielle Boyce Hello friends, I have two files: File 1 has patient ID, test date, test result 1, test result 2, test result 3, etc. (IDs and all test results are numeric) for some 2,000 records. Not all rows are unique patients, as each patient can have more than one test date. File 2 has identical variable names to File 1 for about 900 records. Both files contribute some unique records and some duplicate records. File 1 contains data from 2003-present; File 2 has data from 2007-present. The overlapping records are from 2007-present.
Thanks, Danielle |
In reply to this post by Danielle Boyce
It looks to me like you want to
ADD FILES BY ID Date and then use AGGREGATE, taking the function FIRST( ) for each variable. I'm pretty sure that this ignores the Missings. If there are no duplicated fields with different actual values, LAST( ) would work similarly, with the obvious variation in choice, when there are real values in both records. Or, I think you could use UPDATE with the same effect as using LAST( ) when using aggregation... if UPDATE does allow for entirely new records, which I think it does. -- Rich Ulrich Date: Sat, 28 Jan 2012 19:26:21 -0800 From: [hidden email] Subject: How to reconcile two files in SPSS 19? To: [hidden email] Hello friends, I have two files: File 1 has patient ID, test date, test result 1, test result 2, test result 3, etc. (IDs and all test results are numeric) for some 2,000 records. Not all rows are unique patients, as each patient can have more than one test date. File 2 has identical variable names to File 1 for about 900 records. Both files contribute some unique records and some duplicate
records. File 1 contains data from 2003-present; File 2 has data from 2007-present. The overlapping records are from 2007-present. Some of the duplicate records overlap, but contribute different information for the same test. In other words, File 1 has patient #1234, test date 1/1/2008 with test result 1 as 343 and test results 2 and 3 are blank (because data collectors didn't realize that they were supposed to enter test results 2 and 3). File 2 has patient #1234, test date 1/1/2008 with test result 1 as blank, test result 2 as 565 and test result 3 as 678. There is no predictability as to which fields are blank or completed for these duplicate cases. Sometimes, both files contain all of the completed fields for the records that are duplicated, and all I would have to do is delete one of the duplicate rows. I want one new file which represents both the unique cases that each file contributes plus a single row for each duplicate record. For the example above, I would have one row: patient #1234, test date 1/1/2008 with test result 1 as 343, test result 2 as 565 and test result 3 as 678. Is this even possible? I have tried flagging duplicates by patient ID and date, but it doesn't merge the rows together. Thanks, Danielle |
In reply to this post by John F Hall
CONTENTS DELETED
The author has deleted this message.
|
In reply to this post by Rich Ulrich
Thanks so much to everyone! I will try these approaches today!
Danielle From: Rich Ulrich <[hidden email]>
Sender: "SPSSX(r) Discussion" <[hidden email]>
Date: Sun, 29 Jan 2012 02:44:34 -0500 To: <[hidden email]> ReplyTo: Rich Ulrich <[hidden email]>
Subject: Re: How to reconcile two files in SPSS 19?
It looks to me like you want to
ADD FILES BY ID Date and then use AGGREGATE, taking the function FIRST( ) for each variable. I'm pretty sure that this ignores the Missings. If there are no duplicated fields with different actual values, LAST( ) would work similarly, with the obvious variation in choice, when there are real values in both records. Or, I think you could use UPDATE with the same effect as using LAST( ) when using aggregation... if UPDATE does allow for entirely new records, which I think it does. -- Rich Ulrich Date: Sat, 28 Jan 2012 19:26:21 -0800 From: [hidden email] Subject: How to reconcile two files in SPSS 19? To: [hidden email] Hello friends, I have two files: File 1 has patient ID, test date, test result 1, test result 2, test result 3, etc. (IDs and all test results are numeric) for some 2,000 records. Not all rows are unique patients, as each patient can have more than one test date. File 2 has identical variable names to File 1 for about 900 records. Both files contribute some unique records and some duplicate
records. File 1 contains data from 2003-present; File 2 has data from 2007-present. The overlapping records are from 2007-present. Some of the duplicate records overlap, but contribute different information for the same test. In other words, File 1 has patient #1234, test date 1/1/2008 with test result 1 as 343 and test results 2 and 3 are blank (because data collectors didn't realize that they were supposed to enter test results 2 and 3). File 2 has patient #1234, test date 1/1/2008 with test result 1 as blank, test result 2 as 565 and test result 3 as 678. There is no predictability as to which fields are blank or completed for these duplicate cases. Sometimes, both files contain all of the completed fields for the records that are duplicated, and all I would have to do is delete one of the duplicate rows. I want one new file which represents both the unique cases that each file contributes plus a single row for each duplicate record. For the example above, I would have one row: patient #1234, test date 1/1/2008 with test result 1 as 343, test result 2 as 565 and test result 3 as 678. Is this even possible? I have tried flagging duplicates by patient ID and date, but it doesn't merge the rows together. Thanks, Danielle |
Administrator
|
In reply to this post by Danielle Boyce
"I want one new file which represents both the unique cases that each file contributes plus a single row for each duplicate record. For the example above, I would have one row: patient #1234, test date 1/1/2008 with test result 1 as 343, test result 2 as 565 and test result 3 as
678. Is this even possible?" See below. I am confused by the following "plus a single row for each duplicate record. " WHY would you wish to retain these? Simply adds much cause for later confusion! data list / id date test1 to test3 (f4,x,adate,3(x,f3)) . begin data 1234 01/01/2008 343 4567 02/01/2009 123 456 end data. SORT CASES BY id date. save outfile 'file1.sav'. list. data list / id date test1 to test3 (f4,x,adate,3(x,f3)) . begin data 1234 01/01/2008 565 678 5678 03/04/2001 345 345 222 end data. SORT CASES BY id date. save outfile 'file2.sav'. list. UPDATE FILE 'file1.sav' / FILE 'file2.sav' /BY ID date. list. ID DATE TEST1 TEST2 TEST3 1234 01/01/2008 343 565 678 4567 02/01/2009 123 . 456 5678 03/04/2001 345 345 222 Number of cases read: 3 Number of cases listed: 3
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Administrator
|
In reply to this post by Danielle Boyce
This advice is simply wrong! The correct approach is to use UPDATE!!!
----------------------------------------------------------- On Sun, 29 Jan 2012 16:33:41 +0530, deepanshu bhalla <[hidden email]> wrote: >Hi Danielle, I agree with John F Hall approach to handle your requirement .I want to add one more point in case of duplicates that delete all the variables that you want to be updated before matching and sorting . For example : if you want records to be updated from file 1 to file 2 so delete all the variables except ID in file 2. To find duplicates use Aggregate function with N_Break and then apply select if command to remove duplicates. Thanks and Regards Deepanshu Bhalla ________________________________ From: John F Hall <[hidden email]> To: [hidden email] Sent: Sunday, 29 January 2012 12:33 PM Subject: Re: How to reconcile two files in SPSS 19? Danielle  Are your variable names the same in each file? If so, you might be able to use MATCH FILES. Have a look at the syntax reference guide pp 1141 â 1149  Help > Command Syntax Reference  You need to make sure both files are sorted in ID order. With both datasets open:  File > New > Syntax:  SORT CASES BY ID .         [Run this on each data set]  MATCH FILES FILE <dataset1> /FILE <dataset2> / BY ID .  This will pick up variables from the first file specified and ignore them in the second, but Iâm not sure what it will do with duplicate cases. It will ignore them, but it looks as if you want to keep part of the duplicate records.  Hereâs an extract from the manual:  BY Subcommand BY specifies one or more identification, or key, variables that determine which cases are to be combined. When BY is specified, cases from one file are matched only with cases from other files that have the same values for the key variables. BY is required unless all input files are to be matched sequentially according to the order of cases. ô BY must follow the FILE and TABLE subcommands and any associated RENAME and IN subcommands. ô BY specifies the names of one or more key variables. The key variables must exist in all input files. The key variables can be numeric or long or short strings. ô All input files must be sorted in ascending order of the key variables. If necessary, use SORT CASES before MATCH FILES. ô Missing values for key variables are handled like any other values. ô Unmatched cases are assigned system-missing values (for numeric variables) or blanks (for string variables) for variables from files that do not contain a match. Duplicate Cases Duplicate cases are those with the same values for the key variables named on the BY subcommand. ô Duplicate cases are permitted in any input files except table files. ô When there is no table file, the first duplicate case in each file is matched with the first matching case (if any) from the other files; the second duplicate case is matched with a second matching duplicate, if any; and so on. (In effect, a parallel match is performed within groups of duplicate cases.) Unmatched cases are assigned system-missing values (for numeric variables) or blanks (for string variables) for variables from files that do not contain a match. ô The program displays a warning if it encounters duplicate keys in one or more of the files being matched..  Hope this helps  John Hall  Email:    [hidden email] Website: www.surveyresearch.weebly.com Skype:   surveyresearcher1 Phone:   (+33) (0) 2.33.45.91.47    From:SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Danielle Boyce Sent: 29 January 2012 04:26 To: [hidden email] Subject: How to reconcile two files in SPSS 19?  Hello friends,  I have two files:  File 1 has patient ID, test date, test result 1, test result 2, test result 3, etc. (IDs and all test results are numeric) for some 2,000 records. Not all rows are unique patients, as each patient can have more than one test date.  File 2 has identical variable names to File 1 for about 900 records.  Both files contribute some unique records and some duplicate records. File 1 contains data from 2003-present; File 2 has data from 2007-present. The overlapping records are from 2007-present. Some of the duplicate records overlap, but contribute different information for the same test. In other words, File 1 has patient #1234, test date 1/1/2008 with test result 1 as 343 and test results 2 and 3 are blank (because data collectors didn't realize that they were supposed to enter test results 2 and 3). File 2 has patient #1234, test date 1/1/2008 with test result 1 as blank, test result 2 as 565 and test result 3 as 678. There is no predictability as to which fields are blank or completed for these duplicate cases. Sometimes, both files contain all of the completed fields for the records that are duplicated, and all I would have to do is delete one of the duplicate rows. I want one new file which represents both the unique cases that each file contributes plus a single row for each duplicate record. For the example above, I would have one row: patient #1234, test date 1/1/2008 with test result 1 as 343, test result 2 as 565 and test result 3 as 678. Is this even possible? I have tried flagging duplicates by patient ID and date, but it doesn't merge the rows together.  Thanks, Danielle ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Administrator
|
In reply to this post by Rich Ulrich
"Or, I think you could use UPDATE with the same effect as using
LAST( ) when using aggregation... if UPDATE does allow for entirely new records, which I think it does." It *DOES* indeed! ---
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by David Marso
I agree that UPDATE seems best.
However, the statement "Some of the duplicate records overlap, but contribute different information for the same test." raises concerns for me. One key caveat with UPDATE is that if there should be different responses for the same test result in file 1 and file 2, only the value in the file listed last (i.e. the transaction file, file2 in the example below) will be kept. I believe what Danielle means by "plus a single row for each duplicate record" is that rows with a record in both file 1 and file 2 (duplicate records) would result in only one record (single row) in the combined file. Below is from the syntax help for UPDATE. I added the text in square brackets []. • Cases [in the first listed file (master file)] are updated [with data from the second listed file (transaction file)] when they are matched on the BY variable(s). If the master and transaction files contain common variables for matched cases, the values for those variables are taken from the transaction file, provided that the values are not missing or blanks. Missing or blank values in the transaction files are not used to update values in the master file. • When UPDATE encounters duplicate keys within a transaction file, it applies each transaction sequentially to that case to produce one case per key value in the resulting file. If more than one transaction file is specified, the value for a variable comes from the last transaction file with a nonmissing value for that variable. [If there are two records for the same ID and date in the transaction file, the first record is used to update the master file, then the second record (overwriting the first if there are values in the same variable.] • Variables that are in the transaction files but not in the master file are ADDED [caps added] to the master file. Cases that do not contain those variables are assigned the system-missing value (for numerics) or blanks (for strings). • Cases that are in the transaction files but not in the master file are ADDED [caps added] to the master file and are interleaved according to their values for the key variables. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Marso Sent: Sunday, January 29, 2012 11:11 AM To: [hidden email] Subject: Re: [SPSSX-L] How to reconcile two files in SPSS 19? "I want one new file which represents both the unique cases that each file contributes plus a single row for each duplicate record. For the example above, I would have one row: patient #1234, test date 1/1/2008 with test result 1 as 343, test result 2 as 565 and test result 3 as 678. Is this even possible?" See below. I am confused by the following "plus a single row for each duplicate record. " WHY would you wish to retain these? Simply adds much cause for later confusion! data list / id date test1 to test3 (f4,x,adate,3(x,f3)) . begin data 1234 01/01/2008 343 4567 02/01/2009 123 456 end data. SORT CASES BY id date. save outfile 'file1.sav'. list. data list / id date test1 to test3 (f4,x,adate,3(x,f3)) . begin data 1234 01/01/2008 565 678 5678 03/04/2001 345 345 222 end data. SORT CASES BY id date. save outfile 'file2.sav'. list. UPDATE FILE 'file1.sav' / FILE 'file2.sav' /BY ID date. list. ID DATE TEST1 TEST2 TEST3 1234 01/01/2008 343 565 678 4567 02/01/2009 123 . 456 5678 03/04/2001 345 345 222 Number of cases read: 3 Number of cases listed: 3 Danielle Boyce wrote > > Hello friends, > > > I have two files: > > File 1 has patient ID, test date, test result 1, test result 2, test > result 3, etc. (IDs and all test results are numeric) for some 2,000 > records. Not all rows are unique patients, as each patient can have > more than one test date. > > File 2 has identical variable names to File 1 for about 900 records. > > > Both files contribute some unique records and some duplicate records. > File 1 contains data from 2003-present; File 2 has data from > 2007-present. The overlapping records are from 2007-present. > > Some of the duplicate records overlap, but contribute different > information for the same test. In other words, File 1 has patient > #1234, test date 1/1/2008 with test result 1 as 343 and test results 2 > and 3 are blank (because data collectors didn't realize that they were > supposed to enter test results 2 and 3). File 2 has patient #1234, > test date 1/1/2008 with test result 1 as blank, test result 2 as 565 > and test result 3 as 678. > > There is no predictability as to which fields are blank or completed > for these duplicate cases. Sometimes, both files contain all of the > completed fields for the records that are duplicated, and all I would > have to do is delete one of the duplicate rows. > > I want one new file which represents both the unique cases that each > file contributes plus a single row for each duplicate record. For > the example above, I would have one row: patient #1234, test date > 1/1/2008 with test result 1 as 343, test result 2 as 565 and test > result 3 as 678. Is this even possible? > > I have tried flagging duplicates by patient ID and date, but it > doesn't merge the rows together. > > > Thanks, > Danielle > -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/How-to-reconcile-two-files-in-SPSS-19-tp5438950p5439667.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any attachments may contain PRIVILEGED AND CONFIDENTIAL information and is intended only for the use of the addressee. If you are not the designated recipient, or an employee or agent authorized to deliver such transmittals to the designated recipient, you are hereby notified that any dissemination, copying or publication of this transmittal is strictly prohibited. If you have received this transmittal in error, please notify us immediately by replying to the sender and delete this copy from your system. You may also call us at (309) 827-6026 for assistance. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |