Hello,
I was given two datasets that I will be merging by ID. In the second dataset, there are duplicates for many (but not all) of the IDs. However, unlike other duplicate problems that I see arising, I do not want to delete this data. In the second dataset, each duplicate line represents an additional administration of a test. Some patients were only administered the test once while others were administered the test numerous times (up to 11 administrations). In the merged file, I would like to create one line for each subject but retain all of the additional test administration data. The second file is sorted by date so that each administration is in chronological time. For ease of explanation in the sample datasets below, I will use 1 as responses for first administrations, 2 for second administrations, and 3 for third administrations, as well as using easy dates. There are over 15,000 lines of data in Dataset 2 so doing this manually is clearly out of the question. Dataset1: ID VarX 1 100 2 100 3 100 Dataset 2: ID Date Var1 Var2 1 1/1/01 1 1 1 2/2/02 2 2 2 1/1/01 1 1 3 1/1/01 1 1 3 2/2/02 2 2 3 3/3/03 3 3 Merged dataset that I would like: ID VarX Date Var1 Var2 Date_2 Var1_2 Var2_2 Date_3 Var1_3 Var2_3 1 100 1/1/01 1 1 2/2/02 2 2 2 100 1/1/01 1 1 3 100 1/1/01 1 1 2/2/02 2 2 3/3/03 3 3 Obviously, I would need to rename Date, Var1 and Var2 during each duplicate process to represent the number of times this ID has been duplicated (i.e., Var1 is renamed Var1_2 for the first duplicate and renamed Var1_3 for the second duplicate). I have successfully been able to use the lag and do if functions in the second dataset to create new variables and give them the correct data (but only on the existing duplicate line). However, Im at a loss at how to move these duplicate variables onto the same line as the original ID. Is this even possible? Any suggestions or even pointing me to the proper SPSS function would be greatly appreciated. Thanks so much in advance. David Elkins ************************************************ David Elkins, MS Department of Psychiatry and Behavioral Sciences Eastern Virginia Medical School Norfolk, VA 23507 (757) 446-8985 [hidden email] |
Looks to me like you should use casetovars to convert dataset2 into a
'wide' or horizontal file with one row per client--it will have missing data (as you indicate below) for any administrations that were not conducted. Do you have a variable that indicates which administration it is? If so use this as your index variable in the casetovars syntax. If not you could create it with a lag function to establish a counter of the records by client. CASESTOVARS /ID = identifiervarname /INDEX = adminstrationindicatorvarname /separator="_" /GROUPBY = INDEX . Then do a simple match files to add the data from dataset 1. Melissa -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Elkins Sent: Wednesday, October 11, 2006 3:09 PM To: [hidden email] Subject: [SPSSX-L] Duplicate data - aggregate not delete Hello, I was given two datasets that I will be merging by ID. In the second dataset, there are duplicates for many (but not all) of the IDs. However, unlike other duplicate problems that I see arising, I do not want to delete this data. In the second dataset, each duplicate line represents an additional administration of a test. Some patients were only administered the test once while others were administered the test numerous times (up to 11 administrations). In the merged file, I would like to create one line for each subject but retain all of the additional test administration data. The second file is sorted by date so that each administration is in chronological time. For ease of explanation in the sample datasets below, I will use "1" as responses for first administrations, "2" for second administrations, and "3" for third administrations, as well as using easy dates. There are over 15,000 lines of data in Dataset 2 so doing this manually is clearly out of the question. Dataset1: ID VarX 1 100 2 100 3 100 Dataset 2: ID Date Var1 Var2 1 1/1/01 1 1 1 2/2/02 2 2 2 1/1/01 1 1 3 1/1/01 1 1 3 2/2/02 2 2 3 3/3/03 3 3 Merged dataset that I would like: ID VarX Date Var1 Var2 Date_2 Var1_2 Var2_2 Date_3 Var1_3 Var2_3 1 100 1/1/01 1 1 2/2/02 2 2 2 100 1/1/01 1 1 3 100 1/1/01 1 1 2/2/02 2 2 3/3/03 3 3 Obviously, I would need to rename Date, Var1 and Var2 during each duplicate process to represent the number of times this ID has been duplicated (i.e., Var1 is renamed Var1_2 for the first duplicate and renamed Var1_3 for the second duplicate). I have successfully been able to use the 'lag' and 'do if' functions in the second dataset to create new variables and give them the correct data (but only on the existing duplicate line). However, I'm at a loss at how to move these "duplicate" variables onto the same line as the original ID. Is this even possible? Any suggestions or even pointing me to the proper SPSS function would be greatly appreciated. Thanks so much in advance. David Elkins ************************************************ David Elkins, MS Department of Psychiatry and Behavioral Sciences Eastern Virginia Medical School Norfolk, VA 23507 (757) 446-8985 [hidden email] PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any attachments may contain PRIVILEGED AND CONFIDENTIAL information and is intended only for the use of the addressee. If you are not the designated recipient, or an employee or agent authorized to deliver such transmittals to the designated recipient, you are hereby notified that any dissemination, copying or publication of this transmittal is strictly prohibited. If you have received this transmittal in error, please notify us immediately by replying to the sender and delete this copy from your system. You may also call us at (309) 827-6026 for assistance. |
In reply to this post by David Elkins
At 04:09 PM 10/11/2006, David Elkins wrote:
>I was given two datasets that I will be merging by ID. In the second >dataset, there are duplicates for many (but not all) of the IDs. >However, >unlike other duplicate problems that I see arising, I do not want to >delete this data. You have a very common case. I would not call these 'duplicates'; they are (as you've said) separate records, with separate information, though with the same ID value because they have information for the same person. >In the second dataset, each duplicate line represents an additional >administration of a test. Some patients were administered the test >numerous times (up to 11 administrations). In the merged file, I >would like to create one line for each subject but retain all of the >additional test administration data. You have two steps here. The first is combining Dataset 1 with Dataset 2, to attach your Varx (one instance per patient) to all test-administration records for that patient. Then, >In the merged file, I would like to create one line for each subject >but retain all of the additional test administration data. That's the second step. It's called going from 'long' organization (multiple records per patient, one per test administration) to 'wide' (one record per patient, one set of variables per test administration). It's not very difficult. But first, to repeat myself: think at least twice, before you restructure 'long' to 'wide'. Go over what you plan to do with your data, and how best to do it. You may want to post, asking about that. Many, many operations are easier with data in 'long' form, particularly when there's an indefinite number of test administrations per person. All that said, it's not very difficult. The following is SPSS draft output. To illustrate "not very difficult", the data manipulation commands are clicked up from the menus, with the one hand addition noted. LIST. List |-----------------------------|---------------------------| |Output Created |11-OCT-2006 18:07:17 | |-----------------------------|---------------------------| [Dataset2] ID Date Var1 Var2 1 01/01/01 1 1 1 02/02/02 2 2 2 01/01/01 1 1 3 01/01/01 1 1 3 02/02/02 2 2 3 03/03/03 3 3 Number of cases read: 6 Number of cases listed: 6 * Before SPSS 14, make the following a scratch file and use . * "GET FILE" instead of "DATASET ACTIVATE". . DATASET ACTIVATE Dataset1. LIST. List |-----------------------------|---------------------------| |Output Created |11-OCT-2006 18:07:17 | |-----------------------------|---------------------------| [Dataset1] ID VarX 1 100 2 100 3 100 Number of cases read: 3 Number of cases listed: 3 * "Merged dataset that I would like:" . * . * ID VarX Date Var1 Var2 Date_2 Var1_2 Var2_2 Date_3 Var1_3 Var2_3. * 1 100 1/1/01 1 1 2/2/02 2 2 . * 2 100 1/1/01 1 1 . * 3 100 1/1/01 1 1 2/2/02 2 2 3/3/03 3 3 . * WRR: "The first [step] is combining Dataset 1 with Dataset 2" . * Menu: Data>Merge files>Add Variables . MATCH FILES /TABLE=* /FILE='Dataset2' /BY ID. * WRR: "[The second step is] going from 'long' organization . * to 'wide'" . * Menu: Data>Restucture>Restructure selected cases into variables: . * (Subcommand 'SEPARATOR' added in syntax editor.) . CASESTOVARS /ID = ID VarX /GROUPBY = INDEX /SEPARATOR='_'. Cases to Variables |----------------------------|---------------------------| |Output Created |11-OCT-2006 18:15:51 | |----------------------------|---------------------------| [Dataset1] Generated Variables |---------|------| |Original |Result| |Variabl |------| |e |Name | |-------|-|------| |Date |1|Date_1| | |2|Date_2| | |3|Date_3| |-------|-|------| |Var1 |1|Var1_1| | |2|Var1_2| | |3|Var1_3| |-------|-|------| |Var2 |1|Var2_1| | |2|Var2_2| | |3|Var2_3| |-------|-|------| Processing Statistics |---------------|---| |Cases In |6 | |Cases Out |3 | |Cases In/Cases |2.0| |Out | | |---------------|---| |Variables In |5 | |Variables Out |11 | |Index Values |3 | |---------------|---| LIST. List |-----------------------------|---------------------------| |Output Created |11-OCT-2006 18:16:37 | |-----------------------------|---------------------------| [Dataset1] ID VarX Date_1 Var1_1 Var2_1 Date_2 Var1_2 Var2_2 Date_3 Var1_3 Var2_3 1 100 01/01/01 1 1 02/02/02 2 2 . . . 2 100 01/01/01 1 1 . . . . . . 3 100 01/01/01 1 1 02/02/02 2 2 03/03/03 3 3 Number of cases read: 3 Number of cases listed: 3 |
In reply to this post by Melissa Ives
Postscript:
At 04:18 PM 10/11/2006, Melissa Ives wrote: >Looks to me like you should use casetovars to convert dataset2 into a >'wide' or horizontal file with one row per client [...] Then do a >simple match files to add the data from dataset 1. The order Melissa suggests, CASESTOVARS before MATCH FILES, is easier than what I did, the opposite order. (I kept the latter order to emphasize I'm doubtful about the CASESTOVARS.) In the Restructure Wizard, I got VarX to the front of the output by a trick: selecting VarX as an 'ID' variable, even though it isn't meant as such. |
In reply to this post by David Elkins
I just wanted to personally thank Melissa Ives, _wrristow@mindspring.com_
(mailto:[hidden email]) , and [hidden email].us_ (mailto:[hidden email]) for such quick and helpful responses on this subject. I've tried the CASESTOVARS function and that worked perfectly in this case and the vector/aggregate functions also gave me ideas on how to improve other programming code. Thank you all so much for the kind help. It's truly appreciated. David Elkins P.S. I noticed that when using the casestovars function, two string variables in the dataset had many missing cases and did not produce secondary/tertiary/etc. administrations like the others (while other variables, including strings, had some missing variables and created secondary administrations). However, when converting sysmis to "No", these variables worked perfectly as well. I'll look further into this to find out why but you've already saved me a tremendous amount of time! Thank you again. In a message dated 10/11/2006 4:21:37 P.M. Eastern Standard Time, [hidden email] writes: Looks to me like you should use casetovars to convert dataset2 into a 'wide' or horizontal file with one row per client--it will have missing data (as you indicate below) for any administrations that were not conducted. Do you have a variable that indicates which administration it is? If so use this as your index variable in the casetovars syntax. If not you could create it with a lag function to establish a counter of the records by client. CASESTOVARS /ID = identifiervarname /INDEX = adminstrationindicatorvarname /separator="_" /GROUPBY = INDEX . Then do a simple match files to add the data from dataset 1. Melissa -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Elkins Sent: Wednesday, October 11, 2006 3:09 PM To: [hidden email] Subject: [SPSSX-L] Duplicate data - aggregate not delete Hello, I was given two datasets that I will be merging by ID. In the second dataset, there are duplicates for many (but not all) of the IDs. However, unlike other duplicate problems that I see arising, I do not want to delete this data. In the second dataset, each duplicate line represents an additional administration of a test. Some patients were only administered the test once while others were administered the test numerous times (up to 11 administrations). In the merged file, I would like to create one line for each subject but retain all of the additional test administration data. The second file is sorted by date so that each administration is in chronological time. For ease of explanation in the sample datasets below, I will use "1" as responses for first administrations, "2" for second administrations, and "3" for third administrations, as well as using easy dates. There are over 15,000 lines of data in Dataset 2 so doing this manually is clearly out of the question. Dataset1: ID VarX 1 100 2 100 3 100 Dataset 2: ID Date Var1 Var2 1 1/1/01 1 1 1 2/2/02 2 2 2 1/1/01 1 1 3 1/1/01 1 1 3 2/2/02 2 2 3 3/3/03 3 3 Merged dataset that I would like: ID VarX Date Var1 Var2 Date_2 Var1_2 Var2_2 Date_3 Var1_3 Var2_3 1 100 1/1/01 1 1 2/2/02 2 2 2 100 1/1/01 1 1 3 100 1/1/01 1 1 2/2/02 2 2 3/3/03 3 3 Obviously, I would need to rename Date, Var1 and Var2 during each duplicate process to represent the number of times this ID has been duplicated (i.e., Var1 is renamed Var1_2 for the first duplicate and renamed Var1_3 for the second duplicate). I have successfully been able to use the 'lag' and 'do if' functions in the second dataset to create new variables and give them the correct data (but only on the existing duplicate line). However, I'm at a loss at how to move these "duplicate" variables onto the same line as the original ID. Is this even possible? Any suggestions or even pointing me to the proper SPSS function would be greatly appreciated. Thanks so much in advance. David Elkins ************************************************ David Elkins, MS Department of Psychiatry and Behavioral Sciences Eastern Virginia Medical School Norfolk, VA 23507 (757) 446-8985 [hidden email] PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any attachments may contain PRIVILEGED AND CONFIDENTIAL information and is intended only for the use of the addressee. If you are not the designated recipient, or an employee or agent authorized to deliver such transmittals to the designated recipient, you are hereby notified that any dissemination, copying or publication of this transmittal is strictly prohibited. If you have received this transmittal in error, please notify us immediately by replying to the sender and delete this copy from your system. You may also call us at (309) 827-6026 for assistance. |
Free forum by Nabble | Edit this page |