Duplicate data - aggregate not delete

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Duplicate data - aggregate not delete

David Elkins
Hello,

I was given two datasets that I will be merging by ID.  In the second
dataset, there are duplicates for many (but not all) of the IDs.  However,
unlike other duplicate problems that I see arising, I do not want to delete
this data.  In the second dataset, each duplicate line represents an
additional administration of a test.  Some patients were only administered
the test once while others were administered the test numerous times (up to
11 administrations).  In the merged file, I would like to create one line
for each subject but retain all of the additional test administration
data.  The second file is sorted by date so that each administration is in
chronological time.  For ease of explanation in the sample datasets below,
I will use “1” as responses for first administrations, “2” for second
administrations, and “3” for third administrations, as well as using easy
dates.  There are over 15,000 lines of data in Dataset 2 so doing this
manually is clearly out of the question.


Dataset1:
ID     VarX
1      100
2      100
3      100



Dataset 2:
ID     Date          Var1     Var2
1      1/1/01         1         1
1      2/2/02         2         2
2      1/1/01         1         1
3      1/1/01         1         1
3      2/2/02         2         2
3      3/3/03         3         3



Merged dataset that I would like:

ID   VarX    Date   Var1   Var2    Date_2   Var1_2    Var2_2   Date_3
Var1_3    Var2_3
1    100    1/1/01   1      1      2/2/02      2        2
2    100    1/1/01   1      1
3    100    1/1/01   1      1      2/2/02      2        2      3/3/03
3         3


Obviously, I would need to rename Date, Var1 and Var2 during each duplicate
process to represent the number of times this ID has been duplicated (i.e.,
Var1 is renamed Var1_2 for the first duplicate and renamed Var1_3 for the
second duplicate).  I have successfully been able to use the ‘lag’ and ‘do
if’ functions in the second dataset to create new variables and give them
the correct data (but only on the existing duplicate line).  However, I’m
at a loss at how to move these “duplicate” variables onto the same line as
the original ID.  Is this even possible?  Any suggestions or even pointing
me to the proper SPSS function would be greatly appreciated.  Thanks so
much in advance.

David Elkins



************************************************
David Elkins, MS
Department of Psychiatry and Behavioral Sciences
Eastern Virginia Medical School
Norfolk, VA  23507
(757) 446-8985
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate data - aggregate not delete

Melissa Ives
Looks to me like you should use casetovars to convert dataset2 into a
'wide' or horizontal file with one row per client--it will have missing
data (as you indicate below) for any administrations that were not
conducted.  Do you have a variable that indicates which administration
it is?  If so use this as your index variable in the casetovars syntax.
If not you could create it with a lag function to establish a counter of
the records by client.

CASESTOVARS
 /ID = identifiervarname
 /INDEX = adminstrationindicatorvarname
  /separator="_"
 /GROUPBY = INDEX .

Then do a simple match files to add the data from dataset 1.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
David Elkins
Sent: Wednesday, October 11, 2006 3:09 PM
To: [hidden email]
Subject: [SPSSX-L] Duplicate data - aggregate not delete

Hello,

I was given two datasets that I will be merging by ID.  In the second
dataset, there are duplicates for many (but not all) of the IDs.
However, unlike other duplicate problems that I see arising, I do not
want to delete this data.  In the second dataset, each duplicate line
represents an additional administration of a test.  Some patients were
only administered the test once while others were administered the test
numerous times (up to
11 administrations).  In the merged file, I would like to create one
line for each subject but retain all of the additional test
administration data.  The second file is sorted by date so that each
administration is in chronological time.  For ease of explanation in the
sample datasets below, I will use "1" as responses for first
administrations, "2" for second administrations, and "3" for third
administrations, as well as using easy dates.  There are over 15,000
lines of data in Dataset 2 so doing this manually is clearly out of the
question.


Dataset1:
ID     VarX
1      100
2      100
3      100



Dataset 2:
ID     Date          Var1     Var2
1      1/1/01         1         1
1      2/2/02         2         2
2      1/1/01         1         1
3      1/1/01         1         1
3      2/2/02         2         2
3      3/3/03         3         3



Merged dataset that I would like:

ID   VarX    Date   Var1   Var2    Date_2   Var1_2    Var2_2   Date_3
Var1_3    Var2_3
1    100    1/1/01   1      1      2/2/02      2        2
2    100    1/1/01   1      1
3    100    1/1/01   1      1      2/2/02      2        2      3/3/03
3         3


Obviously, I would need to rename Date, Var1 and Var2 during each
duplicate process to represent the number of times this ID has been
duplicated (i.e.,
Var1 is renamed Var1_2 for the first duplicate and renamed Var1_3 for
the second duplicate).  I have successfully been able to use the 'lag'
and 'do if' functions in the second dataset to create new variables and
give them the correct data (but only on the existing duplicate line).
However, I'm at a loss at how to move these "duplicate" variables onto
the same line as the original ID.  Is this even possible?  Any
suggestions or even pointing me to the proper SPSS function would be
greatly appreciated.  Thanks so much in advance.

David Elkins



************************************************
David Elkins, MS
Department of Psychiatry and Behavioral Sciences Eastern Virginia
Medical School Norfolk, VA  23507
(757) 446-8985
[hidden email]



PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate data - aggregate not delete

Richard Ristow
In reply to this post by David Elkins
At 04:09 PM 10/11/2006, David Elkins wrote:

>I was given two datasets that I will be merging by ID.  In the second
>dataset, there are duplicates for many (but not all) of the IDs.
>However,
>unlike other duplicate problems that I see arising, I do not want to
>delete this data.

You have a very common case. I would not call these 'duplicates'; they
are (as you've said) separate records, with separate information,
though with the same ID value because they have information for the
same person.

>In the second dataset, each duplicate line represents an additional
>administration of a test.  Some patients were administered the test
>numerous times (up to 11 administrations).  In the merged file, I
>would like to create one line for each subject but retain all of the
>additional test administration data.

You have two steps here. The first is combining Dataset 1 with Dataset
2, to attach your Varx (one instance per patient) to all
test-administration records for that patient. Then,

>In the merged file, I would like to create one line for each subject
>but retain all of the additional test administration data.

That's the second step. It's called going from 'long' organization
(multiple records per patient, one per test administration) to 'wide'
(one record per patient, one set of variables per test administration).
It's not very difficult.

But first, to repeat myself: think at least twice, before you
restructure 'long' to 'wide'. Go over what you plan to do with your
data, and how best to do it. You may want to post, asking about that.
Many, many operations are easier with data in 'long' form, particularly
when there's an indefinite number of test administrations per person.

All that said, it's not very difficult. The following is SPSS draft
output. To illustrate "not very difficult", the data manipulation
commands are clicked up from the menus, with the one hand addition
noted.

LIST.

List
|-----------------------------|---------------------------|
|Output Created               |11-OCT-2006 18:07:17       |
|-----------------------------|---------------------------|
[Dataset2]

ID     Date Var1 Var2

  1 01/01/01   1    1
  1 02/02/02   2    2
  2 01/01/01   1    1
  3 01/01/01   1    1
  3 02/02/02   2    2
  3 03/03/03   3    3

Number of cases read:  6    Number of cases listed:  6


*  Before SPSS 14, make the following a scratch file and use .
*  "GET FILE" instead of "DATASET ACTIVATE".                 .
DATASET ACTIVATE Dataset1.
LIST.

List
|-----------------------------|---------------------------|
|Output Created               |11-OCT-2006 18:07:17       |
|-----------------------------|---------------------------|
[Dataset1]

ID VarX

  1  100
  2  100
  3  100

Number of cases read:  3    Number of cases listed:  3


*  "Merged dataset that I would like:"                               .
*                                                                    .
*  ID VarX Date   Var1 Var2 Date_2 Var1_2 Var2_2 Date_3 Var1_3 Var2_3.
*  1  100  1/1/01   1    1  2/2/02   2      2                        .
*  2  100  1/1/01   1    1                                           .
*  3  100  1/1/01   1    1  2/2/02   2      2    3/3/03   3      3   .


*  WRR: "The first [step] is combining Dataset 1 with Dataset 2"     .
*  Menu: Data>Merge files>Add Variables                              .

MATCH FILES /TABLE=*
  /FILE='Dataset2'
  /BY ID.

*  WRR: "[The second step is] going from 'long' organization         .
*       to 'wide'"                                                   .
*  Menu: Data>Restucture>Restructure selected cases into variables:  .
*  (Subcommand 'SEPARATOR' added in syntax editor.)                  .

CASESTOVARS
  /ID = ID VarX
  /GROUPBY = INDEX
  /SEPARATOR='_'.

Cases to Variables
|----------------------------|---------------------------|
|Output Created              |11-OCT-2006 18:15:51       |
|----------------------------|---------------------------|
[Dataset1]

Generated Variables
|---------|------|
|Original |Result|
|Variabl  |------|
|e        |Name  |
|-------|-|------|
|Date   |1|Date_1|
|       |2|Date_2|
|       |3|Date_3|
|-------|-|------|
|Var1   |1|Var1_1|
|       |2|Var1_2|
|       |3|Var1_3|
|-------|-|------|
|Var2   |1|Var2_1|
|       |2|Var2_2|
|       |3|Var2_3|
|-------|-|------|

Processing Statistics
|---------------|---|
|Cases In       |6  |
|Cases Out      |3  |
|Cases In/Cases |2.0|
|Out            |   |
|---------------|---|
|Variables In   |5  |
|Variables Out  |11 |
|Index Values   |3  |
|---------------|---|


LIST.

List
|-----------------------------|---------------------------|
|Output Created               |11-OCT-2006 18:16:37       |
|-----------------------------|---------------------------|
[Dataset1]

ID VarX   Date_1 Var1_1 Var2_1   Date_2 Var1_2 Var2_2   Date_3 Var1_3
Var2_3

  1  100
01/01/01    1      1   02/02/02    2      2          .    .      .
  2  100
01/01/01    1      1          .    .      .          .    .      .
  3  100
01/01/01    1      1   02/02/02    2      2   03/03/03    3      3

Number of cases read:  3    Number of cases listed:  3
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate data - aggregate not delete

Richard Ristow
In reply to this post by Melissa Ives
Postscript:

At 04:18 PM 10/11/2006, Melissa Ives wrote:

>Looks to me like you should use casetovars to convert dataset2 into a
>'wide' or horizontal file with one row per client [...] Then do a
>simple match files to add the data from dataset 1.

The order Melissa suggests, CASESTOVARS before MATCH FILES, is easier
than what I did, the opposite order. (I kept the latter order to
emphasize I'm doubtful about the CASESTOVARS.)

In the Restructure Wizard, I got VarX to the front of the output by a
trick: selecting VarX as an 'ID' variable, even though it isn't meant
as such.
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate data - aggregate not delete

David Elkins
In reply to this post by David Elkins
I just wanted to personally thank Melissa Ives, _wrristow@mindspring.com_
(mailto:[hidden email]) , and [hidden email].us_
(mailto:[hidden email])   for such quick and helpful responses on this
subject.  I've tried the  CASESTOVARS function and that worked perfectly in
this case and the  vector/aggregate functions also gave me ideas on how to
improve other  programming code.  Thank you all so much for the kind help.  It's
truly appreciated.

David Elkins


P.S.  I noticed that when using the casestovars function, two string
variables in the dataset had many missing cases and did not produce
secondary/tertiary/etc. administrations like the others (while other variables,  including
strings, had some missing variables and created secondary  administrations).
However, when converting sysmis to "No", these variables  worked perfectly as
well.  I'll look further into this to find out why but  you've already saved me a
tremendous amount of time!  Thank you  again.




In a message dated 10/11/2006 4:21:37 P.M. Eastern Standard Time,
[hidden email] writes:

Looks to  me like you should use casetovars to convert dataset2 into a
'wide' or  horizontal file with one row per client--it will have missing
data (as you  indicate below) for any administrations that were not
conducted.  Do  you have a variable that indicates which administration
it is?  If so  use this as your index variable in the casetovars syntax.
If not you could  create it with a lag function to establish a counter of
the records by  client.

CASESTOVARS
/ID = identifiervarname
/INDEX =  adminstrationindicatorvarname
/separator="_"
/GROUPBY = INDEX  .

Then do a simple match files to add the data from dataset  1.

Melissa

-----Original Message-----
From: SPSSX(r)  Discussion [mailto:[hidden email]] On Behalf Of
David  Elkins
Sent: Wednesday, October 11, 2006 3:09 PM
To:  [hidden email]
Subject: [SPSSX-L] Duplicate data - aggregate not  delete

Hello,

I was given two datasets that I will be merging by  ID.  In the second
dataset, there are duplicates for many (but not  all) of the IDs.
However, unlike other duplicate problems that I see  arising, I do not
want to delete this data.  In the second dataset,  each duplicate line
represents an additional administration of a  test.  Some patients were
only administered the test once while others  were administered the test
numerous times (up to
11  administrations).  In the merged file, I would like to create one
line  for each subject but retain all of the additional test
administration  data.  The second file is sorted by date so that each
administration  is in chronological time.  For ease of explanation in the
sample  datasets below, I will use "1" as responses for first
administrations, "2"  for second administrations, and "3" for third
administrations, as well as  using easy dates.  There are over 15,000
lines of data in Dataset 2 so  doing this manually is clearly out of  the
question.


Dataset1:
ID     VarX
1   100
2      100
3       100



Dataset 2:
ID     Date     Var1     Var2
1       1/1/01         1          1
1      2/2/02         2   2
2      1/1/01     1         1
3     1/1/01         1       1
3      2/2/02       2         2
3       3/3/03         3          3



Merged dataset that I would like:

ID    VarX    Date   Var1   Var2     Date_2   Var1_2    Var2_2    Date_3
Var1_3    Var2_3
1    100     1/1/01   1      1      2/2/02   2        2
2    100   1/1/01   1      1
3    100   1/1/01   1      1       2/2/02      2        2       3/3/03
3         3


Obviously, I would  need to rename Date, Var1 and Var2 during each
duplicate process to  represent the number of times this ID has been
duplicated (i.e.,
Var1 is  renamed Var1_2 for the first duplicate and renamed Var1_3 for
the second  duplicate).  I have successfully been able to use the 'lag'
and 'do  if' functions in the second dataset to create new variables and
give them  the correct data (but only on the existing duplicate line).
However, I'm at  a loss at how to move these "duplicate" variables onto
the same line as the  original ID.  Is this even possible?  Any
suggestions or even  pointing me to the proper SPSS function would be
greatly appreciated.   Thanks so much in advance.

David  Elkins



************************************************
David  Elkins, MS
Department of Psychiatry and Behavioral Sciences Eastern  Virginia
Medical School Norfolk, VA  23507
(757)  446-8985
[hidden email]



PRIVILEGED AND CONFIDENTIAL  INFORMATION
This transmittal and any attachments may contain PRIVILEGED  AND
CONFIDENTIAL information and is intended only for the use of  the
addressee. If you are not the designated recipient, or an  employee
or agent authorized to deliver such transmittals to the  designated
recipient, you are hereby notified that any  dissemination,
copying or publication of this transmittal is strictly  prohibited. If
you have received this transmittal in error, please notify  us
immediately by replying to the sender and delete this copy from  your
system. You may also call us at (309) 827-6026 for  assistance.