SPSSX Discussion

Removing Duplicate Records but Retaining all Data

Classic

List

Threaded

11 messages Options

JosephYoungblood

Removing Duplicate Records but Retaining all Data

My apologies beforehand, this is a lengthy post. I tried to respond to a
previous note from Eugene requesting clarity and data examples.

Thank you to all who read and respond.

The data I have are stacked by ID number, with numerous entries (cases) per
ID, the original data were medical claims, thus each record or entry
represented a separate claim.

I wish to delete all duplicate records, but I want to retain the data that
may exist in the separate cases.

For example, the variable "PrimDxGrpr" has 23 possible values (1-23), each
representing a diagnostic group, every member ID (person) includes data
regarding the diagnostic group. Regardless of the number of duplicate cases
per ID, some IDs have only a SINGLE diagnostic group number in all entries,
whereas other IDs have multiple diagnostic codes spread across all entries.

Again, I wish to have only single entries per ID, but I also wish to retain
any unique data that may exist in the duplicate entries.

DATA EXAMPLE
ID=0818, has 131 entries, for the variable "PrimDxGrpr" this ID has data for
9 diagnostic groups.

I recoded PrimDxGrpr into 23 new variables (PrimDxGrpr_01 through
PrimDxGrpr_23), and the appropriate data for each diagnostic group carried
over to its New corresponding variable.

In order to retain all data after I have deleted duplicate records, I need
the values in New variable to represent the total number or SUM of all cases
with that diagnostic group code. In the case of ID=0818, DxGrpr_01 should
reflect the total number of cases that included that diagnostic group code.

I have tried TRANSFORM, Count Values within cases, with no success.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: Removing Duplicate Records but Retaining all Data

You have to provide a specific data example showing before and after the transformation you wish to do. Your description implies multiple possible meanings. I noticed that you included a screen shot on your reply to me but I can't access it. And, please reply to the list rather than to me because there others on the list that are very skilled analysts.
Gene maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Joseph Youngblood
Sent: Tuesday, April 15, 2014 2:29 PM
To: [hidden email]
Subject: Removing Duplicate Records but Retaining all Data

My apologies beforehand, this is a lengthy post. I tried to respond to a previous note from Eugene requesting clarity and data examples.

Thank you to all who read and respond.

The data I have are stacked by ID number, with numerous entries (cases) per ID, the original data were medical claims, thus each record or entry represented a separate claim.

I wish to delete all duplicate records, but I want to retain the data that may exist in the separate cases.

For example, the variable "PrimDxGrpr" has 23 possible values (1-23), each representing a diagnostic group, every member ID (person) includes data regarding the diagnostic group. Regardless of the number of duplicate cases per ID, some IDs have only a SINGLE diagnostic group number in all entries, whereas other IDs have multiple diagnostic codes spread across all entries.

Again, I wish to have only single entries per ID, but I also wish to retain any unique data that may exist in the duplicate entries.

DATA EXAMPLE
ID=0818, has 131 entries, for the variable "PrimDxGrpr" this ID has data for
9 diagnostic groups.

I recoded PrimDxGrpr into 23 new variables (PrimDxGrpr_01 through PrimDxGrpr_23), and the appropriate data for each diagnostic group carried over to its New corresponding variable.

In order to retain all data after I have deleted duplicate records, I need the values in New variable to represent the total number or SUM of all cases with that diagnostic group code. In the case of ID=0818, DxGrpr_01 should reflect the total number of cases that included that diagnostic group code.

I have tried TRANSFORM, Count Values within cases, with no success.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

JosephYoungblood

Re: Removing Duplicate Records but Retaining all Data

In reply to this post by JosephYoungblood

The first example shows multiple records for the same ID, and the diagnostic
group code assigned.

Line # ID PrimDxGrpr
1 0818 1
2 0818 1
3 0818 2
4 0818 2
5 0818 2
6 0818 3
7 0818 3
8 0742 2
9 0742 1
10 0742 3
11 0742 3
12 0742 3

This shows the ideal: New Variables show cell count for each diagnostic
group by ID.

Line # ID P_DxGp_1 P_DxGp_3 P_DxGp_3
1 0818 2 3 2
2 0742 1 1 3

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: Removing Duplicate Records but Retaining all Data

Oh .... Ok. This:
Aggregate outfile=*/break=id primdxgrpr/ P_DxGp=nu.

Yields
ID PrimDxGrpr P_DxGp
0818 1 2
0818 2 3
0818 3 2
0742 1 1
0742 2 2
0742 3 3

Then:
Casestovars id=id.

But before you do that, run a frequencies on primdxgrpr and check that you have cases for every one of your 13 or 23 dx groups because you are using the resulting dx group variable names to also indicate the dx groups themselves.

So suppose that's true. Then you get

ID PrimDxGrpr.1 PrimDxGrpr.2 PrimDxGrpr.3 P_DxGp.1 P_DxGp.2 P_DxGp.3.
0818 1 2 3 2 3 2
0742 1 2 3 1 2 3

There's some tidying up to do but that is trivial, I think.
But, if that is not true, then you will need to do a loop and vector chunk of code to shift the dx counts to the variable name.

Gene Maguin

-----Original Message-----
From: Joseph Youngblood [mailto:[hidden email]]
Sent: Tuesday, April 15, 2014 3:18 PM
To: [hidden email]; Maguin, Eugene
Cc: Joseph Youngblood
Subject: Re: Removing Duplicate Records but Retaining all Data

The first example shows multiple records for the same ID, and the diagnostic group code assigned.

Line # ID PrimDxGrpr
1 0818 1
2 0818 1
3 0818 2
4 0818 2
5 0818 2
6 0818 3
7 0818 3
8 0742 2
9 0742 1
10 0742 3
11 0742 3
12 0742 3

This shows the ideal: New Variables show cell count for each diagnostic group by ID.

Line # ID P_DxGp_1 P_DxGp_3 P_DxGp_3
1 0818 2 3 2
2 0742 1 1 3

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Bruce Weaver

Re: Removing Duplicate Records but Retaining all Data

Administrator

In reply to this post by JosephYoungblood

For that example, you could use AGGREGATE followed by CASESTOVARS. E.g.,

* Create a sample data set.

DATA LIST list / Line(f5.0) ID(N4.0) PrimDxGrpr (f1).
BEGIN DATA
1 0818 1
2 0818 1
3 0818 2
4 0818 2
5 0818 2
6 0818 3
7 0818 3
8 0742 2
9 0742 1
10 0742 3
11 0742 3
12 0742 3
END DATA.
FORMATS ID(N4).

* Use AGGREGATE to get the counts.

AGGREGATE
/OUTFILE=*
/BREAK=ID PrimDxGrpr
/P_DxGp_=N.

* Now restructure from LONG to WIDE.

CASESTOVARS
/ID=ID
/INDEX=PrimDxGrpr
/SEPARATOR = ""
/GROUPBY=VARIABLE.

LIST.

OUTPUT:

ID P_DxGp_1 P_DxGp_2 P_DxGp_3

0742 1 1 3
0818 2 3 2

Number of cases read: 2 Number of cases listed: 2

Joseph Youngblood wrote

The first example shows multiple records for the same ID, and the diagnostic
group code assigned.

Line # ID PrimDxGrpr
1 0818 1
2 0818 1
3 0818 2
4 0818 2
5 0818 2
6 0818 3
7 0818 3
8 0742 2
9 0742 1
10 0742 3
11 0742 3
12 0742 3

This shows the ideal: New Variables show cell count for each diagnostic
group by ID.

Line # ID P_DxGp_1 P_DxGp_3 P_DxGp_3
1 0818 2 3 2
2 0742 1 1 3

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

Richard Ristow

Re: Removing Duplicate Records but Retaining all Data

In reply to this post by JosephYoungblood

At 02:28 PM 4/15/2014, Joseph Youngblood wrote:

>The data I have are stacked by ID number, with numerous entries
>(cases) per ID. Each record or entry represents a separate [medical] claim.
>
>I wish to delete all duplicate records, but I want to retain the
>data that may exist in the separate cases.

I'll come back to this, but the first question is, what makes a
record a 'duplicate'? It sounds like every record has information not
duplicated on any other record -- for example, presumably, the date
of the medical claim. So, you aren't removing duplicates; you're
summarizing. That's fine; it just needs to be recognized as such.

>For example, the variable "PrimDxGrpr" has 23 possible values
>(1-23), each representing a diagnostic group ...

Here, I presume each diagnostic group is a set of diagnoses,
presumably ICD codes.

>Every member ID (person) includes data regarding the diagnostic group.

I presume, by "Every member ID", you mean every medical claim record.
What data is there, regarding the diagnostic group Date of diagnosis?
Amount of the medical claim?

You say this is data regarding the diagnostic group, but you say each
record is about an individual claim, so you could have multiple
records for the same person and same diagnostic group, but with
different information.

Or, am I guessing wrong what is in each of the records?

>Regardless of the number of duplicate cases per ID,

OK, we're not going to say 'duplicate' anymore. These are not
duplicates of the same information; they're separate records with
different information. Starting again,

>Regardless of the number of xxxxxxxxx cases per ID, some IDs have
>only a SINGLE diagnostic group number in all entries, whereas other
>IDs have multiple diagnostic codes spread across all entries.

In other words, some people have had claims (possibly more than one
claim) within only one diagnostic group, others have had claims in
several diagnostic groups, perhaps with more than one claim in one or
more of these groups.

>Again, I wish to have only single entries per ID, but I also wish to retain
>any unique data that may exist in the duplicate entries.
>
>I recoded PrimDxGrpr into 23 new variables (PrimDxGrpr_01 through
>PrimDxGrpr_23), and the appropriate data for each diagnostic group
>carried over to its New corresponding variable.

Now we're getting somewhere. What "appropriate data" are you talking
about, that becomes the value of whichever new variable? Date? Cost?
Do you really only have one variable's worth of information you want to keep?

And, I remind you again: at *this* point, you don't have information
about the diagnostic group. You have information about the claim,
which falls into one of the diagnostic groups.

>In order to retain all data after I have deleted duplicate records,
>I need the values in New variable to represent the total number or
>SUM of all cases with that diagnostic group code.

OK, I'm going to say that you have one quantity you want to add up;
I'll call it MedicalQuantity.

Start with your original data; forget the 23 new variables. Use code like this:

DATASET DECLARE Summary.
AGGREGATE OUTFILE=Summary
/BREAK=ID PrimDxGrpr
/TotQuant 'Sum of whatever "MedicalQuantity" represents'
=SUM(MedicalQuantity).

DATASET ACTIVATE Summary WINDOW=FRONT.

Now you have a long-form summary file. It has one record for each ID
for each diagnostic group in which that person had *any* claim. If
you really want one record per ID, then this (but not tested, and
CASESTOVARS is sometimes hard to get right):

CASESTOVARS
/ID=ID
/INDEX=PrimDxGrpr.

And if, as seems likely to me, you need to keep more than one summary
quantity per diagnositic group (like, say, the number of claims), say
that, and we'll tell you how to refine this l0gic.

-Good luck,
Richard Ristow

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Removing Duplicate Records but Retaining all Data

Administrator

In reply to this post by JosephYoungblood

Joseph,
Please study the AGGREGATE and CASESTOVARS commands!
2 lines of code!
David

Joseph Youngblood wrote

The first example shows multiple records for the same ID, and the diagnostic
group code assigned.

Line # ID PrimDxGrpr
1 0818 1
2 0818 1
3 0818 2
4 0818 2
5 0818 2
6 0818 3
7 0818 3
8 0742 2
9 0742 1
10 0742 3
11 0742 3
12 0742 3

This shows the ideal: New Variables show cell count for each diagnostic
group by ID.

Line # ID P_DxGp_1 P_DxGp_3 P_DxGp_3
1 0818 2 3 2
2 0742 1 1 3

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Richard Ristow

Re: Removing Duplicate Records but Retaining all Data

In reply to this post by Maguin, Eugene

At 02:45 PM 4/15/2014, Maguin, Eugene wrote:

>You have to provide a specific data example showing before and after
>the transformation you wish to do.

It does help, doesn't it? Four answers, you got! Looks like all of us
thinking pretty much the same thing.

Again: the truly confusing word was 'duplicate'. 'Duplicate' records
repeat the same information; your 'duplicates' actually have
different information, which you want to keep in summary form.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Removing Duplicate Records but Retaining all Data

Administrator

In reply to this post by David Marso

Here's the code I had in mind.
DATA LIST LIST /Line# ID PrimDxGrpr .
BEGIN DATA
1 0818 1
2 0818 1
3 0818 2
4 0818 2
5 0818 2
6 0818 3
7 0818 3
8 0742 2
9 0742 1
10 0742 3
11 0742 3
12 0742 3
END DATA.
DATASET DECLARE agg.

AGGREGATE OUTFILE agg/BREAK ID PrimDxGrpr /P_DxGp_=N.
DATASET ACTIVATE agg.
CASESTOVARS ID=ID /INDEX=PrimDxGrpr.

David Marso wrote

Joseph,
Please study the AGGREGATE and CASESTOVARS commands!
2 lines of code!
David

Joseph Youngblood wrote

The first example shows multiple records for the same ID, and the diagnostic
group code assigned.

Line # ID PrimDxGrpr
1 0818 1
2 0818 1
3 0818 2
4 0818 2
5 0818 2
6 0818 3
7 0818 3
8 0742 2
9 0742 1
10 0742 3
11 0742 3
12 0742 3

This shows the ideal: New Variables show cell count for each diagnostic
group by ID.

Line # ID P_DxGp_1 P_DxGp_3 P_DxGp_3
1 0818 2 3 2
2 0742 1 1 3

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: Removing Duplicate Records but Retaining all Data

In reply to this post by Maguin, Eugene

Joe,

When you did the frequencies after the aggregate was every one of your dx groups represented?

If all of them were, then I don’t understand this, ‘…but the cell values are not the quantities of each Dx code summed for cases.’

Again a specific example is needed.

If all of them were not represented, then please re-run the casestovars using either Bruce’s syntax or David’s syntax (I think both are the same but I’ve deleted Bruce’s yesterday) because theirs is better thought out than mine was, specifically with respect to the index function.

Gene Maguin

From: Joseph A. Youngblood [mailto:[hidden email]]
Sent: Tuesday, April 15, 2014 6:49 PM
To: Maguin, Eugene
Subject: Re: Removing Duplicate Records but Retaining all Data

Gene,

Your advice was spot on, thank you.

I may have done something wrong.

I executed the AGGREGATE command using the main data set. The output was a separate file, like your example. When I run the CASESTOVARS using the new file it creates a series of new variables that match the data in cases, but the cell values are not the quantities of each Dx code summed for cases.

Sincerely,

Joseph A. Youngblood

On Apr 15, 2014, at 12:44 PM, "Maguin, Eugene" <[hidden email]> wrote:

Oh .... Ok. This:
Aggregate outfile=*/break=id primdxgrpr/ P_DxGp=nu.

Yields
ID PrimDxGrpr P_DxGp
0818 1 2
0818 2 3
0818 3 2
0742 1 1
0742 2 2
0742 3 3

Then:
Casestovars id=id.

But before you do that, run a frequencies on primdxgrpr and check that you have cases for every one of your 13 or 23 dx groups because you are using the resulting dx group variable names to also indicate the dx groups themselves.

So suppose that's true. Then you get

ID PrimDxGrpr.1 PrimDxGrpr.2 PrimDxGrpr.3 P_DxGp.1 P_DxGp.2 P_DxGp.3.
0818 1 2 3 2 3 2
0742 1 2 3 1 2 3

There's some tidying up to do but that is trivial, I think.
But, if that is not true, then you will need to do a loop and vector chunk of code to shift the dx counts to the variable name.

Gene Maguin

-----Original Message-----
From: Joseph Youngblood [mailto:[hidden email]]
Sent: Tuesday, April 15, 2014 3:18 PM
To: [hidden email]; Maguin, Eugene
Cc: Joseph Youngblood
Subject: Re: Removing Duplicate Records but Retaining all Data

The first example shows multiple records for the same ID, and the diagnostic group code assigned.

Line # ID PrimDxGrpr
1 0818 1
2 0818 1
3 0818 2
4 0818 2
5 0818 2
6 0818 3
7 0818 3
8 0742 2
9 0742 1
10 0742 3
11 0742 3
12 0742 3

This shows the ideal: New Variables show cell count for each diagnostic group by ID.

Line # ID P_DxGp_1 P_DxGp_3 P_DxGp_3
1 0818 2 3 2
2 0742 1 1 3

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

statcat

Re: Removing Duplicate Records but Retaining all Data

In reply to this post by David Marso

Funny note to that syntax, my old SPSS (15) I had to add
FORMAT PrimDxGrpr (F2.0).
before aggregating to ensure SPSS could use the column for var names afterwards: the data shows with 2 decimal places and without the FORMAT an error appears and the new var names get replaced with v1 to v3 (as the decimal sign, both "." and "," - for us Europeans - are not valid for use in variable names).