Combining Files with Duplicate Records

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Combining Files with Duplicate Records

JosephYoungblood
I would appreciate any advice / guidance here.  I am very familiar combining
files matched on a single unifying variable.  Unlike a typical merge, the two
files I need to combine do not have an equal number of cases.  For example,
file1, every ID number (unifying variable) has multiple cases, whereas in the
the second file there may be only one case per ID.

I want all data from file2 brought in, but only the exact number of cases
contained in that file.  For example, in the combined file the ID# 001 has 31
total cases (occurrences) of which 10 cases have all data from both files and
21 cases with data from the original file.

Thank you,

Joseph

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

Maguin, Eugene
Clarification. File2 has exactly one record per id while file 1 has 1 or more records per id AND file 1 has sets of records whose id does not appear in file 2 BUT every id value in file 2 appears in file 1. Is this all true?

I think a match files command with file=file1 and table=file2 and the IN keyword specified for file2 would be right then delete records where the IN variable=0. Rationale. Table will spread file2 data across those file1 records matching on the BY variable and set the IN variable to 1. Records in file 1 but not file 2 will be sysmis for the added variables from file 2 and the IN variable will equal 0. And those records you delete. Because Table is used for file 2, records in file 2 are not added to resulting file.
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Joseph Youngblood
Sent: Tuesday, April 08, 2014 2:15 PM
To: [hidden email]
Subject: Combining Files with Duplicate Records

I would appreciate any advice / guidance here.  I am very familiar combining files matched on a single unifying variable.  Unlike a typical merge, the two files I need to combine do not have an equal number of cases.  For example, file1, every ID number (unifying variable) has multiple cases, whereas in the the second file there may be only one case per ID.

I want all data from file2 brought in, but only the exact number of cases contained in that file.  For example, in the combined file the ID# 001 has 31 total cases (occurrences) of which 10 cases have all data from both files and
21 cases with data from the original file.

Thank you,

Joseph

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

Jignesh Sutar
In reply to this post by JosephYoungblood
Sounds like you need something like the below (untested). The IDs in file2 must be unique but those records can be mapped to multiple records (or occurrences of the same ID in file1)

DATASET ACTIVATE file2.
SORT CASES BY ID.
DATASET ACTIVATE file1.
SORT CASES BY ID.
MATCH FILES FILE=file1 /TABLE=file2 /by ID.


On 8 April 2014 19:15, Joseph Youngblood <[hidden email]> wrote:
I would appreciate any advice / guidance here.  I am very familiar combining
files matched on a single unifying variable.  Unlike a typical merge, the two
files I need to combine do not have an equal number of cases.  For example,
file1, every ID number (unifying variable) has multiple cases, whereas in the
the second file there may be only one case per ID.

I want all data from file2 brought in, but only the exact number of cases
contained in that file.  For example, in the combined file the ID# 001 has 31
total cases (occurrences) of which 10 cases have all data from both files and
21 cases with data from the original file.

Thank you,

Joseph

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

Jeanne Eidex
In reply to this post by JosephYoungblood
Hi Joseph...

1) File 1 (with multiple cases) - sort by your ID number

2) File 2 (with single cases)  - sort by your ID number

3) File 2 - make sure the id variable name is the same as in file 1

4) you need to use the match files command with a table match & this will
only add information from the 2nd file that has a match to the first...

MATCH FILES FILE='C:\MATCHPROJECT\FILE_WITH_MULTIPLE_ID_CASES.SAV'
 /TABLE='C:\MATCHPROJECT\FILE_WITH_SINGLE_ID_CASES.SAV'
 /BY ID
 /MAP.

Hopefully that will work, I didn't test this but the key is to make sure you
assign the file and table properly.

Jeanne






-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Joseph Youngblood
Sent: Tuesday, April 08, 2014 2:15 PM
To: [hidden email]
Subject: Combining Files with Duplicate Records

I would appreciate any advice / guidance here.  I am very familiar combining
files matched on a single unifying variable.  Unlike a typical merge, the
two files I need to combine do not have an equal number of cases.  For
example, file1, every ID number (unifying variable) has multiple cases,
whereas in the the second file there may be only one case per ID.

I want all data from file2 brought in, but only the exact number of cases
contained in that file.  For example, in the combined file the ID# 001 has
31 total cases (occurrences) of which 10 cases have all data from both files
and
21 cases with data from the original file.

Thank you,

Joseph

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

David Marso
Administrator
In reply to this post by JosephYoungblood
Joseph,
  If the files are structured as Gene has surmised then his and JSutar's solutions will fly.
However, I find the following sentence to be alarmingly confusing:
In particular in light of " in the the second file there may be only one case per ID"

"For example, in the combined file the ID# 001 has 31 total cases (occurrences)
of which 10 cases have all data from both files and  21 cases with data from the original file."
These two statements are in contradiction!
Please clarify with a simplified example (inputs and desired output).


Joseph Youngblood wrote
I would appreciate any advice / guidance here.  I am very familiar combining
files matched on a single unifying variable.  Unlike a typical merge, the two
files I need to combine do not have an equal number of cases.  For example,
file1, every ID number (unifying variable) has multiple cases, whereas in the
the second file there may be only one case per ID.

I want all data from file2 brought in, but only the exact number of cases
contained in that file.  For example, in the combined file the ID# 001 has 31
total cases (occurrences) of which 10 cases have all data from both files and
21 cases with data from the original file.

Thank you,

Joseph

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

JosephYoungblood
In reply to this post by JosephYoungblood
Thank you David, yes, not quite a non sequitur, but my example didn't follow
logically...allow me to clarify.

I have two files with seven variables in common, with "ID" being the
matching variable.

File 1 = Medical claims data, 64 variables.  Some IDs have only 1 case, but
most IDs have multiple cases.  There is no a priori pattern or logic to the
frequency of cases.

File 2 = Pharmacy claims data, 28 variables.  As with File1, some IDs have
only one case, most IDs have multiple cases.

NOTE:

The number of cases per ID between the two files is NOT equal.  File1 has
the greater number of cases per ID, however, there are a few instances when
File2 has more cases per ID than File1.  The relationship between the number
of cases per ID and the differential between the two files is not of
interest presently.

CHALLENGE:

I want the new combined file to bring in the 21 variables unique to File2,
but only the number of cases per ID contained in File2.

The merge/combining process is straightforward when the number of IDs and
cases per ID are identical for File1 and File2.

The challenge is when there are differential cases per ID between the two
files.

The combined file (File3) could look like the following: (note the numbers
used are arbitrary and used for illustrative purposes).

File3, ID = 001 has 30 cases from File1 (medical claims) AND one case from
File2.

OR

File3, ID = 001 has 15 cases from File1 AND 22 cases from File2

OR

File3, ID = 001 has zero cases from File1 AND 12 cases from File2


Thank you,

Joseph A. Youngblood

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

Maguin, Eugene
Ok, so suppose file 1 id=304 has 5 records and file 2 id=304 has 9 records. Which 5 records from file 2 do you want to keep? Or, suppose id=908 has 3 record in both files, so do it matter how you match up records from file 1 and file 2? Is 1-2, 2-3, 3-1 as good as 1-1, 2-2, 3-3?

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Joseph Youngblood
Sent: Tuesday, April 08, 2014 4:08 PM
To: [hidden email]
Subject: Re: Combining Files with Duplicate Records

Thank you David, yes, not quite a non sequitur, but my example didn't follow logically...allow me to clarify.

I have two files with seven variables in common, with "ID" being the matching variable.

File 1 = Medical claims data, 64 variables.  Some IDs have only 1 case, but most IDs have multiple cases.  There is no a priori pattern or logic to the frequency of cases.

File 2 = Pharmacy claims data, 28 variables.  As with File1, some IDs have only one case, most IDs have multiple cases.

NOTE:

The number of cases per ID between the two files is NOT equal.  File1 has the greater number of cases per ID, however, there are a few instances when
File2 has more cases per ID than File1.  The relationship between the number of cases per ID and the differential between the two files is not of interest presently.

CHALLENGE:

I want the new combined file to bring in the 21 variables unique to File2, but only the number of cases per ID contained in File2.

The merge/combining process is straightforward when the number of IDs and cases per ID are identical for File1 and File2.

The challenge is when there are differential cases per ID between the two files.

The combined file (File3) could look like the following: (note the numbers used are arbitrary and used for illustrative purposes).

File3, ID = 001 has 30 cases from File1 (medical claims) AND one case from File2.

OR

File3, ID = 001 has 15 cases from File1 AND 22 cases from File2

OR

File3, ID = 001 has zero cases from File1 AND 12 cases from File2


Thank you,

Joseph A. Youngblood

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

Rich Ulrich
In reply to this post by David Marso
Unless the OP now knows the way, I agree with the need for
"simplified example".  I want to point out two things.

You are asking for chaos and confusion if you don't have unique
identification for every line.  Thus, if ID# 001, you want another
variable such as Visit that runs from 1 to 31 for the lines for 001.
This should help to promote both the discussion and the eventual
syntax. 

The second thing is the statement, "in the second file there may
be only one case per ID."  Is that "shall" or "might" as the reading
of "may"?  One case per ID is easy enough to deal with, such as -
matching to each of the others. 

The request for help is unclear, though.  One idea I get is that the
OP might want this file 2 record to only match (say) Visit=1  in file 1.

--
Rich Ulrich



> Date: Tue, 8 Apr 2014 12:07:18 -0700

> From: [hidden email]
> Subject: Re: Combining Files with Duplicate Records
> To: [hidden email]
>
> Joseph,
> If the files are structured as Gene has surmised then his and JSutar's
> solutions will fly.
> However, I find the following sentence to be alarmingly confusing:
> In particular in light of " in the the second file there may be only *one
> case per ID*"
>
> "For example, in the combined file the ID# 001 has *31 total cases*
> (occurrences)
> of which *10 cases *have all data from both files and *21 cases* with data
> from the original file."
> These two statements are in contradiction!
> Please clarify with a simplified example (inputs and desired output).
>
>
>
> Joseph Youngblood wrote
> > I would appreciate any advice / guidance here. I am very familiar
> > combining
> > files matched on a single unifying variable. Unlike a typical merge, the
> > two
> > files I need to combine do not have an equal number of cases. For
> > example,
> > file1, every ID number (unifying variable) has multiple cases, whereas in
> > the
> > the second file there may be only one case per ID.
> >
> > I want all data from file2 brought in, but only the exact number of cases
> > contained in that file. For example, in the combined file the ID# 001 has
> > 31
> > total cases (occurrences) of which 10 cases have all data from both files
> > and
> > 21 cases with data from the original file.
> >
> > Thank you,
> >
> > Joseph
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message to
>
> > LISTSERV@.UGA
>
> > (not to SPSSX-L), with no body text except the
> > command. To leave the list, send the command
> > SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions, send the command
> > INFO REFCARD
>
>
>
>
>
> -----
> Please reply to the list and not to my personal email.
> Those desiring my consulting or training services please feel free to email me.
> ---
> "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
> Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
> --
> View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Combining-Files-with-Duplicate-Records-tp5725352p5725355.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

David Marso
Administrator
In reply to this post by JosephYoungblood
Please cobble together a simple example of how such a situation might appear.
ie, a very small example of two input files and what the actual output data should look like.
In the mismatched number of records instances what dictates which record should be joined with any other record?  This is clearly not a simple TABLE merge as Gene and others in this thread assumed from the initial description, nor does it sound like a many to many (Cartesian product which has been discussed frequently in recent months -mostly between Richard Ristow and myself and Jon Peck providing an Extension at one point-).

Joseph Youngblood wrote
Thank you David, yes, not quite a non sequitur, but my example didn't follow
logically...allow me to clarify.

I have two files with seven variables in common, with "ID" being the
matching variable.

File 1 = Medical claims data, 64 variables.  Some IDs have only 1 case, but
most IDs have multiple cases.  There is no a priori pattern or logic to the
frequency of cases.

File 2 = Pharmacy claims data, 28 variables.  As with File1, some IDs have
only one case, most IDs have multiple cases.

NOTE:

The number of cases per ID between the two files is NOT equal.  File1 has
the greater number of cases per ID, however, there are a few instances when
File2 has more cases per ID than File1.  The relationship between the number
of cases per ID and the differential between the two files is not of
interest presently.

CHALLENGE:

I want the new combined file to bring in the 21 variables unique to File2,
but only the number of cases per ID contained in File2.

The merge/combining process is straightforward when the number of IDs and
cases per ID are identical for File1 and File2.

The challenge is when there are differential cases per ID between the two
files.

The combined file (File3) could look like the following: (note the numbers
used are arbitrary and used for illustrative purposes).

File3, ID = 001 has 30 cases from File1 (medical claims) AND one case from
File2.

OR

File3, ID = 001 has 15 cases from File1 AND 22 cases from File2

OR

File3, ID = 001 has zero cases from File1 AND 12 cases from File2


Thank you,

Joseph A. Youngblood

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

Rich Ulrich
In reply to this post by JosephYoungblood
Okay, you have the awkward case of multiple records per ID in each file.
AND you haven't said how those records can be indexed within ID.

Also, I'm afraid that your explanation does not yet match your example.
  "I want the new combined file to bring in the 21 variables unique to File2,
  but only the number of cases per ID contained in File2."
The first example given contradicts this prescription:
  "File3, ID = 001 has 30 cases from File1 (medical claims) AND one case from
   File2."

Here's a thought-experiment.  Suppose that you create a file that has 
64+21 variables, sorted by date-time.  Each line has either the 64 or 28
variables (7 overlapping) with useful values, and the other values Missing.

Does this file have lines that you can collapse together?  Does this file,
as it is, serve whatever purpose you have in mind?

--
Rich Ulrich


> Date: Tue, 8 Apr 2014 16:07:35 -0400

> From: [hidden email]
> Subject: Re: Combining Files with Duplicate Records
> To: [hidden email]
>
> Thank you David, yes, not quite a non sequitur, but my example didn't follow
> logically...allow me to clarify.
>
> I have two files with seven variables in common, with "ID" being the
> matching variable.
>
> File 1 = Medical claims data, 64 variables. Some IDs have only 1 case, but
> most IDs have multiple cases. There is no a priori pattern or logic to the
> frequency of cases.
>
> File 2 = Pharmacy claims data, 28 variables. As with File1, some IDs have
> only one case, most IDs have multiple cases.
>
> NOTE:
>
> The number of cases per ID between the two files is NOT equal. File1 has
> the greater number of cases per ID, however, there are a few instances when
> File2 has more cases per ID than File1. The relationship between the number
> of cases per ID and the differential between the two files is not of
> interest presently.
>
> CHALLENGE:
>
> I want the new combined file to bring in the 21 variables unique to File2,
> but only the number of cases per ID contained in File2.
>
> The merge/combining process is straightforward when the number of IDs and
> cases per ID are identical for File1 and File2.
>
> The challenge is when there are differential cases per ID between the two
> files.
>
> The combined file (File3) could look like the following: (note the numbers
> used are arbitrary and used for illustrative purposes).
>
> File3, ID = 001 has 30 cases from File1 (medical claims) AND one case from
> File2.
>
> OR
>
> File3, ID = 001 has 15 cases from File1 AND 22 cases from File2
>
> OR
>
> File3, ID = 001 has zero cases from File1 AND 12 cases from File2
>
>
> Thank you,
>
> Joseph A. Youngblood
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

JosephYoungblood
In reply to this post by JosephYoungblood
I have run attempted this several times, each with a margin of success, but
nothing fully arriving at a solution.

I really appreciate all who have contributed, thank you.

Simply put, I need to combine two files that contain the following
information into a New file that contains only those variables unique to
each other, and MATCHED on the ID number, while retaining the number of
cases per ID.

BACKGROUND:
Among the many research questions we aim to answer, we need to construct a
complete picture or profile of each individual as it relates to his/her
medical history.  Relying on the medical claims information in isolation of
the pharmacy claims information would not render a complete picture.

DATA SOURCE
FIle1 (Medical Claims, 64 variables) includes; ID, age, gender, relationship
to the insured, primary diagnosis, secondary diagnosis, place of service,
procedure, type of visit, duration of visit (in the case of a
hospitalization), dates of service, provider name, provider city, bio-metric
data, whether the individual participates in a wellness program, and if so
type, frequency etc, and several other variables pertaining to billing.

File2 (Pharmacy Claims, 28 variables) includes; ID, age, gender,
relationship to the insured, drug class, drug name, drug brand name, dosage,
generic status, refill status and quantity, and several other variables
related to provider, and billing.

The NEW Combined/Merged file (File3) should have 85 variables, 64 from
File1, and 21 from File2 (28 variables less the seven shared with File1).


"MATCHING CASES ON KEY VARIABLE(S)"

Each member has a unique ID number, and, to the extent that a member filed
both a medical claim and pharmacy claim, the member ID is present in both
files.


SCENARIO LEADING TO UNMATCHED NUMBER OF CASES PER FILE:

Member ID 001 had 25 medical claims in the time period of reference, each
claim is filed as a unique case, hence 25 cases per member ID 001 from File1
(medical claims).

Member 001 also had 4 pharmacy claims in the time period of reference, each
pharmacy claim is filed as a unique case.

Technically, Member ID 001 has had 29 claims filed with his/her insurance
company for services rendered (25 medical, 4 pharmacy).  However, because I
am interested in the presumed richness of the combined file, I wish to merge
the two files.

It is important to retain the "number of claims" (cases) per ID.  So, the
data from the 4 pharmacy claims (Cases) from File2 will be subsumed into the
25 medical claims (Cases).

For member 001, the NEW FILE should have 25 total cases, of which four cases
will have data for all 85 variables, and 21 cases with data for only 64
variables.

In SPSS 22, the GUI steps are the following:  Data -> Merge Files -> Add
Variables.  The user is required, if necessary for the merge, to select the
option "Match Cases on Key Variables" and then select the variable(s) common
in both files and that will be used as the matching variable.  In my case
that would be ID.

Following this out results in complete data for all variables for each case
of a Member ID.  Using member ID 001 again as an example, I get 25 cases
with data for all 85 variables.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

David Marso
Administrator
This seems to me a path towards utter disaster!
Consider rolling everything for a given ID into a single record for each file using CASESTOVARS,
then a simple MATCH BY ID.
As much as I despise wide data format, it seems doing otherwise is an even greater invitation to chaos!

Joseph Youngblood wrote
I have run attempted this several times, each with a margin of success, but
nothing fully arriving at a solution.

I really appreciate all who have contributed, thank you.

Simply put, I need to combine two files that contain the following
information into a New file that contains only those variables unique to
each other, and MATCHED on the ID number, while retaining the number of
cases per ID.

BACKGROUND:
Among the many research questions we aim to answer, we need to construct a
complete picture or profile of each individual as it relates to his/her
medical history.  Relying on the medical claims information in isolation of
the pharmacy claims information would not render a complete picture.

DATA SOURCE
FIle1 (Medical Claims, 64 variables) includes; ID, age, gender, relationship
to the insured, primary diagnosis, secondary diagnosis, place of service,
procedure, type of visit, duration of visit (in the case of a
hospitalization), dates of service, provider name, provider city, bio-metric
data, whether the individual participates in a wellness program, and if so
type, frequency etc, and several other variables pertaining to billing.

File2 (Pharmacy Claims, 28 variables) includes; ID, age, gender,
relationship to the insured, drug class, drug name, drug brand name, dosage,
generic status, refill status and quantity, and several other variables
related to provider, and billing.

The NEW Combined/Merged file (File3) should have 85 variables, 64 from
File1, and 21 from File2 (28 variables less the seven shared with File1).


"MATCHING CASES ON KEY VARIABLE(S)"

Each member has a unique ID number, and, to the extent that a member filed
both a medical claim and pharmacy claim, the member ID is present in both
files.


SCENARIO LEADING TO UNMATCHED NUMBER OF CASES PER FILE:

Member ID 001 had 25 medical claims in the time period of reference, each
claim is filed as a unique case, hence 25 cases per member ID 001 from File1
(medical claims).

Member 001 also had 4 pharmacy claims in the time period of reference, each
pharmacy claim is filed as a unique case.

Technically, Member ID 001 has had 29 claims filed with his/her insurance
company for services rendered (25 medical, 4 pharmacy).  However, because I
am interested in the presumed richness of the combined file, I wish to merge
the two files.

It is important to retain the "number of claims" (cases) per ID.  So, the
data from the 4 pharmacy claims (Cases) from File2 will be subsumed into the
25 medical claims (Cases).

For member 001, the NEW FILE should have 25 total cases, of which four cases
will have data for all 85 variables, and 21 cases with data for only 64
variables.

In SPSS 22, the GUI steps are the following:  Data -> Merge Files -> Add
Variables.  The user is required, if necessary for the merge, to select the
option "Match Cases on Key Variables" and then select the variable(s) common
in both files and that will be used as the matching variable.  In my case
that would be ID.

Following this out results in complete data for all variables for each case
of a Member ID.  Using member ID 001 again as an example, I get 25 cases
with data for all 85 variables.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

Rich Ulrich
In reply to this post by JosephYoungblood
I think that my earlier post gives you, pretty directly, two different solutions.

As I read it, you have (say) 25 records in file 1 and 4 in the file 2, and you
want to arbitrarily jamb those 4 from 2 onto the end of the first 4 lines in 1

If you do as I suggested and create a CLAIM#  that runs from 1-25 in the
first and 1-4 in the second, then you can do a simple MATCH FILES  using
ID and CLAIM#  to create that desired outcome.

On the other hand, I consider that a horrible file organization -- even if it
did not leave you a further quandary of what to do when there are more
Pharm claims than doctor visits.

There seems to be no essential reason to put those data into one file, unless
you simply want to be able to name one file that can be used for statistics.

What I suggested as a thought experiment seems to be a more elegant solution
for having something available with all the data: merged lines, with multiple MISS.

If you want to be able to get a simple count of the number of visits (that "25"),
it is easy enough to count the non-Missing on something, or use an indicator
to show, for each CLAIM#, whether it is DOCTOR or PHARM.

--
Rich Ulrich



> Date: Tue, 8 Apr 2014 18:44:14 -0400

> From: [hidden email]
> Subject: Re: Combining Files with Duplicate Records
> To: [hidden email]
>
> I have run attempted this several times, each with a margin of success, but
> nothing fully arriving at a solution.
>
> I really appreciate all who have contributed, thank you.
>
> Simply put, I need to combine two files that contain the following
> information into a New file that contains only those variables unique to
> each other, and MATCHED on the ID number, while retaining the number of
> cases per ID.
>
> BACKGROUND:
> Among the many research questions we aim to answer, we need to construct a
> complete picture or profile of each individual as it relates to his/her
> medical history. Relying on the medical claims information in isolation of
> the pharmacy claims information would not render a complete picture.
>
> DATA SOURCE
> FIle1 (Medical Claims, 64 variables) includes; ID, age, gender, relationship
> to the insured, primary diagnosis, secondary diagnosis, place of service,
> procedure, type of visit, duration of visit (in the case of a
> hospitalization), dates of service, provider name, provider city, bio-metric
> data, whether the individual participates in a wellness program, and if so
> type, frequency etc, and several other variables pertaining to billing.
>
> File2 (Pharmacy Claims, 28 variables) includes; ID, age, gender,
> relationship to the insured, drug class, drug name, drug brand name, dosage,
> generic status, refill status and quantity, and several other variables
> related to provider, and billing.
>
> The NEW Combined/Merged file (File3) should have 85 variables, 64 from
> File1, and 21 from File2 (28 variables less the seven shared with File1).
>
>
> "MATCHING CASES ON KEY VARIABLE(S)"
>
> Each member has a unique ID number, and, to the extent that a member filed
> both a medical claim and pharmacy claim, the member ID is present in both
> files.
>
>
> SCENARIO LEADING TO UNMATCHED NUMBER OF CASES PER FILE:
>
> Member ID 001 had 25 medical claims in the time period of reference, each
> claim is filed as a unique case, hence 25 cases per member ID 001 from File1
> (medical claims).
>
> Member 001 also had 4 pharmacy claims in the time period of reference, each
> pharmacy claim is filed as a unique case.
>
> Technically, Member ID 001 has had 29 claims filed with his/her insurance
> company for services rendered (25 medical, 4 pharmacy). However, because I
> am interested in the presumed richness of the combined file, I wish to merge
> the two files.
>
> It is important to retain the "number of claims" (cases) per ID. So, the
> data from the 4 pharmacy claims (Cases) from File2 will be subsumed into the
> 25 medical claims (Cases).
>
> For member 001, the NEW FILE should have 25 total cases, of which four cases
> will have data for all 85 variables, and 21 cases with data for only 64
> variables.
>
> In SPSS 22, the GUI steps are the following: Data -> Merge Files -> Add
> Variables. The user is required, if necessary for the merge, to select the
> option "Match Cases on Key Variables" and then select the variable(s) common
> in both files and that will be used as the matching variable. In my case
> that would be ID.
>
> Following this out results in complete data for all variables for each case
> of a Member ID. Using member ID 001 again as an example, I get 25 cases
> with data for all 85 variables.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

Richard Ristow
In reply to this post by JosephYoungblood
At 04:07 PM 4/8/2014, Joseph Youngblood wrote:

>I have two files with seven variables in common, with "ID" being the
>matching variable.
>
>File 1 = Medical claims data, 64 variables.  Some IDs have only 1
>case, but most IDs have multiple cases.
>
>File 2 = Pharmacy claims data, 28 variables.  As with File1, some
>IDs have only one case, most IDs have multiple cases.
>
>The number of cases per ID between the two files is NOT equal.
>
>I want the new combined file to bring in the 21 variables unique to
>File2, but only the number of cases per ID contained in File2.

This isn't a problem of SPSS coding; it's a problem of being clear
what you mean.

If you merge File 1 and File 2, it looks like the resulting records will have,

. Variable ID
. The other variables the two files have in common. (Careful! What if
the values of the 'common' variables conflict between the two file?
Which version do you want, and do you want error checking so you know
you have the problem?)
. The variables from ONE medical  claim
. The variables from ONE pharmacy claim

Now: this doesn't make sense unless the pharmacy claim is connected
to the medical claim -- say, is part of the same disease episode. Is
there a meaningful way of saying which pharmacy claims go with which
medical claims? If ther isn't, then putting medical and pharmacy
claims in the same records is unwise: it asserts there's a
connection, when there isn't. If there is, then some form of
two-variable key may work, to put pharmacy claims with their
corresponding medical claims.

>The merge/combining process is straightforward when the number of
>IDs and cases per ID are identical for File1 and File2. The
>challenge is when there are differential cases per ID between the two files.

It's maybe straightforward to do, but the result may be wrong, as
I've said: you're connecting pharmacy claims to medical claims to
which they don't necessarily belong.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

JosephYoungblood
In reply to this post by JosephYoungblood
I really appreciate everyone's contribution to this quandary.  After several
hours of head scratching the solution became depressingly obvious, my data
were stacked by ID.

Previously
For example, ID 001 had 8 separate entries for DxGrp_05

The goal
To have one entry per ID, and I need the new variable to be the sum of the
separate values, i.e., 8

I recoded DxGrp into 13 separate variables, (DxGrpr_01 thru DxGrpr_13), and
all data populated correctly.  However, I still have multiple entries for
each unique ID, and need to delete (clear) all duplicate IDs, thus I need to
RECODE the new vars such that the cell represents the sum of the values for
that particular DxGrpr.

Please advise.

Joseph A. Youngblood

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

Maguin, Eugene
Is this a continuation of the last couple of weeks? Either way, how about giving some example data so that the incoming data structure and the desired data structure are clear to all. If this is a continuation, I can appreciate that this has been a frustrating problem for you but for us, probably not. Thus reengage us.
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Joseph Youngblood
Sent: Tuesday, April 15, 2014 12:16 PM
To: [hidden email]
Subject: Re: Combining Files with Duplicate Records

I really appreciate everyone's contribution to this quandary.  After several hours of head scratching the solution became depressingly obvious, my data were stacked by ID.

Previously
For example, ID 001 had 8 separate entries for DxGrp_05

The goal
To have one entry per ID, and I need the new variable to be the sum of the separate values, i.e., 8

I recoded DxGrp into 13 separate variables, (DxGrpr_01 thru DxGrpr_13), and all data populated correctly.  However, I still have multiple entries for each unique ID, and need to delete (clear) all duplicate IDs, thus I need to RECODE the new vars such that the cell represents the sum of the values for that particular DxGrpr.

Please advise.

Joseph A. Youngblood

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

David Marso
Administrator
I believe I made this same request about a week ago!
Joseph, please save everyone some time and post the requested example!!!

Maguin, Eugene wrote
Is this a continuation of the last couple of weeks? Either way, how about giving some example data so that the incoming data structure and the desired data structure are clear to all. If this is a continuation, I can appreciate that this has been a frustrating problem for you but for us, probably not. Thus reengage us.
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Joseph Youngblood
Sent: Tuesday, April 15, 2014 12:16 PM
To: [hidden email]
Subject: Re: Combining Files with Duplicate Records

I really appreciate everyone's contribution to this quandary.  After several hours of head scratching the solution became depressingly obvious, my data were stacked by ID.

Previously
For example, ID 001 had 8 separate entries for DxGrp_05

The goal
To have one entry per ID, and I need the new variable to be the sum of the separate values, i.e., 8

I recoded DxGrp into 13 separate variables, (DxGrpr_01 thru DxGrpr_13), and all data populated correctly.  However, I still have multiple entries for each unique ID, and need to delete (clear) all duplicate IDs, thus I need to RECODE the new vars such that the cell represents the sum of the values for that particular DxGrpr.

Please advise.

Joseph A. Youngblood

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Combining Files with Duplicate Records

JosephYoungblood
In reply to this post by JosephYoungblood
Thank you David, and indeed you did suggest this.

Per a suggstion, I posted a refined post under the subject "Removing Duplicate
Records but Retaining ALl Data.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD