SPSSX Discussion

how to merge/compare across multiple datasets with duplicate IDs

Classic

List

Threaded

7 messages Options

Nico Peruzzi

how to merge/compare across multiple datasets with duplicate IDs

Hi listers,

I've got a pile of data that came in 3 datasets. They all have an ID
variable, however here's the trick.

dataset #1 has ID + some demographic variables (there is 1 case per ID)
dataset #2 has ID + some variables related to someone's account (note that
an ID can have more than 1 acocunt) (there is 1 case per account #)
dataset #3 has ID + vehicle variables (note that an ID can have more than 1
vehicle) (there is 1 case per vehicle ID)

here's an example of one deceptively simple-sounding chart I need to create:

Show frequencies of age ranges (comes from dataset #1) based on type of
account (comes from dataset #2)

here's another example:

Show frequencies for each vehicle type (comes from dataset #3) based on
account type (from #2) and ownership status (from #1)

All would be great if I could just merge on ID, but as I mentioned above the
only variable across all 3 datasets is ID, and there will be multiple
occurences of ID in datasets #2 and #3.

Any thoughts on how to work through or around this?

Thanks in advance, Nico

--
Nico Peruzzi, Ph.D.

Melissa Ives

Re: how to merge/compare across multiple datasets with duplicate IDs

Does dataset 3 have an indicator of which account type from dataset 2
would go with which car?
You could use a match files with a lookup table e.g.

Match files file=table2/table=table 1.

That will associate each (retained) value of table 1 with each record
from table 2. However matching 2 and 3 is more questionable since it is
now a many to many relationship.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Nico Peruzzi
Sent: Thursday, October 05, 2006 1:07 PM
To: [hidden email]
Subject: [SPSSX-L] how to merge/compare across multiple datasets with
duplicate IDs

Hi listers,

I've got a pile of data that came in 3 datasets. They all have an ID
variable, however here's the trick.

dataset #1 has ID + some demographic variables (there is 1 case per ID)
dataset #2 has ID + some variables related to someone's account (note
that an ID can have more than 1 acocunt) (there is 1 case per account #)
dataset #3 has ID + vehicle variables (note that an ID can have more
than 1
vehicle) (there is 1 case per vehicle ID)

here's an example of one deceptively simple-sounding chart I need to
create:

Show frequencies of age ranges (comes from dataset #1) based on type of
account (comes from dataset #2)

here's another example:

Show frequencies for each vehicle type (comes from dataset #3) based on
account type (from #2) and ownership status (from #1)

All would be great if I could just merge on ID, but as I mentioned above
the only variable across all 3 datasets is ID, and there will be
multiple occurences of ID in datasets #2 and #3.

Any thoughts on how to work through or around this?

Thanks in advance, Nico

--
Nico Peruzzi, Ph.D.

PRIVILEGED AND CONFIDENTIAL INFORMATION
This transmittal and any attachments may contain PRIVILEGED AND
CONFIDENTIAL information and is intended only for the use of the
addressee. If you are not the designated recipient, or an employee
or agent authorized to deliver such transmittals to the designated
recipient, you are hereby notified that any dissemination,
copying or publication of this transmittal is strictly prohibited. If
you have received this transmittal in error, please notify us
immediately by replying to the sender and delete this copy from your
system. You may also call us at (309) 827-6026 for assistance.

Nico Peruzzi

Re: how to merge/compare across multiple datasets with duplicate IDs

Melissa,

There's no indicator of account type in dataset 3, just car information.

But match files looks like a good thing for me to explore.

But is not possible to use it for a many to many match - as in dataset 2 to
3?

Thanks, Nico

On 10/5/06, Melissa Ives <[hidden email]> wrote:

>
> Does dataset 3 have an indicator of which account type from dataset 2
> would go with which car?
> You could use a match files with a lookup table e.g.
>
> Match files file=table2/table=table 1.
>
> That will associate each (retained) value of table 1 with each record
> from table 2. However matching 2 and 3 is more questionable since it is
> now a many to many relationship.
>
> Melissa
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> Nico Peruzzi
> Sent: Thursday, October 05, 2006 1:07 PM
> To: [hidden email]
> Subject: [SPSSX-L] how to merge/compare across multiple datasets with
> duplicate IDs
>
> Hi listers,
>
> I've got a pile of data that came in 3 datasets. They all have an ID
> variable, however here's the trick.
>
> dataset #1 has ID + some demographic variables (there is 1 case per ID)
> dataset #2 has ID + some variables related to someone's account (note
> that an ID can have more than 1 acocunt) (there is 1 case per account #)
> dataset #3 has ID + vehicle variables (note that an ID can have more
> than 1
> vehicle) (there is 1 case per vehicle ID)
>
> here's an example of one deceptively simple-sounding chart I need to
> create:
>
> Show frequencies of age ranges (comes from dataset #1) based on type of
> account (comes from dataset #2)
>
> here's another example:
>
> Show frequencies for each vehicle type (comes from dataset #3) based on
> account type (from #2) and ownership status (from #1)
>
> All would be great if I could just merge on ID, but as I mentioned above
> the only variable across all 3 datasets is ID, and there will be
> multiple occurences of ID in datasets #2 and #3.
>
> Any thoughts on how to work through or around this?
>
> Thanks in advance, Nico
>
> --
> Nico Peruzzi, Ph.D.
>
>
>
> PRIVILEGED AND CONFIDENTIAL INFORMATION
> This transmittal and any attachments may contain PRIVILEGED AND
> CONFIDENTIAL information and is intended only for the use of the
> addressee. If you are not the designated recipient, or an employee
> or agent authorized to deliver such transmittals to the designated
> recipient, you are hereby notified that any dissemination,
> copying or publication of this transmittal is strictly prohibited. If
> you have received this transmittal in error, please notify us
> immediately by replying to the sender and delete this copy from your
> system. You may also call us at (309) 827-6026 for assistance.
>

--
Nico Peruzzi, Ph.D.

Richard Ristow

Re: how to merge/compare across multiple datasets

In reply to this post by Nico Peruzzi

At 02:06 PM 10/5/2006, Nico Peruzzi wrote:

>I've 3 datasets. They all have an ID variable, however here's the
>trick.
>
>dataset #1 has ID + some demographic variables (there is 1 case per
>ID)
>dataset #2 has ID + some variables related to someone's account (note
>that an ID can have more than 1 acocunt) (there is 1 case per account
>#)
>dataset #3 has ID + vehicle variables (note that an ID can have more
>than 1 vehicle) (there is 1 case per vehicle ID)
>
>here's an example of one deceptively simple-sounding chart I need to
>create:
>
>Show frequencies of age ranges (comes from dataset #1) based on type
>of account (comes from dataset #2)

As Melissa Ives said, this is straightforward with MATCH FILES. You do
need a /BY subcommand to make it work. And this requires that both
files be sorted by ID:

MATCH FILES
/TABLE=dataset1
/FILE=dataset2
/BY ID.

The resulting file has one record per *account*, not per person, and
frequency of age ranges by account type is based on total accounts.
That's probably fine; but it's important to keep track of your 'unit of
analysis,' in a study where there are several possibilities.

>Here's another example:
>
>Show frequencies for each vehicle type (comes from dataset #3) based
>on account type (from #2) and ownership status (from #1).

>But match files looks like a good thing for me to explore. But is not
>possible to use it for a many to many match - as in dataset 2 to 3?

No, MATCH FILES won't do a many-to-many match. The best I know for
many-to-many is the "long to wide to long" technique. But before
getting into it for you, this analysis bothers me in principle. If
George has three accounts and two cars, he'll get six records, one for
each pairing of car and account, and I'd be hard put to it to define
what they mean. And "ownership status" sounds like it should belong to
a car, not to the person overall.

So: you CAN get a dataset of all pairs of account and car for each
person, one record for each pair. Individuals with either no cars, or
no accounts, won't be in it. And once it's done, adding demographic
data is a one-to-many match, as above.

But right now, what that dataset means worries me more, than does how
to get it.

Maguin, Eugene

Re: how to merge/compare across multiple datasets with duplicate IDs

In reply to this post by Nico Peruzzi

Nico,

This is going to be a quick email because I working on something else. I
have to confess that I am not quite sure how to work this problem. I wonder
if one way might not be through the multiple response procedure(s). I have
not used that proc for many years and don't remember much about it. If it
turns out that mult response is useful to you, then you have the problem of
getting your datasets in the right format. As I understand your three
datasets, I think that you should convert datasets 2 and 3 from the present
'long' format to a 'wide' format. I'll assume you are familiar with how to
do this. Then match the three files together. This will give you a very wide
dataset with one record per id. That record will contain information 1 to m
accounts and from 1 to n vehicles.

The other scheme is making a new data set such that each id has a many
records as accounts * vehicles and with dataset 1 data spread over every
record. It seems like Richard Ristow has written about this operation but I
don't recall his procedures. I haven't done this but I think this will work
to make such a dataset.

Aggregate dataset2 and dataset 3, breaking each by id, and keep only the
record count. The result is the number of accounts and the number of
vehicles, respectively, per id. Match those two files and either multiply
the counts together and do a frequencies on the product or crosstab the two
counts. Either way note the largest value. Either way, you see the number of
records per id such that every account is matched with every vehicle. You

Using the just matched files, set up a loop structure to create some new
variables. Like this:

* replace <ds2_ds3max> with the maximum of the count products from
* the frequencies or the crosstab.
Vector ds2rec ds3rec(<ds2_ds3max>,f2.0).
Loop #i=1 to ds2count. /* accounts per id, dataset 2.
+ loop #j=1 to ds3count. /* vehicles per id, dataset 3.
+ compute #k=#i*#j.
+ compute ds2rec(#k)=#i.
+ compute ds3rec(#k)=#j.
+ end loop.
End loop.

Convert the resulting file from 'wide' to 'long'.
Save this file. Let it be called 'Stemfile'.

Save this file and go back to dataset 2 and then to dataset 3 and number the
account and vehicle records, respectively. In dataset2 this variable is
called ds2rec; in dataset3 this variable is called ds3rec. I'll assume you
know how to do this.

Lastly, match dataset 2 as the table file to Stemfile using id and ds2rec as
the by variables. Then match dataset 3 as the table file to the just created
file using id and ds3rec as by variables. Lastly, match dataset 1 as the
table file to the just created file using id ONLY as the by variable.

I haven't tried this but I think it will work.
Let me know what happens.

Gene Maguin

Simon Phillip Freidin

Re: how to merge/compare across multiple datasets with duplicate IDs

In reply to this post by Nico Peruzzi

convert the vehicle file to a single id using:
casestovars/id=id.

match this file as well
match files file=#1/table=#2/file=#3/by id.

then use varstocases to convert the vehicle variables back to
multiple cases, keeping all the vars on the first two datasets

BTW this is database problem not a stats problem.

Try MS Access/My SQL and do the report using select statements

S.

On 06/10/2006, at 4:06 AM, Nico Peruzzi wrote:

> Hi listers,
>
> I've got a pile of data that came in 3 datasets. They all have an ID
> variable, however here's the trick.
>
> dataset #1 has ID + some demographic variables (there is 1 case per
> ID)
> dataset #2 has ID + some variables related to someone's account
> (note that
> an ID can have more than 1 acocunt) (there is 1 case per account #)
> dataset #3 has ID + vehicle variables (note that an ID can have
> more than 1
> vehicle) (there is 1 case per vehicle ID)
>
> here's an example of one deceptively simple-sounding chart I need
> to create:
>
> Show frequencies of age ranges (comes from dataset #1) based on
> type of
> account (comes from dataset #2)
>
> here's another example:
>
> Show frequencies for each vehicle type (comes from dataset #3)
> based on
> account type (from #2) and ownership status (from #1)
>
> All would be great if I could just merge on ID, but as I mentioned
> above the
> only variable across all 3 datasets is ID, and there will be multiple
> occurences of ID in datasets #2 and #3.
>
> Any thoughts on how to work through or around this?
>
> Thanks in advance, Nico
>
> --
> Nico Peruzzi, Ph.D.

Nico Peruzzi

Re: how to merge/compare across multiple datasets

In reply to this post by Richard Ristow

I think what I'm learning from this exchange is to first double check that
all the proposed charts make sense, and then to explore Access as a possible
way to simplify the database aspects. The stats aren't the hard part here -
it's the database manipulation, and Access may be better suited.

Thanks for all the responses, Nico

On 10/5/06, Richard Ristow <[hidden email]> wrote:

>
> At 02:06 PM 10/5/2006, Nico Peruzzi wrote:
>
> >I've 3 datasets. They all have an ID variable, however here's the
> >trick.
> >
> >dataset #1 has ID + some demographic variables (there is 1 case per
> >ID)
> >dataset #2 has ID + some variables related to someone's account (note
> >that an ID can have more than 1 acocunt) (there is 1 case per account
> >#)
> >dataset #3 has ID + vehicle variables (note that an ID can have more
> >than 1 vehicle) (there is 1 case per vehicle ID)
> >
> >here's an example of one deceptively simple-sounding chart I need to
> >create:
> >
> >Show frequencies of age ranges (comes from dataset #1) based on type
> >of account (comes from dataset #2)
>
> As Melissa Ives said, this is straightforward with MATCH FILES. You do
> need a /BY subcommand to make it work. And this requires that both
> files be sorted by ID:
>
> MATCH FILES
> /TABLE=dataset1
> /FILE=dataset2
> /BY ID.
>
> The resulting file has one record per *account*, not per person, and
> frequency of age ranges by account type is based on total accounts.
> That's probably fine; but it's important to keep track of your 'unit of
> analysis,' in a study where there are several possibilities.
>
> >Here's another example:
> >
> >Show frequencies for each vehicle type (comes from dataset #3) based
> >on account type (from #2) and ownership status (from #1).
>
> >But match files looks like a good thing for me to explore. But is not
> >possible to use it for a many to many match - as in dataset 2 to 3?
>
> No, MATCH FILES won't do a many-to-many match. The best I know for
> many-to-many is the "long to wide to long" technique. But before
> getting into it for you, this analysis bothers me in principle. If
> George has three accounts and two cars, he'll get six records, one for
> each pairing of car and account, and I'd be hard put to it to define
> what they mean. And "ownership status" sounds like it should belong to
> a car, not to the person overall.
>
> So: you CAN get a dataset of all pairs of account and car for each
> person, one record for each pair. Individuals with either no cars, or
> no accounts, won't be in it. And once it's done, adding demographic
> data is a one-to-many match, as above.
>
> But right now, what that dataset means worries me more, than does how
> to get it.
>
>

--
Nico Peruzzi, Ph.D.