Hi listers,
I've got a pile of data that came in 3 datasets. They all have an ID variable, however here's the trick. dataset #1 has ID + some demographic variables (there is 1 case per ID) dataset #2 has ID + some variables related to someone's account (note that an ID can have more than 1 acocunt) (there is 1 case per account #) dataset #3 has ID + vehicle variables (note that an ID can have more than 1 vehicle) (there is 1 case per vehicle ID) here's an example of one deceptively simple-sounding chart I need to create: Show frequencies of age ranges (comes from dataset #1) based on type of account (comes from dataset #2) here's another example: Show frequencies for each vehicle type (comes from dataset #3) based on account type (from #2) and ownership status (from #1) All would be great if I could just merge on ID, but as I mentioned above the only variable across all 3 datasets is ID, and there will be multiple occurences of ID in datasets #2 and #3. Any thoughts on how to work through or around this? Thanks in advance, Nico -- Nico Peruzzi, Ph.D. |
Does dataset 3 have an indicator of which account type from dataset 2
would go with which car? You could use a match files with a lookup table e.g. Match files file=table2/table=table 1. That will associate each (retained) value of table 1 with each record from table 2. However matching 2 and 3 is more questionable since it is now a many to many relationship. Melissa -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Nico Peruzzi Sent: Thursday, October 05, 2006 1:07 PM To: [hidden email] Subject: [SPSSX-L] how to merge/compare across multiple datasets with duplicate IDs Hi listers, I've got a pile of data that came in 3 datasets. They all have an ID variable, however here's the trick. dataset #1 has ID + some demographic variables (there is 1 case per ID) dataset #2 has ID + some variables related to someone's account (note that an ID can have more than 1 acocunt) (there is 1 case per account #) dataset #3 has ID + vehicle variables (note that an ID can have more than 1 vehicle) (there is 1 case per vehicle ID) here's an example of one deceptively simple-sounding chart I need to create: Show frequencies of age ranges (comes from dataset #1) based on type of account (comes from dataset #2) here's another example: Show frequencies for each vehicle type (comes from dataset #3) based on account type (from #2) and ownership status (from #1) All would be great if I could just merge on ID, but as I mentioned above the only variable across all 3 datasets is ID, and there will be multiple occurences of ID in datasets #2 and #3. Any thoughts on how to work through or around this? Thanks in advance, Nico -- Nico Peruzzi, Ph.D. PRIVILEGED AND CONFIDENTIAL INFORMATION This transmittal and any attachments may contain PRIVILEGED AND CONFIDENTIAL information and is intended only for the use of the addressee. If you are not the designated recipient, or an employee or agent authorized to deliver such transmittals to the designated recipient, you are hereby notified that any dissemination, copying or publication of this transmittal is strictly prohibited. If you have received this transmittal in error, please notify us immediately by replying to the sender and delete this copy from your system. You may also call us at (309) 827-6026 for assistance. |
Melissa,
There's no indicator of account type in dataset 3, just car information. But match files looks like a good thing for me to explore. But is not possible to use it for a many to many match - as in dataset 2 to 3? Thanks, Nico On 10/5/06, Melissa Ives <[hidden email]> wrote: > > Does dataset 3 have an indicator of which account type from dataset 2 > would go with which car? > You could use a match files with a lookup table e.g. > > Match files file=table2/table=table 1. > > That will associate each (retained) value of table 1 with each record > from table 2. However matching 2 and 3 is more questionable since it is > now a many to many relationship. > > Melissa > > -----Original Message----- > From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of > Nico Peruzzi > Sent: Thursday, October 05, 2006 1:07 PM > To: [hidden email] > Subject: [SPSSX-L] how to merge/compare across multiple datasets with > duplicate IDs > > Hi listers, > > I've got a pile of data that came in 3 datasets. They all have an ID > variable, however here's the trick. > > dataset #1 has ID + some demographic variables (there is 1 case per ID) > dataset #2 has ID + some variables related to someone's account (note > that an ID can have more than 1 acocunt) (there is 1 case per account #) > dataset #3 has ID + vehicle variables (note that an ID can have more > than 1 > vehicle) (there is 1 case per vehicle ID) > > here's an example of one deceptively simple-sounding chart I need to > create: > > Show frequencies of age ranges (comes from dataset #1) based on type of > account (comes from dataset #2) > > here's another example: > > Show frequencies for each vehicle type (comes from dataset #3) based on > account type (from #2) and ownership status (from #1) > > All would be great if I could just merge on ID, but as I mentioned above > the only variable across all 3 datasets is ID, and there will be > multiple occurences of ID in datasets #2 and #3. > > Any thoughts on how to work through or around this? > > Thanks in advance, Nico > > -- > Nico Peruzzi, Ph.D. > > > > PRIVILEGED AND CONFIDENTIAL INFORMATION > This transmittal and any attachments may contain PRIVILEGED AND > CONFIDENTIAL information and is intended only for the use of the > addressee. If you are not the designated recipient, or an employee > or agent authorized to deliver such transmittals to the designated > recipient, you are hereby notified that any dissemination, > copying or publication of this transmittal is strictly prohibited. If > you have received this transmittal in error, please notify us > immediately by replying to the sender and delete this copy from your > system. You may also call us at (309) 827-6026 for assistance. > -- Nico Peruzzi, Ph.D. |
In reply to this post by Nico Peruzzi
At 02:06 PM 10/5/2006, Nico Peruzzi wrote:
>I've 3 datasets. They all have an ID variable, however here's the >trick. > >dataset #1 has ID + some demographic variables (there is 1 case per >ID) >dataset #2 has ID + some variables related to someone's account (note >that an ID can have more than 1 acocunt) (there is 1 case per account >#) >dataset #3 has ID + vehicle variables (note that an ID can have more >than 1 vehicle) (there is 1 case per vehicle ID) > >here's an example of one deceptively simple-sounding chart I need to >create: > >Show frequencies of age ranges (comes from dataset #1) based on type >of account (comes from dataset #2) As Melissa Ives said, this is straightforward with MATCH FILES. You do need a /BY subcommand to make it work. And this requires that both files be sorted by ID: MATCH FILES /TABLE=dataset1 /FILE=dataset2 /BY ID. The resulting file has one record per *account*, not per person, and frequency of age ranges by account type is based on total accounts. That's probably fine; but it's important to keep track of your 'unit of analysis,' in a study where there are several possibilities. >Here's another example: > >Show frequencies for each vehicle type (comes from dataset #3) based >on account type (from #2) and ownership status (from #1). >But match files looks like a good thing for me to explore. But is not >possible to use it for a many to many match - as in dataset 2 to 3? No, MATCH FILES won't do a many-to-many match. The best I know for many-to-many is the "long to wide to long" technique. But before getting into it for you, this analysis bothers me in principle. If George has three accounts and two cars, he'll get six records, one for each pairing of car and account, and I'd be hard put to it to define what they mean. And "ownership status" sounds like it should belong to a car, not to the person overall. So: you CAN get a dataset of all pairs of account and car for each person, one record for each pair. Individuals with either no cars, or no accounts, won't be in it. And once it's done, adding demographic data is a one-to-many match, as above. But right now, what that dataset means worries me more, than does how to get it. |
In reply to this post by Nico Peruzzi
Nico,
This is going to be a quick email because I working on something else. I have to confess that I am not quite sure how to work this problem. I wonder if one way might not be through the multiple response procedure(s). I have not used that proc for many years and don't remember much about it. If it turns out that mult response is useful to you, then you have the problem of getting your datasets in the right format. As I understand your three datasets, I think that you should convert datasets 2 and 3 from the present 'long' format to a 'wide' format. I'll assume you are familiar with how to do this. Then match the three files together. This will give you a very wide dataset with one record per id. That record will contain information 1 to m accounts and from 1 to n vehicles. The other scheme is making a new data set such that each id has a many records as accounts * vehicles and with dataset 1 data spread over every record. It seems like Richard Ristow has written about this operation but I don't recall his procedures. I haven't done this but I think this will work to make such a dataset. Aggregate dataset2 and dataset 3, breaking each by id, and keep only the record count. The result is the number of accounts and the number of vehicles, respectively, per id. Match those two files and either multiply the counts together and do a frequencies on the product or crosstab the two counts. Either way note the largest value. Either way, you see the number of records per id such that every account is matched with every vehicle. You Using the just matched files, set up a loop structure to create some new variables. Like this: * replace <ds2_ds3max> with the maximum of the count products from * the frequencies or the crosstab. Vector ds2rec ds3rec(<ds2_ds3max>,f2.0). Loop #i=1 to ds2count. /* accounts per id, dataset 2. + loop #j=1 to ds3count. /* vehicles per id, dataset 3. + compute #k=#i*#j. + compute ds2rec(#k)=#i. + compute ds3rec(#k)=#j. + end loop. End loop. Convert the resulting file from 'wide' to 'long'. Save this file. Let it be called 'Stemfile'. Save this file and go back to dataset 2 and then to dataset 3 and number the account and vehicle records, respectively. In dataset2 this variable is called ds2rec; in dataset3 this variable is called ds3rec. I'll assume you know how to do this. Lastly, match dataset 2 as the table file to Stemfile using id and ds2rec as the by variables. Then match dataset 3 as the table file to the just created file using id and ds3rec as by variables. Lastly, match dataset 1 as the table file to the just created file using id ONLY as the by variable. I haven't tried this but I think it will work. Let me know what happens. Gene Maguin |
In reply to this post by Nico Peruzzi
convert the vehicle file to a single id using:
casestovars/id=id. match this file as well match files file=#1/table=#2/file=#3/by id. then use varstocases to convert the vehicle variables back to multiple cases, keeping all the vars on the first two datasets BTW this is database problem not a stats problem. Try MS Access/My SQL and do the report using select statements S. On 06/10/2006, at 4:06 AM, Nico Peruzzi wrote: > Hi listers, > > I've got a pile of data that came in 3 datasets. They all have an ID > variable, however here's the trick. > > dataset #1 has ID + some demographic variables (there is 1 case per > ID) > dataset #2 has ID + some variables related to someone's account > (note that > an ID can have more than 1 acocunt) (there is 1 case per account #) > dataset #3 has ID + vehicle variables (note that an ID can have > more than 1 > vehicle) (there is 1 case per vehicle ID) > > here's an example of one deceptively simple-sounding chart I need > to create: > > Show frequencies of age ranges (comes from dataset #1) based on > type of > account (comes from dataset #2) > > here's another example: > > Show frequencies for each vehicle type (comes from dataset #3) > based on > account type (from #2) and ownership status (from #1) > > All would be great if I could just merge on ID, but as I mentioned > above the > only variable across all 3 datasets is ID, and there will be multiple > occurences of ID in datasets #2 and #3. > > Any thoughts on how to work through or around this? > > Thanks in advance, Nico > > -- > Nico Peruzzi, Ph.D. |
In reply to this post by Richard Ristow
I think what I'm learning from this exchange is to first double check that
all the proposed charts make sense, and then to explore Access as a possible way to simplify the database aspects. The stats aren't the hard part here - it's the database manipulation, and Access may be better suited. Thanks for all the responses, Nico On 10/5/06, Richard Ristow <[hidden email]> wrote: > > At 02:06 PM 10/5/2006, Nico Peruzzi wrote: > > >I've 3 datasets. They all have an ID variable, however here's the > >trick. > > > >dataset #1 has ID + some demographic variables (there is 1 case per > >ID) > >dataset #2 has ID + some variables related to someone's account (note > >that an ID can have more than 1 acocunt) (there is 1 case per account > >#) > >dataset #3 has ID + vehicle variables (note that an ID can have more > >than 1 vehicle) (there is 1 case per vehicle ID) > > > >here's an example of one deceptively simple-sounding chart I need to > >create: > > > >Show frequencies of age ranges (comes from dataset #1) based on type > >of account (comes from dataset #2) > > As Melissa Ives said, this is straightforward with MATCH FILES. You do > need a /BY subcommand to make it work. And this requires that both > files be sorted by ID: > > MATCH FILES > /TABLE=dataset1 > /FILE=dataset2 > /BY ID. > > The resulting file has one record per *account*, not per person, and > frequency of age ranges by account type is based on total accounts. > That's probably fine; but it's important to keep track of your 'unit of > analysis,' in a study where there are several possibilities. > > >Here's another example: > > > >Show frequencies for each vehicle type (comes from dataset #3) based > >on account type (from #2) and ownership status (from #1). > > >But match files looks like a good thing for me to explore. But is not > >possible to use it for a many to many match - as in dataset 2 to 3? > > No, MATCH FILES won't do a many-to-many match. The best I know for > many-to-many is the "long to wide to long" technique. But before > getting into it for you, this analysis bothers me in principle. If > George has three accounts and two cars, he'll get six records, one for > each pairing of car and account, and I'd be hard put to it to define > what they mean. And "ownership status" sounds like it should belong to > a car, not to the person overall. > > So: you CAN get a dataset of all pairs of account and car for each > person, one record for each pair. Individuals with either no cars, or > no accounts, won't be in it. And once it's done, adding demographic > data is a one-to-many match, as above. > > But right now, what that dataset means worries me more, than does how > to get it. > > -- Nico Peruzzi, Ph.D. |
Free forum by Nabble | Edit this page |