|
Hi List,
I thought someone on this list might know the answer to my problem and that others might benefit from this question as well, so here goes: I have two data files in SPSS format, one of which is information from medical patients collected onsite (N = 150) and the other is publicly available data from a large epidemiological study of people in the United States (N ~ 20,000). I would like to match patients in the publicly available dataset to those in my experimental dataset to create a matched control group using key demographic variables such as gender, ethnicity, and age. While gender and ethnicity should be exact matches, the age matches may be +/- 1 year. Does anyone know how to do this using syntax? Ideally, the program/syntax would also create a variable for the matched cases that states which ID number (âpidâ) from the original dataset it matches, so that I am able to doublecheck the match/program. I have not (yet) learned Python, so although I can use Python (Iâve downloaded the appropriate Python add-ons), I do not yet have the ability to write Python code. Maybe this is my opportunity to learnâ¦. Iâm running SPSS 15 on Windows Vista. Thanks for your help! ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Hi Daryl,
you can try to match using the propensity score, with demographic variables as covariates. Here is an example from Ray's site: http://www.spsstools.net/Syntax/RandomSampling/MatchCasesOnBasisOfPropensityScores.txt Hope that helps, Vlad . On Nov 21, 2007 2:40 AM, Daryl Schrock <[hidden email]> wrote: > Hi List, > > I thought someone on this list might know the answer to my problem and > that > others might benefit from this question as well, so here goes: > > I have two data files in SPSS format, one of which is information from > medical patients collected onsite (N = 150) and the other is publicly > available data from a large epidemiological study of people in the United > States (N ~ 20,000). I would like to match patients in the publicly > available dataset to those in my experimental dataset to create a matched > control group using key demographic variables such as gender, ethnicity, > and > age. While gender and ethnicity should be exact matches, the age matches > may > be +/- 1 year. Does anyone know how to do this using syntax? > > Ideally, the program/syntax would also create a variable for the matched > cases that states which ID number ('pid') from the original dataset it > matches, so that I am able to doublecheck the match/program. > > I have not (yet) learned Python, so although I can use Python (I've > downloaded the appropriate Python add-ons), I do not yet have the ability > to > write Python code. Maybe this is my opportunity to learn…. I'm running > SPSS > 15 on Windows Vista. > > Thanks for your help! > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > -- Vlad Simion Data Analyst Tel: +40 0751945296 ====================To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Daryl Schrock
Daryl,
It may be that the propensity scoring method suggested by Vlad is the better method for your problem. I can't comment on the relative merits of a propensity vs exact match methods. Perhaps others can. A) This is, at a minimum, a one-to-many match and, possibly, a many-to-many match. The first thing to do is to determine whether it is a one-to-many or a many-to-many. 1) Open and make a string age range variable in the patient dataset. String agerange(a4). Compute agerange=concat(string(trunc(age)-1),f2.0),string(trunc(age)+1),f2.0)). 2) Aggregate the patient dataset by your matching variables and count the number of cases with each combination. Call it Npatient. 3) Do frequencies on the count variable. If the frequency of each combo is 1, the match will be one-to-many, if 2 or more the match will be many-to-many. 4) If a combination frequency is 2 or more, aggregate again by the matching variables and keep a count variable. Call it Npatient. 4) Save a copy of this aggregated file, call it 'AggrPatient'. B) Checking that all matches are possible. That is, it may be that the patient file has a combination that is not represented at all or an insufficient number of times in the survey database. 1) Open the survey file and make a string age range variable as for the patient file. 2) Aggregate the file by your matching variables and count the number of cases with each combination. Use a different variable name for the count in this file than you used in the patient file. Call it NSurvey. 3) Match the AggrPatient file to this aggregated file by your match variables. 4) do a frequencies on NSurvey. If you have missing values, then you have cases in the patient file that have no match in the survey file. This is a problem. 5) If the match is going to be a many-to-many, crosstab NPatient and Nsurvey. Check for cases where NPatient is 2 or more and NSurvey is less than NPatient. Of course you could also do this with a temporary, select if, and list. If there are any such cases, this is also a problem. C) So now you know the lay of the matching landscape. Now, for the match itself. 1) Open the patient dataset and make a string age range variable. 2) Run this piece of code to number the combinations and then save the file to a new name, call it NewPatient. Sort cases by sex ethnicity agerange. Compute sequence=1. If (sex eq lag(sex) and ethnicity eq lag(ethnicity) and agerange eq lag(agerange) sequence=lag(sequence)+1. 3) Open the survey dataset and make a string age range variable. 4) Run this piece of code to add a random number to each record. Compute rannum=uniform(1). 5) Run this piece of code to number the combinations and then save the file to a new name, call it NewSurvey. Sort cases by sex ethnicity agerange rannum. Compute sequence=1. If (sex eq lag(sex) and ethnicity eq lag(ethnicity) and agerange eq lag(agerange) sequence=lag(sequence)+1. 6) Do a match files on NewSurvey and NewPatient with the by variables being the matchiing variables AND sequence. Use In subcommands in relation to NewPatient and in relation to NewSurvey to mark which cases have data from which input files. 7) Crosstab the two variables specified on the In subcommands and examine this crosstab carefully if either B4) or B5) are true. 8) Once you are satisfied with things, select on the In subcommand variable for NewPatient. That is your desired file. Good luck, Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
