SPSSX Discussion

Matching cases to create matched control group

Classic

List

Threaded

3 messages Options

Daryl Schrock

Matching cases to create matched control group

Hi List,

I thought someone on this list might know the answer to my problem and that
others might benefit from this question as well, so here goes:

I have two data files in SPSS format, one of which is information from
medical patients collected onsite (N = 150) and the other is publicly
available data from a large epidemiological study of people in the United
States (N ~ 20,000). I would like to match patients in the publicly
available dataset to those in my experimental dataset to create a matched
control group using key demographic variables such as gender, ethnicity, and
age. While gender and ethnicity should be exact matches, the age matches may
be +/- 1 year. Does anyone know how to do this using syntax?

Ideally, the program/syntax would also create a variable for the matched
cases that states which ID number (âpidâ) from the original dataset it
matches, so that I am able to doublecheck the match/program.

I have not (yet) learned Python, so although I can use Python (Iâve
downloaded the appropriate Python add-ons), I do not yet have the ability to
write Python code. Maybe this is my opportunity to learnâ¦. Iâm running SPSS
15 on Windows Vista.

Thanks for your help!

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

vlad simion

Re: Matching cases to create matched control group

Hi Daryl,

you can try to match using the propensity score, with demographic variables
as covariates.
Here is an example from Ray's site:

http://www.spsstools.net/Syntax/RandomSampling/MatchCasesOnBasisOfPropensityScores.txt

Hope that helps,
Vlad

.
On Nov 21, 2007 2:40 AM, Daryl Schrock <[hidden email]> wrote:

> Hi List,
>
> I thought someone on this list might know the answer to my problem and
> that
> others might benefit from this question as well, so here goes:
>
> I have two data files in SPSS format, one of which is information from
> medical patients collected onsite (N = 150) and the other is publicly
> available data from a large epidemiological study of people in the United
> States (N ~ 20,000). I would like to match patients in the publicly
> available dataset to those in my experimental dataset to create a matched
> control group using key demographic variables such as gender, ethnicity,
> and
> age. While gender and ethnicity should be exact matches, the age matches
> may
> be +/- 1 year. Does anyone know how to do this using syntax?
>
> Ideally, the program/syntax would also create a variable for the matched
> cases that states which ID number ('pid') from the original dataset it
> matches, so that I am able to doublecheck the match/program.
>
> I have not (yet) learned Python, so although I can use Python (I've
> downloaded the appropriate Python add-ons), I do not yet have the ability
> to
> write Python code. Maybe this is my opportunity to learn…. I'm running
> SPSS
> 15 on Windows Vista.
>
> Thanks for your help!
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>

--
Vlad Simion
Data Analyst
Tel: +40 0751945296

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: Matching cases to create matched control group

In reply to this post by Daryl Schrock

Daryl,

It may be that the propensity scoring method suggested by Vlad is the better
method for your problem. I can't comment on the relative merits of a
propensity vs exact match methods. Perhaps others can.

A) This is, at a minimum, a one-to-many match and, possibly, a many-to-many
match. The first thing to do is to determine whether it is a one-to-many or
a many-to-many.

1) Open and make a string age range variable in the patient dataset.

String agerange(a4).
Compute
agerange=concat(string(trunc(age)-1),f2.0),string(trunc(age)+1),f2.0)).

2) Aggregate the patient dataset by your matching variables and count the
number of cases with each combination. Call it Npatient.

3) Do frequencies on the count variable. If the frequency of each combo is
1, the match will be one-to-many, if 2 or more the match will be
many-to-many.

4) If a combination frequency is 2 or more, aggregate again by the matching
variables and keep a count variable. Call it Npatient.

4) Save a copy of this aggregated file, call it 'AggrPatient'.

B) Checking that all matches are possible. That is, it may be that the
patient file has a combination that is not represented at all or an
insufficient number of times in the survey database.

1) Open the survey file and make a string age range variable as for the
patient file.

2) Aggregate the file by your matching variables and count the number of
cases with each combination. Use a different variable name for the count in
this file than you used in the patient file. Call it NSurvey.

3) Match the AggrPatient file to this aggregated file by your match
variables.

4) do a frequencies on NSurvey. If you have missing values, then you have
cases in the patient file that have no match in the survey file. This is a
problem.

5) If the match is going to be a many-to-many, crosstab NPatient and
Nsurvey. Check for cases where NPatient is 2 or more and NSurvey is less
than NPatient. Of course you could also do this with a temporary, select if,
and list. If there are any such cases, this is also a problem.

C) So now you know the lay of the matching landscape. Now, for the match
itself.

1) Open the patient dataset and make a string age range variable.

2) Run this piece of code to number the combinations and then save the file
to a new name, call it NewPatient.

Sort cases by sex ethnicity agerange.
Compute sequence=1.
If (sex eq lag(sex) and ethnicity eq lag(ethnicity) and agerange eq
lag(agerange)
sequence=lag(sequence)+1.

3) Open the survey dataset and make a string age range variable.

4) Run this piece of code to add a random number to each record.

Compute rannum=uniform(1).

5) Run this piece of code to number the combinations and then save the file
to a new name, call it NewSurvey.

Sort cases by sex ethnicity agerange rannum.
Compute sequence=1.
If (sex eq lag(sex) and ethnicity eq lag(ethnicity) and agerange eq
lag(agerange)
sequence=lag(sequence)+1.

6) Do a match files on NewSurvey and NewPatient with the by variables being
the matchiing variables AND sequence. Use In subcommands in relation to
NewPatient and in relation to NewSurvey to mark which cases have data from
which input files.

7) Crosstab the two variables specified on the In subcommands and examine
this crosstab carefully if either B4) or B5) are true.

8) Once you are satisfied with things, select on the In subcommand variable
for NewPatient. That is your desired file.

Good luck,
Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD