SPSSX Discussion

Exact matching from a larger population

Classic

List

Threaded

3 messages Options

ronnieslijkhuis

Exact matching from a larger population

Hello,

I've conducted an Internet survey among a general population in the Netherlands. I'm trying to study whether the Internet is a valid mode of distributing questionnaires. In order to do so I'm comparing the results with a paper-based questionnaire. Problem is however that, due to a low response rate in the Internet survey, the group sizes of both modes are totally unbalanced (80 vs 8000). In order to deal with this kind of problem, exact matching can be applied. In this method the smaller population is matched in the larger population based on key variables. In my case these are health, age, gender, income, ethnicity, current employment, and education (All these variables are categorical). The goal is to have equal populations on these characteristics and size so only the mode of distribution and the answers given in the questionnaires are different and ready for comparison through for instance ANOVA's.
My problem is, however, that I have no idea how to match the populations... I hope one of you can help me out here. If it's not exactly clear what I'm trying to do, please let me know.

Regards,
Ronnie

Maguin, Eugene

Re: Exact matching from a larger population

Ronnie,

Your problem has two parts. One part is the actual match. First, you have to
make a matching variable. Normally, this variable is an id variable, which
is just part of the data. Here, it won't be. You mention a number of
candidate variables. You'll need to select a subset of matching variables.
I'd recommend that you combine these variables to make a new variable. For
example if you selected sex, age, education as matching variables, the new
variable would be the combination of those three. One way to do this is to
create a string variable.

String match(a5).
Compute
match=concat(string(sex,f1.0),string(age,f2.0),string(education,f2.0)).

The second part is that you may have multiple matching possibilities for a
given case in the small file. Thus you may have to somehow select a single
case from a set of possible matches. The way to do this is to compute a
random draw from a uniform distribution for each case in the big file, sort
by that variable within possible matches and select the case with the first
(smallest) value. In syntax this is

Assume big file is open and the matching variable is called match.

Compute draw=uniform(1).

Sort cases by match draw.

Compute rec=1.
If (match eq lag(match)) rec=lag(match)+1.

Select if (rec eq 1).

This defines your randomly selected matching big dataset.

Now, just put the two files together in the usual way with match as the 'by'
variable.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: Exact matching from a larger population

A comment on one technique --

At 03:08 PM 3/19/2008, Gene Maguin wrote:

>First, you have to make a matching variable. You mention a number of
>candidate variables. You'll need to select a subset of matching
>variables. I'd recommend that you combine these variables to make a
>new variable. For example, [from] sex, age, education, [you could]
>create a string variable.
>
>String match(a5).
>Compute
>match=concat(string(sex,f1.0),string(age,f2.0),string(education,f2.0)).

Something like this is fairly common advice. Why? All the SPSS
commands that use keys (MATCH FILES, ADD FILES, AGGREGATE, maybe I've
missed some) accept multi-variable keys. Instead of computing the new
variable (which must be done in both files), why not just

MATCH FILES
/FILE=*
/FILE=OTHER
/BY sex age education.

>The second part is that you may have multiple matching possibilities
>for a given case in the small file. [Then, you should select one of
>the matching possibilities at random.]

The third part, unfortunately, is that you may have no matches at all
for some case in the small file. There are 100 cases in the large
file for every one in the small file; but once you filter down by,
say, sex, age and education, the mean cell size can be pretty small,
with a significant chance that a cell is empty. Then, you need a
strategy for an "almost exact" match, opening a bigger can of worms.

>The way to [select at random] is to compute a random draw from a
>uniform distribution for each case in the big file, sort by that
>variable within possible matches and select the case with the first
>(smallest) value.

Or, recently, Jim Marks showed how to do this with RANK instead of sorting(*):

>COMPUTE selector = UNIFORM (1).
>
>** change "grp" to match your grouping variable.
>RANK selector BY grp /RANK INTO select_order.
.....................
(*)Date: Mon, 17 Mar 2008 10:44:45 -0500
From: "Marks, Jim" <[hidden email]>
Subject: Re: Computation help
To: [hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD