SPSSX Discussion

Obtaining a matched control group

Classic

List

Threaded

21 messages Options

Ivana

Obtaining a matched control group

Hi everyone

I desperately need help with generating a matched control group through SPSS(16). I have 1269 records of individuals with learning disability. 142 of these have a mental health problem (mhprob=1). The control group needs to be generated from the rest of the cases who do not have a mental health problem (mhprob=0) on 1:1 basis. The matching parameters are age, sex and a score (ABCTOT) which indicates the ability of the individuals to function independently (expressed as a percentage). I have tried applying the script in one of the answers on this forum:

http://spssx-discussion.1045642.n5.nabble.com/Sampling-question-How-to-draw-a-matched-control-group-td1086666.html

I cannot get it working at all. Please have in mind I am not much of an SPSS expert when it comes down to programming and scripts.

Many thanks

Ivana

David Marso

Re: Obtaining a matched control group

Administrator

"I have tried applying the script in one of the answers on this forum: "
Please help others help you! Which script?
There is initial reference to a rather sad piece of code
http://www.spsstools.net/Syntax/RandomSampling/findRandomPairsOfCasesWithSameCharacteristics.txt
Then Syntax by Albert-Jan Roskam
and an SPSS extension "CASECTRL"
What have you tried?
What errors do you receive?
"I can not get it working at all. " Is not that informative.

Ivana wrote

Hi everyone

I desperately need help with generating a matched control group through SPSS(16). I have 1269 records of individuals with learning disability. 142 of these have a mental health problem (mhprob=1). The control group needs to be generated from the rest of the cases who do not have a mental health problem (mhprob=0) on 1:1 basis. The matching parameters are age, sex and a score (ABCTOT) which indicates the ability of the individuals to function independently (expressed as a percentage). I have tried applying the script in one of the answers on this forum:

http://spssx-discussion.1045642.n5.nabble.com/Sampling-question-How-to-draw-a-matched-control-group-td1086666.html

I cannot get it working at all. Please have in mind I am not much of an SPSS expert when it comes down to programming and scripts.

Many thanks

Ivana

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

John F Hall

Re: Obtaining a matched control group

In reply to this post by Ivana

RE: Obtaining a matched control group

You could try something like this. Create two new data files, one for mhprob = 1 and the other for an equal sized sample of mhprob = 0, then use ADD FILES to generate a third data set with the same no of cases of each group. In syntax it would look something like (untested: temp = temporary selection, so SPSS reverts to original file):

Temp .

Select if mhprob = 1 .

Save out <file1.sav> .

Temp .

Select if mhprob = 0 .

Sample n 142 from 1127 .

Save out <file2.sav> .

This gives you two files of 142 cases each. (You could also use file > save as)

add files file <file1.sav> /file <file2.sav> .

I'm not a statistician, so others may advise leaving your original file as is and using statistical procedures which don't need equal numbers of each group.

John Hall

[hidden email]

www.surveyresearch.weebly.com

-----Original Message-----
From: SPSSX(r) Discussion [[hidden email]] On Behalf Of Ivana
Sent: 30 March 2011 13:29
To: [hidden email]
Subject: Obtaining a matched control group

Hi everyone

I desperately need help with generating a matched control group through

SPSS(16). I have 1269 records of individuals with learning disability. 142

of these have a mental health problem (mhprob=1). The control group needs to

be generated from the rest of the cases who do not have a mental health

problem (mhprob=0) on 1:1 basis. The matching parameters are age, sex and a

score (ABCTOT) which indicates the ability of the individuals to function

independently (expressed as a percentage). I have tried applying the script

in one of the answers on this forum:

http://spssx-discussion.1045642.n5.nabble.com/Sampling-question-How-to-draw-a-matched-control-group-td1086666.html

I cannot get it working at all. Please have in mind I am not much of an SPSS

expert when it comes down to programming and scripts.

Many thanks

Ivana

View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Obtaining-a-matched-control-group-tp4271299p4271299.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD

Maguin, Eugene

Re: Obtaining a matched control group

In reply to this post by Ivana

Ivana,

So this is the code that you are referring to and will need to use. (Did you
understand that the first section of code was used to generate some example
data?). You said:

I have 1269 records of individuals with learning disability. 142
of these have a mental health problem (mhprob=1). The control group needs to
be generated from the rest of the cases who do not have a mental health
problem (mhprob=0) on 1:1 basis. The matching parameters are age, sex and a
score (ABCTOT) which indicates the ability of the individuals to function
independently (expressed as a percentage).

Here's how to convert the code to your instance. But there's some reading in
the syntax reference that will be helpful to understand what the commands
are doing.

* actual code.
compute random = rv.uniform(0,1).
sort cases by mhprob sex age abctot random.
aggr out = *
/ presorted
/ break = mhprob sex age abctot
/ dv1 to dv23=first(dv1 to dv23).
formats all (f5).

At this point you have pairs of cases (but see note following) that are
arranged so that each case in the pair are on separate lines. What you do
next depends on what you are going to do analytically. If you are going to
do paired t-tests you will need to restructure the data further. But, if you
are going to use independent sample t-tests, the data are ready for use.
NOTE. Before you do anything further you should carefully examine your data
to be sure that every treatment case (mhprob=1) has a match. You should be
seriously concerned about that since abctot is a percentage. This is where
you are going to have problems.

If you are satisfied that every treatment case has an adequate match AND you
are doing, for example, paired t-tests, then you need to restructure your
data so that case and matched control are on the same record or line. This
next part does that.

sort cases by sex age abctot mhprob.
casestovars / id = sex age abctot / index = mhprob.
Execute.

From left to right the resulting file will have the match variables, dv1 to
dv23 for the controls followed by dv1 to dv23 for the treatment cases. The
.0 suffix indicating controls and the .1 suffix indicating treatment cases.

I don't know what this does. Get rid of it.
begin program.
import spss
spss.Submit("sample 41 from %s." % spss.GetCaseCount())
end program.
exe.

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Ivana
Sent: Wednesday, March 30, 2011 7:29 AM
To: [hidden email]
Subject: Obtaining a matched control group

Hi everyone

I desperately need help with generating a matched control group through
SPSS(16). I have 1269 records of individuals with learning disability. 142
of these have a mental health problem (mhprob=1). The control group needs to
be generated from the rest of the cases who do not have a mental health
problem (mhprob=0) on 1:1 basis. The matching parameters are age, sex and a
score (ABCTOT) which indicates the ability of the individuals to function
independently (expressed as a percentage). I have tried applying the script
in one of the answers on this forum:

http://spssx-discussion.1045642.n5.nabble.com/Sampling-question-How-to-draw-
a-matched-control-group-td1086666.html

I cannot get it working at all. Please have in mind I am not much of an SPSS
expert when it comes down to programming and scripts.

Many thanks

Ivana

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Obtaining-a-matched-control-gr
oup-tp4271299p4271299.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Obtaining a matched control group

Administrator

In reply to this post by John F Hall

John,
There was something crucial about "matched" control group...
I think we need to wait for the OP to tell us what was tried but didn't work.
Basic minimal code will require some SORTS and clever LAGS and TAGS.
Likely that exact matches will not be available for all requested attributes, so...
need to throw some fuzz into the mix.
D

John F Hall wrote

You could try something like this. Create two new data files, one for
mhprob = 1 and the other for an equal sized sample of mhprob = 0, then use
ADD FILES to generate a third data set with the same no of cases of each
group. In syntax it would look something like (untested: temp = temporary
selection, so SPSS reverts to original file):

Temp .
Select if mhprob = 1 .
Save out <file1.sav> .
Temp .
Select if mhprob = 0 .
Sample n 142 from 1127 .
Save out <file2.sav> .

This gives you two files of 142 cases each. (You could also use file > save
as)

add files file <file1.sav> /file <file2.sav> .

I'm not a statistician, so others may advise leaving your original file as
is and using statistical procedures which don't need equal numbers of each
group.

John Hall
johnfhall@orange.fr
www.surveyresearch.weebly.com

-----Original Message-----
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of
Ivana
Sent: 30 March 2011 13:29
To: SPSSX-L@LISTSERV.UGA.EDU
Subject: Obtaining a matched control group

Hi everyone

I desperately need help with generating a matched control group through
SPSS(16). I have 1269 records of individuals with learning disability. 142
of these have a mental health problem (mhprob=1). The control group needs to
be generated from the rest of the cases who do not have a mental health
problem (mhprob=0) on 1:1 basis. The matching parameters are age, sex and a
score (ABCTOT) which indicates the ability of the individuals to function
independently (expressed as a percentage). I have tried applying the script
in one of the answers on this forum:

http://spssx-discussion.1045642.n5.nabble.com/Sampling-question-How-to-draw-
a-matched-control-group-td1086666.html

I cannot get it working at all. Please have in mind I am not much of an SPSS
expert when it comes down to programming and scripts.

Many thanks

Ivana

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Obtaining-a-matched-control-gr
oup-tp4271299p4271299.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Ivana

Re: Obtaining a matched control group

In reply to this post by David Marso

Sorry, this is what I tried to modify with not much luck

* seed, needed for reproducability.
set rng=mt mtindex= 20090120.

* sample data.
input program.
loop #i=1 to 2000.
compute ses = trunc(rv.uniform(0, 5)).
compute age = trunc(rv.uniform(18, 45)).
compute sex = trunc(rv.uniform(1, 2.9)).
compute blah = rv.normal(1, 100).
compute bloh = rnd(rv.normal(1, 52)).
compute casecontr = trunc(rv.uniform(0,1.9)).
end case.
end loop.
end file.
end input program.
value labels casecontr 0 'control' 1 'case'.
variable label blah 'mysterious outcome var #1' / bloh 'mysterious outcome var #2'.

* actual code.
compute random = rv.uniform(0,1).
sort cases by casecontr sex age ses random.
aggr out = *
/ presorted
/ break = casecontr sex age ses
/ blah = first (blah) / bloh = first (bloh).
formats all (f5).
sort cases by sex age ses.
casestovars / id = sex age ses / index = casecontr.
begin program.
import spss
spss.Submit("sample 41 from %s." % spss.GetCaseCount())
end program.
exe.

MacGillivary Heather L

Automatic reply: Obtaining a matched control group

Apologies, I am working at the warehouse today but I will be checking email periodically.

Thanks,

Heather

Ivana

Re: Obtaining a matched control group

In reply to this post by Maguin, Eugene

Hurrah! This has worked! Thanks so much. I think I can take it from here.

Many thanks

Ivana

David Marso

Re: Obtaining a matched control group

Administrator

In reply to this post by Maguin, Eugene

Ivana and Gene,
Something really bothers me about the following:
"
* actual code.
compute random = rv.uniform(0,1).
sort cases by mhprob sex age abctot random.
aggr out = *
/ presorted
/ break = mhprob sex age abctot
/ dv1 to dv23=first(dv1 to dv23).
formats all (f5).
"
consider a situation where you have multiple cases with the same desired matching profile and mhprob status. The AGGregate will lose the cases associated with data2 and data4.

0 1 20 .5 data1
0 1 20 .5 data2
1 1 20 .5 data3
1 1 20 .5 data4
-------
May need something a bit more complex and bullet proof .
--Here's my crack at it.
Rather than aggregating I do what I call a LAG and DRAG.
Note this hasn't been tested as I don't have SPSS immediately available.
If nothing else it should provide some insight into the complexities of the issue. Note the first pass obtains a random exact match on SEX AGE ABCTOT. The second on SEX and AGE.... etc.
This idea can be generalized to as many variables as needed.
----
* Making this up on the fly and no way to test without rebooting my box ;-(
Logic should suffice, but there might be a mistep, but I believe it will work as is. OR, someone will step up and correct my code.

*-----------
* First sort files by matching criteria*.
COMPUTE SCRAMBLER=UNIFORM(1).
COMPUTE PAIREDUP=0.
SORT CASES BY SEX AGE ABCTOT (A) SCRAMBLER mhprob (D) .
COMPUTE YOKE_ID=$CASENUM.
COMPUTE PAIRED=YOKE_ID.
* This will place cases with matching age sex abctot next to each other and tag them with a unique ID.
* Those with mhprob 0/1 randomly occurring within blocks of "matched cases" *.
* Now identify exact matches * .
DO IF SEX EQ LAG(SEX) AGE EQ LAG(AGE) AND AND ABCTOT=LAG(ABCTOT) AND mhprob EQ 0 AND LAG(mhprob) EQ 1.
COMPUTE PAIRED=LAG(YOKE_ID) .
COMPUTE MATE=YOKE_ID .
END IF.
* we have now something like this *.
matchedstuff mhprob yoke_id paired mate
xxxxxxxxxxx 1 4 4 .
xxxxxxxxxxx 0 5 4 5

SORT CASES BY YOKE_ID (D).
* we have now something like this *.
matchedstuff mhprob yoke_id paired mate
xxxxxxxxxxx 0 5 4 5
xxxxxxxxxxx 1 4 4 .

IF NOT (MISSING(LAG(MATE))) AND MISSING (MATE) MATE=LAG(MATE).
EXE.
* we have now something like this *.
matchedstuff mhprob yoke_id paired mate
xxxxxxxxxxx 0 5 4 5
xxxxxxxxxxx 1 4 4 5

DO IF NOT(MISSING(MATE)).
XSAVE OUTFILE "MATCHED1.SAV".
COMPUTE PAIREDUP=1.
ELSE.

END IF.

SELECT IF PAIREDUP=0.
MATCH FILES / FILE * / DROP SCRAMBLER PAIRED MATE .
*Every case in MATCHED1.SAV should be yoked to another case.
*Active file contains unmatched cases.

* Now repeat with relaxed criteria (ie not requiring exactly equal abctot).

COMPUTE SCRAMBLER=UNIFORM(1).
SORT CASES BY SEX AGE ABCTOT (A) SCRAMBLER mhprob (D) .
COMPUTE YOKE_ID=$CASENUM.
COMPUTE PAIRED=YOKE_ID.
* Now identify matches on AGE and SEX and tag CLOSEST ABCTOT* .
DO IF SEX EQ LAG(SEX) AND AGE EQ LAG(AGE) AND mhprob EQ 0 AND LAG(mhprob) EQ 1.
COMPUTE PAIRED=LAG(YOKE_ID).
COMPUTE MATE=YOKE_ID .
END IF.

SORT CASES BY YOKE_ID (D).
IF NOT (MISSING(LAG(MATE))) AND MISSING (MATE) MATE=LAG(MATE).
EXE.

DO IF NOT(MISSING(MATE)).
XSAVE OUTFILE "MATCHED2.SAV".
COMPUTE PAIREDUP=1.
ELSE.
END IF.
SELECT IF PAIREDUP=0.
*Matched2.sav contains exact matches on sex and age but possibly inexact on ABCTOT.

MATCH FILES / FILE * / DROP SCRAMBLER PAIRED MATE .

* Exercise for reader.... Adapt for relaxed criteria on age ;-)

Gene Maguin wrote

Ivana,

So this is the code that you are referring to and will need to use. (Did you
understand that the first section of code was used to generate some example
data?). You said:

I have 1269 records of individuals with learning disability. 142
of these have a mental health problem (mhprob=1). The control group needs to
be generated from the rest of the cases who do not have a mental health
problem (mhprob=0) on 1:1 basis. The matching parameters are age, sex and a
score (ABCTOT) which indicates the ability of the individuals to function
independently (expressed as a percentage).

Here's how to convert the code to your instance. But there's some reading in
the syntax reference that will be helpful to understand what the commands
are doing.

* actual code.
compute random = rv.uniform(0,1).
sort cases by mhprob sex age abctot random.
aggr out = *
/ presorted
/ break = mhprob sex age abctot
/ dv1 to dv23=first(dv1 to dv23).
formats all (f5).

At this point you have pairs of cases (but see note following) that are
arranged so that each case in the pair are on separate lines. What you do
next depends on what you are going to do analytically. If you are going to
do paired t-tests you will need to restructure the data further. But, if you
are going to use independent sample t-tests, the data are ready for use.
NOTE. Before you do anything further you should carefully examine your data
to be sure that every treatment case (mhprob=1) has a match. You should be
seriously concerned about that since abctot is a percentage. This is where
you are going to have problems.

If you are satisfied that every treatment case has an adequate match AND you
are doing, for example, paired t-tests, then you need to restructure your
data so that case and matched control are on the same record or line. This
next part does that.

sort cases by sex age abctot mhprob.
casestovars / id = sex age abctot / index = mhprob.
Execute.

From left to right the resulting file will have the match variables, dv1 to
dv23 for the controls followed by dv1 to dv23 for the treatment cases. The
.0 suffix indicating controls and the .1 suffix indicating treatment cases.

I don't know what this does. Get rid of it.
begin program.
import spss
spss.Submit("sample 41 from %s." % spss.GetCaseCount())
end program.
exe.

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of
Ivana
Sent: Wednesday, March 30, 2011 7:29 AM
To: SPSSX-L@LISTSERV.UGA.EDU
Subject: Obtaining a matched control group

Hi everyone

I desperately need help with generating a matched control group through
SPSS(16). I have 1269 records of individuals with learning disability. 142
of these have a mental health problem (mhprob=1). The control group needs to
be generated from the rest of the cases who do not have a mental health
problem (mhprob=0) on 1:1 basis. The matching parameters are age, sex and a
score (ABCTOT) which indicates the ability of the individuals to function
independently (expressed as a percentage). I have tried applying the script
in one of the answers on this forum:

http://spssx-discussion.1045642.n5.nabble.com/Sampling-question-How-to-draw-
a-matched-control-group-td1086666.html

I cannot get it working at all. Please have in mind I am not much of an SPSS
expert when it comes down to programming and scripts.

Many thanks

Ivana

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Obtaining-a-matched-control-gr
oup-tp4271299p4271299.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: Obtaining a matched control group

David,

I'm glad you pointed out that possibility because I overlooked it in my
response. Thank you.

Ivana, this is something to check before you do the matching operation and
after you do the matching operation. Afterwards, the frequencies of mhprob=1
should match the frequencies of that value before matching. Actually, the
place to do the frequencies is after the aggregate and before the
casestovars.

Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
David Marso
Sent: Wednesday, March 30, 2011 11:29 AM
To: [hidden email]
Subject: Re: Obtaining a matched control group

Ivana and Gene,
Something really bothers me about the following:
"
* actual code.
compute random = rv.uniform(0,1).
sort cases by mhprob sex age abctot random.
aggr out = *
/ presorted
/ break = mhprob sex age abctot
/ dv1 to dv23=first(dv1 to dv23).
formats all (f5).
"
consider a situation where you have multiple cases with the same desired
matching profile and mhprob status. The AGGregate will lose the cases
associated with data2 and data4.

0 1 20 .5 data1
0 1 20 .5 data2
1 1 20 .5 data3
1 1 20 .5 data4
-------
May need something a bit more complex and bullet proof .
--Here's my crack at it.
Rather than aggregating I do what I call a LAG and DRAG.
Note this hasn't been tested as I don't have SPSS immediately available.
If nothing else it should provide some insight into the complexities of the
issue. Note the first pass obtains a random exact match on SEX AGE ABCTOT.
The second on SEX and AGE.... etc.
This idea can be generalized to as many variables as needed.
----
* Making this up on the fly and no way to test without rebooting my box ;-(
Logic should suffice, but there might be a mistep, but I believe it will
work as is. OR, someone will step up and correct my code.

*-----------
* First sort files by matching criteria*.
COMPUTE SCRAMBLER=UNIFORM(1).
COMPUTE PAIREDUP=0.
SORT CASES BY SEX AGE ABCTOT (A) SCRAMBLER mhprob (D) .
COMPUTE YOKE_ID=$CASENUM.
COMPUTE PAIRED=YOKE_ID.
* This will place cases with matching age sex abctot next to each other and
tag them with a unique ID.
* Those with mhprob 0/1 randomly occurring within blocks of "matched cases"
*.
* Now identify exact matches * .
DO IF SEX EQ LAG(SEX) AGE EQ LAG(AGE) AND AND ABCTOT=LAG(ABCTOT) AND
mhprob EQ 0 AND LAG(mhprob) EQ 1.
COMPUTE PAIRED=LAG(YOKE_ID) .
COMPUTE MATE=YOKE_ID .
END IF.
* we have now something like this *.
matchedstuff mhprob yoke_id paired mate
xxxxxxxxxxx 1 4 4 .
xxxxxxxxxxx 0 5 4 5

SORT CASES BY YOKE_ID (D).
* we have now something like this *.
matchedstuff mhprob yoke_id paired mate
xxxxxxxxxxx 0 5 4 5
xxxxxxxxxxx 1 4 4 .

IF NOT (MISSING(LAG(MATE))) AND MISSING (MATE) MATE=LAG(MATE).
EXE.
* we have now something like this *.
matchedstuff mhprob yoke_id paired mate
xxxxxxxxxxx 0 5 4 5
xxxxxxxxxxx 1 4 4 5

DO IF NOT(MISSING(MATE)).
XSAVE OUTFILE "MATCHED1.SAV".
COMPUTE PAIREDUP=1.
ELSE.

END IF.

SELECT IF PAIREDUP=0.
MATCH FILES / FILE * / DROP SCRAMBLER PAIRED MATE .
*Every case in MATCHED1.SAV should be yoked to another case.
*Active file contains unmatched cases.

* Now repeat with relaxed criteria (ie not requiring exactly equal abctot).

COMPUTE SCRAMBLER=UNIFORM(1).
SORT CASES BY SEX AGE ABCTOT (A) SCRAMBLER mhprob (D) .
COMPUTE YOKE_ID=$CASENUM.
COMPUTE PAIRED=YOKE_ID.
* Now identify matches on AGE and SEX and tag CLOSEST ABCTOT* .
DO IF SEX EQ LAG(SEX) AND AGE EQ LAG(AGE) AND mhprob EQ 0 AND LAG(mhprob)
EQ 1.
COMPUTE PAIRED=LAG(YOKE_ID).
COMPUTE MATE=YOKE_ID .
END IF.

SORT CASES BY YOKE_ID (D).
IF NOT (MISSING(LAG(MATE))) AND MISSING (MATE) MATE=LAG(MATE).
EXE.

DO IF NOT(MISSING(MATE)).
XSAVE OUTFILE "MATCHED2.SAV".
COMPUTE PAIREDUP=1.
ELSE.
END IF.
SELECT IF PAIREDUP=0.
*Matched2.sav contains exact matches on sex and age but possibly inexact on
ABCTOT.

MATCH FILES / FILE * / DROP SCRAMBLER PAIRED MATE .

* Exercise for reader.... Adapt for relaxed criteria on age ;-)

Gene Maguin wrote:

>
> Ivana,
>
> So this is the code that you are referring to and will need to use. (Did
> you
> understand that the first section of code was used to generate some
> example
> data?). You said:
>
> I have 1269 records of individuals with learning disability. 142
> of these have a mental health problem (mhprob=1). The control group needs
> to
> be generated from the rest of the cases who do not have a mental health
> problem (mhprob=0) on 1:1 basis. The matching parameters are age, sex and
> a
> score (ABCTOT) which indicates the ability of the individuals to function
> independently (expressed as a percentage).
>
> Here's how to convert the code to your instance. But there's some reading
> in
> the syntax reference that will be helpful to understand what the commands
> are doing.
>
>
> * actual code.
> compute random = rv.uniform(0,1).
> sort cases by mhprob sex age abctot random.
> aggr out = *
> / presorted
> / break = mhprob sex age abctot
> / dv1 to dv23=first(dv1 to dv23).
> formats all (f5).
>
> At this point you have pairs of cases (but see note following) that are
> arranged so that each case in the pair are on separate lines. What you do
> next depends on what you are going to do analytically. If you are going to
> do paired t-tests you will need to restructure the data further. But, if
> you
> are going to use independent sample t-tests, the data are ready for use.
> NOTE. Before you do anything further you should carefully examine your
> data
> to be sure that every treatment case (mhprob=1) has a match. You should be
> seriously concerned about that since abctot is a percentage. This is where
> you are going to have problems.
>
> If you are satisfied that every treatment case has an adequate match AND
> you
> are doing, for example, paired t-tests, then you need to restructure your
> data so that case and matched control are on the same record or line. This
> next part does that.
>
> sort cases by sex age abctot mhprob.
> casestovars / id = sex age abctot / index = mhprob.
> Execute.
>
> From left to right the resulting file will have the match variables, dv1
> to
> dv23 for the controls followed by dv1 to dv23 for the treatment cases. The
> .0 suffix indicating controls and the .1 suffix indicating treatment
> cases.
>
>
> I don't know what this does. Get rid of it.
> begin program.
> import spss
> spss.Submit("sample 41 from %s." % spss.GetCaseCount())
> end program.
> exe.
>
>
> Gene Maguin
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> Ivana
> Sent: Wednesday, March 30, 2011 7:29 AM
> To: [hidden email]
> Subject: Obtaining a matched control group
>
> Hi everyone
>
> I desperately need help with generating a matched control group through
> SPSS(16). I have 1269 records of individuals with learning disability. 142
> of these have a mental health problem (mhprob=1). The control group needs
> to
> be generated from the rest of the cases who do not have a mental health
> problem (mhprob=0) on 1:1 basis. The matching parameters are age, sex and
> a
> score (ABCTOT) which indicates the ability of the individuals to function
> independently (expressed as a percentage). I have tried applying the
> script
> in one of the answers on this forum:
>
>

http://spssx-discussion.1045642.n5.nabble.com/Sampling-question-How-to-draw-

> a-matched-control-group-td1086666.html
>
> I cannot get it working at all. Please have in mind I am not much of an
> SPSS
> expert when it comes down to programming and scripts.
>
> Many thanks
>
> Ivana
>
> --
> View this message in context:
>

http://spssx-discussion.1045642.n5.nabble.com/Obtaining-a-matched-control-gr

> oup-tp4271299p4271299.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Obtaining-a-matched-control-gr
oup-tp4271299p4271701.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Obtaining a matched control group

Administrator

In reply to this post by David Marso

Well, I'm glad Gene's code worked for Ivana but it has that fatal flaw
if there are sequences of exact matches.
I had a chance to test my code and there were a few typos (AND AND .. does not compute ;-).
Here is a revision (just the first main logic).
It could probably be made more efficient but I need to get dinner going.
Probably could lose an EXE but with the lags going I figured better safe than sorry and I don't have time to fine tune it.
HTH, David

** SIMULATION DATA **.
input program.
loop sex= 1 to 2.
loop #=1 to 100.
compute age=trunc(uniform(10)).
compute abctot = trunc(uniform(10))/10.
compute mhprob=1.
leave sex.
end case.
end loop.
end loop.
loop sex= 1 to 2.
loop #=1 to 1000.
compute age=trunc(uniform(10)).
compute abctot = trunc(uniform(10))/10.
compute mhprob=0.
leave sex.
end case.
end loop.
end loop.
end file.
end input program.
string datamark(a8).
COMPUTE datamark=CONCAT("DATA",STRING($CASENUM,N4)).
exe.

**
**RUN ONLY ONCE **.
COMPUTE YOKE_ID=$CASENUM.
COMPUTE PAIRED=YOKE_ID.
COMPUTE PAIREDUP=0.

**** REPEAT THIS CODE UNTIL ALL EXACT MATCHES HAVE BEEN DONE ***.
** CROSSTABS / TABLES SEX BY AGE BY ABCTOT BY MHPROB / CELLS = COUNT.
COMPUTE SCRAMBLE=UNIFORM(1).
SORT CASES BY PAIREDUP SEX AGE ABCTOT (A) SCRAMBLE mhprob (D) .
* This will place cases with matching age sex abctot next to each other and tag them with a unique ID.
* Those with mhprob 0/1 randomly occurring within blocks of "matched cases" *.
* Now identify exact matches * .

DO IF SEX EQ LAG(SEX) AND AGE EQ LAG(AGE) AND ABCTOT=LAG(ABCTOT) AND mhprob EQ 0 AND LAG(mhprob) EQ 1.
+ DO IF (NOT(PAIREDUP)).
+ COMPUTE PAIRED=LAG(YOKE_ID) .
+ COMPUTE MATE=YOKE_ID .
+ COMPUTE MATED=1.
+ END IF.
END IF.

SORT CASES BY PAIREDUP (A) PAIRED (D) MATE(D).

* we have now something like this *.
*matchedstuff mhprob yoke_id paired mate
*xxxxxxxxxxx 0 5 4 5
*xxxxxxxxxxx 1 4 4 .
*.

DO IF PAIRED=LAG(PAIRED) AND MISSING (MATE) AND NOT(PAIREDUP).
COMPUTE MATE=LAG(MATE).
COMPUTE MATED=1.
END IF.
EXE.

* we have now something like this *.
*matchedstuff mhprob yoke_id paired mate
*xxxxxxxxxxx 0 5 4 5
*xxxxxxxxxxx 1 4 4 5
*.
IF MATED PAIREDUP=1.
CROSSTABS TABLES PAIREDUP BY MHPROB.
freq pairedup.
** REPEAT UNTIL HAPPY!!! *.

hillel vardi

Re: Obtaining a matched control group

In reply to this post by Ivana

Shalom
As David Marso point out what you are looking for is more complicate
then what you stated .
The assumption that cases and controls are sprad evenly along the file
is rarly mete .
If you use aggregate to form the match it is possible that some of the
groups wont have any control in them or as David Marso point will have
more then one cases in them .
Here is an example using only age to define the groups .

id age case/control
1 11 1 >>> match
12 11 0 >>> match
14 12 1 >>> no match
23 14 1 >>> no match
31 15 1 >>> all most match
7 16 0 >>> all most match
2 16 0 >>> no match
4 17 0 >>> no match
here you may wont to match 14 with 7 , 14 with 2 , and 15 with 17 .
That kind of match is not passable using aggregate .

To solve this kind of matching you can create a ruining sum and add 1 to
it when ever a case is met and
substrate 1 when the first match control is met .

here is a general syntax (not tested )

sort cases by sex age abctot random.
numeric match_num run_sum (f4) .
leave match_num run_sum .
do if case eq 1 .
compute run_sum = sum( run_sum,1) .
compute match_num = sum(match_num,1) .
else if case eq 0 and run_sum gt 0 .
compute run_sum = sum(run_sum,-1) .
compute is_match= 1.
end if .
select if case eq 1 or is_match eq 1.

This syntax will match the closest control AFTER the case which may or
may not be a problem .

Hillel Vardi
BGU

On 30/03/2011 13:29, Ivana wrote:

> Hi everyone
>
> I desperately need help with generating a matched control group through
> SPSS(16). I have 1269 records of individuals with learning disability. 142
> of these have a mental health problem (mhprob=1). The control group needs to
> be generated from the rest of the cases who do not have a mental health
> problem (mhprob=0) on 1:1 basis. The matching parameters are age, sex and a
> score (ABCTOT) which indicates the ability of the individuals to function
> independently (expressed as a percentage). I have tried applying the script
> in one of the answers on this forum:
>
> http://spssx-discussion.1045642.n5.nabble.com/Sampling-question-How-to-draw-a-matched-control-group-td1086666.html
>
> I cannot get it working at all. Please have in mind I am not much of an SPSS
> expert when it comes down to programming and scripts.
>
> Many thanks
>
> Ivana
>
> --
> View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Obtaining-a-matched-control-group-tp4271299p4271299.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Ivana

RE: Obtaining a matched control group

In reply to this post by David Marso

Dear David

I have tested this and precisely as you listed, it worked beautifully. I am so grateful for your time and effort. My thanks also to many other people who have replied.

Best wishes

Ivana

___________________________

Dr Ivana Dojcinov, MD MRCPsych

Date: Wed, 30 Mar 2011 13:50:55 -0700
From: [hidden email]
To: [hidden email]
Subject: Re: Obtaining a matched control group

Well, I'm glad Gene's code worked for Ivana but it has that fatal flaw
if there are sequences of exact matches.
I had a chance to test my code and there were a few typos (AND AND .. does not compute ;-).
Here is a revision (just the first main logic).
It could probably be made more efficient but I need to get dinner going.
Probably could lose an EXE but with the lags going I figured better safe than sorry and I don't have time to fine tune it.
HTH, David

** SIMULATION DATA **.
input program.
loop sex= 1 to 2.
loop #=1 to 100.
compute age=trunc(uniform(10)).
compute abctot = trunc(uniform(10))/10.
compute mhprob=1.
leave sex.
end case.
end loop.
end loop.
loop sex= 1 to 2.
loop #=1 to 1000.
compute age=trunc(uniform(10)).
compute abctot = trunc(uniform(10))/10.
compute mhprob=0.
leave sex.
end case.
end loop.
end loop.
end file.
end input program.
string datamark(a8).
COMPUTE datamark=CONCAT("DATA",STRING($CASENUM,N4)).
exe.

**
**RUN ONLY ONCE **.
COMPUTE YOKE_ID=$CASENUM.
COMPUTE PAIRED=YOKE_ID.
COMPUTE PAIREDUP=0.

**** REPEAT THIS CODE UNTIL ALL EXACT MATCHES HAVE BEEN DONE ***.
** CROSSTABS / TABLES SEX BY AGE BY ABCTOT BY MHPROB / CELLS = COUNT.
COMPUTE SCRAMBLE=UNIFORM(1).
SORT CASES BY PAIREDUP SEX AGE ABCTOT (A) SCRAMBLE mhprob (D) .
* This will place cases with matching age sex abctot next to each other and tag them with a unique ID.
* Those with mhprob 0/1 randomly occurring within blocks of "matched cases" *.
* Now identify exact matches * .

DO IF SEX EQ LAG(SEX) AND AGE EQ LAG(AGE) AND ABCTOT=LAG(ABCTOT) AND mhprob EQ 0 AND LAG(mhprob) EQ 1.
+ DO IF (NOT(PAIREDUP)).
+ COMPUTE PAIRED=LAG(YOKE_ID) .
+ COMPUTE MATE=YOKE_ID .
+ COMPUTE MATED=1.
+ END IF.
END IF.

SORT CASES BY PAIREDUP (A) PAIRED (D) MATE(D).

* we have now something like this *.
*matchedstuff mhprob yoke_id paired mate
*xxxxxxxxxxx 0 5 4 5
*xxxxxxxxxxx 1 4 4 .
*.

DO IF PAIRED=LAG(PAIRED) AND MISSING (MATE) AND NOT(PAIREDUP).
COMPUTE MATE=LAG(MATE).
COMPUTE MATED=1.
END IF.
EXE.

* we have now something like this *.
*matchedstuff mhprob yoke_id paired mate
*xxxxxxxxxxx 0 5 4 5
*xxxxxxxxxxx 1 4 4 5
*.
IF MATED PAIREDUP=1.
CROSSTABS TABLES PAIREDUP BY MHPROB.
freq pairedup.
** REPEAT UNTIL HAPPY!!! *.

If you reply to this email, your message will be added to the discussion below:

http://spssx-discussion.1045642.n5.nabble.com/Obtaining-a-matched-control-group-tp4271299p4272308.html

To unsubscribe from Obtaining a matched control group, click here.

Ivana

Re: Obtaining a matched control group

In reply to this post by David Marso

Dear David

I've just tested this and it worked beautifully. Thank you ever so much. My thanks also to all other people who have spared time and effort to help me.

Kind regards

Ivana

David Marso

RE: Obtaining a matched control group (A final Nail)

Administrator

In reply to this post by Ivana

Hi Ivana,
You are very welcome!
I was think on this further after an interesting email from Gene regarding sequences (similar to Hillel Vardi's post last night). I came up with the following tidbit which is much easier than my previous post and has the added feature of being almost completely intuitive. Another nice benefit is it does not require a SORT and in my tests is a KEEPER ;-).

COMPUTE ID=$CASENUM.
COMPUTE SCRAMBL=UNIFORM(1).
RANK SCRAMBL BY SEX AGE ABCTOT mhPROB.
IF MHPROB=0 ID0=ID.
IF MHPROB=1 ID1=ID.
AGGREGATE OUTFILE * / BREAK sex age ABCTOT rscrambl /id0 id1=max(id0 id1).
COMPUTE MATCH=NOT(MISSING(ID1)) AND NOT(MISSING(ID0)).
FREQ MATCH.

Comments:
RANK is able to construct 'counters' BY strata without the relevant cases being contiguous. NICE.
After the AGGREGATE the file will have the strata variables (and paired IDs -ID1, ID2-) but not the MHPROB variable. No problem since this information is implied by presence/absence of ID0 and ID1.

Taking it further:
One could segregate the MATCH cases into a separate file, deleting from working file and then rerun the code after doing a VARSTOCASES (ie restoring ID from ID0 and ID1). In this case I would probably.

COMPUTE a random variable and sort on it, then use a variant of the RANK as:
RANK ABCTOT BY SEX AGE mhPROB (may need to specify TIES to deal with duplicate values in ABCTOT?).
This would build RANKS of ABCTOT within the strata and a later AGGREGATE would group them together as previously (fuzzy match within the ranked values of ABCTOT).

NOTE: In contrast to Gene's example I do not spread the data elements, I just store the IDs. To map the data to the IDs will simply require a VARSTOCASES to make the file long -That's all you need to carry-
SORT CASES BY ID
MATCH FILES into the SORTED detail level file.
Hope this helps,
David

Jon K Peck

Re: Obtaining a matched control group (A final Nail)

Although it wasn't stated in the original post, it sounded to me like one of the match variables was continuous and therefore, exact matches would be unlikely. In that case you would need a tolerance factor in order to get a match. FUZZY, of course, handles all of this.

Contrary to what I recalled earlier, FUZZY should work with version 16 (but no earlier one). The clue is the one-word name of the extension command, which is a limitation in V16.

Jon Peck
Senior Software Engineer, IBM
[hidden email]
312-651-3435

From: David Marso <[hidden email]>
To: [hidden email]
Date: 03/31/2011 07:56 AM
Subject: Re: [SPSSX-L] Obtaining a matched control group (A final Nail)
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Hi Ivana, You are very welcome! I was think on this further after an interesting email from Gene regarding sequences (similar to Hillel Vardi's post last night). I came up with the following tidbit which is much easier than my previous post and has the added feature of being almost completely intuitive. Another nice benefit is it does not require a SORT and in my tests is a KEEPER ;-). COMPUTE ID=$CASENUM. COMPUTE SCRAMBL=UNIFORM(1). RANK SCRAMBL BY SEX AGE ABCTOT mhPROB. IF MHPROB=0 ID0=ID. IF MHPROB=1 ID1=ID. AGGREGATE OUTFILE * / BREAK sex age ABCTOT rscrambl /id0 id1=max(id0 id1). COMPUTE MATCH=NOT(MISSING(ID1)) AND NOT(MISSING(ID0)). FREQ MATCH. Comments: RANK is able to construct 'counters' BY strata without the relevant cases being contiguous. NICE. After the AGGREGATE the file will have the strata variables (and paired IDs -ID1, ID2-) but not the MHPROB variable. No problem since this information is implied by presence/absence of ID0 and ID1. Taking it further: One could segregate the MATCH cases into a separate file, deleting from working file and then rerun the code after doing a VARSTOCASES (ie restoring ID from ID0 and ID1). In this case I would probably. COMPUTE a random variable and sort on it, then use a variant of the RANK as: RANK ABCTOT BY SEX AGE mhPROB (may need to specify TIES to deal with duplicate values in ABCTOT?). This would build RANKS of ABCTOT within the strata and a later AGGREGATE would group them together as previously (fuzzy match within the ranked values of ABCTOT). NOTE: In contrast to Gene's example I do not spread the data elements, I just store the IDs. To map the data to the IDs will simply require a VARSTOCASES to make the file long -That's all you need to carry- SORT CASES BY ID MATCH FILES into the SORTED detail level file. Hope this helps, David -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/Obtaining-a-matched-control-group-tp4271299p4273397.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Albert-Jan Roskam

Re: Obtaining a matched control group (A final Nail)

This has nothing to do with Fuzzy itself, but is following code fragment used in conjunction with gettext?:
    #enable localization
    global _
    try:
        _("---")
    except:
        def _(msg):
            return msg

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

From: Jon K Peck <[hidden email]>
To: [hidden email]
Sent: Thu, March 31, 2011 4:04:30 PM
Subject: Re: [SPSSX-L] Obtaining a matched control group (A final Nail)

Although it wasn't stated in the original post, it sounded to me like one of the match variables was continuous and therefore, exact matches would be unlikely. In that case you would need a tolerance factor in order to get a match. FUZZY, of course, handles all of this.

Contrary to what I recalled earlier, FUZZY should work with version 16 (but no earlier one). The clue is the one-word name of the extension command, which is a limitation in V16.

Jon Peck
Senior Software Engineer, IBM
[hidden email]
312-651-3435

From: David Marso <[hidden email]>
To: [hidden email]
Date: 03/31/2011 07:56 AM
Subject: Re: [SPSSX-L] Obtaining a matched control group (A final Nail)
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Hi Ivana, You are very welcome! I was think on this further after an interesting email from Gene regarding sequences (similar to Hillel Vardi's post last night). I came up with the following tidbit which is much easier than my previous post and has the added feature of being almost completely intuitive. Another nice benefit is it does not require a SORT and in my tests is a KEEPER ;-). COMPUTE ID=$CASENUM. COMPUTE SCRAMBL=UNIFORM(1). RANK SCRAMBL BY SEX AGE ABCTOT mhPROB. IF MHPROB=0 ID0=ID. IF MHPROB=1 ID1=ID. AGGREGATE OUTFILE * / BREAK sex age ABCTOT rscrambl /id0 id1=max(id0 id1). COMPUTE MATCH=NOT(MISSING(ID1)) AND NOT(MISSING(ID0)). FREQ MATCH. Comments: RANK is able to construct 'counters' BY strata without the relevant cases being contiguous. NICE. After the AGGREGATE the file will have the strata variables (and paired IDs -ID1, ID2-) but not the MHPROB variable. No problem since this information is implied by presence/absence of ID0 and ID1. Taking it further: One could segregate the MATCH cases into a separate file, deleting from working file and then rerun the code after doing a VARSTOCASES (ie restoring ID from ID0 and ID1). In this case I would probably. COMPUTE a random variable and sort on it, then use a variant of the RANK as: RANK ABCTOT BY SEX AGE mhPROB (may need to specify TIES to deal with duplicate values in ABCTOT?). This would build RANKS of ABCTOT within the strata and a later AGGREGATE would group them together as previously (fuzzy match within the ranked values of ABCTOT). NOTE: In contrast to Gene's example I do not spread the data elements, I just store the IDs. To map the data to the IDs will simply require a VARSTOCASES to make the file long -That's all you need to carry- SORT CASES BY ID MATCH FILES into the SORTED detail level file. Hope this helps, David -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/Obtaining-a-matched-control-group-tp4271299p4273397.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck

Re: Obtaining a matched control group (A final Nail)

Jon Peck
Senior Software Engineer, IBM
[hidden email]
312-651-3435

From: Albert-Jan Roskam <[hidden email]>
To: [hidden email]
Date: 03/31/2011 01:45 PM
Subject: Re: [SPSSX-L] Obtaining a matched control group (A final Nail)
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Hi Jon,

Very interesting. I didn't know that extension command. From the docstring:
h is the current demander case hash
case is the current supplier case
return is
- 0 if no match
- 1 if fuzzy match
- 2 if exact match

Why only these discrete values and not values [0-1]? A better distinction could then be made between different candidate record pairs. Also, I wonder if it isn't a big penalty if a possible match is considered a non-match if one of the linkage vars is missing?

>>>When I designed this, I felt that a missing value should not be considered as a match with anything - there is no information. If someone wants different behavior, they can change the missing values temporarily.
>>>As for the metric, in order to provide a distance for the mismatch, there has to be some metric defined, so the user would have to provide that. Of course, that only applies when not using an exact match. In the case of categorical variables, this could be pretty messy. And the user might well want to weight variables differently. If a user wanted to provide a code fragment that calculated a distance, I could use that, but it would be hard to a user to get it right IMO.

The second problem here is that one might then want to minimize the total error in the matches, and that is a large integer programming problem that would require a substantially different approach to matching. Other than the EXACTPRIORITY keyword, FUZZY picks at random from among all cases that satisfy the fuzz criteria. If it picked the best match among the eligible ones, it would be giving priority to cases that are earlier in the file, and this could introduce a subtle bias in the matching behavior if the cases are not in random order (there are some comments about this in the documentation). Even with the current behavior, there is potential for this problem to occur in a milder way, which is why there is a SHUFFLE keyword to combat it, but that increases the time and memory requirements.

This has nothing to do with Fuzzy itself, but is following code fragment used in conjunction with gettext?:
#enable localization
global _
try:
_("---")
except:
def _(msg):
return msg
>>>
I added some automatic setup for translations to the extensions.py module in, IIRC, version 18. Since most of the extension commands also work with V17 and might not have the updated extensions.py module, the code above checks to see whether the _ function is defined and generates an identity function if not. There are some subtleties with _ that are explained in the extension module code. We write all the Python extension commands to be translatable now, even though many are not currently translated. Documentation on how this works is in the extension command doc.

Thanks for the comments.

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

From: Jon K Peck <[hidden email]>
To: [hidden email]
Sent: Thu, March 31, 2011 4:04:30 PM
Subject: Re: [SPSSX-L] Obtaining a matched control group (A final Nail)

Although it wasn't stated in the original post, it sounded to me like one of the match variables was continuous and therefore, exact matches would be unlikely. In that case you would need a tolerance factor in order to get a match. FUZZY, of course, handles all of this.

Contrary to what I recalled earlier, FUZZY should work with version 16 (but no earlier one). The clue is the one-word name of the extension command, which is a limitation in V16.

Jon Peck
Senior Software Engineer, IBM
[hidden email]
312-651-3435

From: David Marso <[hidden email]>
To: [hidden email]
Date: 03/31/2011 07:56 AM
Subject: Re: [SPSSX-L] Obtaining a matched control group (A final Nail)
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Hi Ivana,
You are very welcome!
I was think on this further after an interesting email from Gene regarding
sequences (similar to Hillel Vardi's post last night). I came up with the
following tidbit which is much easier than my previous post and has the
added feature of being almost completely intuitive. Another nice benefit is
it does not require a SORT and in my tests is a KEEPER ;-).

COMPUTE ID=$CASENUM.
COMPUTE SCRAMBL=UNIFORM(1).
RANK SCRAMBL BY SEX AGE ABCTOT mhPROB.
IF MHPROB=0 ID0=ID.
IF MHPROB=1 ID1=ID.
AGGREGATE OUTFILE * / BREAK sex age ABCTOT rscrambl /id0 id1=max(id0 id1).
COMPUTE MATCH=NOT(MISSING(ID1)) AND NOT(MISSING(ID0)).
FREQ MATCH.

Comments:
RANK is able to construct 'counters' BY strata without the relevant cases
being contiguous. NICE.
After the AGGREGATE the file will have the strata variables (and paired IDs
-ID1, ID2-) but not the MHPROB variable. No problem since this information
is implied by presence/absence of ID0 and ID1.

Taking it further:
One could segregate the MATCH cases into a separate file, deleting from
working file and then rerun the code after doing a VARSTOCASES (ie restoring
ID from ID0 and ID1). In this case I would probably.

COMPUTE a random variable and sort on it, then use a variant of the RANK as:
RANK ABCTOT BY SEX AGE mhPROB (may need to specify TIES to deal with
duplicate values in ABCTOT?).
This would build RANKS of ABCTOT within the strata and a later AGGREGATE
would group them together as previously (fuzzy match within the ranked
values of ABCTOT).

NOTE: In contrast to Gene's example I do not spread the data elements, I
just store the IDs. To map the data to the IDs will simply require a
VARSTOCASES to make the file long -That's all you need to carry-
SORT CASES BY ID
MATCH FILES into the SORTED detail level file.
Hope this helps,
David

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Obtaining-a-matched-control-group-tp4271299p4273397.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

hillel vardi

Re: Obtaining a matched control group (A final Nail)

In reply to this post by David Marso

Shalom

After thinking all other answers I am quit sure that using Aggregate ,
Lag or Rank will not work .
Te reason for that is that the assumption that there will be controls in
all the groups is not met in all situations.
Here is an example using David Marso program ( i only reduce the
number of cases to 8 and controls to 20 ) .

input program.
loop sex= 1 to 2.
loop #=1 to 4.
compute age=trunc(uniform(10)).
compute abctot = trunc(uniform(10))/10.
compute mhprob=1.
leave sex.
end case.
end loop.
end loop.
loop sex= 1 to 2.
loop #=1 to 10.
compute age=trunc(uniform(10)).
compute abctot = trunc(uniform(10))/10.
compute mhprob=0.
leave sex.
end case.
end loop.
end loop.
end file.
end input program.
string datamark(a8).
COMPUTE datamark=CONCAT("DATA",STRING($CASENUM,N4)).
exe.
COMPUTE ID=$CASENUM.
COMPUTE SCRAMBL=UNIFORM(1).
RANK SCRAMBL BY SEX AGE ABCTOT mhPROB.
IF MHPROB=0 ID0=ID.
IF MHPROB=1 ID1=ID.
AGGREGATE OUTFILE * / BREAK sex age ABCTOT rscrambl /id0 id1=max(id0 id1).
COMPUTE MATCH=NOT(MISSING(ID1)) AND NOT(MISSING(ID0)).
FREQ MATCH.

Hillel Vardi
BGU

On 31/03/2011 15:52, David Marso wrote:

> Hi Ivana,
> You are very welcome!
> I was think on this further after an interesting email from Gene regarding
> sequences (similar to Hillel Vardi's post last night). I came up with the
> following tidbit which is much easier than my previous post and has the
> added feature of being almost completely intuitive. Another nice benefit is
> it does not require a SORT and in my tests is a KEEPER ;-).
>
> COMPUTE ID=$CASENUM.
> COMPUTE SCRAMBL=UNIFORM(1).
> RANK SCRAMBL BY SEX AGE ABCTOT mhPROB.
> IF MHPROB=0 ID0=ID.
> IF MHPROB=1 ID1=ID.
> AGGREGATE OUTFILE * / BREAK sex age ABCTOT rscrambl /id0 id1=max(id0 id1).
> COMPUTE MATCH=NOT(MISSING(ID1)) AND NOT(MISSING(ID0)).
> FREQ MATCH.
>
> Comments:
> RANK is able to construct 'counters' BY strata without the relevant cases
> being contiguous. NICE.
> After the AGGREGATE the file will have the strata variables (and paired IDs
> -ID1, ID2-) but not the MHPROB variable. No problem since this information
> is implied by presence/absence of ID0 and ID1.
>
> Taking it further:
> One could segregate the MATCH cases into a separate file, deleting from
> working file and then rerun the code after doing a VARSTOCASES (ie restoring
> ID from ID0 and ID1). In this case I would probably.
>
> COMPUTE a random variable and sort on it, then use a variant of the RANK as:
> RANK ABCTOT BY SEX AGE mhPROB (may need to specify TIES to deal with
> duplicate values in ABCTOT?).
> This would build RANKS of ABCTOT within the strata and a later AGGREGATE
> would group them together as previously (fuzzy match within the ranked
> values of ABCTOT).
>
> NOTE: In contrast to Gene's example I do not spread the data elements, I
> just store the IDs. To map the data to the IDs will simply require a
> VARSTOCASES to make the file long -That's all you need to carry-
> SORT CASES BY ID
> MATCH FILES into the SORTED detail level file.
> Hope this helps,
> David
>
>

David Marso

Re: Obtaining a matched control group (A final Nail)

Administrator

I really wouldn't expect ANYTHING to work well with those sample sizes
and distributions ;-)
My code should be pretty much usable for reasonably large samples.
How does Jon's Fuzzy do with this data?

On Thu, Mar 31, 2011 at 7:08 PM, hillel vardi <[hidden email]> wrote:

> Shalom
>
> After thinking all other answers I am quit sure that using Aggregate , Lag
> or Rank will not work .
> Te reason for that is that the assumption that there will be controls in all
> the groups is not met in all situations.
> Here is an example using David Marso program ( i only reduce the number of
> cases to 8 and controls to 20 ) .
>
> input program.
> loop sex= 1 to 2.
> loop #=1 to 4.
> compute age=trunc(uniform(10)).
> compute abctot = trunc(uniform(10))/10.
> compute mhprob=1.
> leave sex.
> end case.
> end loop.
> end loop.
> loop sex= 1 to 2.
> loop #=1 to 10.
> compute age=trunc(uniform(10)).
> compute abctot = trunc(uniform(10))/10.
> compute mhprob=0.
> leave sex.
> end case.
> end loop.
> end loop.
> end file.
> end input program.
> string datamark(a8).
> COMPUTE datamark=CONCAT("DATA",STRING($CASENUM,N4)).
> exe.
> COMPUTE ID=$CASENUM.
> COMPUTE SCRAMBL=UNIFORM(1).
> RANK SCRAMBL BY SEX AGE ABCTOT mhPROB.
> IF MHPROB=0 ID0=ID.
> IF MHPROB=1 ID1=ID.
> AGGREGATE OUTFILE * / BREAK sex age ABCTOT rscrambl /id0 id1=max(id0 id1).
> COMPUTE MATCH=NOT(MISSING(ID1)) AND NOT(MISSING(ID0)).
> FREQ MATCH.
>
> Hillel Vardi
> BGU
>
> On 31/03/2011 15:52, David Marso wrote:
>>
>> Hi Ivana,
>> You are very welcome!
>> I was think on this further after an interesting email from Gene regarding
>> sequences (similar to Hillel Vardi's post last night). I came up with the
>> following tidbit which is much easier than my previous post and has the
>> added feature of being almost completely intuitive. Another nice benefit
>> is
>> it does not require a SORT and in my tests is a KEEPER ;-).
>>
>> COMPUTE ID=$CASENUM.
>> COMPUTE SCRAMBL=UNIFORM(1).
>> RANK SCRAMBL BY SEX AGE ABCTOT mhPROB.
>> IF MHPROB=0 ID0=ID.
>> IF MHPROB=1 ID1=ID.
>> AGGREGATE OUTFILE * / BREAK sex age ABCTOT rscrambl /id0 id1=max(id0 id1).
>> COMPUTE MATCH=NOT(MISSING(ID1)) AND NOT(MISSING(ID0)).
>> FREQ MATCH.
>>
>> Comments:
>> RANK is able to construct 'counters' BY strata without the relevant cases
>> being contiguous. NICE.
>> After the AGGREGATE the file will have the strata variables (and paired
>> IDs
>> -ID1, ID2-) but not the MHPROB variable. No problem since this
>> information
>> is implied by presence/absence of ID0 and ID1.
>>
>> Taking it further:
>> One could segregate the MATCH cases into a separate file, deleting from
>> working file and then rerun the code after doing a VARSTOCASES (ie
>> restoring
>> ID from ID0 and ID1). In this case I would probably.
>>
>> COMPUTE a random variable and sort on it, then use a variant of the RANK
>> as:
>> RANK ABCTOT BY SEX AGE mhPROB (may need to specify TIES to deal with
>> duplicate values in ABCTOT?).
>> This would build RANKS of ABCTOT within the strata and a later AGGREGATE
>> would group them together as previously (fuzzy match within the ranked
>> values of ABCTOT).
>>
>> NOTE: In contrast to Gene's example I do not spread the data elements, I
>> just store the IDs. To map the data to the IDs will simply require a
>> VARSTOCASES to make the file long -That's all you need to carry-
>> SORT CASES BY ID
>> MATCH FILES into the SORTED detail level file.
>> Hope this helps,
>> David
>>
>>
>
>