SPSSX Discussion

Probability matching of two files

Classic

List

Threaded

8 messages Options

Muir Houston-2

Probability matching of two files

Hi all,
I have two datasets - one a baseline of school pupils containing the
usual suspects (dob, gender, post code (zip in US), school name plus a
motivational inventory and a number of items which ask about career
influence and future plans.

The second dataset contains dob, gender, postcode and school and was
collected at various events or activities related to a career in the
health sector from pupils drawn for the first sample.

What I would like to do, is match respondents from the second dataset,
to the first on the basis of probability matching - I think I need to
create a vector of log odds relating to the probability of each
component of a record (my variables noted above - gender, dob, postcode
and school name) being a match. SO, birth date may match in a comparison
of records from each dataset, this would provide one score or weight in
the vector - the other variables (gender, postcode and school name)
would also be scored as being a probability of match or not match - so a
vector of all four variables would be formed

Any ideas how to go about this? My command of syntax, although evolving
is not up to this yet!

Or references?

Thanks
Muir

Dr M. Houston
DACE
University of Glasgow
0141-330-4699

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Peck, Jon

Re: Probability matching of two files

There is no built-in way to do probability matching, but there is an extension command (usable with version 16 or 17) that will do case-control exact matching. You can specify a set of variables that must match exactly, and it will sample randomly for one or more cases from those that match exactly on the specified variables. The command is CASECTRL, and it can be downloaded from SPSS Developer Central. It requires the Python programmability plug-in, but no knowledge of Python is needed to use it.

If an exact match can't be found, the matching case will be, natch, missing. Sometimes collapsing fine-grained variables into slightly broader categories is sufficient for this.

HTH,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Muir Houston
Sent: Tuesday, November 18, 2008 3:19 AM
To: [hidden email]
Subject: [SPSSX-L] Probability matching of two files

Hi all,
I have two datasets - one a baseline of school pupils containing the
usual suspects (dob, gender, post code (zip in US), school name plus a
motivational inventory and a number of items which ask about career
influence and future plans.

The second dataset contains dob, gender, postcode and school and was
collected at various events or activities related to a career in the
health sector from pupils drawn for the first sample.

What I would like to do, is match respondents from the second dataset,
to the first on the basis of probability matching - I think I need to
create a vector of log odds relating to the probability of each
component of a record (my variables noted above - gender, dob, postcode
and school name) being a match. SO, birth date may match in a comparison
of records from each dataset, this would provide one score or weight in
the vector - the other variables (gender, postcode and school name)
would also be scored as being a probability of match or not match - so a
vector of all four variables would be formed

Any ideas how to go about this? My command of syntax, although evolving
is not up to this yet!

Or references?

Thanks
Muir

Dr M. Houston
DACE
University of Glasgow
0141-330-4699

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Khaleel Hussaini

Re: Probability matching of two files

I have tried the CDCs LinkPlus probabilistic match software. The
documentation is self-explanatory and you could probably try using that
software. The link is below.

http://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

On Tue, Nov 18, 2008 at 8:10 AM, Peck, Jon <[hidden email]> wrote:

> There is no built-in way to do probability matching, but there is an
> extension command (usable with version 16 or 17) that will do case-control
> exact matching. You can specify a set of variables that must match exactly,
> and it will sample randomly for one or more cases from those that match
> exactly on the specified variables. The command is CASECTRL, and it can be
> downloaded from SPSS Developer Central. It requires the Python
> programmability plug-in, but no knowledge of Python is needed to use it.
>
> If an exact match can't be found, the matching case will be, natch,
> missing. Sometimes collapsing fine-grained variables into slightly broader
> categories is sufficient for this.
>
> HTH,
> Jon Peck
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
> Muir Houston
> Sent: Tuesday, November 18, 2008 3:19 AM
> To: [hidden email]
> Subject: [SPSSX-L] Probability matching of two files
>
> Hi all,
> I have two datasets - one a baseline of school pupils containing the
> usual suspects (dob, gender, post code (zip in US), school name plus a
> motivational inventory and a number of items which ask about career
> influence and future plans.
>
> The second dataset contains dob, gender, postcode and school and was
> collected at various events or activities related to a career in the
> health sector from pupils drawn for the first sample.
>
> What I would like to do, is match respondents from the second dataset,
> to the first on the basis of probability matching - I think I need to
> create a vector of log odds relating to the probability of each
> component of a record (my variables noted above - gender, dob, postcode
> and school name) being a match. SO, birth date may match in a comparison
> of records from each dataset, this would provide one score or weight in
> the vector - the other variables (gender, postcode and school name)
> would also be scored as being a probability of match or not match - so a
> vector of all four variables would be formed
>
> Any ideas how to go about this? My command of syntax, although evolving
> is not up to this yet!
>
> Or references?
>
> Thanks
> Muir
>
> Dr M. Houston
> DACE
> University of Glasgow
> 0141-330-4699
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>

Rolf Pfister

AW: Re: Probability matching of two files

Muir,

there is also an R-Package called "MatchIt". You can integrate it in SPSS via the R-Plugin of SPSS.

Here you can find more:
http://gking.harvard.edu/matchit/

HTH
Rolf

-----Ursprüngliche Nachricht-----
Von: SPSSX(r) Discussion [mailto:[hidden email]] Im Auftrag von Khaleel Hussaini
Gesendet: Dienstag, 18. November 2008 22:58
An: [hidden email]
Betreff: Re: Probability matching of two files

I have tried the CDCs LinkPlus probabilistic match software. The
documentation is self-explanatory and you could probably try using that
software. The link is below.

http://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

On Tue, Nov 18, 2008 at 8:10 AM, Peck, Jon <[hidden email]> wrote:

Albert-Jan Roskam

Re: AW: Re: Probability matching of two files

Hi,

I used Febrl (Freely Extensible Biomedical Record Linkage), which is Python-based. It has a huge range of comparison functions and the code has many comments so it's easier to read. The latest version is GUI-based. There's also a program called Link King, which is SAS based. Both are free.

and while we're at it... I heard of SQL-based 'LIKE' (%) linkage programs, but I wouldn't know where to get them. Does anybody happen to know where I can find more info on this?

Cheers!!
Albert-Jan

--- On Wed, 11/19/08, Rolf Pfister <[hidden email]> wrote:

> From: Rolf Pfister <[hidden email]>
> Subject: AW: Re: Probability matching of two files
> To: [hidden email]
> Date: Wednesday, November 19, 2008, 10:25 AM
> Muir,
>
> there is also an R-Package called "MatchIt". You
> can integrate it in SPSS via the R-Plugin of SPSS.
>
> Here you can find more:
> http://gking.harvard.edu/matchit/
>
> HTH
> Rolf
>
> -----Ursprüngliche Nachricht-----
> Von: SPSSX(r) Discussion [mailto:[hidden email]]
> Im Auftrag von Khaleel Hussaini
> Gesendet: Dienstag, 18. November 2008 22:58
> An: [hidden email]
> Betreff: Re: Probability matching of two files
>
> I have tried the CDCs LinkPlus probabilistic match
> software. The
> documentation is self-explanatory and you could probably
> try using that
> software. The link is below.
>
> http://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm
>
> On Tue, Nov 18, 2008 at 8:10 AM, Peck, Jon
> <[hidden email]> wrote:
>
> > There is no built-in way to do probability matching,
> but there is an
> > extension command (usable with version 16 or 17) that
> will do case-control
> > exact matching. You can specify a set of variables
> that must match exactly,
> > and it will sample randomly for one or more cases from
> those that match
> > exactly on the specified variables. The command is
> CASECTRL, and it can be
> > downloaded from SPSS Developer Central. It requires
> the Python
> > programmability plug-in, but no knowledge of Python is
> needed to use it.
> >
> > If an exact match can't be found, the matching
> case will be, natch,
> > missing. Sometimes collapsing fine-grained variables
> into slightly broader
> > categories is sufficient for this.
> >
> > HTH,
> > Jon Peck
> >
> > -----Original Message-----
> > From: SPSSX(r) Discussion
> [mailto:[hidden email]] On Behalf Of
> > Muir Houston
> > Sent: Tuesday, November 18, 2008 3:19 AM
> > To: [hidden email]
> > Subject: [SPSSX-L] Probability matching of two files
> >
> > Hi all,
> > I have two datasets - one a baseline of school pupils
> containing the
> > usual suspects (dob, gender, post code (zip in US),
> school name plus a
> > motivational inventory and a number of items which ask
> about career
> > influence and future plans.
> >
> > The second dataset contains dob, gender, postcode and
> school and was
> > collected at various events or activities related to a
> career in the
> > health sector from pupils drawn for the first sample.
> >
> > What I would like to do, is match respondents from the
> second dataset,
> > to the first on the basis of probability matching - I
> think I need to
> > create a vector of log odds relating to the
> probability of each
> > component of a record (my variables noted above -
> gender, dob, postcode
> > and school name) being a match. SO, birth date may
> match in a comparison
> > of records from each dataset, this would provide one
> score or weight in
> > the vector - the other variables (gender, postcode and
> school name)
> > would also be scored as being a probability of match
> or not match - so a
> > vector of all four variables would be formed
> >
> > Any ideas how to go about this? My command of syntax,
> although evolving
> > is not up to this yet!
> >
> > Or references?
> >
> > Thanks
> > Muir
> >
> > Dr M. Houston
> > DACE
> > University of Glasgow
> > 0141-330-4699
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message
> to
> > [hidden email] (not to SPSSX-L), with no
> body text except the
> > command. To leave the list, send the command
> > SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions, send
> the command
> > INFO REFCARD
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message
> to
> > [hidden email] (not to SPSSX-L), with no
> body text except the
> > command. To leave the list, send the command
> > SIGNOFF SPSSX-L
> > For a list of commands to manage subscriptions, send
> the command
> > INFO REFCARD
> >
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body
> text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body
> text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the
> command
> INFO REFCARD

la volta statistics

AW: AW: Re: Probability matching of two files

Hi

You may also look at the example #14 (Syntax/Random sampling) in Reynald's
SPSS Tools (http://www.spsstools.net/)

http://www.spsstools.net/Syntax/RandomSampling/MatchCasesOnBasisOfPropensity
Scores.txt

Perhaps this helps, Christian

-----Ursprüngliche Nachricht-----
Von: SPSSX(r) Discussion [mailto:[hidden email]] Im Auftrag von
Albert-jan Roskam
Gesendet: Mittwoch, 19. November 2008 11:46
An: [hidden email]
Betreff: Re: AW: Re: Probability matching of two files

Hi,

I used Febrl (Freely Extensible Biomedical Record Linkage), which is
Python-based. It has a huge range of comparison functions and the code has
many comments so it's easier to read. The latest version is GUI-based.
There's also a program called Link King, which is SAS based. Both are free.

and while we're at it... I heard of SQL-based 'LIKE' (%) linkage programs,
but I wouldn't know where to get them. Does anybody happen to know where I
can find more info on this?

Cheers!!
Albert-Jan

--- On Wed, 11/19/08, Rolf Pfister <[hidden email]> wrote:

Muir Houston-2

Re: AW: AW: Re: Probability matching of two files

Thanks for all your suggestions - I have now have plenty to occupy my time in finding a solution

Cheers
Muir

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of la volta statistics
Sent: 19 November 2008 14:28
To: [hidden email]
Subject: AW: AW: Re: Probability matching of two files

Hi

You may also look at the example #14 (Syntax/Random sampling) in Reynald's SPSS Tools (http://www.spsstools.net/)

http://www.spsstools.net/Syntax/RandomSampling/MatchCasesOnBasisOfPropensity
Scores.txt

Perhaps this helps, Christian

-----Ursprüngliche Nachricht-----
Von: SPSSX(r) Discussion [mailto:[hidden email]] Im Auftrag von Albert-jan Roskam
Gesendet: Mittwoch, 19. November 2008 11:46
An: [hidden email]
Betreff: Re: AW: Re: Probability matching of two files

Hi,

I used Febrl (Freely Extensible Biomedical Record Linkage), which is Python-based. It has a huge range of comparison functions and the code has many comments so it's easier to read. The latest version is GUI-based.
There's also a program called Link King, which is SAS based. Both are free.

and while we're at it... I heard of SQL-based 'LIKE' (%) linkage programs, but I wouldn't know where to get them. Does anybody happen to know where I can find more info on this?

Cheers!!
Albert-Jan

--- On Wed, 11/19/08, Rolf Pfister <[hidden email]> wrote:

> From: Rolf Pfister <[hidden email]>
> Subject: AW: Re: Probability matching of two files
> To: [hidden email]
> Date: Wednesday, November 19, 2008, 10:25 AM Muir,
>
> there is also an R-Package called "MatchIt". You can integrate it in
> SPSS via the R-Plugin of SPSS.
>
> Here you can find more:
> http://gking.harvard.edu/matchit/
>
> HTH
> Rolf
>
> -----Ursprüngliche Nachricht-----
> Von: SPSSX(r) Discussion [mailto:[hidden email]] Im Auftrag
> von Khaleel Hussaini
> Gesendet: Dienstag, 18. November 2008 22:58
> An: [hidden email]
> Betreff: Re: Probability matching of two files
>
> I have tried the CDCs LinkPlus probabilistic match software. The
> documentation is self-explanatory and you could probably try using
> that software. The link is below.
>
> http://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm
>
> On Tue, Nov 18, 2008 at 8:10 AM, Peck, Jon <[hidden email]> wrote:
>
> > There is no built-in way to do probability matching,
> but there is an
> > extension command (usable with version 16 or 17) that
> will do case-control
> > exact matching. You can specify a set of variables
> that must match exactly,
> > and it will sample randomly for one or more cases from
> those that match
> > exactly on the specified variables. The command is
> CASECTRL, and it can be
> > downloaded from SPSS Developer Central. It requires
> the Python
> > programmability plug-in, but no knowledge of Python is
> needed to use it.
> >
> > If an exact match can't be found, the matching
> case will be, natch,
> > missing. Sometimes collapsing fine-grained variables
> into slightly broader
> > categories is sufficient for this.
> >
> > HTH,
> > Jon Peck
> >
> > -----Original Message-----
> > From: SPSSX(r) Discussion
> [mailto:[hidden email]] On Behalf Of
> > Muir Houston
> > Sent: Tuesday, November 18, 2008 3:19 AM
> > To: [hidden email]
> > Subject: [SPSSX-L] Probability matching of two files
> >
> > Hi all,
> > I have two datasets - one a baseline of school pupils
> containing the
> > usual suspects (dob, gender, post code (zip in US),
> school name plus a
> > motivational inventory and a number of items which ask
> about career
> > influence and future plans.
> >
> > The second dataset contains dob, gender, postcode and
> school and was
> > collected at various events or activities related to a
> career in the
> > health sector from pupils drawn for the first sample.
> >
> > What I would like to do, is match respondents from the
> second dataset,
> > to the first on the basis of probability matching - I
> think I need to
> > create a vector of log odds relating to the
> probability of each
> > component of a record (my variables noted above -
> gender, dob, postcode
> > and school name) being a match. SO, birth date may
> match in a comparison
> > of records from each dataset, this would provide one
> score or weight in
> > the vector - the other variables (gender, postcode and
> school name)
> > would also be scored as being a probability of match
> or not match - so a
> > vector of all four variables would be formed
> >
> > Any ideas how to go about this? My command of syntax,
> although evolving
> > is not up to this yet!
> >
> > Or references?
> >
> > Thanks
> > Muir
> >
> > Dr M. Houston
> > DACE
> > University of Glasgow
> > 0141-330-4699
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message
> to
> > [hidden email] (not to SPSSX-L), with no
> body text except the
> > command. To leave the list, send the command SIGNOFF SPSSX-L For a
> > list of commands to manage subscriptions, send
> the command
> > INFO REFCARD
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message
> to
> > [hidden email] (not to SPSSX-L), with no
> body text except the
> > command. To leave the list, send the command SIGNOFF SPSSX-L For a
> > list of commands to manage subscriptions, send
> the command
> > INFO REFCARD
> >
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except
> the command. To leave the list, send the command SIGNOFF SPSSX-L For a
> list of commands to manage subscriptions, send the command INFO
> REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except
> the command. To leave the list, send the command SIGNOFF SPSSX-L For a
> list of commands to manage subscriptions, send the command INFO
> REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Dennis Deck

Re: AW: AW: Re: Probability matching of two files

In reply to this post by Muir Houston-2

Link King (Campbell, et al) requires access to SAS. The match rate is
qiote good. See
http://www.uclaisap.org/slides/caloms/nov2007/day2/Campbell2.pdf for an
overview and reference to an article about it. Kevin has made it
available at no cost (www.the-link-king.com). It applies both
probabilistic and deterministic linking strategies.

One alternative is Link Plus, downloadable free from CDC's web site.
It is easy to use, reasonably well documented, and performs well.
It can read SPSS files. It does not require any other software.
http://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

CSAT hosted a project to link large data sets awhile back.
http://csat.samhsa.gov/IDBSE/idb/modules/linking/material/linking.pdf

Dennis Deck, PhD
RMC Research Corporation
111 SW Columbia Street, Suite 1200
Portland, Oregon 97201-5843
voice: 503-223-8248 x715
voice: 800-788-1887 x715
fax: 503-223-8399
[hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD