Restructuring a fuzzy matched data set

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Restructuring a fuzzy matched data set

Anton-24
Dear Listers,

I am trying to unduplicate a file using a fuzzy match approach via the CDC
software LinkPlus.  The issue I'm having is trying to restructure the data
set so all Ids that match appear on one row in the.  For example LinkPlus
spits out a data set (actually a report) that has all pairs of matches:

Match#         ID  Name
1  27 Carl
1  42 Carl
2  27 Carl
2  53 Carl
3  42 Carl
3  53 Carl
4  18 Sue
4  99 Sue


I'd love to have a data set that looks like this:

Match1 Match2 Match3
27  42  53
18  99

I can use the lag function to create the row for Sue, but I'm afraid that I
need multiple lag functions and/or some kind of looping action to solve the
"Carl" issue.

Note:  I don't have Python and the number of match records can exceed the
given example.

Any help is much appreciated.

Anton

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Restructuring a fuzzy matched data set

Maguin, Eugene
Anton,

>>I am trying to unduplicate a file using a fuzzy match approach via the CDC
software LinkPlus.  The issue I'm having is trying to restructure the data
set so all Ids that match appear on one row in the.  For example LinkPlus
spits out a data set (actually a report) that has all pairs of matches:

Match#         ID  Name
1  27 Carl
1  42 Carl
2  27 Carl
2  53 Carl
3  42 Carl
3  53 Carl
4  18 Sue
4  99 Sue

I'd love to have a data set that looks like this:

Match1 Match2 Match3
27  42  53
18  99


I think I'd work the problem this way.

Sort cases by name id.

Compute dups=0.
If (id eq lag(id)) dups=lag(dups)+1.

Select if (dups eq 0).
*  Now you have a set of unique ids within a name value.
*  Use Casestovars to build a single record.

Casestovars /id=name.

I really never use casestovars so you may need to fiddle around with it a
bit but I think you will get records consisting of
Name Match1 Match2 ... MatchN ID1 ID2 ... IDN

You will be interested in the variables ID1 thru IDN.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Restructuring a fuzzy matched data set

Anton-24
In reply to this post by Anton-24
Hi Gene-

Thanks so much for your time and effort.  Unfortunately, since this data set
contains fuzzy matches, the name variable won't always be exactly the same
for each matched pair (it could even be missing on one of the pairs).  So...
sorting on name, in my example, is not an option.  Might there a work-around?

Anton

On Wed, 16 Jan 2008 13:06:14 -0500, Gene Maguin <[hidden email]> wrote:

>Anton,
>
>>>I am trying to unduplicate a file using a fuzzy match approach via the CDC
>software LinkPlus.  The issue I'm having is trying to restructure the data
>set so all Ids that match appear on one row in the.  For example LinkPlus
>spits out a data set (actually a report) that has all pairs of matches:
>
>Match#         ID  Name
>1  27 Carl
>1  42 Carl
>2  27 Carl
>2  53 Carl
>3  42 Carl
>3  53 Carl
>4  18 Sue
>4  99 Sue
>
>I'd love to have a data set that looks like this:
>
>Match1 Match2 Match3
>27  42  53
>18  99
>
>
>I think I'd work the problem this way.
>
>Sort cases by name id.
>
>Compute dups=0.
>If (id eq lag(id)) dups=lag(dups)+1.
>
>Select if (dups eq 0).
>*  Now you have a set of unique ids within a name value.
>*  Use Casestovars to build a single record.
>
>Casestovars /id=name.
>
>I really never use casestovars so you may need to fiddle around with it a
>bit but I think you will get records consisting of
>Name Match1 Match2 ... MatchN ID1 ID2 ... IDN
>
>You will be interested in the variables ID1 thru IDN.
>
>Gene Maguin
>
>=====================
>To manage your subscription to SPSSX-L, send a message to
>[hidden email] (not to SPSSX-L), with no body text except the
>command. To leave the list, send the command
>SIGNOFF SPSSX-L
>For a list of commands to manage subscriptions, send the command
>INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Restructuring a fuzzy matched data set

Maguin, Eugene
Anton,

I don't know if there is an alternative. You mentioned a CDC program. I am
not familiar with it. I guess I also don't understand (or misunderstood) the
process. I assumed that you passed your dataset through the CDC program and
it produced a dataset structured as shown in your example. In your example
the variable Name defined sets of cases. It sounds like you are now saying
that Name does not uniquely identify sets of cases. How can you identify
sets of cases? I think that is THE question.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Restructuring a fuzzy matched data set

Richard Ristow
In reply to this post by Anton-24
At 03:37 PM 1/15/2008, Anton wrote:

>I am trying to unduplicate a file using a fuzzy match approach via
>the CDC software LinkPlus.  I'm trying to restructure the data set
>so all Ids that match appear on one row in the [file].  For example
>LinkPlus spits out a data set that has all pairs of matches:
|-----------------------------|---------------------------|
|Output Created               |16-JAN-2008 17:25:24       |
|-----------------------------|---------------------------|
Match#   ID Name

     1    27 Carl
     1    42 Carl
     2    27 Carl
     2    53 Carl
     3    42 Carl
     3    53 Carl
     4    18 Sue
     4    99 Sue

Number of cases read:  8    Number of cases listed:  8

>I'd love to have a data set that looks like this:
>
>Match1 Match2 Match3
>27     42     53
>18     99

Your posting is clear about this, but I'll re-state it to make sure
there's no misunderstanding: LinkPlus has done the pairing already;
you're not doing anything to compare names. For your current
purposes, the only variables you're working with are 'Match#' and 'ID'.

First, a major question (and the answer may not be immediately clear;
even the question may not be): Does the report from LinkPlus have the
property called 'transitive-complete'? Here, 'transitive' refers to
the rule that "two names that match the same name, match each other."
For example, in your data,

Match# ID  Name
1      27 Carl
1      42 Carl
2      27 Carl
2      53 Carl
3      42 Carl
3      53 Carl

is transitive-complete: 27 matches 42; 42 matches 53; so 27 should
match 53, and it does. If you can have situations like

Match# ID  Name
1      27 Carl
1      42 Carl
2      42 Carl
2      53 Carl

then to get your record

Match1 Match2 Match3
27     42     53

you need an operation called 'transitive closure' to infer that 27
matches 53. That takes a macro loop, or much preferably a Python loop.

But if your set of matches is transitive-complete, this should work.
It's tested, though I'd like to have a little more test data to be
sure. This is draft output; test data, and code without the printed
output, are in the Appendix at the end.

*  I.      Restructure, with matches on one record instead of two    .
*  I.A.    Make sure all matches are in direction low -> high        .

SORT CASES BY Match# ID.
*  I.B.    Restructure                                         ..... .
*          (This may be confusing. See comments, on the statement.)  .

CASESTOVARS
  /ID        = Match#   /* 'ID' is a subcommand, 'Match#' a variable*/
  /RENAME      ID=Match /* 'ID' is a variable,   'Match'  a vbl pfx.*/
  /SEPARATOR = ''
  /DROP      = Name     /* Name is irrelevant for this purpose      */
  /GROUPBY   = VARIABLE .


Cases to Variables
|-----------------------------|---------------------------|
|Output Created               |16-JAN-2008 17:25:24       |
|-----------------------------|---------------------------|
Generated Variables
|---------|------|
|Original |Result|
|Variabl  |------|
|e        |Name  |
|-------|-|------|
|ID     |1|Match1|
|       |2|Match2|
|-------|-|------|

Processing Statistics
|---------------|---|
|Cases In       |8  |
|Cases Out      |4  |
|---------------|---|
|Cases In/Cases |2.0|
|Out            |   |
|---------------|---|
|Variables In   |3  |
|Variables Out  |3  |
|---------------|---|
|Index Values   |2  |
|---------------|---|


.  /**/ LIST  /*-*/.

List
|-----------------------------|---------------------------|
|Output Created               |16-JAN-2008 17:25:26       |
|-----------------------------|---------------------------|
Match# Match1 Match2

     1     27     42
     2     27     53
     3     42     53
     4     18     99

Number of cases read:  4    Number of cases listed:  4


*  II.     Combine, to get the required dataset                      .
*  II.A.   Find the lowest-numbered record that each record matches  .
*          (This won't give what you want, unless the set of matches .
*          is transitive-complete.)                                  .

AGGREGATE  OUTFILE=*
   /BREAK = Match2
   /Root   'Lowest match to MATCH2 record' = MIN(Match1).


.  /**/ LIST /* sorry MATCH2 and MATCH1 come out in reversed order */.

List
|-----------------------------|---------------------------|
|Output Created               |16-JAN-2008 17:25:26       |
|-----------------------------|---------------------------|
Match2 Root

    42    27
    53    27
    99    18

Number of cases read:  3    Number of cases listed:  3


*  II.B.   Restructure, to the 'wide' records desired          ..... .
*          (This was requested in the posting; but for many purposes,.
*          it may not be desirable.)                                 .

CASESTOVARS
  /ID      = Root
  /RENAME    Match2 = Match
  /SEPARATOR = ''
  /GROUPBY   = VARIABLE .

Cases to Variables
|-----------------------------|---------------------------|
|Output Created               |16-JAN-2008 17:25:26       |
|-----------------------------|---------------------------|
Generated Variables
|---------|------|
|Original |Result|
|Variabl  |------|
|e        |Name  |
|-------|-|------|
|Match2 |1|Match1|
|       |2|Match2|
|-------|-|------|

Processing Statistics
|---------------|---|
|Cases In       |3  |
|Cases Out      |2  |
|---------------|---|
|Cases In/Cases |1.5|
|Out            |   |
|---------------|---|
|Variables In   |2  |
|Variables Out  |3  |
|---------------|---|
|Index Values   |2  |
|---------------|---|


LIST /* Variable names aren't as requested, but structure is */.

List
|-----------------------------|---------------------------|
|Output Created               |16-JAN-2008 17:25:26       |
|-----------------------------|---------------------------|
Root Match1 Match2

   27    42     53
   18    99      .

Number of cases read:  2    Number of cases listed:  2


=============================================
APPENDIX: Test data, and code without listing
=============================================
*  ................................................................. .
*  .................   Test data, from original posting  ........... .
DATA LIST LIST /
    Match#  ID  Name
   (F3,     F4, A10).
BEGIN DATA
    1       27  Carl
    1       42  Carl
    2       27  Carl
    2       53  Carl
    3       42  Carl
    3       53  Carl
    4       18  Sue
    4       99  Sue
END DATA.

LIST.

*  .................   Post after this point   ..................... .
*  ................................................................. .

*  I.      Restructure, with matches on one record instead of two    .
*  I.A.    Make sure all matches are in direction low -> high        .

SORT CASES BY Match# ID.

*  I.B.    Restructure                                         ..... .
*          (This may be confusing. See comments, on the statement.)  .

CASESTOVARS
  /ID        = Match#   /* 'ID' is a subcommand, 'Match#' a variable*/
  /RENAME      ID=Match /* 'ID' is a variable,   'Match'  a vbl pfx.*/
  /SEPARATOR = ''
  /DROP      = Name     /* Name is irrelevant for this purpose      */
  /GROUPBY   = VARIABLE .
.  /**/ LIST  /*-*/.


*  II.     Combine, to get the required dataset                      .
*  II.A.   Find the lowest-numbered record that each record matches  .
*          (This won't give what you want, unless the set of matches .
*          is transitive-complete.)                                  .

AGGREGATE  OUTFILE=*
   /BREAK = Match2
   /Root   'Lowest match to MATCH2 record' = MIN(Match1).


.  /**/ LIST /* sorry MATCH2 and MATCH1 come out in reversed order */.

*  II.B.   Restructure, to the 'wide' records desired          ..... .
*          (This was requested in the posting; but for many purposes,.
*          it may not be desirable.)                                 .

CASESTOVARS
  /ID      = Root
  /RENAME    Match2 = Match
  /SEPARATOR = ''
  /GROUPBY   = VARIABLE .

LIST /* Variable names aren't as requested, but structure is */.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Restructuring a fuzzy matched data set

Anton-24
In reply to this post by Anton-24
Sorry for the confusion.  My explanation of what I have and what I need
probably wasn't clearly articulated.

I have the following SPSS data set with two columns: Match# and ID.  What
Match# 1 this tells me in the following data set is that ID 27 and ID 42 are
linked or are matched. Match# 2 tells me that ID 27 and ID 53 are linked.
Match# 4 tells me that ID 42 and ID 53 are linked, and so on.  So,
logically, I know that Ids 27, 42 & 53 are linked to each other, as are 18
and 99.

Match# ID
1 27
1 42
2 27
2 53
3 18
3 99
4 42
4 53

I need a data set that takes the all of the IDs that are linked or matched
and put them on the same row:

VAR1 VAR2 VAR3
27 42 53
18 99

BTW: I don't have/use python.

Thanks again for any guidance.

Anton







On Wed, 16 Jan 2008 15:26:04 -0500, Gene Maguin <[hidden email]> wrote:

>Anton,
>
>I don't know if there is an alternative. You mentioned a CDC program. I am
>not familiar with it. I guess I also don't understand (or misunderstood) the
>process. I assumed that you passed your dataset through the CDC program and
>it produced a dataset structured as shown in your example. In your example
>the variable Name defined sets of cases. It sounds like you are now saying
>that Name does not uniquely identify sets of cases. How can you identify
>sets of cases? I think that is THE question.
>
>Gene Maguin
>
>=====================
>To manage your subscription to SPSSX-L, send a message to
>[hidden email] (not to SPSSX-L), with no body text except the
>command. To leave the list, send the command
>SIGNOFF SPSSX-L
>For a list of commands to manage subscriptions, send the command
>INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Restructuring a fuzzy matched data set

Albert-Jan Roskam
In reply to this post by Maguin, Eugene
> How can you identify
> sets of cases? I think that is THE question.
>

===> That's indeed the big question in probabilistic
linkage. AFAIK, sets of cases are defined as the sum
of the linkage weights. This sum reflects the
probability that those cases belong to the same set.
If this probability exceeds a certain threshold (which
is also estimated), the cases are considered linked.

So I'd say the linkage weight and the linkage
threshold are information you need for this decision.

Cheers!!
Albert-Jan



      ____________________________________________________________________________________
Looking for last minute shopping deals?
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD