|
Dear Listers,
I am trying to unduplicate a file using a fuzzy match approach via the CDC software LinkPlus. The issue I'm having is trying to restructure the data set so all Ids that match appear on one row in the. For example LinkPlus spits out a data set (actually a report) that has all pairs of matches: Match# ID Name 1 27 Carl 1 42 Carl 2 27 Carl 2 53 Carl 3 42 Carl 3 53 Carl 4 18 Sue 4 99 Sue I'd love to have a data set that looks like this: Match1 Match2 Match3 27 42 53 18 99 I can use the lag function to create the row for Sue, but I'm afraid that I need multiple lag functions and/or some kind of looping action to solve the "Carl" issue. Note: I don't have Python and the number of match records can exceed the given example. Any help is much appreciated. Anton ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Anton,
>>I am trying to unduplicate a file using a fuzzy match approach via the CDC software LinkPlus. The issue I'm having is trying to restructure the data set so all Ids that match appear on one row in the. For example LinkPlus spits out a data set (actually a report) that has all pairs of matches: Match# ID Name 1 27 Carl 1 42 Carl 2 27 Carl 2 53 Carl 3 42 Carl 3 53 Carl 4 18 Sue 4 99 Sue I'd love to have a data set that looks like this: Match1 Match2 Match3 27 42 53 18 99 I think I'd work the problem this way. Sort cases by name id. Compute dups=0. If (id eq lag(id)) dups=lag(dups)+1. Select if (dups eq 0). * Now you have a set of unique ids within a name value. * Use Casestovars to build a single record. Casestovars /id=name. I really never use casestovars so you may need to fiddle around with it a bit but I think you will get records consisting of Name Match1 Match2 ... MatchN ID1 ID2 ... IDN You will be interested in the variables ID1 thru IDN. Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Anton-24
Hi Gene-
Thanks so much for your time and effort. Unfortunately, since this data set contains fuzzy matches, the name variable won't always be exactly the same for each matched pair (it could even be missing on one of the pairs). So... sorting on name, in my example, is not an option. Might there a work-around? Anton On Wed, 16 Jan 2008 13:06:14 -0500, Gene Maguin <[hidden email]> wrote: >Anton, > >>>I am trying to unduplicate a file using a fuzzy match approach via the CDC >software LinkPlus. The issue I'm having is trying to restructure the data >set so all Ids that match appear on one row in the. For example LinkPlus >spits out a data set (actually a report) that has all pairs of matches: > >Match# ID Name >1 27 Carl >1 42 Carl >2 27 Carl >2 53 Carl >3 42 Carl >3 53 Carl >4 18 Sue >4 99 Sue > >I'd love to have a data set that looks like this: > >Match1 Match2 Match3 >27 42 53 >18 99 > > >I think I'd work the problem this way. > >Sort cases by name id. > >Compute dups=0. >If (id eq lag(id)) dups=lag(dups)+1. > >Select if (dups eq 0). >* Now you have a set of unique ids within a name value. >* Use Casestovars to build a single record. > >Casestovars /id=name. > >I really never use casestovars so you may need to fiddle around with it a >bit but I think you will get records consisting of >Name Match1 Match2 ... MatchN ID1 ID2 ... IDN > >You will be interested in the variables ID1 thru IDN. > >Gene Maguin > >===================== >To manage your subscription to SPSSX-L, send a message to >[hidden email] (not to SPSSX-L), with no body text except the >command. To leave the list, send the command >SIGNOFF SPSSX-L >For a list of commands to manage subscriptions, send the command >INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Anton,
I don't know if there is an alternative. You mentioned a CDC program. I am not familiar with it. I guess I also don't understand (or misunderstood) the process. I assumed that you passed your dataset through the CDC program and it produced a dataset structured as shown in your example. In your example the variable Name defined sets of cases. It sounds like you are now saying that Name does not uniquely identify sets of cases. How can you identify sets of cases? I think that is THE question. Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Anton-24
At 03:37 PM 1/15/2008, Anton wrote:
>I am trying to unduplicate a file using a fuzzy match approach via >the CDC software LinkPlus. I'm trying to restructure the data set >so all Ids that match appear on one row in the [file]. For example >LinkPlus spits out a data set that has all pairs of matches: |-----------------------------|---------------------------| |Output Created |16-JAN-2008 17:25:24 | |-----------------------------|---------------------------| Match# ID Name 1 27 Carl 1 42 Carl 2 27 Carl 2 53 Carl 3 42 Carl 3 53 Carl 4 18 Sue 4 99 Sue Number of cases read: 8 Number of cases listed: 8 >I'd love to have a data set that looks like this: > >Match1 Match2 Match3 >27 42 53 >18 99 Your posting is clear about this, but I'll re-state it to make sure there's no misunderstanding: LinkPlus has done the pairing already; you're not doing anything to compare names. For your current purposes, the only variables you're working with are 'Match#' and 'ID'. First, a major question (and the answer may not be immediately clear; even the question may not be): Does the report from LinkPlus have the property called 'transitive-complete'? Here, 'transitive' refers to the rule that "two names that match the same name, match each other." For example, in your data, Match# ID Name 1 27 Carl 1 42 Carl 2 27 Carl 2 53 Carl 3 42 Carl 3 53 Carl is transitive-complete: 27 matches 42; 42 matches 53; so 27 should match 53, and it does. If you can have situations like Match# ID Name 1 27 Carl 1 42 Carl 2 42 Carl 2 53 Carl then to get your record Match1 Match2 Match3 27 42 53 you need an operation called 'transitive closure' to infer that 27 matches 53. That takes a macro loop, or much preferably a Python loop. But if your set of matches is transitive-complete, this should work. It's tested, though I'd like to have a little more test data to be sure. This is draft output; test data, and code without the printed output, are in the Appendix at the end. * I. Restructure, with matches on one record instead of two . * I.A. Make sure all matches are in direction low -> high . SORT CASES BY Match# ID. * I.B. Restructure ..... . * (This may be confusing. See comments, on the statement.) . CASESTOVARS /ID = Match# /* 'ID' is a subcommand, 'Match#' a variable*/ /RENAME ID=Match /* 'ID' is a variable, 'Match' a vbl pfx.*/ /SEPARATOR = '' /DROP = Name /* Name is irrelevant for this purpose */ /GROUPBY = VARIABLE . Cases to Variables |-----------------------------|---------------------------| |Output Created |16-JAN-2008 17:25:24 | |-----------------------------|---------------------------| Generated Variables |---------|------| |Original |Result| |Variabl |------| |e |Name | |-------|-|------| |ID |1|Match1| | |2|Match2| |-------|-|------| Processing Statistics |---------------|---| |Cases In |8 | |Cases Out |4 | |---------------|---| |Cases In/Cases |2.0| |Out | | |---------------|---| |Variables In |3 | |Variables Out |3 | |---------------|---| |Index Values |2 | |---------------|---| . /**/ LIST /*-*/. List |-----------------------------|---------------------------| |Output Created |16-JAN-2008 17:25:26 | |-----------------------------|---------------------------| Match# Match1 Match2 1 27 42 2 27 53 3 42 53 4 18 99 Number of cases read: 4 Number of cases listed: 4 * II. Combine, to get the required dataset . * II.A. Find the lowest-numbered record that each record matches . * (This won't give what you want, unless the set of matches . * is transitive-complete.) . AGGREGATE OUTFILE=* /BREAK = Match2 /Root 'Lowest match to MATCH2 record' = MIN(Match1). . /**/ LIST /* sorry MATCH2 and MATCH1 come out in reversed order */. List |-----------------------------|---------------------------| |Output Created |16-JAN-2008 17:25:26 | |-----------------------------|---------------------------| Match2 Root 42 27 53 27 99 18 Number of cases read: 3 Number of cases listed: 3 * II.B. Restructure, to the 'wide' records desired ..... . * (This was requested in the posting; but for many purposes,. * it may not be desirable.) . CASESTOVARS /ID = Root /RENAME Match2 = Match /SEPARATOR = '' /GROUPBY = VARIABLE . Cases to Variables |-----------------------------|---------------------------| |Output Created |16-JAN-2008 17:25:26 | |-----------------------------|---------------------------| Generated Variables |---------|------| |Original |Result| |Variabl |------| |e |Name | |-------|-|------| |Match2 |1|Match1| | |2|Match2| |-------|-|------| Processing Statistics |---------------|---| |Cases In |3 | |Cases Out |2 | |---------------|---| |Cases In/Cases |1.5| |Out | | |---------------|---| |Variables In |2 | |Variables Out |3 | |---------------|---| |Index Values |2 | |---------------|---| LIST /* Variable names aren't as requested, but structure is */. List |-----------------------------|---------------------------| |Output Created |16-JAN-2008 17:25:26 | |-----------------------------|---------------------------| Root Match1 Match2 27 42 53 18 99 . Number of cases read: 2 Number of cases listed: 2 ============================================= APPENDIX: Test data, and code without listing ============================================= * ................................................................. . * ................. Test data, from original posting ........... . DATA LIST LIST / Match# ID Name (F3, F4, A10). BEGIN DATA 1 27 Carl 1 42 Carl 2 27 Carl 2 53 Carl 3 42 Carl 3 53 Carl 4 18 Sue 4 99 Sue END DATA. LIST. * ................. Post after this point ..................... . * ................................................................. . * I. Restructure, with matches on one record instead of two . * I.A. Make sure all matches are in direction low -> high . SORT CASES BY Match# ID. * I.B. Restructure ..... . * (This may be confusing. See comments, on the statement.) . CASESTOVARS /ID = Match# /* 'ID' is a subcommand, 'Match#' a variable*/ /RENAME ID=Match /* 'ID' is a variable, 'Match' a vbl pfx.*/ /SEPARATOR = '' /DROP = Name /* Name is irrelevant for this purpose */ /GROUPBY = VARIABLE . . /**/ LIST /*-*/. * II. Combine, to get the required dataset . * II.A. Find the lowest-numbered record that each record matches . * (This won't give what you want, unless the set of matches . * is transitive-complete.) . AGGREGATE OUTFILE=* /BREAK = Match2 /Root 'Lowest match to MATCH2 record' = MIN(Match1). . /**/ LIST /* sorry MATCH2 and MATCH1 come out in reversed order */. * II.B. Restructure, to the 'wide' records desired ..... . * (This was requested in the posting; but for many purposes,. * it may not be desirable.) . CASESTOVARS /ID = Root /RENAME Match2 = Match /SEPARATOR = '' /GROUPBY = VARIABLE . LIST /* Variable names aren't as requested, but structure is */. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Anton-24
Sorry for the confusion. My explanation of what I have and what I need
probably wasn't clearly articulated. I have the following SPSS data set with two columns: Match# and ID. What Match# 1 this tells me in the following data set is that ID 27 and ID 42 are linked or are matched. Match# 2 tells me that ID 27 and ID 53 are linked. Match# 4 tells me that ID 42 and ID 53 are linked, and so on. So, logically, I know that Ids 27, 42 & 53 are linked to each other, as are 18 and 99. Match# ID 1 27 1 42 2 27 2 53 3 18 3 99 4 42 4 53 I need a data set that takes the all of the IDs that are linked or matched and put them on the same row: VAR1 VAR2 VAR3 27 42 53 18 99 BTW: I don't have/use python. Thanks again for any guidance. Anton On Wed, 16 Jan 2008 15:26:04 -0500, Gene Maguin <[hidden email]> wrote: >Anton, > >I don't know if there is an alternative. You mentioned a CDC program. I am >not familiar with it. I guess I also don't understand (or misunderstood) the >process. I assumed that you passed your dataset through the CDC program and >it produced a dataset structured as shown in your example. In your example >the variable Name defined sets of cases. It sounds like you are now saying >that Name does not uniquely identify sets of cases. How can you identify >sets of cases? I think that is THE question. > >Gene Maguin > >===================== >To manage your subscription to SPSSX-L, send a message to >[hidden email] (not to SPSSX-L), with no body text except the >command. To leave the list, send the command >SIGNOFF SPSSX-L >For a list of commands to manage subscriptions, send the command >INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Maguin, Eugene
> How can you identify
> sets of cases? I think that is THE question. > ===> That's indeed the big question in probabilistic linkage. AFAIK, sets of cases are defined as the sum of the linkage weights. This sum reflects the probability that those cases belong to the same set. If this probability exceeds a certain threshold (which is also estimated), the cases are considered linked. So I'd say the linkage weight and the linkage threshold are information you need for this decision. Cheers!! Albert-Jan ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
