Matching on a partial key variable

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Matching on a partial key variable

TheGhazi
This post was updated on .
Dear,

I would like merge 2 data that has a common key variable but i have problem with the second dataset as it has the key variable (numerical) with different length.

For example,
 data 1:

File_No   age   sex
123        75     F
4567      85     M
7896      83     F


Data 2:
File_No     disease_code
12345          C1.2
45677          D132
7896            C1.2

Can i match using the first 3 or 5 characters ?

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Matching on a partial key variable

David Marso
Administrator
Seems that data 1 doesn't bring anything to the party!
Obviously MATCH FILES wont work with what you have.
Chopping data 2 might give you something if you also use other variables as keys.
Still YMMV and you might consider razzing the data 1 peeps to get their shit together and cease providing defective data?
Please elaborate further if I have missed something here!
GIGO!!!!!!!
---

Ghazi wrote
Dear,

I would like merge 2 data that has a common key variable but i have problem with the second dataset as it has the key variable (numerical) with different length.

For example,
 data 1:

File_No   age   sex
123        75     F
4567      85     M
7896      83     F


Data 2:
File_No   age   sex  disease_code
12345     75    F      C1.2
45677     85    M     D132
7896       83    F      C1.2

Can i match using the first 3 or 5 characters ?

Thanks
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Matching on a partial key variable

TheGhazi
This post was updated on .
HI GiGO,

i edited dataset one, however, it was only meant to explain the idea and it is not part of the real data.

I am dealing with 2 datasets that has different patients information in each but data 2 has the file numbers appended with the visit number (i,e., if the file number is 112345 and this is the first visit, they will write the file number as 112345111 - with the addition of 111 to the real number). In addition, some of the file numbers are 4 digits and some go upto 8 digits.

Reply | Threaded
Open this post in threaded view
|

Re: Matching on a partial key variable

Maguin, Eugene
So the id number for file 2 is patient id plus visit number. Are the patient ids unique numbers (I can't imagine why they wouldn't be but stranger things have happened!)? How many patients are there? What is the base 10 exponent (2=100): 3, 4, 5, 6, >6? What structure is there in the numbering of the visits? Do they always start at 1 and always increment by 1 or, more generally, always increment by 1 from x to y, where x and y are patient-specific numbers? I'm assuming that the file 2 represents a dump of data from date A to date B so that the first record for some patients is 1, for others is 13 or 34,  and for still others it may be 123.  Can you conclude that every patient whose visit is in file 2 is named in file 1 and that every patient in file 1 has one or more visits in file 2?

This is where I'm going with these questions. (I don't really know, in the sense of having tested it, if this will work but it seems like it might or might be a starting point.) Suppose you had 5 patients in file 1 and 14 visit records in file 2. Suppose you had only those numbers with and no other useful information, which you may have. So suppose you deleted all variables in file 1 but the patient id (pid) and then restructured that file long to wide, giving one record with five variables (pid.1 to pid.5). Next, append that record to every record in file 2 and delete all variables but the patient+visit number and pid.1-pid.5. Now, one of pid.1-pid.5 has to be the left-most digits in the visit number such that the visit number sequence is preserved and every pid.x has to be accounted for. This is not a perfect deal in that file 1 ids will always be linked with the correct file 2 records. You might have multiple possible pids for file 2 records. It doesn't seem, however, that you have anything better at the moment.

Gene Maguin





-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ghazi
Sent: Thursday, September 11, 2014 12:45 PM
To: [hidden email]
Subject: Re: Matching on a partial key variable

HI GiGO,

i edited dataset one, however, it wasn't meant to explain the idea and it is not part of the real data.

I am dealing with 2 datasets that has different patients information in each but data 2 has the file numbers appended with the visit number (i,e., if the file number is 112345 and this is the first visit, they will write the file number as 112345111 - with the addition of 111 to the real number). In addition, some of the file numbers are 4 digits and some go upto 8 digits.





--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Matching-on-a-partial-key-variable-tp5727214p5727222.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Matching on a partial key variable

David Marso
Administrator
In reply to this post by TheGhazi

If it is always 3 digits appended then simply divide by 1000 and truncate!
If not then you have a serious data definition problem and should probably get a bitching session going with your data provider.
BTW, GIGO is acronym for garbage in garbage out, NOT my name ;-)
--
Ghazi wrote
HI GiGO,

i edited dataset one, however, it wasn't meant to explain the idea and it is not part of the real data.

I am dealing with 2 datasets that has different patients information in each but data 2 has the file numbers appended with the visit number (i,e., if the file number is 112345 and this is the first visit, they will write the file number as 112345111 - with the addition of 111 to the real number). In addition, some of the file numbers are 4 digits and some go upto 8 digits.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Matching on a partial key variable

TheGhazi
Ohhh Dear David,

I am truly sorry ☹

But yaa I believe this is the way the provider codes these variables. 

I thinked i figured away. I will let you know if it works. 

Many thanks,

Ghazi

On Thu, Sep 11, 2014 at 1:47 PM, David Marso [via SPSSX Discussion] <[hidden email]> wrote:

If it is always 3 digits appended then simply divide by 1000 and truncate!
If not then you have a serious data definition problem and should probably get a bitching session going with your data provider.
BTW, GIGO is acronym for garbage in garbage out, NOT my name ;-)
--
Ghazi wrote
HI GiGO,

i edited dataset one, however, it wasn't meant to explain the idea and it is not part of the real data.

I am dealing with 2 datasets that has different patients information in each but data 2 has the file numbers appended with the visit number (i,e., if the file number is 112345 and this is the first visit, they will write the file number as 112345111 - with the addition of 111 to the real number). In addition, some of the file numbers are 4 digits and some go upto 8 digits.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"



If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Matching-on-a-partial-key-variable-tp5727214p5727223.html
To unsubscribe from Matching on a partial key variable, click here.
NAML

Reply | Threaded
Open this post in threaded view
|

Re: Matching on a partial key variable

TheGhazi
In reply to this post by Maguin, Eugene
Hi Gene, 

Thanks for the time and thoughtful response. 

So data A (the main data) has the the REAL unique pid and it also has the date of the required visit. Data B has the patient unique id PLUS 3 more digits in some patients and the pid itself as in data A for some patients. The total patients in data A is 627 and 755 in data B. The exact match in data B is only 317 using the pid as a key variable.

 The remaining patients has some weird 3 digits appended to the pid and i can't delete the 3 digits because the pid(s) themselves can be 4 to 8 digits. I hope this is clear. 

The ideal situation is to look at data A (pid variable-> count how many character in each variable-> then match with the first equal number of characters in the pid variable in data B).

Lets us not to discuss the presence of duplicate cases, although i have some but i can manually match using the pid and visit date to merge duplicate. 

Cheers, 

Ghazi

On Thu, Sep 11, 2014 at 2:26 PM, Maguin, Eugene [via SPSSX Discussion] <[hidden email]> wrote:
So the id number for file 2 is patient id plus visit number. Are the patient ids unique numbers (I can't imagine why they wouldn't be but stranger things have happened!)? How many patients are there? What is the base 10 exponent (2=100): 3, 4, 5, 6, >6? What structure is there in the numbering of the visits? Do they always start at 1 and always increment by 1 or, more generally, always increment by 1 from x to y, where x and y are patient-specific numbers? I'm assuming that the file 2 represents a dump of data from date A to date B so that the first record for some patients is 1, for others is 13 or 34,  and for still others it may be 123.  Can you conclude that every patient whose visit is in file 2 is named in file 1 and that every patient in file 1 has one or more visits in file 2?

This is where I'm going with these questions. (I don't really know, in the sense of having tested it, if this will work but it seems like it might or might be a starting point.) Suppose you had 5 patients in file 1 and 14 visit records in file 2. Suppose you had only those numbers with and no other useful information, which you may have. So suppose you deleted all variables in file 1 but the patient id (pid) and then restructured that file long to wide, giving one record with five variables (pid.1 to pid.5). Next, append that record to every record in file 2 and delete all variables but the patient+visit number and pid.1-pid.5. Now, one of pid.1-pid.5 has to be the left-most digits in the visit number such that the visit number sequence is preserved and every pid.x has to be accounted for. This is not a perfect deal in that file 1 ids will always be linked with the correct file 2 records. You might have multiple possible pids for file 2 records. It doesn't seem, however, that you have anything better at the moment.

Gene Maguin





-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ghazi
Sent: Thursday, September 11, 2014 12:45 PM
To: [hidden email]
Subject: Re: Matching on a partial key variable

HI GiGO,

i edited dataset one, however, it wasn't meant to explain the idea and it is not part of the real data.

I am dealing with 2 datasets that has different patients information in each but data 2 has the file numbers appended with the visit number (i,e., if the file number is 112345 and this is the first visit, they will write the file number as 112345111 - with the addition of 111 to the real number). In addition, some of the file numbers are 4 digits and some go upto 8 digits.





--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Matching-on-a-partial-key-variable-tp5727214p5727222.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Matching-on-a-partial-key-variable-tp5727214p5727227.html
To unsubscribe from Matching on a partial key variable, click here.
NAML

Reply | Threaded
Open this post in threaded view
|

Re: Matching on a partial key variable

Bruce Weaver
Administrator
In reply to this post by David Marso
I thought Ghazi was suggesting a new nickname for you, David.  But personally, I think "The Marsonator" is much more apt.  (You've been Marsonated, etc.)  ;-)


David Marso wrote
If it is always 3 digits appended then simply divide by 1000 and truncate!
If not then you have a serious data definition problem and should probably get a bitching session going with your data provider.
BTW, GIGO is acronym for garbage in garbage out, NOT my name ;-)
--
Ghazi wrote
HI GiGO,

i edited dataset one, however, it wasn't meant to explain the idea and it is not part of the real data.

I am dealing with 2 datasets that has different patients information in each but data 2 has the file numbers appended with the visit number (i,e., if the file number is 112345 and this is the first visit, they will write the file number as 112345111 - with the addition of 111 to the real number). In addition, some of the file numbers are 4 digits and some go upto 8 digits.
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Matching on a partial key variable

Richard Ristow
In reply to this post by TheGhazi
At 04:43 PM 9/11/2014, Ghazi wrote:

>Data A (the main data) has the REAL unique pid and it also has the
>date of the required visit. Data B has the patient unique id PLUS 3
>more digits in some patients and the pid itself as in data A for
>some patients. The total patients in data A is 627 and 755 in data
>B. The exact match in data B is only 317 using the pid as a key variable.
>
>The remaining patients has some weird 3 digits appended to the pid
>and i can't delete the 3 digits because the pid(s) themselves can be
>4 to 8 digits. I hope this is clear.

This one is kind of a bear. I'd like to have a word with whomever set
up data B's key that way!  The worst problem is if the appended
digits give you the PID for the *wrong* patient: you have patients
1234 and 1234567, and file B has records for patient 1234 but has
appended digits 567 to the ID, so it matches patient 1234567. You'd
have to go deeper into your data sources, if there's any chance of
this happening.

Below is a solution (NOT TESTED), based on the assumption that in
data B, the ID is either the pid or the pid with *exactly* three
digits appended. Its result is the Data B records with the correct
pid. It will work only for those patients in data B who are also in
data A, and where the original ID in data B is not ambiguous.


*  I.    Since data A is visits, a patient may logically appear more .
*        than once in data A. So, create an index file with one      .
*        record for each patient in data A.                          .

GET FILE=data_A.
DATASET DECLARE   PtList.

AGGREGATE OUTFILE=PtList
    /BREAK=pid
    /NVISITS 'Number of visits for this patient' = NU.

*  II.   Similarly, create a file of all ID values that occur in     .
*        Data B:                                                     .

*        I'm guessing that the ID variable in data B is called 'pid' .
*        If so, that's undesirable, so it'll be changed on loading.  .
*        (If it's called something else, the RENAME isn't necessary.).

GET FILE Data_B
    /RENAME=(pid=Data_B_id).

DATASET NAME      Data_B_set  WINDOW=FRONT.

DATASET DECLARE   Data_B_keys.

AGGREGATE OUTFILE=Data_B_keys
    /BREAK=Data_B_id
    /N_Brcds 'Number of records for this key in data B' = NU.


*  III.  List all *possible* pid values from data B.                 .

*        This assumes that the ID field in Data B is always either   .
*        the pid or the pid with *exactly* three digits appended.    .
*        (If that's not true, a different approach is needed.)       .

DATASET ACTIVATE  Data_B_keys WINDOW=FRONT.

NUMERIC Longpid  Shortpid (F8).

NUMERIC LongType ShortType (F2).
VALUE LABELS
         LongType ShortType
         1 'Full id'
         2 'Suffix dropped'.

COMPUTE Longpid   = Data_B_id.
COMPUTE LongType  = 1.
COMPUTE ShortPid  = TRUNC(Data_B_id/1000).
COMPUTE ShortType = 2.

*   The following VARSTOCASES creates a file with two records for    .
*   each Data B record, containing the two possible pid values from  .
*   that record: the id field as given, and the id field dropping    .
*   three trailing digits.                                           .

VARSTOCASES
     /MAKE pid    FROM Longpid  Shortpid
     /MAKE Source FROM LongType ShortType
     /KEEP Data_B_id.

*  IV.   Match possible pid values from data B with actual values    .
*        from data A, to identify correct values.                    .

SORT CASES BY pid Data_B_id.
MATCH FILES
   /TABLE=PtList  /IN=Matched
   /FILE =*
   /BY   pid.

SELECT IF  Matched.

*  V.A   Summarize matches by Data B id,                             .

AGGREGATE OUTFILE=*
   /pid    = MIN(pid)
   /Source = MIN(Source)
   /NMatch = NU.

ADD VALUE LABELS Source
     0  'No match'
     9  'Ambiguous'.

DO IF   NMatch GT 1.
.  COMPUTE pid    = $SYSMIS.
.  COMPUTE Source = 9.
END IF.

IF NMatch GT 1 Source = 9.


*  V.B   Attach matches to the list of all Data B ids                .

MATCH FILES
    /FILE=*
    /FILE=Data_B_keys
    /BY   Data_B_id
    /DROP=N_Brcds NMatch.

RECODE Source (SYSMIS = 0).

*  VI.   Finally, attach pids to all Data B records                  .

MATCH FILES
    /TABLE=*
    /FILE =Data_B_set
    /BY   Data_B_id.


DATASET NAME       Data_B_pid WINDOW=FRONT.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Matching on a partial key variable

Richard Ristow
Of course, a correction. At 03:23 PM 9/12/2014, I posted code including:

>*  V.A   Summarize matches by Data B id,                             .
>
>AGGREGATE OUTFILE=*
>   /pid    = MIN(pid)
>   /Source = MIN(Source)
>   /NMatch = NU.

The AGGREGATE needs a BREAK clause:

AGGREGATE OUTFILE=*
   /BREAK  = Data_B_id    /* ADDED */
   /pid    = MIN(pid)
   /Source = MIN(Source)
   /NMatch = NU.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD