This post was updated on .
Dear,
I would like merge 2 data that has a common key variable but i have problem with the second dataset as it has the key variable (numerical) with different length. For example, data 1: File_No age sex 123 75 F 4567 85 M 7896 83 F Data 2: File_No disease_code 12345 C1.2 45677 D132 7896 C1.2 Can i match using the first 3 or 5 characters ? Thanks |
Administrator
|
Seems that data 1 doesn't bring anything to the party!
Obviously MATCH FILES wont work with what you have. Chopping data 2 might give you something if you also use other variables as keys. Still YMMV and you might consider razzing the data 1 peeps to get their shit together and cease providing defective data? Please elaborate further if I have missed something here! GIGO!!!!!!! ---
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
This post was updated on .
HI GiGO,
i edited dataset one, however, it was only meant to explain the idea and it is not part of the real data. I am dealing with 2 datasets that has different patients information in each but data 2 has the file numbers appended with the visit number (i,e., if the file number is 112345 and this is the first visit, they will write the file number as 112345111 - with the addition of 111 to the real number). In addition, some of the file numbers are 4 digits and some go upto 8 digits. |
So the id number for file 2 is patient id plus visit number. Are the patient ids unique numbers (I can't imagine why they wouldn't be but stranger things have happened!)? How many patients are there? What is the base 10 exponent (2=100): 3, 4, 5, 6, >6? What structure is there in the numbering of the visits? Do they always start at 1 and always increment by 1 or, more generally, always increment by 1 from x to y, where x and y are patient-specific numbers? I'm assuming that the file 2 represents a dump of data from date A to date B so that the first record for some patients is 1, for others is 13 or 34, and for still others it may be 123. Can you conclude that every patient whose visit is in file 2 is named in file 1 and that every patient in file 1 has one or more visits in file 2?
This is where I'm going with these questions. (I don't really know, in the sense of having tested it, if this will work but it seems like it might or might be a starting point.) Suppose you had 5 patients in file 1 and 14 visit records in file 2. Suppose you had only those numbers with and no other useful information, which you may have. So suppose you deleted all variables in file 1 but the patient id (pid) and then restructured that file long to wide, giving one record with five variables (pid.1 to pid.5). Next, append that record to every record in file 2 and delete all variables but the patient+visit number and pid.1-pid.5. Now, one of pid.1-pid.5 has to be the left-most digits in the visit number such that the visit number sequence is preserved and every pid.x has to be accounted for. This is not a perfect deal in that file 1 ids will always be linked with the correct file 2 records. You might have multiple possible pids for file 2 records. It doesn't seem, however, that you have anything better at the moment. Gene Maguin -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Ghazi Sent: Thursday, September 11, 2014 12:45 PM To: [hidden email] Subject: Re: Matching on a partial key variable HI GiGO, i edited dataset one, however, it wasn't meant to explain the idea and it is not part of the real data. I am dealing with 2 datasets that has different patients information in each but data 2 has the file numbers appended with the visit number (i,e., if the file number is 112345 and this is the first visit, they will write the file number as 112345111 - with the addition of 111 to the real number). In addition, some of the file numbers are 4 digits and some go upto 8 digits. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Matching-on-a-partial-key-variable-tp5727214p5727222.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by TheGhazi
If it is always 3 digits appended then simply divide by 1000 and truncate! If not then you have a serious data definition problem and should probably get a bitching session going with your data provider. BTW, GIGO is acronym for garbage in garbage out, NOT my name ;-) --
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Ohhh Dear David, I am truly sorry ☹ But yaa I believe this is the way the provider codes these variables. I thinked i figured away. I will let you know if it works. Many thanks, Ghazi On Thu, Sep 11, 2014 at 1:47 PM, David Marso [via SPSSX Discussion] <[hidden email]> wrote:
|
In reply to this post by Maguin, Eugene
Hi Gene, Thanks for the time and thoughtful response. So data A (the main data) has the the REAL unique pid and it also has the date of the required visit. Data B has the patient unique id PLUS 3 more digits in some patients and the pid itself as in data A for some patients. The total patients in data A is 627 and 755 in data B. The exact match in data B is only 317 using the pid as a key variable. The remaining patients has some weird 3 digits appended to the pid and i can't delete the 3 digits because the pid(s) themselves can be 4 to 8 digits. I hope this is clear. The ideal situation is to look at data A (pid variable-> count how many character in each variable-> then match with the first equal number of characters in the pid variable in data B). Lets us not to discuss the presence of duplicate cases, although i have some but i can manually match using the pid and visit date to merge duplicate. Cheers, Ghazi On Thu, Sep 11, 2014 at 2:26 PM, Maguin, Eugene [via SPSSX Discussion] <[hidden email]> wrote: So the id number for file 2 is patient id plus visit number. Are the patient ids unique numbers (I can't imagine why they wouldn't be but stranger things have happened!)? How many patients are there? What is the base 10 exponent (2=100): 3, 4, 5, 6, >6? What structure is there in the numbering of the visits? Do they always start at 1 and always increment by 1 or, more generally, always increment by 1 from x to y, where x and y are patient-specific numbers? I'm assuming that the file 2 represents a dump of data from date A to date B so that the first record for some patients is 1, for others is 13 or 34, and for still others it may be 123. Can you conclude that every patient whose visit is in file 2 is named in file 1 and that every patient in file 1 has one or more visits in file 2? |
Administrator
|
In reply to this post by David Marso
I thought Ghazi was suggesting a new nickname for you, David. But personally, I think "The Marsonator" is much more apt. (You've been Marsonated, etc.) ;-)
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by TheGhazi
At 04:43 PM 9/11/2014, Ghazi wrote:
>Data A (the main data) has the REAL unique pid and it also has the >date of the required visit. Data B has the patient unique id PLUS 3 >more digits in some patients and the pid itself as in data A for >some patients. The total patients in data A is 627 and 755 in data >B. The exact match in data B is only 317 using the pid as a key variable. > >The remaining patients has some weird 3 digits appended to the pid >and i can't delete the 3 digits because the pid(s) themselves can be >4 to 8 digits. I hope this is clear. This one is kind of a bear. I'd like to have a word with whomever set up data B's key that way! The worst problem is if the appended digits give you the PID for the *wrong* patient: you have patients 1234 and 1234567, and file B has records for patient 1234 but has appended digits 567 to the ID, so it matches patient 1234567. You'd have to go deeper into your data sources, if there's any chance of this happening. Below is a solution (NOT TESTED), based on the assumption that in data B, the ID is either the pid or the pid with *exactly* three digits appended. Its result is the Data B records with the correct pid. It will work only for those patients in data B who are also in data A, and where the original ID in data B is not ambiguous. * I. Since data A is visits, a patient may logically appear more . * than once in data A. So, create an index file with one . * record for each patient in data A. . GET FILE=data_A. DATASET DECLARE PtList. AGGREGATE OUTFILE=PtList /BREAK=pid /NVISITS 'Number of visits for this patient' = NU. * II. Similarly, create a file of all ID values that occur in . * Data B: . * I'm guessing that the ID variable in data B is called 'pid' . * If so, that's undesirable, so it'll be changed on loading. . * (If it's called something else, the RENAME isn't necessary.). GET FILE Data_B /RENAME=(pid=Data_B_id). DATASET NAME Data_B_set WINDOW=FRONT. DATASET DECLARE Data_B_keys. AGGREGATE OUTFILE=Data_B_keys /BREAK=Data_B_id /N_Brcds 'Number of records for this key in data B' = NU. * III. List all *possible* pid values from data B. . * This assumes that the ID field in Data B is always either . * the pid or the pid with *exactly* three digits appended. . * (If that's not true, a different approach is needed.) . DATASET ACTIVATE Data_B_keys WINDOW=FRONT. NUMERIC Longpid Shortpid (F8). NUMERIC LongType ShortType (F2). VALUE LABELS LongType ShortType 1 'Full id' 2 'Suffix dropped'. COMPUTE Longpid = Data_B_id. COMPUTE LongType = 1. COMPUTE ShortPid = TRUNC(Data_B_id/1000). COMPUTE ShortType = 2. * The following VARSTOCASES creates a file with two records for . * each Data B record, containing the two possible pid values from . * that record: the id field as given, and the id field dropping . * three trailing digits. . VARSTOCASES /MAKE pid FROM Longpid Shortpid /MAKE Source FROM LongType ShortType /KEEP Data_B_id. * IV. Match possible pid values from data B with actual values . * from data A, to identify correct values. . SORT CASES BY pid Data_B_id. MATCH FILES /TABLE=PtList /IN=Matched /FILE =* /BY pid. SELECT IF Matched. * V.A Summarize matches by Data B id, . AGGREGATE OUTFILE=* /pid = MIN(pid) /Source = MIN(Source) /NMatch = NU. ADD VALUE LABELS Source 0 'No match' 9 'Ambiguous'. DO IF NMatch GT 1. . COMPUTE pid = $SYSMIS. . COMPUTE Source = 9. END IF. IF NMatch GT 1 Source = 9. * V.B Attach matches to the list of all Data B ids . MATCH FILES /FILE=* /FILE=Data_B_keys /BY Data_B_id /DROP=N_Brcds NMatch. RECODE Source (SYSMIS = 0). * VI. Finally, attach pids to all Data B records . MATCH FILES /TABLE=* /FILE =Data_B_set /BY Data_B_id. DATASET NAME Data_B_pid WINDOW=FRONT. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Of course, a correction. At 03:23 PM 9/12/2014, I posted code including:
>* V.A Summarize matches by Data B id, . > >AGGREGATE OUTFILE=* > /pid = MIN(pid) > /Source = MIN(Source) > /NMatch = NU. The AGGREGATE needs a BREAK clause: AGGREGATE OUTFILE=* /BREAK = Data_B_id /* ADDED */ /pid = MIN(pid) /Source = MIN(Source) /NMatch = NU. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |