|
I have two database extracts that I am trying to merge, one of which
contains enrolled students and the other contains approximately the same group of students when they were applicants. In the Enrollment file students are uniquely identified by 'enroll_id'. In the Applicant file there is a primary ID, 'applicant_id1', but there may be up to two other IDs which were issued and updated provisionally as the student was going through the application process. The rub is that 'enroll_id' may match any one of the applicant IDs, not necessarily the primary one. So logically what I want to do is "Match by 'enroll_id' and ('applicant_id1' OR 'applicant_id2' OR 'applicant_id3')". It seems like this shouldn't be too hard to automate, but I'm having trouble this morning wrapping my mind around it... Any suggestions? Thanks, Dan R. -- Daniel Robertson Senior Research and Planning Associate Institutional Research and Planning Cornell University, Ithaca NY 14853-2801 607.255.9642 / irp.cornell.edu |
|
Daniel,
I'm going to assume that you are very familiar with syntaxs so that a discussion of approach would be sufficient. I'm also going to assume that a student appears exactly once in each file. My initial approach would be to add the two files together. (I would probably keep only the id variables but keep all variables is probably ok as well in that it doesn't affect what follows). Use the /In option or create a source id variable to identify which dataset the records came from. So what you would have is missing data for the id variables from the other file. (As well as all variables not common to the two files). Next, for the enrollment database records (which will have sysmis values for the three variables from the applicant file), copy the value of the enrollment id variable to the three variables from the applicant file. Next, sort the file by each of the applicant file id variables and the record source variable and track the results after each sort. Think about the results of the first sort. Some records will be paired up--two records having the same id number and different source file ids. For those records, the correct applicant database variable to use was the first. That fact needs to be recorded. Other records won't be paired up, which means that one of the two remaining applicant database variables must be the correct one. So. Sort the whole file by the second variable. Again, some records will be paired up. For those records you need to record that the second variable was the correct one. So you need three passes through the file, each time sorting by a different variable. A frequencies on the variable that records the correct id variable to use should have three values. If not, go look for the problem. The final step is to 'spread' the value of the id variable across both records in the pair. Here is the code for the first sort pass. Let Enid be the enrollment file id, Apid1, Apid2, Apid3 be the three applicant file ids, Source by the source of the record (1=enrollment, 2=applicant). Sort file by apid1 source. Compute apidvar=0. If (apid1 eq lag(apid1)) apidvar=1. Sort file by apid1 (a) match (d). If (apid1 eq lag(apid1) apidvar=1. This chunk is repeated three times. The second time apidvar=2 for successful cases. Next, you can write a do if structure to transfer the value from the correct applicant file variable to enrollment file variable on the applicant file record. (You could also create a new variable on both records for this operation). So to check you should be able to sort by the enrollment file id variable and find all records paired up. (Or, at least the same ones as were paired up across the three sorts in the preceding paragraph.) Last, I suspect you really want to have just one record for each person. So use the Casestovars command to reformat the datafile. (Before doing that, you may want to delete some variables using the Delete vars command.) So there you have it. I haven't tested this so it may not work but this is how I'd start. Gene Maguin |
|
Thanks Gene, this looks like a very reasonable approach. I'll give it a try.
Dan R. Gene Maguin wrote: > Daniel, > > I'm going to assume that you are very familiar with syntaxs so that a > discussion of approach would be sufficient. I'm also going to assume that a > student appears exactly once in each file. My initial approach would be to > add the two files together. (I would probably keep only the id variables but > keep all variables is probably ok as well in that it doesn't affect what > follows). Use the /In option or create a source id variable to identify > which dataset the records came from. So what you would have is missing data > for the id variables from the other file. (As well as all variables not > common to the two files). > > Next, for the enrollment database records (which will have sysmis values for > the three variables from the applicant file), copy the value of the > enrollment id variable to the three variables from the applicant file. > > Next, sort the file by each of the applicant file id variables and the > record source variable and track the results after each sort. Think about > the results of the first sort. Some records will be paired up--two records > having the same id number and different source file ids. For those records, > the correct applicant database variable to use was the first. That fact > needs to be recorded. Other records won't be paired up, which means that one > of the two remaining applicant database variables must be the correct one. > So. Sort the whole file by the second variable. Again, some records will be > paired up. For those records you need to record that the second variable was > the correct one. So you need three passes through the file, each time > sorting by a different variable. A frequencies on the variable that records > the correct id variable to use should have three values. If not, go look for > the problem. The final step is to 'spread' the value of the id variable > across both records in the pair. > > Here is the code for the first sort pass. Let > Enid be the enrollment file id, > Apid1, Apid2, Apid3 be the three applicant file ids, > Source by the source of the record (1=enrollment, 2=applicant). > > Sort file by apid1 source. > Compute apidvar=0. > If (apid1 eq lag(apid1)) apidvar=1. > > Sort file by apid1 (a) match (d). > If (apid1 eq lag(apid1) apidvar=1. > > This chunk is repeated three times. The second time apidvar=2 for successful > cases. > > > Next, you can write a do if structure to transfer the value from the correct > applicant file variable to enrollment file variable on the applicant file > record. (You could also create a new variable on both records for this > operation). So to check you should be able to sort by the enrollment file id > variable and find all records paired up. (Or, at least the same ones as were > paired up across the three sorts in the preceding paragraph.) > > Last, I suspect you really want to have just one record for each person. So > use the Casestovars command to reformat the datafile. (Before doing that, > you may want to delete some variables using the Delete vars command.) > > So there you have it. I haven't tested this so it may not work but this is > how I'd start. > > Gene Maguin > > -- Daniel Robertson Senior Research and Planning Associate Institutional Research and Planning Cornell University, Ithaca NY 14853-2801 607.255.9642 / irp.cornell.edu |
|
In reply to this post by Daniel Robertson
At 09:29 AM 7/12/2007, Daniel Robertson wrote:
>I have two database extracts that I am trying to merge, one of which >contains enrolled students and the other contains approximately the >same group of students when they were applicants. In the Enrollment >file students are uniquely identified by 'enroll_id'. In the Applicant >file there is a primary ID, 'applicant_id1', but there may be up to >two other IDs which were issued and updated provisionally as the >student was going through the application process. The rub is that >'enroll_id' may match any one of the applicant IDs, not necessarily >the primary one. Gene's given you a workable solution. It requires sorting the data three times; but with three keys, something like that is inevitable. You *can* combine the three sorting operations into one step, by using XSAVE to create three copies of each Applicant record, one each in which 'enroll_id' is loaded from each of the three candidate key variables in the Applicant file. Then sort the resulting file by that 'enroll_id', MATCH FILES with the Enrollment file, and discard any Applicant records that don't match. Now, that's the simplest possible case. You may need logic in case, say, the same ID value occurs in more than one of the Applicant-record fields. But it's another way to go. Sorry; no code this time. |
| Free forum by Nabble | Edit this page |
