SPSSX Discussion

Matching files on one of three possible ID's

Classic

List

Threaded

4 messages Options

Daniel Robertson

Matching files on one of three possible ID's

I have two database extracts that I am trying to merge, one of which
contains enrolled students and the other contains approximately the same
group of students when they were applicants. In the Enrollment file
students are uniquely identified by 'enroll_id'. In the Applicant file
there is a primary ID, 'applicant_id1', but there may be up to two other
IDs which were issued and updated provisionally as the student was going
through the application process. The rub is that 'enroll_id' may match
any one of the applicant IDs, not necessarily the primary one.

So logically what I want to do is "Match by 'enroll_id' and
('applicant_id1' OR 'applicant_id2' OR 'applicant_id3')".

It seems like this shouldn't be too hard to automate, but I'm having
trouble this morning wrapping my mind around it... Any suggestions?

Thanks,
Dan R.

--
Daniel Robertson
Senior Research and Planning Associate
Institutional Research and Planning
Cornell University, Ithaca NY 14853-2801
607.255.9642 / irp.cornell.edu

Maguin, Eugene

Re: Matching files on one of three possible ID's

Daniel,

I'm going to assume that you are very familiar with syntaxs so that a
discussion of approach would be sufficient. I'm also going to assume that a
student appears exactly once in each file. My initial approach would be to
add the two files together. (I would probably keep only the id variables but
keep all variables is probably ok as well in that it doesn't affect what
follows). Use the /In option or create a source id variable to identify
which dataset the records came from. So what you would have is missing data
for the id variables from the other file. (As well as all variables not
common to the two files).

Next, for the enrollment database records (which will have sysmis values for
the three variables from the applicant file), copy the value of the
enrollment id variable to the three variables from the applicant file.

Next, sort the file by each of the applicant file id variables and the
record source variable and track the results after each sort. Think about
the results of the first sort. Some records will be paired up--two records
having the same id number and different source file ids. For those records,
the correct applicant database variable to use was the first. That fact
needs to be recorded. Other records won't be paired up, which means that one
of the two remaining applicant database variables must be the correct one.
So. Sort the whole file by the second variable. Again, some records will be
paired up. For those records you need to record that the second variable was
the correct one. So you need three passes through the file, each time
sorting by a different variable. A frequencies on the variable that records
the correct id variable to use should have three values. If not, go look for
the problem. The final step is to 'spread' the value of the id variable
across both records in the pair.

Here is the code for the first sort pass. Let
Enid be the enrollment file id,
Apid1, Apid2, Apid3 be the three applicant file ids,
Source by the source of the record (1=enrollment, 2=applicant).

Sort file by apid1 source.
Compute apidvar=0.
If (apid1 eq lag(apid1)) apidvar=1.

Sort file by apid1 (a) match (d).
If (apid1 eq lag(apid1) apidvar=1.

This chunk is repeated three times. The second time apidvar=2 for successful
cases.

Next, you can write a do if structure to transfer the value from the correct
applicant file variable to enrollment file variable on the applicant file
record. (You could also create a new variable on both records for this
operation). So to check you should be able to sort by the enrollment file id
variable and find all records paired up. (Or, at least the same ones as were
paired up across the three sorts in the preceding paragraph.)

Last, I suspect you really want to have just one record for each person. So
use the Casestovars command to reformat the datafile. (Before doing that,
you may want to delete some variables using the Delete vars command.)

So there you have it. I haven't tested this so it may not work but this is
how I'd start.

Gene Maguin

Daniel Robertson

Re: Matching files on one of three possible ID's

Thanks Gene, this looks like a very reasonable approach. I'll give it a try.
Dan R.

Gene Maguin wrote:

> Daniel,
>
> I'm going to assume that you are very familiar with syntaxs so that a
> discussion of approach would be sufficient. I'm also going to assume that a
> student appears exactly once in each file. My initial approach would be to
> add the two files together. (I would probably keep only the id variables but
> keep all variables is probably ok as well in that it doesn't affect what
> follows). Use the /In option or create a source id variable to identify
> which dataset the records came from. So what you would have is missing data
> for the id variables from the other file. (As well as all variables not
> common to the two files).
>
> Next, for the enrollment database records (which will have sysmis values for
> the three variables from the applicant file), copy the value of the
> enrollment id variable to the three variables from the applicant file.
>
> Next, sort the file by each of the applicant file id variables and the
> record source variable and track the results after each sort. Think about
> the results of the first sort. Some records will be paired up--two records
> having the same id number and different source file ids. For those records,
> the correct applicant database variable to use was the first. That fact
> needs to be recorded. Other records won't be paired up, which means that one
> of the two remaining applicant database variables must be the correct one.
> So. Sort the whole file by the second variable. Again, some records will be
> paired up. For those records you need to record that the second variable was
> the correct one. So you need three passes through the file, each time
> sorting by a different variable. A frequencies on the variable that records
> the correct id variable to use should have three values. If not, go look for
> the problem. The final step is to 'spread' the value of the id variable
> across both records in the pair.
>
> Here is the code for the first sort pass. Let
> Enid be the enrollment file id,
> Apid1, Apid2, Apid3 be the three applicant file ids,
> Source by the source of the record (1=enrollment, 2=applicant).
>
> Sort file by apid1 source.
> Compute apidvar=0.
> If (apid1 eq lag(apid1)) apidvar=1.
>
> Sort file by apid1 (a) match (d).
> If (apid1 eq lag(apid1) apidvar=1.
>
> This chunk is repeated three times. The second time apidvar=2 for successful
> cases.
>
>
> Next, you can write a do if structure to transfer the value from the correct
> applicant file variable to enrollment file variable on the applicant file
> record. (You could also create a new variable on both records for this
> operation). So to check you should be able to sort by the enrollment file id
> variable and find all records paired up. (Or, at least the same ones as were
> paired up across the three sorts in the preceding paragraph.)
>
> Last, I suspect you really want to have just one record for each person. So
> use the Casestovars command to reformat the datafile. (Before doing that,
> you may want to delete some variables using the Delete vars command.)
>
> So there you have it. I haven't tested this so it may not work but this is
> how I'd start.
>
> Gene Maguin
>
>

--
Daniel Robertson
Senior Research and Planning Associate
Institutional Research and Planning
Cornell University, Ithaca NY 14853-2801
607.255.9642 / irp.cornell.edu

Richard Ristow

Re: Matching files on one of three possible ID's

In reply to this post by Daniel Robertson

At 09:29 AM 7/12/2007, Daniel Robertson wrote:

>I have two database extracts that I am trying to merge, one of which
>contains enrolled students and the other contains approximately the
>same group of students when they were applicants. In the Enrollment
>file students are uniquely identified by 'enroll_id'. In the Applicant
>file there is a primary ID, 'applicant_id1', but there may be up to
>two other IDs which were issued and updated provisionally as the
>student was going through the application process. The rub is that
>'enroll_id' may match any one of the applicant IDs, not necessarily
>the primary one.

Gene's given you a workable solution. It requires sorting the data
three times; but with three keys, something like that is inevitable.

You *can* combine the three sorting operations into one step, by using
XSAVE to create three copies of each Applicant record, one each in
which 'enroll_id' is loaded from each of the three candidate key
variables in the Applicant file. Then sort the resulting file by that
'enroll_id', MATCH FILES with the Enrollment file, and discard any
Applicant records that don't match.

Now, that's the simplest possible case. You may need logic in case,
say, the same ID value occurs in more than one of the Applicant-record
fields. But it's another way to go.

Sorry; no code this time.