SPSSX Discussion

MATCH FILES

Classic

List

Threaded

5 messages Options

Ron0z

MATCH FILES

I’m trying to MATCH FILES that do not have a common key. I can think of a solution but the word clumsy comes to mind. Perhaps someone out there may have experienced something similar and can pass on their experience.

There are two data files. This is data from an educational instruction.

Each record in file A can be identified by student, course, subject, date (YYYYMMDD as A8).
Each record in file B can be identified by student, course, subject, CRN (a number stored as A5).

If, with either file, I were to
AGGREGATE OUTFILE = *
/BREAK = student course subject
/COUNT = N.
A FREQUENCY of count would show several thousand records with count = 1, and a few hundred with count = 2, 3, 4.

Here's the clumsy part: I was thinking of computing a new variable from date, and another from CRN, having the same data type. I was thinking of sorting on these variables (date, CRN) and assigning a numeric value, but at this point I haven’t worked out how to do that.

If someone knows of something that's more elegant that would be nice.

Mike

Re: MATCH FILES

You leave a lot of questions unanswered. I had to deal with
a similar situation years back when I had to match pre-test
(September testing) with post-test (May testing) with the primary
goal of identifying pupils who had been in a remediation program.
In this context, I knew everyone at pre-test had to have a post-test
unless their data had somehow was not entered. Similarly,
each pupil should have a post-test unless their data was not
entered or they dropped out during the school year.

Duplicates at pre-test or post-test either repeats of the same
student (or an error entry and a correct entry) or different
students who are have the same name and similar info
but have to be distinguished as two (or more) different cases.

So, what is the relationship of file 1 to file 2? Is it a pre-test-
post-test situation? What percentage of matches do you
expect and why? For data you describe below in the aggregate
results, why do you have counts greater than 1? Are these
duplicates, errors, legitimate values (i.e., repeated measures
on the same pupil -- why did this occur?). And so on.

I assume that the variable "student" means that it is the pupil
name and that it is the same type in both files (e.g., A25).

You don't indicate how many cases you have in the two files
(I assume that they don't match but, again, you don't explain
why). A brute force method you might consider is the following

(1) Create a file based on file 1 containing student, course, subject
but with 1 at the end (i.e., student1, etc.)

(2) Sort cases on student, course, and subject.

(3) Create a new variable Case=$casenum.

(4) Create a file based on file 2 containing student, course, subject
but 2 at the end (i.e., student2, etc.)

(5) Sort cases on student, course, and subject.

(6) Create a new variable Case=$casenum.

(7) Match the two files on the basis of the variable Case.

(8) Test if Student1 eq Studen2, Course1 eq Course2,
and Subject1 eq Subject2. Determine how many
matches you have and determine if they are valid.
Determine how many mismatches you have and
determine why? Is it the same student taking multiple
courses and/or multiple subjects?

(9) It is unclear to me how many variables you expect
to have for each pupil: do you expect to have only
one course and one subject per pupil or multiple
courses and/or multiple subjects for each student?

My ultimate goal was to match pre-test with post-test
standardized test scores and determine whether post-
test score had increased beyond what was expected
by regression to the mean. What is your ultimate goal?

-Mike Palij
New York University
[hidden email]

----- Original Message -----
From: "Ron0z" <[hidden email]>
To: <[hidden email]>
Sent: Wednesday, April 13, 2016 10:15 PM
Subject: MATCH FILES

I’m trying to MATCH FILES that do not have a common key. I can think of
a
solution but the word clumsy comes to mind. Perhaps someone out there
may
have experienced something similar and can pass on their experience.

There are two data files. This is data from an educational instruction.

Each record in file A can be identified by student, course, subject,
date
(YYYYMMDD as A8).
Each record in file B can be identified by student, course, subject, CRN
(a
number stored as A5).

If, with either file, I were to
AGGREGATE OUTFILE = *
/BREAK = student course subject
/COUNT = N.
A FREQUENCY of count would show several thousand records with count = 1,
and
a few hundred with count = 2, 3, 4.

Here's the clumsy part: I was thinking of computing a new variable from
date, and another from CRN, having the same data type. I was thinking
of
sorting on these variables (date, CRN) and assigning a numeric value,
but at
this point I haven’t worked out how to do that.

If someone knows of something that's more elegant that would be nice.

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/MATCH-FILES-tp5731933.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Ron0z

Re: MATCH FILES

Thanks for your response Mike. I’m pleasantly surprised to hear you have an educational background. I didn’t write fully as I wasn’t sure anyone would respond. My goal in this is to test the validity of data submitted to a reporting body (because I have retained a record of what was submitted) against changes staff may have made to the student records in the meantime.

I was thinking of using $casenum, but was not quite sure how to apply it.

If everything were ideal file A and file B would have the identical number of records for each student, and that was to be my first check. If there were any discrepancies (more or less records in one file than the other) I would investigate the reasons. One file is an extract for the student database at one period. The other file is another extract at a different period in time, but targeting the same students. Differences are the result of staff changing the data by correcting errors, and I need to know what changes may have taken place. The only other difference is that one file has results and the other does not, but I’m not interested in results at this point.

The aggregate shows a count = 1 for all records where BREAK = student course subject CRN in one file, or where BREAK = student course subject DATE in the other file. As you know, we cannot use different variables used with BY. Hence my thoughts of attempting to create a new variable for CRN and DATE that has the same data type.

The reason for the duplicates is that some students repeat the same subject. They may have failed and wish to reenrol, or they may have discontinued then subsequently enrolled again.

Student is studentID (A10).

One file contains data from all of our students (Cert 1 to Bachelor), and has about 160,000 records. The other holds data from a subset of the other (Diploma and Adv Dip), and has about 14,000 records.

This is how the files look for one student in each file:

student course subject CRN
CIT000001 C2-BC33 COMP214 10001
CIT000001 C2-BC33 COMP214 20003
CIT000001 C2-BC33 COMP214 40002

student course subject Date
CIT000001 C2-BC33 COMP214 20150201
CIT000001 C2-BC33 COMP214 20150401
CIT000001 C2-BC33 COMP214 20150402

Mike

Re: MATCH FILES

I think you are confusing things and I'll try to simplify what I
think you're requesting.

(1) You say "The aggregate shows a count = 1 for all records
where BREAK = student course subject CRN in one file, or
where BREAK = student course subject DATE in the
other file. As you know, we cannot use different variables used with
BY.
Hence my thoughts of attempting to create a new variable for CRN and
DATE
that has the same data type."

(2) If there is only one pupil with a record in file 1 and
one record in file 2, what is the problem? These should
match up on "student"?

(3) You present the situation where one subject has three
"cases" in File 1 and three cases in File 2. Do you
want: a single case that represents all of the variables at different
times (you would do a cases-to-variables at some time to
get a single case per pupil) or do you want three cases
per pupil? In other words, what is the unit of analysis:
the pupil or the course?

File 1:
student course subject CRN
CIT000001 C2-BC33 COMP214 10001
CIT000001 C2-BC33 COMP214 20003
CIT000001 C2-BC33 COMP214 40002

File 2:
student course subject Date
CIT000001 C2-BC33 COMP214 20150201
CIT000001 C2-BC33 COMP214 20150401
CIT000001 C2-BC33 COMP214 20150402

It seems to me that you want one line/case per pupil
but you are not really addressing what you want to
do with the additional variables on the additional lines.
Do you want to drop them or keep them?

If it is one line/case per pupil, what would the data
look like?

-Mike Palij
New York University
[hidden email]

----- Original Message -----
From: "Ron0z" <[hidden email]>
To: <[hidden email]>
Sent: Wednesday, April 13, 2016 11:59 PM
Subject: Re: MATCH FILES

Thanks for your response Mike. I’m pleasantly surprised to hear you
have an
educational background. I didn’t write fully as I wasn’t sure anyone
would
respond. My goal in this is to test the validity of data submitted to a
reporting body (because I have retained a record of what was submitted)
against changes staff may have made to the student records in the
meantime.

I was thinking of using $casenum, but was not quite sure how to apply
it.

If everything were ideal file A and file B would have the identical
number
of records for each student, and that was to be my first check. If
there
were any discrepancies (more or less records in one file than the other)
I
would investigate the reasons. One file is an extract for the student
database at one period. The other file is another extract at a
different
period in time, but targeting the same students. Differences are the
result
of staff changing the data by correcting errors, and I need to know what
changes may have taken place. The only other difference is that one
file
has results and the other does not, but I’m not interested in results at
this point.

The aggregate shows a count = 1 for all records where BREAK = student
course
subject CRN in one file, or where BREAK = student course subject DATE in
the
other file. As you know, we cannot use different variables used with
BY.
Hence my thoughts of attempting to create a new variable for CRN and
DATE
that has the same data type.

The reason for the duplicates is that some students repeat the same
subject.
They may have failed and wish to reenrol, or they may have discontinued
then
subsequently enrolled again.

Student is studentID (A10).

One file contains data from all of our students (Cert 1 to Bachelor),
and
has about 160,000 records. The other holds data from a subset of the
other
(Diploma and Adv Dip), and has about 14,000 records.

This is how the files look for one student in each file:

student course subject CRN
CIT000001 C2-BC33 COMP214 10001
CIT000001 C2-BC33 COMP214 20003
CIT000001 C2-BC33 COMP214 40002

student course subject Date
CIT000001 C2-BC33 COMP214 20150201
CIT000001 C2-BC33 COMP214 20150401
CIT000001 C2-BC33 COMP214 20150402

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/MATCH-FILES-tp5731933p5731935.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: MATCH FILES

Administrator

In reply to this post by Ron0z

I'd prefer not to roll around in the mud with this one until you clarify the relationship between Date and CRN and why there should be any match on that basis. Meanwhile you might want to wrap your head around the following threads:

http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1068821&query=CARTESIAN&n=1068821

Keep an umbrella handy in case your head explodes.

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"