I’m trying to MATCH FILES that do not have a common key. I can think of a solution but the word clumsy comes to mind. Perhaps someone out there may have experienced something similar and can pass on their experience.
There are two data files. This is data from an educational instruction. Each record in file A can be identified by student, course, subject, date (YYYYMMDD as A8). Each record in file B can be identified by student, course, subject, CRN (a number stored as A5). If, with either file, I were to AGGREGATE OUTFILE = * /BREAK = student course subject /COUNT = N. A FREQUENCY of count would show several thousand records with count = 1, and a few hundred with count = 2, 3, 4. Here's the clumsy part: I was thinking of computing a new variable from date, and another from CRN, having the same data type. I was thinking of sorting on these variables (date, CRN) and assigning a numeric value, but at this point I haven’t worked out how to do that. If someone knows of something that's more elegant that would be nice. |
You leave a lot of questions unanswered. I had to deal with
a similar situation years back when I had to match pre-test (September testing) with post-test (May testing) with the primary goal of identifying pupils who had been in a remediation program. In this context, I knew everyone at pre-test had to have a post-test unless their data had somehow was not entered. Similarly, each pupil should have a post-test unless their data was not entered or they dropped out during the school year. Duplicates at pre-test or post-test either repeats of the same student (or an error entry and a correct entry) or different students who are have the same name and similar info but have to be distinguished as two (or more) different cases. So, what is the relationship of file 1 to file 2? Is it a pre-test- post-test situation? What percentage of matches do you expect and why? For data you describe below in the aggregate results, why do you have counts greater than 1? Are these duplicates, errors, legitimate values (i.e., repeated measures on the same pupil -- why did this occur?). And so on. I assume that the variable "student" means that it is the pupil name and that it is the same type in both files (e.g., A25). You don't indicate how many cases you have in the two files (I assume that they don't match but, again, you don't explain why). A brute force method you might consider is the following (1) Create a file based on file 1 containing student, course, subject but with 1 at the end (i.e., student1, etc.) (2) Sort cases on student, course, and subject. (3) Create a new variable Case=$casenum. (4) Create a file based on file 2 containing student, course, subject but 2 at the end (i.e., student2, etc.) (5) Sort cases on student, course, and subject. (6) Create a new variable Case=$casenum. (7) Match the two files on the basis of the variable Case. (8) Test if Student1 eq Studen2, Course1 eq Course2, and Subject1 eq Subject2. Determine how many matches you have and determine if they are valid. Determine how many mismatches you have and determine why? Is it the same student taking multiple courses and/or multiple subjects? (9) It is unclear to me how many variables you expect to have for each pupil: do you expect to have only one course and one subject per pupil or multiple courses and/or multiple subjects for each student? My ultimate goal was to match pre-test with post-test standardized test scores and determine whether post- test score had increased beyond what was expected by regression to the mean. What is your ultimate goal? -Mike Palij New York University [hidden email] ----- Original Message ----- From: "Ron0z" <[hidden email]> To: <[hidden email]> Sent: Wednesday, April 13, 2016 10:15 PM Subject: MATCH FILES I’m trying to MATCH FILES that do not have a common key. I can think of a solution but the word clumsy comes to mind. Perhaps someone out there may have experienced something similar and can pass on their experience. There are two data files. This is data from an educational instruction. Each record in file A can be identified by student, course, subject, date (YYYYMMDD as A8). Each record in file B can be identified by student, course, subject, CRN (a number stored as A5). If, with either file, I were to AGGREGATE OUTFILE = * /BREAK = student course subject /COUNT = N. A FREQUENCY of count would show several thousand records with count = 1, and a few hundred with count = 2, 3, 4. Here's the clumsy part: I was thinking of computing a new variable from date, and another from CRN, having the same data type. I was thinking of sorting on these variables (date, CRN) and assigning a numeric value, but at this point I haven’t worked out how to do that. If someone knows of something that's more elegant that would be nice. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/MATCH-FILES-tp5731933.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Thanks for your response Mike. I’m pleasantly surprised to hear you have an educational background. I didn’t write fully as I wasn’t sure anyone would respond. My goal in this is to test the validity of data submitted to a reporting body (because I have retained a record of what was submitted) against changes staff may have made to the student records in the meantime.
I was thinking of using $casenum, but was not quite sure how to apply it. If everything were ideal file A and file B would have the identical number of records for each student, and that was to be my first check. If there were any discrepancies (more or less records in one file than the other) I would investigate the reasons. One file is an extract for the student database at one period. The other file is another extract at a different period in time, but targeting the same students. Differences are the result of staff changing the data by correcting errors, and I need to know what changes may have taken place. The only other difference is that one file has results and the other does not, but I’m not interested in results at this point. The aggregate shows a count = 1 for all records where BREAK = student course subject CRN in one file, or where BREAK = student course subject DATE in the other file. As you know, we cannot use different variables used with BY. Hence my thoughts of attempting to create a new variable for CRN and DATE that has the same data type. The reason for the duplicates is that some students repeat the same subject. They may have failed and wish to reenrol, or they may have discontinued then subsequently enrolled again. Student is studentID (A10). One file contains data from all of our students (Cert 1 to Bachelor), and has about 160,000 records. The other holds data from a subset of the other (Diploma and Adv Dip), and has about 14,000 records. This is how the files look for one student in each file: student course subject CRN CIT000001 C2-BC33 COMP214 10001 CIT000001 C2-BC33 COMP214 20003 CIT000001 C2-BC33 COMP214 40002 student course subject Date CIT000001 C2-BC33 COMP214 20150201 CIT000001 C2-BC33 COMP214 20150401 CIT000001 C2-BC33 COMP214 20150402 |
I think you are confusing things and I'll try to simplify what I
think you're requesting. (1) You say "The aggregate shows a count = 1 for all records where BREAK = student course subject CRN in one file, or where BREAK = student course subject DATE in the other file. As you know, we cannot use different variables used with BY. Hence my thoughts of attempting to create a new variable for CRN and DATE that has the same data type." (2) If there is only one pupil with a record in file 1 and one record in file 2, what is the problem? These should match up on "student"? (3) You present the situation where one subject has three "cases" in File 1 and three cases in File 2. Do you want: a single case that represents all of the variables at different times (you would do a cases-to-variables at some time to get a single case per pupil) or do you want three cases per pupil? In other words, what is the unit of analysis: the pupil or the course? File 1: student course subject CRN CIT000001 C2-BC33 COMP214 10001 CIT000001 C2-BC33 COMP214 20003 CIT000001 C2-BC33 COMP214 40002 File 2: student course subject Date CIT000001 C2-BC33 COMP214 20150201 CIT000001 C2-BC33 COMP214 20150401 CIT000001 C2-BC33 COMP214 20150402 It seems to me that you want one line/case per pupil but you are not really addressing what you want to do with the additional variables on the additional lines. Do you want to drop them or keep them? If it is one line/case per pupil, what would the data look like? -Mike Palij New York University [hidden email] ----- Original Message ----- From: "Ron0z" <[hidden email]> To: <[hidden email]> Sent: Wednesday, April 13, 2016 11:59 PM Subject: Re: MATCH FILES Thanks for your response Mike. I’m pleasantly surprised to hear you have an educational background. I didn’t write fully as I wasn’t sure anyone would respond. My goal in this is to test the validity of data submitted to a reporting body (because I have retained a record of what was submitted) against changes staff may have made to the student records in the meantime. I was thinking of using $casenum, but was not quite sure how to apply it. If everything were ideal file A and file B would have the identical number of records for each student, and that was to be my first check. If there were any discrepancies (more or less records in one file than the other) I would investigate the reasons. One file is an extract for the student database at one period. The other file is another extract at a different period in time, but targeting the same students. Differences are the result of staff changing the data by correcting errors, and I need to know what changes may have taken place. The only other difference is that one file has results and the other does not, but I’m not interested in results at this point. The aggregate shows a count = 1 for all records where BREAK = student course subject CRN in one file, or where BREAK = student course subject DATE in the other file. As you know, we cannot use different variables used with BY. Hence my thoughts of attempting to create a new variable for CRN and DATE that has the same data type. The reason for the duplicates is that some students repeat the same subject. They may have failed and wish to reenrol, or they may have discontinued then subsequently enrolled again. Student is studentID (A10). One file contains data from all of our students (Cert 1 to Bachelor), and has about 160,000 records. The other holds data from a subset of the other (Diploma and Adv Dip), and has about 14,000 records. This is how the files look for one student in each file: student course subject CRN CIT000001 C2-BC33 COMP214 10001 CIT000001 C2-BC33 COMP214 20003 CIT000001 C2-BC33 COMP214 40002 student course subject Date CIT000001 C2-BC33 COMP214 20150201 CIT000001 C2-BC33 COMP214 20150401 CIT000001 C2-BC33 COMP214 20150402 -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/MATCH-FILES-tp5731933p5731935.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by Ron0z
I'd prefer not to roll around in the mud with this one until you clarify the relationship between Date and CRN and why there should be any match on that basis. Meanwhile you might want to wrap your head around the following threads:
http://spssx-discussion.1045642.n5.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1068821&query=CARTESIAN&n=1068821 Keep an umbrella handy in case your head explodes.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Free forum by Nabble | Edit this page |