|
Hello,
I commonly have a problem where a single case has multiple variables for a given factor(i.e., race), with some valid data and some missing. When these values are numeric, it is usually easy to do a max or min function to compile these data across observations. Right now I have a similar problem, but the variables are string variables and I'm not sure how to manage this most efficiently. Please advise. An example: Student Corp1 Schl1 Corp2 Schl2 Corp3 Schl3 1 XX -- -- Smith -- -- 2 -- -- YZ -- -- Johnson 3 -- Baker -- -- CD Baker So, what I want to do is create a set of variables that picks out the valid information from the missing, or from multiple repeated values to create a "cleaned" and complete set. Thank you for any advice. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Adam
I'm not sure how you could identify invalid string values, unless you have a list of the valid values. If you do, you could use something like: DO IF (race EQ 'Black' | race EQ 'White' | race EQ 'Other'). ELSE. COMPUTE race = 'Invalid'. END IF. EXECUTE. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Adam Sent: 30 June 2010 21:03 To: [hidden email] Subject: Cleaning data Hello, I commonly have a problem where a single case has multiple variables for a given factor(i.e., race), with some valid data and some missing. When these values are numeric, it is usually easy to do a max or min function to compile these data across observations. Right now I have a similar problem, but the variables are string variables and I'm not sure how to manage this most efficiently. Please advise. An example: Student Corp1 Schl1 Corp2 Schl2 Corp3 Schl3 1 XX -- -- Smith -- -- 2 -- -- YZ -- -- Johnson 3 -- Baker -- -- CD Baker So, what I want to do is create a set of variables that picks out the valid information from the missing, or from multiple repeated values to create a "cleaned" and complete set. Thank you for any advice. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by Adam M.-3
Adam-
Would there be any times where the data would disagree? For instance, what if for student 3, school 95 said "Brown" but school 97 said "Jones"? I think that VARSTOCASES might be your best bet. That would fold all the Town variables and all the school variables each into a single variable. Then you can aggregate the town variable, aggregate the school variable, and delete the duplicates. If it's important to know WHICH town or school variable the data came from, the code will need a bit of tweaking. If there could be different data from different sources (see my question above), then the code will also need a bit of tweaking. Hope that helps, -Eric /* ===== Syntax ===== */ DATA LIST FREE / Student(F1) Town95(A10) School95(A10) Town96(A10) School96(A10) Town97(A10) School97(A10). BEGIN DATA 1, LA, , , Jones , , 2, , Smith, , , Chicago, 3, NYC, Brown, , , , Brown END DATA. dataset name sample window=front. VARSTOCASES /make Town FROM Town95 Town96 Town97 /make School FROM School96 School96 School97 /index=ind(3). * The number 3 represents the number of variables in the MAKE subcommand. Sort cases student town. AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=Student /Town_last=LAST(Town). Sort cases student School. AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=Student /School_last=LAST(School). * Identify Duplicate Cases. SORT CASES BY Student(A). MATCH FILES /FILE=* /BY Student /LAST=PrimaryLast. Select if PrimaryLast. EXECUTE. * Delete extra variables. match files file=* /drop ind town school primarylast. exe. /* ===== End Syntax ===== */ -----Original Message----- From: Adam M. [mailto:[hidden email]] Sent: Wednesday, June 30, 2010 8:11 PM Subject: Cleaning data Hello! I seem to have a recurring problem when it comes to cleaning data after merging numerous files with student records. Many times I have multiple observations (eg, schools attended) taken from different data sources and I want to merge these into a single observation. I have string variables and I want to take the "best" information from multiple observations to make a more complete and pared down data set. For example: Student Town95 School95 Town96 School96 Town97 School97 1 LA -- -- Jones -- -- 2 -- Smith -- -- Chicago -- 3 NYC Brown -- -- -- Brown I'd like the above to turn into: Student Town School 1 LA Jones 2 Chicago Smith 3 NYC Brown So, I want to create a pair of variables (town & school). To accomplish this I need to pick up the non-missing information from these sets of observations. I'm familiar with running a max or min command across numeric values, but in this case with strings that doesn't seem to work. Any ideas for automating this process with strings? Thanks for your help! ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Adam- So, what’s the rule for picking the correct school for the
student? If it’s the Town/School variable with the highest number,
then it’s a pretty easy fix. Note that I changed on of the schools
to Zebra. I added two compute commands and then adjusted the sorting
before the aggregation. So that will take the last non-blank value in
your original variable order. If it’s something else, then you can adjust the sort
immediately before “Sort cases student school_flg” to get the value
you want at the bottom of the list – suppose you had a date field
associated with the data and you wanted the most recent or something like that. Also, try to “reply-to-all” to that the whole thread
gets saved in the listserv. J Hope that helps, -Eric /* ===== Syntax ===== */ DATA LIST FREE / Student(F1) Town95(A10) School95(A10)
Town96(A10) School96(A10) Town97(A10) School97(A10). BEGIN DATA 1, LA, , , Jones , , 2, , Smith, , , Chicago, 3, NYC, Zebra, , , , Brown END DATA. dataset name sample window=front. VARSTOCASES /make Town FROM Town95 Town96 Town97 /make School FROM School95 School96 School97 /index=ind(3). * The number 3 represents the number of variables in the MAKE subcommand. compute town_flg = (town NE ""). compute school_flg = (school NE ""). Sort cases student town_flg. AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=Student /Town_last=LAST(Town). sort cases student ind. Sort cases student school_flg. AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=Student /School_last=LAST(School). * Identify Duplicate Cases. SORT CASES BY Student(A). MATCH FILES /FILE=* /BY Student /LAST=PrimaryLast. Select if PrimaryLast. EXECUTE. * Delete extra variables. match files file=* /drop ind town school primarylast town_flg school_flg. exe. /* ===== End Syntax ===== */ From: Adam Maltese
[mailto:[hidden email]] Eric, On Thu, Jul 1, 2010 at 11:05 AM, Langston, Eric <[hidden email]> wrote: Adam- 1, LA, , , Jones , , END DATA.
Hello! I seem to have a recurring problem when it comes to
cleaning data |
|
Really no rule for picking the correct school within the current data set. Later on I need to add in attendance records and determine which school a student attended for the most days during a given school year - then that will likely become their "home" school for that year.
Cheers, Adam On Thu, Jul 1, 2010 at 3:04 PM, Langston, Eric <[hidden email]> wrote:
|
| Free forum by Nabble | Edit this page |
