Cleaning data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Cleaning data

Adam M.-3
Hello! I seem to have a recurring problem when it comes to cleaning data
after merging numerous files with student records. Many times I have
multiple observations (eg, schools attended) taken from different data
sources and I want to merge these into a single observation. I have string
variables and I want to take the "best" information from multiple
observations to make a more complete and pared down data set. For example:

Student   Town95 School95 Town96 School96 Town97 School97
1       LA   --  --  Jones  --  --
2       --  Smith  --  --  Chicago --
3       NYC Brown --  --  -- Brown

I'd like the above to turn into:

Student   Town School
1       LA  Jones
2       Chicago  Smith
3       NYC Brown


So, I want to create a pair of variables (town & school). To accomplish this
I need to pick up the non-missing information from these sets of
observations. I'm familiar with running a max or min command across numeric
values, but in this case with strings that doesn't seem to work. Any ideas
for automating this process with strings? Thanks for your help!

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cleaning data

Ruben Geert van den Berg
Dear Adam,
 
If I understand your situation correctly, I've a question: if there's more than one valid 'School' in the original data, then should the 'best School' contain the latest valid value? If so, the syntax below should work for you.
 
HTH,

Ruben van den Berg
Consultant Models & Methods
TNS NIPO
Email: [hidden email]
Mobiel: +31 6 24641435
Telefoon: +31 20 522 5738
Internet: www.tns-nipo.com
 
*Create test data.

data list free/s_95 t_95 s_96 t_96(4a1).
begin data
'' 'a' 'b' ''
'c' '' '' 'd'
'd' 'e' 'f' 'g'
'' '' 'h' 'i'
end data.
 
*Declare new ('best') variables, containing last valid value on s (School) or t (Town).
 
string s_all t_all(a1).
 
*Note that s will first be replaced by s_95 and then by s_96. If both are valid, s_96 (the last valid value) will prevail.
 
do repeat s=s_95 s_96/t=t_95 t_96.
if s ne '' s_all = s.
if t ne '' t_all = t.
end repeat.
 
execute./*Not really needed, but it will make the result visible.


 

> Date: Wed, 30 Jun 2010 21:10:49 -0400
> From: [hidden email]
> Subject: Cleaning data
> To: [hidden email]
>
> Hello! I seem to have a recurring problem when it comes to cleaning data
> after merging numerous files with student records. Many times I have
> multiple observations (eg, schools attended) taken from different data
> sources and I want to merge these into a single observation. I have string
> variables and I want to take the "best" information from multiple
> observations to make a more complete and pared down data set. For example:
>
> Student Town95 School95 Town96 School96 Town97 School97
> 1 LA -- -- Jones -- --
> 2 -- Smith -- -- Chicago --
> 3 NYC Brown -- -- -- Brown
>
> I'd like the above to turn into:
>
> Student Town School
> 1 LA Jones
> 2 Chicago Smith
> 3 NYC Brown
>
>
> So, I want to create a pair of variables (town & school). To accomplish this
> I need to pick up the non-missing information from these sets of
> observations. I'm familiar with running a max or min command across numeric
> values, but in this case with strings that doesn't seem to work. Any ideas
> for automating this process with strings? Thanks for your help!
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD


New Windows 7: Find the right PC for you. Learn more.