Cleaning data

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Cleaning data

Adam M.-3
Hello,

I commonly have a problem where a single case has multiple variables for a
given factor(i.e., race), with some valid data and some missing. When these
values are numeric, it is usually easy to do a max or min function to
compile these data across observations. Right now I have a similar problem,
but the variables are string variables and I'm not sure how to manage this
most efficiently. Please advise.

An example:

Student Corp1 Schl1 Corp2 Schl2 Corp3 Schl3
1       XX    --     --    Smith --  --
2       --    --    YZ    --    --   Johnson
3       --   Baker  --    --  CD  Baker

So, what I want to do is create a set of variables that picks out the valid
information from the missing, or from multiple repeated values to create a
"cleaned" and complete set. Thank you for any advice.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cleaning data

Garry Gelade
Adam

I'm not sure how you could identify invalid string values, unless you have a
list of the valid values.  If you do, you could use something like:

DO IF (race EQ 'Black' | race EQ 'White' | race EQ 'Other').
ELSE.
        COMPUTE race = 'Invalid'.
END IF.
EXECUTE.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Adam
Sent: 30 June 2010 21:03
To: [hidden email]
Subject: Cleaning data

Hello,

I commonly have a problem where a single case has multiple variables for a
given factor(i.e., race), with some valid data and some missing. When these
values are numeric, it is usually easy to do a max or min function to
compile these data across observations. Right now I have a similar problem,
but the variables are string variables and I'm not sure how to manage this
most efficiently. Please advise.

An example:

Student Corp1 Schl1 Corp2 Schl2 Corp3 Schl3
1       XX    --     --    Smith --  --
2       --    --    YZ    --    --   Johnson
3       --   Baker  --    --  CD  Baker

So, what I want to do is create a set of variables that picks out the valid
information from the missing, or from multiple repeated values to create a
"cleaned" and complete set. Thank you for any advice.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cleaning data

Langston, Eric
In reply to this post by Adam M.-3
Adam-

Would there be any times where the data would disagree?  For instance,
what if for student 3, school 95 said "Brown" but school 97 said
"Jones"?

I think that VARSTOCASES might be your best bet.  That would fold all
the Town variables and all the school variables each into a single
variable.  Then you can aggregate the town variable, aggregate the
school variable, and delete the duplicates.

If it's important to know WHICH town or school variable the data came
from, the code will need a bit of tweaking.
If there could be different data from different sources (see my question
above), then the code will also need a bit of tweaking.

Hope that helps,
-Eric



/* ===== Syntax ===== */

DATA LIST FREE / Student(F1) Town95(A10) School95(A10) Town96(A10)
School96(A10) Town97(A10) School97(A10).
BEGIN DATA
1, LA, , , Jones , ,
2, , Smith, , , Chicago,
3, NYC, Brown, , , , Brown
END DATA.
dataset name sample window=front.

VARSTOCASES /make Town FROM Town95 Town96 Town97
/make School FROM School96 School96 School97
/index=ind(3).
* The number 3 represents the number of variables in the MAKE
subcommand.

Sort cases student town.
AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES
  /BREAK=Student
  /Town_last=LAST(Town).

Sort cases student School.
AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES
  /BREAK=Student
  /School_last=LAST(School).

* Identify Duplicate Cases.
SORT CASES BY Student(A).
MATCH FILES
  /FILE=*
  /BY Student
  /LAST=PrimaryLast.
Select if PrimaryLast.
EXECUTE.

* Delete extra variables.
match files file=*
/drop ind town school primarylast.
exe.

/* ===== End Syntax ===== */

-----Original Message-----
From: Adam M. [mailto:[hidden email]]
Sent: Wednesday, June 30, 2010 8:11 PM
Subject: Cleaning data

Hello! I seem to have a recurring problem when it comes to cleaning data
after merging numerous files with student records. Many times I have
multiple observations (eg, schools attended) taken from different data
sources and I want to merge these into a single observation. I have
string
variables and I want to take the "best" information from multiple
observations to make a more complete and pared down data set. For
example:

Student   Town95 School95 Town96 School96 Town97 School97
1       LA   --  --  Jones  --  --
2       --  Smith  --  --  Chicago --
3       NYC Brown --  --  -- Brown

I'd like the above to turn into:

Student   Town School
1       LA  Jones
2       Chicago  Smith
3       NYC Brown


So, I want to create a pair of variables (town & school). To accomplish
this
I need to pick up the non-missing information from these sets of
observations. I'm familiar with running a max or min command across
numeric
values, but in this case with strings that doesn't seem to work. Any
ideas
for automating this process with strings? Thanks for your help!

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Cleaning data

Langston, Eric

Adam-

 

So, what’s the rule for picking the correct school for the student?  If it’s the Town/School variable with the highest number, then it’s a pretty easy fix.  Note that I changed on of the schools to Zebra.  I added two compute commands and then adjusted the sorting before the aggregation.  So that will take the last non-blank value in your original variable order.

 

If it’s something else, then you can adjust the sort immediately before “Sort cases student school_flg” to get the value you want at the bottom of the list – suppose you had a date field associated with the data and you wanted the most recent or something like that.

 

Also, try to “reply-to-all” to that the whole thread gets saved in the listserv. J

 

Hope that helps,

-Eric

 

 

 

/* ===== Syntax ===== */

 

DATA LIST FREE / Student(F1) Town95(A10) School95(A10) Town96(A10) School96(A10) Town97(A10) School97(A10).

BEGIN DATA

1, LA, , , Jones , ,

2, , Smith, , , Chicago,

3, NYC, Zebra, , , , Brown

END DATA.

dataset name sample window=front.

 

VARSTOCASES /make Town FROM Town95 Town96 Town97

/make School FROM School95 School96 School97

/index=ind(3).

 

* The number 3 represents the number of variables in the MAKE

subcommand.

 

compute town_flg = (town NE "").

compute school_flg = (school NE "").

 

Sort cases student town_flg.

AGGREGATE

 /OUTFILE=* MODE=ADDVARIABLES

 /BREAK=Student

 /Town_last=LAST(Town).

 

sort cases student ind.

Sort cases student school_flg.

AGGREGATE

 /OUTFILE=* MODE=ADDVARIABLES

 /BREAK=Student

 /School_last=LAST(School).

 

* Identify Duplicate Cases.

SORT CASES BY Student(A).

MATCH FILES

 /FILE=*

 /BY Student

 /LAST=PrimaryLast.

Select if PrimaryLast.

EXECUTE.

 

* Delete extra variables.

match files file=*

/drop ind town school primarylast town_flg school_flg.

exe.

 

/* ===== End Syntax ===== */

 

 

From: Adam Maltese [mailto:[hidden email]]
Sent: Thursday, July 01, 2010 1:23 PM
To: Langston, Eric
Subject: Re: Cleaning data

 

Eric,

Thanks for your input - it's so fascinating how everyone attacks the problem in a slightly different way. While most students have matching data, there are some who attended different schools. Needless to say, with over 600,000 cases in the data set it gets a bit tedious to try to clean this manually.

Thanks,
Adam

On Thu, Jul 1, 2010 at 11:05 AM, Langston, Eric <[hidden email]> wrote:

Adam-

Would there be any times where the data would disagree?  For instance,
what if for student 3, school 95 said "Brown" but school 97 said
"Jones"?

I think that VARSTOCASES might be your best bet.  That would fold all
the Town variables and all the school variables each into a single
variable.  Then you can aggregate the town variable, aggregate the
school variable, and delete the duplicates.

If it's important to know WHICH town or school variable the data came
from, the code will need a bit of tweaking.
If there could be different data from different sources (see my question
above), then the code will also need a bit of tweaking.

Hope that helps,
-Eric



/* ===== Syntax ===== */

DATA LIST FREE / Student(F1) Town95(A10) School95(A10) Town96(A10)
School96(A10) Town97(A10) School97(A10).
BEGIN DATA

1, LA, , , Jones , ,
2, , Smith, , , Chicago,
3, NYC, Brown, , , , Brown

END DATA.
dataset name sample window=front.

VARSTOCASES /make Town FROM Town95 Town96 Town97
/make School FROM School96 School96 School97
/index=ind(3).
* The number 3 represents the number of variables in the MAKE
subcommand.

Sort cases student town.
AGGREGATE
 /OUTFILE=* MODE=ADDVARIABLES
 /BREAK=Student
 /Town_last=LAST(Town).

Sort cases student School.
AGGREGATE
 /OUTFILE=* MODE=ADDVARIABLES
 /BREAK=Student
 /School_last=LAST(School).

* Identify Duplicate Cases.
SORT CASES BY Student(A).
MATCH FILES
 /FILE=*
 /BY Student
 /LAST=PrimaryLast.
Select if PrimaryLast.
EXECUTE.

* Delete extra variables.
match files file=*
/drop ind town school primarylast.
exe.

/* ===== End Syntax ===== */


-----Original Message-----
From: Adam M. [mailto:[hidden email]]
Sent: Wednesday, June 30, 2010 8:11 PM
Subject: Cleaning data

Hello! I seem to have a recurring problem when it comes to cleaning data
after merging numerous files with student records. Many times I have
multiple observations (eg, schools attended) taken from different data
sources and I want to merge these into a single observation. I have
string
variables and I want to take the "best" information from multiple
observations to make a more complete and pared down data set. For
example:

Student   Town95 School95 Town96 School96 Town97 School97
1       LA   --  --  Jones  --  --
2       --  Smith  --  --  Chicago --
3       NYC Brown --  --  -- Brown

I'd like the above to turn into:

Student   Town School
1       LA  Jones
2       Chicago  Smith
3       NYC Brown


So, I want to create a pair of variables (town & school). To accomplish
this
I need to pick up the non-missing information from these sets of
observations. I'm familiar with running a max or min command across
numeric
values, but in this case with strings that doesn't seem to work. Any
ideas
for automating this process with strings? Thanks for your help!

 

Reply | Threaded
Open this post in threaded view
|

Re: Cleaning data

Adam M.-3
Really no rule for picking the correct school within the current data set. Later on I need to add in attendance records and determine which school a student attended for the most days during a given school year - then that will likely become their "home" school for that year.

Cheers,
Adam

On Thu, Jul 1, 2010 at 3:04 PM, Langston, Eric <[hidden email]> wrote:

Adam-

 

So, what’s the rule for picking the correct school for the student?  If it’s the Town/School variable with the highest number, then it’s a pretty easy fix.  Note that I changed on of the schools to Zebra.  I added two compute commands and then adjusted the sorting before the aggregation.  So that will take the last non-blank value in your original variable order.

 

If it’s something else, then you can adjust the sort immediately before “Sort cases student school_flg” to get the value you want at the bottom of the list – suppose you had a date field associated with the data and you wanted the most recent or something like that.

 

Also, try to “reply-to-all” to that the whole thread gets saved in the listserv. J

 

Hope that helps,

-Eric

 

 

 

/* ===== Syntax ===== */

 

DATA LIST FREE / Student(F1) Town95(A10) School95(A10) Town96(A10) School96(A10) Town97(A10) School97(A10).

BEGIN DATA

1, LA, , , Jones , ,

2, , Smith, , , Chicago,

3, NYC, Zebra, , , , Brown

END DATA.

dataset name sample window=front.

 

VARSTOCASES /make Town FROM Town95 Town96 Town97

/make School FROM School95 School96 School97

/index=ind(3).

 

* The number 3 represents the number of variables in the MAKE

subcommand.

 

compute town_flg = (town NE "").

compute school_flg = (school NE "").

 

Sort cases student town_flg.

AGGREGATE

 /OUTFILE=* MODE=ADDVARIABLES

 /BREAK=Student

 /Town_last=LAST(Town).

 

sort cases student ind.

Sort cases student school_flg.

AGGREGATE

 /OUTFILE=* MODE=ADDVARIABLES

 /BREAK=Student

 /School_last=LAST(School).

 

* Identify Duplicate Cases.

SORT CASES BY Student(A).

MATCH FILES

 /FILE=*

 /BY Student

 /LAST=PrimaryLast.

Select if PrimaryLast.

EXECUTE.

 

* Delete extra variables.

match files file=*

/drop ind town school primarylast town_flg school_flg.

exe.

 

/* ===== End Syntax ===== */

 

 

From: Adam Maltese [mailto:[hidden email]]
Sent: Thursday, July 01, 2010 1:23 PM
To: Langston, Eric
Subject: Re: Cleaning data

 

Eric,

Thanks for your input - it's so fascinating how everyone attacks the problem in a slightly different way. While most students have matching data, there are some who attended different schools. Needless to say, with over 600,000 cases in the data set it gets a bit tedious to try to clean this manually.

Thanks,
Adam

On Thu, Jul 1, 2010 at 11:05 AM, Langston, Eric <[hidden email]> wrote:

Adam-

Would there be any times where the data would disagree?  For instance,
what if for student 3, school 95 said "Brown" but school 97 said
"Jones"?

I think that VARSTOCASES might be your best bet.  That would fold all
the Town variables and all the school variables each into a single
variable.  Then you can aggregate the town variable, aggregate the
school variable, and delete the duplicates.

If it's important to know WHICH town or school variable the data came
from, the code will need a bit of tweaking.
If there could be different data from different sources (see my question
above), then the code will also need a bit of tweaking.

Hope that helps,
-Eric



/* ===== Syntax ===== */

DATA LIST FREE / Student(F1) Town95(A10) School95(A10) Town96(A10)
School96(A10) Town97(A10) School97(A10).
BEGIN DATA

1, LA, , , Jones , ,
2, , Smith, , , Chicago,
3, NYC, Brown, , , , Brown

END DATA.
dataset name sample window=front.

VARSTOCASES /make Town FROM Town95 Town96 Town97
/make School FROM School96 School96 School97
/index=ind(3).
* The number 3 represents the number of variables in the MAKE
subcommand.

Sort cases student town.
AGGREGATE
 /OUTFILE=* MODE=ADDVARIABLES
 /BREAK=Student
 /Town_last=LAST(Town).

Sort cases student School.
AGGREGATE
 /OUTFILE=* MODE=ADDVARIABLES
 /BREAK=Student
 /School_last=LAST(School).

* Identify Duplicate Cases.
SORT CASES BY Student(A).
MATCH FILES
 /FILE=*
 /BY Student
 /LAST=PrimaryLast.
Select if PrimaryLast.
EXECUTE.

* Delete extra variables.
match files file=*
/drop ind town school primarylast.
exe.

/* ===== End Syntax ===== */


-----Original Message-----
From: Adam M. [mailto:[hidden email]]
Sent: Wednesday, June 30, 2010 8:11 PM
Subject: Cleaning data

Hello! I seem to have a recurring problem when it comes to cleaning data
after merging numerous files with student records. Many times I have
multiple observations (eg, schools attended) taken from different data
sources and I want to merge these into a single observation. I have
string
variables and I want to take the "best" information from multiple
observations to make a more complete and pared down data set. For
example:

Student   Town95 School95 Town96 School96 Town97 School97
1       LA   --  --  Jones  --  --
2       --  Smith  --  --  Chicago --
3       NYC Brown --  --  -- Brown

I'd like the above to turn into:

Student   Town School
1       LA  Jones
2       Chicago  Smith
3       NYC Brown


So, I want to create a pair of variables (town & school). To accomplish
this
I need to pick up the non-missing information from these sets of
observations. I'm familiar with running a max or min command across
numeric
values, but in this case with strings that doesn't seem to work. Any
ideas
for automating this process with strings? Thanks for your help!