recoding messy values

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

recoding messy values

Raffe, Sydelle, SSA
I'm working with data that has designated  fields for city and state. However, the original data (I don't have access to that) was badly parsed and in 30% of the records, city and state do not appear in the correct fields. Instead, they may all be contained, for instance,  in the street address field. Here are some exammples in the city field:

CITY CA           should be Daly City . .
CIFICA CA         should be Pacifica
ERRITO CA     should be El Cerrito
l Cerrito CA      should be El Cerrito  .               .
CERRITO CA  should be El Cerrito                .         .
RRITO CA          should be El Cerrito
NTIOCH            Should be Antioch
AN GABRIEL C Should be San Gabriel in city.

The variations are endless.

I'd like to take the fragments and recode them to proper city and state names. I want to look at fragments of words that I know would equate to a known city (such as "cifica" or "rrito") or state (A or C in the state field). Once I have cities correctly entered, I could easily compute a correct state field.

Any suggestions (other than working with the source to get cleaner data -- doing that) on how to approach this? .
        .                       .

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: recoding messy values

Maguin, Eugene
Sydelle,

Perhaps someone else has done this before and knows about some cleverly
written software. Lacking that, I think you have a long, hard, tedious slog
using recode commands. I think I'd export just those two variables (city and
state) and aggregate each variable (maybe, separately, or not, depending on
the data specifics) to get rid of duplicates.  Then delete the ones that are
correct. Export the rest to a text file, rearrange the lines so that all the
El Cerrito's are tgoether and write the recodes. One thing that will save
you some time is to include the text delimiters around the text value. That
is,
Write 'erito' or "erito" rather than just erito. There's probably other
tricks to use. You'll figure them out as you go or, perhaps, others will
comment.

Gene Maguin



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Raffe, Sydelle, SSA
Sent: Thursday, November 18, 2010 11:40 AM
To: [hidden email]
Subject: recoding messy values

I'm working with data that has designated  fields for city and state.
However, the original data (I don't have access to that) was badly parsed
and in 30% of the records, city and state do not appear in the correct
fields. Instead, they may all be contained, for instance,  in the street
address field. Here are some exammples in the city field:

CITY CA           should be Daly City . .
CIFICA CA         should be Pacifica
ERRITO CA     should be El Cerrito
l Cerrito CA      should be El Cerrito  .               .
CERRITO CA  should be El Cerrito                .         .
RRITO CA          should be El Cerrito
NTIOCH            Should be Antioch
AN GABRIEL C Should be San Gabriel in city.

The variations are endless.

I'd like to take the fragments and recode them to proper city and state
names. I want to look at fragments of words that I know would equate to a
known city (such as "cifica" or "rrito") or state (A or C in the state
field). Once I have cities correctly entered, I could easily compute a
correct state field.

Any suggestions (other than working with the source to get cleaner data --
doing that) on how to approach this? .
        .                       .

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: recoding messy values

Mark Casazza-2
In reply to this post by Raffe, Sydelle, SSA
Hi Sydelle,

Here's a partial solution. You'll need to play a bit with the strings you're searching for and for the state string "CA" matches any "ca" so Pacifica would get snagged as well even if the state abbreviation is removed. You might want to do that last and then check. Anyway, here's my suggestion:

--
data list fixed/stringvar (a18).
begin data
CITY CA          
CIFICA CA        
ERRITO CA        
l Cerrito CA     
CERRITO CA       
RRITO CA         
NTIOCH           
AN GABRIEL C     
end data.
list.

string city (a16).
string state (a2).

if index(lower(stringvar),'ca') > 0 state = "CA".
if index(lower(stringvar),'errito') > 0 city = "El Cerrito".
if index(lower(stringvar),'ifica') > 0 city = "Pacifica".
if index(lower(stringvar),'tioch') > 0 city = "Antioch".
list.

--

Mark

“That which can be asserted without evidence, can be dismissed without evidence.”
- Christopher Hitchens


Mark Casazza
Director of Academic Information
The City University of New York
555 West 57th Street, Suite 1240
New York, NY 10019

Phone: 212.541.0396
Fax: 212.541.0392
email: [hidden email]

On 2010-11-18 11:40, Raffe, Sydelle, SSA wrote:
I'm working with data that has designated  fields for city and state. However, the original data (I don't have access to that) was badly parsed and in 30% of the records, city and state do not appear in the correct fields. Instead, they may all be contained, for instance,  in the street address field. Here are some exammples in the city field:

CITY CA           should be Daly City . .
CIFICA CA         should be Pacifica
ERRITO CA     should be El Cerrito
l Cerrito CA      should be El Cerrito  .               .
CERRITO CA  should be El Cerrito                .         .
RRITO CA          should be El Cerrito
NTIOCH            Should be Antioch
AN GABRIEL C Should be San Gabriel in city.

The variations are endless.

I'd like to take the fragments and recode them to proper city and state names. I want to look at fragments of words that I know would equate to a known city (such as "cifica" or "rrito") or state (A or C in the state field). Once I have cities correctly entered, I could easily compute a correct state field.

Any suggestions (other than working with the source to get cleaner data -- doing that) on how to approach this? .
        .                       .

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: recoding messy values

David Marso
Administrator
In reply to this post by Raffe, Sydelle, SSA
"badly parsed and in 30% of the records, city and state do not appear in the correct fields."
Badly parsed is an understatement!  FIRE that data vender NOW!
----
It would be good to know if preceding and following fields contain the missing information?  If so, glue the crap back together and reparse it.
Is this a large file?
It seems to me that you should extract the rightmost piece of some of these fragments and grab "state" and rebuild the "city" without it.
Is it conceivable to create a TABLE of mismatches and appropriate corrections and use MATCH files to do a first pass?

"The variations are endless."
Aren't they always?
Reminds me of a project from years ago where the client had oodles of scraps of excretion from a screen scraping program from online auctions and all of these data elements had to be normalized.
Initial attacks from several previous consultants using various forms of RECODE and IF logic proved to be a massive waste of time and money.  In the end, a combination of ingenious TABLE logic and excruciating manual corrections solved the basic problem.  GOOD LUCK, you are going to need it!  
HTH, David

Raffe, Sydelle, SSA wrote
I'm working with data that has designated  fields for city and state. However, the original data (I don't have access to that) was badly parsed and in 30% of the records, city and state do not appear in the correct fields. Instead, they may all be contained, for instance,  in the street address field. Here are some exammples in the city field:

CITY CA           should be Daly City . .
CIFICA CA         should be Pacifica
ERRITO CA     should be El Cerrito
l Cerrito CA      should be El Cerrito  .               .
CERRITO CA  should be El Cerrito                .         .
RRITO CA          should be El Cerrito
NTIOCH            Should be Antioch
AN GABRIEL C Should be San Gabriel in city.

The variations are endless.

I'd like to take the fragments and recode them to proper city and state names. I want to look at fragments of words that I know would equate to a known city (such as "cifica" or "rrito") or state (A or C in the state field). Once I have cities correctly entered, I could easily compute a correct state field.

Any suggestions (other than working with the source to get cleaner data -- doing that) on how to approach this? .
        .                       .

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: recoding messy values

Vlad Lopez
Sydelle,
Do you have zip code in your records? There might be a straight solution to the problem.
 
vlad

On Fri, Nov 19, 2010 at 3:56 AM, David Marso <[hidden email]> wrote:
"badly parsed and in 30% of the records, city and state do not appear in the
correct fields."
Badly parsed is an understatement!  FIRE that data vender NOW!
----
It would be good to know if preceding and following fields contain the
missing information?  If so, glue the crap back together and reparse it.
Is this a large file?
It seems to me that you should extract the rightmost piece of some of these
fragments and grab "state" and rebuild the "city" without it.
Is it conceivable to create a TABLE of mismatches and appropriate
corrections and use MATCH files to do a first pass?

"The variations are endless."
Aren't they always?
Reminds me of a project from years ago where the client had oodles of scraps
of excretion from a screen scraping program from online auctions and all of
these data elements had to be normalized.
Initial attacks from several previous consultants using various forms of
RECODE and IF logic proved to be a massive waste of time and money.  In the
end, a combination of ingenious TABLE logic and excruciating manual
corrections solved the basic problem.  GOOD LUCK, you are going to need it!
HTH, David


Raffe, Sydelle, SSA wrote:
>
> I'm working with data that has designated  fields for city and state.
> However, the original data (I don't have access to that) was badly parsed
> and in 30% of the records, city and state do not appear in the correct
> fields. Instead, they may all be contained, for instance,  in the street
> address field. Here are some exammples in the city field:
>
> CITY CA           should be Daly City . .
> CIFICA CA         should be Pacifica
> ERRITO CA     should be El Cerrito
> l Cerrito CA      should be El Cerrito  .               .
> CERRITO CA  should be El Cerrito                .         .
> RRITO CA          should be El Cerrito
> NTIOCH            Should be Antioch
> AN GABRIEL C Should be San Gabriel in city.
>
> The variations are endless.
>
> I'd like to take the fragments and recode them to proper city and state
> names. I want to look at fragments of words that I know would equate to a
> known city (such as "cifica" or "rrito") or state (A or C in the state
> field). Once I have cities correctly entered, I could easily compute a
> correct state field.
>
> Any suggestions (other than working with the source to get cleaner data --
> doing that) on how to approach this? .
>         .                       .
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/recoding-messy-values-tp3271015p3272199.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: recoding messy values

Raffe, Sydelle, SSA
Thank you all for commiserating with me and giving me some good laughs. I will check to see if there's a zip to slice through this mess as well as concatie the fields and work backwords to parse.
 
P.S. I had been promised a better file about a month ago. On the 11/1810, I got an e-mail saying there would be a fix as of 11/10/10. Guess that didn't work either!

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of Vladimir Lopez-Prado
Sent: Friday, November 19, 2010 04:09 AM
To: [hidden email]
Subject: Re: recoding messy values

Sydelle,
Do you have zip code in your records? There might be a straight solution to the problem.
 
vlad

On Fri, Nov 19, 2010 at 3:56 AM, David Marso <[hidden email]> wrote:
"badly parsed and in 30% of the records, city and state do not appear in the
correct fields."
Badly parsed is an understatement!  FIRE that data vender NOW!
----
It would be good to know if preceding and following fields contain the
missing information?  If so, glue the crap back together and reparse it.
Is this a large file?
It seems to me that you should extract the rightmost piece of some of these
fragments and grab "state" and rebuild the "city" without it.
Is it conceivable to create a TABLE of mismatches and appropriate
corrections and use MATCH files to do a first pass?

"The variations are endless."
Aren't they always?
Reminds me of a project from years ago where the client had oodles of scraps
of excretion from a screen scraping program from online auctions and all of
these data elements had to be normalized.
Initial attacks from several previous consultants using various forms of
RECODE and IF logic proved to be a massive waste of time and money.  In the
end, a combination of ingenious TABLE logic and excruciating manual
corrections solved the basic problem.  GOOD LUCK, you are going to need it!
HTH, David


Raffe, Sydelle, SSA wrote:
>
> I'm working with data that has designated  fields for city and state.
> However, the original data (I don't have access to that) was badly parsed
> and in 30% of the records, city and state do not appear in the correct
> fields. Instead, they may all be contained, for instance,  in the street
> address field. Here are some exammples in the city field:
>
> CITY CA           should be Daly City . .
> CIFICA CA         should be Pacifica
> ERRITO CA     should be El Cerrito
> l Cerrito CA      should be El Cerrito  .               .
> CERRITO CA  should be El Cerrito                .         .
> RRITO CA          should be El Cerrito
> NTIOCH            Should be Antioch
> AN GABRIEL C Should be San Gabriel in city.
>
> The variations are endless.
>
> I'd like to take the fragments and recode them to proper city and state
> names. I want to look at fragments of words that I know would equate to a
> known city (such as "cifica" or "rrito") or state (A or C in the state
> field). Once I have cities correctly entered, I could easily compute a
> correct state field.
>
> Any suggestions (other than working with the source to get cleaner data --
> doing that) on how to approach this? .
>         .                       .
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/recoding-messy-values-tp3271015p3272199.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD