I'm working with data that has designated fields for city and state. However, the original data (I don't have access to that) was badly parsed and in 30% of the records, city and state do not appear in the correct fields. Instead, they may all be contained, for instance, in the street address field. Here are some exammples in the city field:
CITY CA should be Daly City . . CIFICA CA should be Pacifica ERRITO CA should be El Cerrito l Cerrito CA should be El Cerrito . . CERRITO CA should be El Cerrito . . RRITO CA should be El Cerrito NTIOCH Should be Antioch AN GABRIEL C Should be San Gabriel in city. The variations are endless. I'd like to take the fragments and recode them to proper city and state names. I want to look at fragments of words that I know would equate to a known city (such as "cifica" or "rrito") or state (A or C in the state field). Once I have cities correctly entered, I could easily compute a correct state field. Any suggestions (other than working with the source to get cleaner data -- doing that) on how to approach this? . . . ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Sydelle,
Perhaps someone else has done this before and knows about some cleverly written software. Lacking that, I think you have a long, hard, tedious slog using recode commands. I think I'd export just those two variables (city and state) and aggregate each variable (maybe, separately, or not, depending on the data specifics) to get rid of duplicates. Then delete the ones that are correct. Export the rest to a text file, rearrange the lines so that all the El Cerrito's are tgoether and write the recodes. One thing that will save you some time is to include the text delimiters around the text value. That is, Write 'erito' or "erito" rather than just erito. There's probably other tricks to use. You'll figure them out as you go or, perhaps, others will comment. Gene Maguin -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Raffe, Sydelle, SSA Sent: Thursday, November 18, 2010 11:40 AM To: [hidden email] Subject: recoding messy values I'm working with data that has designated fields for city and state. However, the original data (I don't have access to that) was badly parsed and in 30% of the records, city and state do not appear in the correct fields. Instead, they may all be contained, for instance, in the street address field. Here are some exammples in the city field: CITY CA should be Daly City . . CIFICA CA should be Pacifica ERRITO CA should be El Cerrito l Cerrito CA should be El Cerrito . . CERRITO CA should be El Cerrito . . RRITO CA should be El Cerrito NTIOCH Should be Antioch AN GABRIEL C Should be San Gabriel in city. The variations are endless. I'd like to take the fragments and recode them to proper city and state names. I want to look at fragments of words that I know would equate to a known city (such as "cifica" or "rrito") or state (A or C in the state field). Once I have cities correctly entered, I could easily compute a correct state field. Any suggestions (other than working with the source to get cleaner data -- doing that) on how to approach this? . . . ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Raffe, Sydelle, SSA
Hi Sydelle,
Here's a partial solution. You'll need to play a bit with the strings you're searching for and for the state string "CA" matches any "ca" so Pacifica would get snagged as well even if the state abbreviation is removed. You might want to do that last and then check. Anyway, here's my suggestion: -- data list fixed/stringvar (a18). begin data CITY CA CIFICA CA ERRITO CA l Cerrito CA CERRITO CA RRITO CA NTIOCH AN GABRIEL C end data. list. string city (a16). string state (a2). if index(lower(stringvar),'ca') > 0 state = "CA". if index(lower(stringvar),'errito') > 0 city = "El Cerrito". if index(lower(stringvar),'ifica') > 0 city = "Pacifica". if index(lower(stringvar),'tioch') > 0 city = "Antioch". list. -- Mark “That which can be
asserted without evidence, can be dismissed without evidence.”
- Christopher Hitchens Mark Casazza Director of Academic Information The City University of New York 555 West 57th Street, Suite 1240 New York, NY 10019 Phone: 212.541.0396 Fax: 212.541.0392 email: [hidden email] On 2010-11-18 11:40, Raffe, Sydelle, SSA wrote: ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARDI'm working with data that has designated fields for city and state. However, the original data (I don't have access to that) was badly parsed and in 30% of the records, city and state do not appear in the correct fields. Instead, they may all be contained, for instance, in the street address field. Here are some exammples in the city field: CITY CA should be Daly City . . CIFICA CA should be Pacifica ERRITO CA should be El Cerrito l Cerrito CA should be El Cerrito . . CERRITO CA should be El Cerrito . . RRITO CA should be El Cerrito NTIOCH Should be Antioch AN GABRIEL C Should be San Gabriel in city. The variations are endless. I'd like to take the fragments and recode them to proper city and state names. I want to look at fragments of words that I know would equate to a known city (such as "cifica" or "rrito") or state (A or C in the state field). Once I have cities correctly entered, I could easily compute a correct state field. Any suggestions (other than working with the source to get cleaner data -- doing that) on how to approach this? . . . ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by Raffe, Sydelle, SSA
"badly parsed and in 30% of the records, city and state do not appear in the correct fields."
Badly parsed is an understatement! FIRE that data vender NOW! ---- It would be good to know if preceding and following fields contain the missing information? If so, glue the crap back together and reparse it. Is this a large file? It seems to me that you should extract the rightmost piece of some of these fragments and grab "state" and rebuild the "city" without it. Is it conceivable to create a TABLE of mismatches and appropriate corrections and use MATCH files to do a first pass? "The variations are endless." Aren't they always? Reminds me of a project from years ago where the client had oodles of scraps of excretion from a screen scraping program from online auctions and all of these data elements had to be normalized. Initial attacks from several previous consultants using various forms of RECODE and IF logic proved to be a massive waste of time and money. In the end, a combination of ingenious TABLE logic and excruciating manual corrections solved the basic problem. GOOD LUCK, you are going to need it! HTH, David
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Sydelle,
Do you have zip code in your records? There might be a straight solution to the problem.
vlad
On Fri, Nov 19, 2010 at 3:56 AM, David Marso <[hidden email]> wrote:
|
Thank you all for commiserating with me and giving me some good laughs. I will check to see if there's a zip to slice through this mess as well as concatie the fields and work backwords
to parse.
P.S. I had been promised a better file about a month ago. On the 11/1810, I got an e-mail saying there would be a fix as of 11/10/10. Guess that didn't work either!
|
Free forum by Nabble | Edit this page |