|
Hi,
I have problems with a dataset and one variable which contains the street of the participants. Unfortunately they all look different:( But I need to extract the house number in a new variable. I have some workarounds but I would prefer a more complex version:-) Chaussée de XXX 1 BERGL 2A Bevrijd 1 513, chausee de XXX 133 RUE MONTAGNE This is what I need afterwards: 1 2A 1 513 133 Thank you! |
|
You could get more fancy with regular expressions, see https://andrewpwheeler.wordpress.com/2014/08/22/using-regular-expressions-in-spss/, but here a workflow in plain SPSS is:
- find the first number - extract the text afterwards - find if a space exists after - eliminate text after space Example below. **********************************************. DATA LIST FREE / Add (A100). BEGIN DATA. "Chaussée de XXX 1" "BERGL 2A" "Bevrijd 1" "513, chausee de XXX" "133 RUE MONTAGNE" END DATA. DATASET NAME Addresses. *Find first number. COMPUTE Fn = CHAR.INDEX(Add,"1234567890",1). *Find next space or end of string. STRING Res (A100). IF Fn > 0 Res = CHAR.SUBSTR(Add,Fn). COMPUTE Ns = CHAR.INDEX(Res," "). IF Ns > 0 Res = CHAR.SUBSTR(Res,1,Ns). EXE. **********************************************. The 513 example ends up being "513," with this approach. You can eliminate commas or whatever superflous characters using the REPLACE function. |
|
In reply to this post by emma78
Try the code below (if the email mangles the indentation, I can send it as a file). It assumes that the number pattern is at least one digit followed by zero or more letter characters up to a blank, comma, or end of string. This will work for accented characters such as in the last two records below because of the use of the UNICODE flag in the regular expression. If more than one number is found, the first is returned. The result is blank if no number is found. data list fixed/s(a20). begin data Chaussée de XXX 1 BERGL 2A Bevrijd 1 513, chausee de XXX 133 RUE MONTAGNE there is no number BERGL 2ã BERGL 2ã, 123 456 end data. dataset name addresses. begin program. import re def getnumber(arg): number = re.findall(r'\d+\w*',arg, flags=re.UNICODE) if len(number) > 1: number = number[0] return number end program. spssinc trans result=housenumber type=20 /formula "getnumber(s)". On Tue, Jan 19, 2016 at 8:53 AM, emma78 <[hidden email]> wrote: Hi, |
|
Hi Jon,
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Does the use of the unicode flag in python assume that unicode mode in spss is used? In other words: is the case data for spssinc trans unicode in unicode mode and str in codepage mode (which would then require re.LOCALE)? It might also be nice to define groups so findall returns a list of tuples (e.g street number, suffix): re.findall(r"(\d+)\s*(\w{,3})", s, re.U) But suffixes are tricky (what if there is a three-letter city? :-) Regards, Albert-Jan Date: Tue, 19 Jan 2016 09:50:49 -0700 From: [hidden email] Subject: Re: [SPSSX-L] Adress field To: [hidden email] Try the code below (if the email mangles the indentation, I can send it as a file). It assumes that the number pattern is at least one digit followed by zero or more letter characters up to a blank, comma, or end of string. This will work for accented characters such as in the last two records below because of the use of the UNICODE flag in the regular expression. If more than one number is found, the first is returned. The result is blank if no number is found. data list fixed/s(a20). begin data Chaussée de XXX 1 BERGL 2A Bevrijd 1 513, chausee de XXX 133 RUE MONTAGNE there is no number BERGL 2ã BERGL 2ã, 123 456 end data. dataset name addresses. begin program. import re def getnumber(arg): number = re.findall(r'\d+\w*',arg, flags=re.UNICODE) if len(number) > 1: number = number[0] return number end program. spssinc trans result=housenumber type=20 /formula "getnumber(s)". On Tue, Jan 19, 2016 at 8:53 AM, emma78 <claudia.weitkowitz@...> wrote: Hi, |
|
I am always in Unicode mode, but re.LOCALE might be better in code page mode. The flag doc says that these flags affect the character classification for things like \w but doesn't mention the actual encoding. In the backend, Unicode mode is actually UTF-8.
On Tuesday, January 19, 2016, Albert-Jan Roskam <[hidden email]> wrote:
-- ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
The subtle distinction between eg [0-9] and \d is, well, interesting:
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Python 2.7.2 (default, Nov 2 2015, 01:07:37) [GCC 4.9 20140827 (prerelease)] on linux4 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> import unicodedata as ud >>> s = unichr(int("06f4", 16)) >>> ud.name(s) 'EXTENDED ARABIC-INDIC DIGIT FOUR' >>> re.match(r"\d", s) >>> re.match(r"\d", s, re.U) <_sre.SRE_Match object at 0xb60b6678 >>> re.match(r"[0-9]", s) >>> re.match(r"[0-9]", s, re.U) Date: Tue, 19 Jan 2016 15:32:58 -0700 From: [hidden email] Subject: Re: [SPSSX-L] Adress field To: [hidden email] I am always in Unicode mode, but re.LOCALE might be better in code page mode. The flag doc says that these flags affect the character classification for things like \w but doesn't mention the actual encoding. In the backend, Unicode mode is actually UTF-8. On Tuesday, January 19, 2016, Albert-Jan Roskam <sjeik_appie@...> wrote:
-- ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Yes, Westerners have a very narrow view of digits. But don't try to enter those others in a numeric field in Statistics.
On Tuesday, January 19, 2016, Albert-Jan Roskam <[hidden email]> wrote:
-- ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
