Adress field

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Adress field

emma78
Hi,
I have problems with a dataset and one variable which contains the street of the participants. Unfortunately they all look different:( But I need to extract the house number in a new variable.
I have some workarounds but I would prefer a more complex version:-)


Chaussée de XXX 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
BERGL 2A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
Bevrijd 1                                                                                                                                                                                              
513, chausee de XXX                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
133 RUE MONTAGNE                                                                                                                                                                                                                                                                                This is what I need afterwards:
1
2A
1
513
133


Thank you!                                                                                                                                                                                          
Reply | Threaded
Open this post in threaded view
|

Re: Adress field

Andy W
You could get more fancy with regular expressions, see https://andrewpwheeler.wordpress.com/2014/08/22/using-regular-expressions-in-spss/, but here a workflow in plain SPSS is:

 - find the first number
 - extract the text afterwards
 - find if a space exists after
 - eliminate text after space

Example below.

**********************************************.
DATA LIST FREE / Add (A100).
BEGIN DATA.
"Chaussée de XXX 1"          
"BERGL 2A"                          
"Bevrijd 1"                              
"513, chausee de XXX"        
"133 RUE MONTAGNE"
END DATA.
DATASET NAME Addresses.

*Find first number.
COMPUTE Fn = CHAR.INDEX(Add,"1234567890",1).
*Find next space or end of string.
STRING Res (A100).
IF Fn > 0 Res = CHAR.SUBSTR(Add,Fn).
COMPUTE Ns = CHAR.INDEX(Res," ").
IF Ns > 0 Res = CHAR.SUBSTR(Res,1,Ns).
EXE.
**********************************************.

The 513 example ends up being "513," with this approach. You can eliminate commas or whatever superflous characters using the REPLACE function.
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Adress field

Jon Peck
In reply to this post by emma78
Try the code below (if the email mangles the indentation, I can send it as a file).  It assumes that the number pattern is at least one digit followed by zero or more letter characters up to a blank, comma, or end of string.  This will work for accented characters such as in the last two records below because of the use of the UNICODE flag in the regular expression.

If more than one number is found, the first is returned.  The result is blank if no number is found.

data list fixed/s(a20).
begin data
Chaussée de XXX 1
BERGL 2A
Bevrijd 1
513, chausee de XXX
133 RUE MONTAGNE
there is no number
BERGL 2ã
BERGL 2ã,
123 456
end data.
dataset name addresses.


begin program.
import re
def getnumber(arg):
    number = re.findall(r'\d+\w*',arg, flags=re.UNICODE)
    if len(number) > 1:
        number = number[0]
    return number
end program.

spssinc trans result=housenumber type=20
/formula "getnumber(s)".



On Tue, Jan 19, 2016 at 8:53 AM, emma78 <[hidden email]> wrote:
Hi,
I have problems with a dataset and one variable which contains the street of
the participants. Unfortunately they all look different:( But I need to
extract the house number in a new variable.
I have some workarounds but I would prefer a more complex version:-)


Chaussée de XXX 1
BERGL 2A
Bevrijd 1
513, chausee de XXX
133 RUE MONTAGNE
This is what I need afterwards:
1
2A
1
513
133


Thank you!



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Adress-field-tp5731339.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Adress field

Albert-Jan Roskam-3
Hi Jon,

Does the use of the unicode flag in python assume that unicode mode in spss is used? In other words: is the case data for spssinc trans unicode in unicode mode and str in codepage mode (which would then require re.LOCALE)?

It might also be nice to define groups so findall returns a list of tuples (e.g street number, suffix):

re.findall(r"(\d+)\s*(\w{,3})", s, re.U)

But suffixes are tricky (what if there is a three-letter city? :-)

Regards,
Albert-Jan


Date: Tue, 19 Jan 2016 09:50:49 -0700
From: [hidden email]
Subject: Re: [SPSSX-L] Adress field
To: [hidden email]

Try the code below (if the email mangles the indentation, I can send it as a file).  It assumes that the number pattern is at least one digit followed by zero or more letter characters up to a blank, comma, or end of string.  This will work for accented characters such as in the last two records below because of the use of the UNICODE flag in the regular expression.

If more than one number is found, the first is returned.  The result is blank if no number is found.

data list fixed/s(a20).
begin data
Chaussée de XXX 1
BERGL 2A
Bevrijd 1
513, chausee de XXX
133 RUE MONTAGNE
there is no number
BERGL 2ã
BERGL 2ã,
123 456
end data.
dataset name addresses.


begin program.
import re
def getnumber(arg):
    number = re.findall(r'\d+\w*',arg, flags=re.UNICODE)
    if len(number) > 1:
        number = number[0]
    return number
end program.

spssinc trans result=housenumber type=20
/formula "getnumber(s)".



On Tue, Jan 19, 2016 at 8:53 AM, emma78 <claudia.weitkowitz@...> wrote:
Hi,
I have problems with a dataset and one variable which contains the street of
the participants. Unfortunately they all look different:( But I need to
extract the house number in a new variable.
I have some workarounds but I would prefer a more complex version:-)


Chaussée de XXX 1
BERGL 2A
Bevrijd 1
513, chausee de XXX
133 RUE MONTAGNE
This is what I need afterwards:
1
2A
1
513
133


Thank you!



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Adress-field-tp5731339.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
jkpeck@...

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Adress field

Jon Peck
I am always in Unicode mode, but re.LOCALE might be better in code page mode.  The flag doc says that these flags affect the character classification for things like \w but doesn't mention the actual encoding.  In the backend, Unicode mode is actually UTF-8.

On Tuesday, January 19, 2016, Albert-Jan Roskam <[hidden email]> wrote:
Hi Jon,

Does the use of the unicode flag in python assume that unicode mode in spss is used? In other words: is the case data for spssinc trans unicode in unicode mode and str in codepage mode (which would then require re.LOCALE)?

It might also be nice to define groups so findall returns a list of tuples (e.g street number, suffix):

re.findall(r"(\d+)\s*(\w{,3})", s, re.U)

But suffixes are tricky (what if there is a three-letter city? :-)

Regards,
Albert-Jan


Date: Tue, 19 Jan 2016 09:50:49 -0700
From: <a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;jkpeck@gmail.com&#39;);" target="_blank">jkpeck@...
Subject: Re: [SPSSX-L] Adress field
To: <a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;SPSSX-L@LISTSERV.UGA.EDU&#39;);" target="_blank">SPSSX-L@...

Try the code below (if the email mangles the indentation, I can send it as a file).  It assumes that the number pattern is at least one digit followed by zero or more letter characters up to a blank, comma, or end of string.  This will work for accented characters such as in the last two records below because of the use of the UNICODE flag in the regular expression.

If more than one number is found, the first is returned.  The result is blank if no number is found.

data list fixed/s(a20).
begin data
Chaussée de XXX 1
BERGL 2A
Bevrijd 1
513, chausee de XXX
133 RUE MONTAGNE
there is no number
BERGL 2ã
BERGL 2ã,
123 456
end data.
dataset name addresses.


begin program.
import re
def getnumber(arg):
    number = re.findall(r'\d+\w*',arg, flags=re.UNICODE)
    if len(number) > 1:
        number = number[0]
    return number
end program.

spssinc trans result=housenumber type=20
/formula "getnumber(s)".



On Tue, Jan 19, 2016 at 8:53 AM, emma78 <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;claudia.weitkowitz@respondi.com&#39;);" target="_blank">claudia.weitkowitz@...> wrote:
Hi,
I have problems with a dataset and one variable which contains the street of
the participants. Unfortunately they all look different:( But I need to
extract the house number in a new variable.
I have some workarounds but I would prefer a more complex version:-)


Chaussée de XXX 1
BERGL 2A
Bevrijd 1
513, chausee de XXX
133 RUE MONTAGNE
This is what I need afterwards:
1
2A
1
513
133


Thank you!



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Adress-field-tp5731339.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;LISTSERV@LISTSERV.UGA.EDU&#39;);" target="_blank">LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;jkpeck@gmail.com&#39;);" target="_blank">jkpeck@...

===================== To manage your subscription to SPSSX-L, send a message to <a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;LISTSERV@LISTSERV.UGA.EDU&#39;);" target="_blank">LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


--
Jon K Peck
[hidden email]


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Adress field

Albert-Jan Roskam-3
The subtle distinction between eg [0-9] and \d is, well, interesting:

Python 2.7.2 (default, Nov  2 2015, 01:07:37) [GCC 4.9 20140827 (prerelease)] on linux4
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> import unicodedata as ud
>>> s = unichr(int("06f4", 16))
>>> ud.name(s)
'EXTENDED ARABIC-INDIC DIGIT FOUR'
>>> re.match(r"\d", s)
>>> re.match(r"\d", s, re.U)
<_sre.SRE_Match object at 0xb60b6678
>>> re.match(r"[0-9]", s)
>>> re.match(r"[0-9]", s, re.U)


Date: Tue, 19 Jan 2016 15:32:58 -0700
From: [hidden email]
Subject: Re: [SPSSX-L] Adress field
To: [hidden email]

I am always in Unicode mode, but re.LOCALE might be better in code page mode.  The flag doc says that these flags affect the character classification for things like \w but doesn't mention the actual encoding.  In the backend, Unicode mode is actually UTF-8.

On Tuesday, January 19, 2016, Albert-Jan Roskam <sjeik_appie@...> wrote:
Hi Jon,

Does the use of the unicode flag in python assume that unicode mode in spss is used? In other words: is the case data for spssinc trans unicode in unicode mode and str in codepage mode (which would then require re.LOCALE)?

It might also be nice to define groups so findall returns a list of tuples (e.g street number, suffix):

re.findall(r"(\d+)\s*(\w{,3})", s, re.U)

But suffixes are tricky (what if there is a three-letter city? :-)

Regards,
Albert-Jan


Date: Tue, 19 Jan 2016 09:50:49 -0700
From: [hidden email]
Subject: Re: [SPSSX-L] Adress field
To: [hidden email]

Try the code below (if the email mangles the indentation, I can send it as a file).  It assumes that the number pattern is at least one digit followed by zero or more letter characters up to a blank, comma, or end of string.  This will work for accented characters such as in the last two records below because of the use of the UNICODE flag in the regular expression.

If more than one number is found, the first is returned.  The result is blank if no number is found.

data list fixed/s(a20).
begin data
Chaussée de XXX 1
BERGL 2A
Bevrijd 1
513, chausee de XXX
133 RUE MONTAGNE
there is no number
BERGL 2ã
BERGL 2ã,
123 456
end data.
dataset name addresses.


begin program.
import re
def getnumber(arg):
    number = re.findall(r'\d+\w*',arg, flags=re.UNICODE)
    if len(number) > 1:
        number = number[0]
    return number
end program.

spssinc trans result=housenumber type=20
/formula "getnumber(s)".



On Tue, Jan 19, 2016 at 8:53 AM, emma78 <[hidden email]> wrote:
Hi,
I have problems with a dataset and one variable which contains the street of
the participants. Unfortunately they all look different:( But I need to
extract the house number in a new variable.
I have some workarounds but I would prefer a more complex version:-)


Chaussée de XXX 1
BERGL 2A
Bevrijd 1
513, chausee de XXX
133 RUE MONTAGNE
This is what I need afterwards:
1
2A
1
513
133


Thank you!



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Adress-field-tp5731339.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


--
Jon K Peck
jkpeck@...


===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Adress field

Jon Peck
Yes, Westerners have a very narrow view of digits.  But don't try to enter those others in a numeric field in Statistics.

On Tuesday, January 19, 2016, Albert-Jan Roskam <[hidden email]> wrote:
The subtle distinction between eg [0-9] and \d is, well, interesting:

Python 2.7.2 (default, Nov  2 2015, 01:07:37) [GCC 4.9 20140827 (prerelease)] on linux4
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> import unicodedata as ud
>>> s = unichr(int("06f4", 16))
>>> ud.name(s)
'EXTENDED ARABIC-INDIC DIGIT FOUR'
>>> re.match(r"\d", s)
>>> re.match(r"\d", s, re.U)
<_sre.SRE_Match object at 0xb60b6678
>>> re.match(r"[0-9]", s)
>>> re.match(r"[0-9]", s, re.U)


Date: Tue, 19 Jan 2016 15:32:58 -0700
From: <a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;jkpeck@gmail.com&#39;);" target="_blank">jkpeck@...
Subject: Re: [SPSSX-L] Adress field
To: <a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;SPSSX-L@LISTSERV.UGA.EDU&#39;);" target="_blank">SPSSX-L@...

I am always in Unicode mode, but re.LOCALE might be better in code page mode.  The flag doc says that these flags affect the character classification for things like \w but doesn't mention the actual encoding.  In the backend, Unicode mode is actually UTF-8.

On Tuesday, January 19, 2016, Albert-Jan Roskam <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;sjeik_appie@hotmail.com&#39;);" target="_blank">sjeik_appie@...> wrote:
Hi Jon,

Does the use of the unicode flag in python assume that unicode mode in spss is used? In other words: is the case data for spssinc trans unicode in unicode mode and str in codepage mode (which would then require re.LOCALE)?

It might also be nice to define groups so findall returns a list of tuples (e.g street number, suffix):

re.findall(r"(\d+)\s*(\w{,3})", s, re.U)

But suffixes are tricky (what if there is a three-letter city? :-)

Regards,
Albert-Jan


Date: Tue, 19 Jan 2016 09:50:49 -0700
From: [hidden email]
Subject: Re: [SPSSX-L] Adress field
To: [hidden email]

Try the code below (if the email mangles the indentation, I can send it as a file).  It assumes that the number pattern is at least one digit followed by zero or more letter characters up to a blank, comma, or end of string.  This will work for accented characters such as in the last two records below because of the use of the UNICODE flag in the regular expression.

If more than one number is found, the first is returned.  The result is blank if no number is found.

data list fixed/s(a20).
begin data
Chaussée de XXX 1
BERGL 2A
Bevrijd 1
513, chausee de XXX
133 RUE MONTAGNE
there is no number
BERGL 2ã
BERGL 2ã,
123 456
end data.
dataset name addresses.


begin program.
import re
def getnumber(arg):
    number = re.findall(r'\d+\w*',arg, flags=re.UNICODE)
    if len(number) > 1:
        number = number[0]
    return number
end program.

spssinc trans result=housenumber type=20
/formula "getnumber(s)".



On Tue, Jan 19, 2016 at 8:53 AM, emma78 <[hidden email]> wrote:
Hi,
I have problems with a dataset and one variable which contains the street of
the participants. Unfortunately they all look different:( But I need to
extract the house number in a new variable.
I have some workarounds but I would prefer a more complex version:-)


Chaussée de XXX 1
BERGL 2A
Bevrijd 1
513, chausee de XXX
133 RUE MONTAGNE
This is what I need afterwards:
1
2A
1
513
133


Thank you!



--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Adress-field-tp5731339.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


--
Jon K Peck
<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;jkpeck@gmail.com&#39;);" target="_blank">jkpeck@...


===================== To manage your subscription to SPSSX-L, send a message to <a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;LISTSERV@LISTSERV.UGA.EDU&#39;);" target="_blank">LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


--
Jon K Peck
[hidden email]


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Adress field

emma78
Perfect, thank you all