Stripping postcodes

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Stripping postcodes

John F Hall

I have a data set with British postcodes which normally consist of a postal district (one or two letters followed by 1 or 2 digits) followed by 3 characters (one digit two letters) signifying address group.  Full postcodes consist of 5, 6 or 7 characters, but some of the entries only show the district.  These are in a single string variable postcode:

 

NW11RS

NW52XB

N78HQ

N1

N78EU

N19DG

N18DW

N7

 

Some have spaces separating the district and the address group, but the spaces can be deleted using [CTRL]H

 

N1  9BL

N1  9JW

N1  9BA

 

A further complication is that some central London postcodes are subdivided by an additional letter:

 

EC1V97Q  [This one is impossible, but the district code is OK]

EC1N9BB

EC1V9BB

WC1N1BG

WC1N1HD

WC1N1HJ

 

To avoid any risk of disclosure, these need to be anonymised by stripping off the last three characters, but leaving the district code if it does not have these.  Thus for the above examples I need to have only:

 

NW1

NW5

N7

N1

N7

N1

N1

N7

N1

N1

N1

EC1V

EC1N

EC1V

WC1N

WC1N

WC1N

 

I’ve done it the long way round with:

 

autorecode var postcode / into district.

freq district.

 

. . then visually grouped (ignoring the additional letter for central London) with:

 

recode district

   (2 thru 96 =1)

   (97 thru 394 = 2)

   (395 thru 457 = 3)

   (458 thru 534 = 4)

   (535 thru 697 = 5)

   (702 thru 931 = 6)

   (932 thru 1011 = 7)

   (1013 thru 1022 = 8)

   (1023 thru 1125 = 9)

   (else = -1)

   into area.

 

format area (f2.0).

missing values area (-1).

freq area.

 

 

 

area

 

Frequency

Percent

Valid Percent

Cumulative Percent

Valid

1

115

8.2

8.2

8.2

2

341

24.2

24.4

32.6

3

71

5.0

5.1

37.6

4

95

6.7

6.8

44.4

5

200

14.2

14.3

58.7

6

304

21.5

21.7

80.4

7

92

6.5

6.6

87.0

8

33

2.3

2.4

89.4

9

149

10.6

10.6

100.0

Total

1400

99.2

100.0

 

Missing

-1

11

.8

 

 

Total

1411

100.0

 

 

 

. . but there must be a quicker way using one of the string functions.  Any suggestions?

 

John F Hall (Mr)

[Retired academic survey researcher]

 

Email:   [hidden email] 

Website: www.surveyresearch.weebly.com

SPSS start page:  www.surveyresearch.weebly.com/1-survey-analysis-workshop

 

 

 

 

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Stripping postcodes

Jignesh Sutar
This may help to some degree, starting point, to validate the postcode, group into parts and then remove as much of trailing postcode as required. SPSSINC TRANS will be useful here.

However I suspect RegEx will be new to you John...? :-) Well worth looking into..
Reply | Threaded
Open this post in threaded view
|

Re: Stripping postcodes

John F Hall
Jignesh

Thanks: interesting.  Think I'll stick to my solution, then try to work out
what the 11 dodgy codes might be (hasty typing by the client?).  I'm doing
this because Edinburgh have offered to take some small surveys off my hands
which I have preserved for years and want to put somewhere safe.  I can't
get authorisation to release them for teaching or research because the
original clients no longer exist or simply don't respond to requests for
authorisation.  I just can't bring myself to delete them.

John  

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Jignesh Sutar
Sent: 18 May 2015 11:10
To: [hidden email]
Subject: Re: Stripping postcodes

This
<http://stackoverflow.com/questions/164979/uk-postcode-regex-comprehensive>
may help to some degree, starting point, to validate the postcode, group
into parts and then remove as much of trailing postcode as required. SPSSINC
TRANS will be useful here.

However I suspect RegEx will be new to you John...? :-) Well worth looking
into..




--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Stripping-postcodes-tp5729566p
5729567.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Stripping postcodes

Mario Giesel
Hi, John,
  if you've got a Python installation within your SPSS version I can probably give you a syntax to do what Jignesh proposed.

GL,
  Mario



John F Hall <[hidden email]> schrieb am 12:08 Montag, 18.Mai 2015:


Jignesh

Thanks: interesting.  Think I'll stick to my solution, then try to work out
what the 11 dodgy codes might be (hasty typing by the client?).  I'm doing
this because Edinburgh have offered to take some small surveys off my hands
which I have preserved for years and want to put somewhere safe.  I can't
get authorisation to release them for teaching or research because the
original clients no longer exist or simply don't respond to requests for
authorisation.  I just can't bring myself to delete them.

John 

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Jignesh Sutar
Sent: 18 May 2015 11:10
To: [hidden email]
Subject: Re: Stripping postcodes

This
<http://stackoverflow.com/questions/164979/uk-postcode-regex-comprehensive>
may help to some degree, starting point, to validate the postcode, group
into parts and then remove as much of trailing postcode as required. SPSSINC
TRANS will be useful here.

However I suspect RegEx will be new to you John...? :-) Well worth looking
into..




--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Stripping-postcodes-tp5729566p
5729567.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD


=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Mario Giesel
Munich, Germany
Reply | Threaded
Open this post in threaded view
|

Re: Stripping postcodes

Jon K Peck
In reply to this post by John F Hall
How about this, where postalcode is the string variable holding the code.

spssinc trans result=prunedpostal type=10
/formula "re.sub(r'(‎\‏d+)‎\‏D+$', r'‎\‏1',postalcode)".


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        John F Hall <[hidden email]>
To:        [hidden email]
Date:        05/18/2015 02:28 AM
Subject:        [SPSSX-L] Stripping postcodes
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




I have a data set with British postcodes which normally consist of a postal district (one or two letters followed by 1 or 2 digits) followed by 3 characters (one digit two letters) signifying address group.  Full postcodes consist of 5, 6 or 7 characters, but some of the entries only show the district.  These are in a single string variable postcode:

 

NW11RS

NW52XB

N78HQ

N1

N78EU

N19DG

N18DW

N7

 

Some have spaces separating the district and the address group, but the spaces can be deleted using [CTRL]H

 

N1  9BL

N1  9JW

N1  9BA

 

A further complication is that some central London postcodes are subdivided by an additional letter:

 

EC1V97Q  [This one is impossible, but the district code is OK]

EC1N9BB

EC1V9BB

WC1N1BG

WC1N1HD

WC1N1HJ

 

To avoid any risk of disclosure, these need to be anonymised by stripping off the last three characters, but leaving the district code if it does not have these.  Thus for the above examples I need to have only:

 

NW1

NW5

N7

N1

N7

N1

N1

N7

N1

N1

N1

EC1V

EC1N

EC1V

WC1N

WC1N

WC1N

 

I’ve done it the long way round with:

 

autorecode var postcode / into district.

freq district.

 

. . then visually grouped (ignoring the additional letter for central London) with:

 

recode district

   (2 thru 96 =1)

   (97 thru 394 = 2)

   (395 thru 457 = 3)

   (458 thru 534 = 4)

   (535 thru 697 = 5)

   (702 thru 931 = 6)

   (932 thru 1011 = 7)

   (1013 thru 1022 = 8)

   (1023 thru 1125 = 9)

   (else = -1)

   into area.

 

format area (f2.0).

missing values area (-1).

freq area.

 

 

 

area
 
Frequency
Percent
Valid Percent
Cumulative Percent
Valid 1
115
8.2
8.2
8.2
2
341
24.2
24.4
32.6
3
71
5.0
5.1
37.6
4
95
6.7
6.8
44.4
5
200
14.2
14.3
58.7
6
304
21.5
21.7
80.4
7
92
6.5
6.6
87.0
8
33
2.3
2.4
89.4
9
149
10.6
10.6
100.0
Total
1400
99.2
100.0
 
Missing -1
11
.8
   
Total
1411
100.0
   

 

. . but there must be a quicker way using one of the string functions.  Any suggestions?

 

John F Hall (Mr)

[Retired academic survey researcher]

 

Email:   johnfhall@...  

Website: www.surveyresearch.weebly.com

SPSS start page:  www.surveyresearch.weebly.com/1-survey-analysis-workshop

 

 

 

 

 

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Stripping postcodes

PRogman
In reply to this post by John F Hall
The programmer's Guide  to PAF of Royal Mail
http://www.royalmail.com/sites/default/files/docs/pdf/programmers_guide_edition_7_v5.pdf
specifies the British postcode system. The postcode is composed by two sections, outward and inward. Outward seems to hold  2-4 characters, and inward holds 3 characters (A##), in all 5 to 7 characters. Normally these are divided by a space. If you don't have to verify valid postcodes and trust your data you could just drop the inward codes and keep outward (which I believe was your wish). I assume PostCode is in correctly format and properly truncated if <5 characters.

HTH, PR

DATA LIST list / PostCode (A7).
BEGIN DATA
NW11RS
NW52XB
N78HQ
N1
N78EU
N19DG
N18DW
N7
N1  9BL
N1  9JW
N1  9BA
EC1V97Q
EC1N9BB
EC1V9BB
WC1N1BG
WC1N1HD
WC1N1HJ
END DATA.

* Length is 5-7 characters without the space *.
STRING
  #pc (A7)
 /PcOutward (A4)
 /PcInward  (A3). 
* Remove spaces *.
COMPUTE #pc=REPLACE(PostCode, ' ', '').
COMPUTE ## =CHAR.LENGTH(#pc).
DO IF (## LT 5). 
/* Assume inward code is missing */.
  COMPUTE PcOutward=#pc.
  COMPUTE PcInward =''.
ELSE.
/* Copy Outward code */.
  COMPUTE PcOutward=CHAR.SUBSTR(#pc, 1, ##-3).
  COMPUTE PcInward =CHAR.SUBSTR(#pc, ##-2, 3).
END IF.

LIST ALL.