SPSSX Discussion

Subtract information from the URL/Browser (http://)

Classic

List

Threaded

7 messages Options

mils

Subtract information from the URL/Browser (http://)

Hi SPSS gurus!

I've collected some consumer browser information. Example below:

http://www.amazon.co.uk/gp/feature.html/ref=gw_183381267_1/280-7736284-9043160?ie=UTF8&docId=1000819073&nav_sdd=aps&pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1&pf_rd_r=0E998J5H0PQH1ZNFWKCR&pf_rd_t=101&pf_rd_p=536784447&pf_rd_i=468294

http://www1.skysports.com/football/news/11669/9520711/premier-league-raheem-sterling-defended-by-liverpool-boss-brendan-rodgers

http://www.whatcar.com/home/used-cars

http://www.amazon.co.uk/gp/product/B00I4WQ618/ref=s9_simh_gw_p465_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-3&pf_rd_r=0KTS5WW6D0TB29T9NXVY&pf_rd_t=101&pf_rd_p=455333147&pf_rd_i=468294

First, I would like to subtract the website name

web_names
amazon
skysports
whatcar
amazon

Then, I would like to keep a record of the sections they have visited. So ideally I would like to extract everything that it's between the / in separate columns. Taking Skysports as an example:

Section_1 Section_2 Section_3 Section_n
Football news ….. ….

I'm happy to use either SPSS or Python.

Thanks in advance!

mils

Andy W

Re: Subtract information from the URL/Browser (http://)

Here is a quick python program that replaces the "http://" at the beginning of the string and then splits the string up to 5 times (for the 5th string it will just place all that is left).

****************************.
BEGIN PROGRAM Python.
def splitURL(url):
t = url.replace(r'http://','',1)
return t.split(r'/',4)
END PROGRAM.
****************************.

Now you can use SPSSINC TRANS to return the split string.

****************************.
*when pasting this make sure line breaks dont sneak in!.
DATA LIST FREE / url (A500).
BEGIN DATA
http://www.amazon.co.uk/gp/feature.html/ref=gw_183381267_1/280-7736284-9043160?ie=UTF8&docId=1000819073&nav_sdd=aps&pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1&pf_rd_r=0E998J5H0PQH1ZNFWKCR&pf_rd_t=101&pf_rd_p=536784447&pf_rd_i=468294
http://www1.skysports.com/football/news/11669/9520711/premier-league-raheem-sterling-defended-by-liverpool-boss-brendan-rodgers
http://www.whatcar.com/home/used-cars
http://www.amazon.co.uk/gp/product/B00I4WQ618/ref=s9_simh_gw_p465_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-3&pf_rd_r=0KTS5WW6D0TB29T9NXVY&pf_rd_t=101&pf_rd_p=455333147&pf_rd_i=468294
END DATA.
SPSSINC TRANS Result = Split1 Split2 Split3 Split4 Split5 TYPE = 100 /FORMULA splitURL(url).
****************************.

You can accomplish a similar feat in native SPSS code as well. (E.g. create a vector of strings, replace "http://", the loop over the string and extract everything between two left slashes. )

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Andy W

Re: Subtract information from the URL/Browser (http://)

Here is what I could come up with as an example for just native SPSS code wrapped up in a macro.

*******************************************.
DEFINE !URLParse (url = !TOKENS(1)
/Stub = !TOKENS(1)
/N = !TOKENS(1)
/StrLen = !DEFAULT(500) !TOKENS(1) )
STRING #t (!CONCAT("A",!StrLen)).
COMPUTE #t = REPLACE(!url,"http://","",1).
COMPUTE #f = CHAR.INDEX(#t,"/").
VECTOR !Stub(!N,A100).
LOOP #i = 1 TO !N.
DO IF (#f > 0) AND (#i < !N).
COMPUTE !Stub(#i) = CHAR.SUBSTR(#t,1,#f-1).
COMPUTE #t = CHAR.SUBSTR(#t,#f+1,500).
COMPUTE #f = CHAR.INDEX(#t,"/").
ELSE.
COMPUTE !Stub(#i) = #t.
COMPUTE #f = -1.
END IF.
END LOOP IF #f = -1.
!ENDDEFINE.

!UrlParse url = url Stub = Base N = 5.
EXECUTE.
*******************************************.

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

mils

Re: Subtract information from the URL/Browser (http://)

In reply to this post by Andy W

Hi Andy,

Thanks for your python code. It worked! However, I have a few URL that start with https not with http.

https://www.google.com.au/_/chrome/newtab?rlz=1C1CHWA_enAU595AU595&espv=2&ie=UTF-8

different from the normal URL:

http://www.ebay.co.uk/sch/i.html?_trksid=p2047675.m570.l1313.TR7.TRC1.A0.H0.Xferrari+engine&_nkw=ferrari+engine&_sacat=0&_from=R40

What can I do?

Thanks!

mils

David Marso

Re: Subtract information from the URL/Browser (http://)

Administrator

Obviously a second replace statement!
Read the code and try to grok it!

mils wrote

Hi Andy,

Thanks for your python code. It worked! However, I have a few URL that start with https not with http.

https://www.google.com.au/_/chrome/newtab?rlz=1C1CHWA_enAU595AU595&espv=2&ie=UTF-8

different from the normal URL:

http://www.ebay.co.uk/sch/i.html?_trksid=p2047675.m570.l1313.TR7.TRC1.A0.H0.Xferrari+engine&_nkw=ferrari+engine&_sacat=0&_from=R40

What can I do?

Thanks!

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Andy W

Re: Subtract information from the URL/Browser (http://)

In reply to this post by mils

As David said, just a second replace in either code snippet would do the trick. This is a case in which order does not matter. In python you can modify the object multiple times, e.g.

****************************.
BEGIN PROGRAM Python.
def splitURL(url):
t = url.replace(r'http://','',1).replace(r'https://','',1)
return t.split(r'/',4)
END PROGRAM.
****************************.

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Albert-Jan Roskam-2

Re: Subtract information from the URL/Browser (http://)

In reply to this post by Andy W

begin program.
import re
def splitURL(url):
return re.split("(?<!/)/", url, 6)
end program.

This returns the protocol (http, https, etc), the domain (which may need a little more cleaning), followed by the locations. Six indicates the maximum number of splits.

Regards,

Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a

fresh water system, and public health, what have the Romans ever done for us?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

----- Original Message -----

> From: Andy W <[hidden email]>
> To: [hidden email]
> Cc:
> Sent: Friday, October 17, 2014 6:40 PM
> Subject: Re: [SPSSX-L] Subtract information from the URL/Browser (http://)
>
> Here is a quick python program that replaces the "http://" at the
> beginning
> of the string and then splits the string up to 5 times (for the 5th string
> it will just place all that is left).
>
> ****************************.
> BEGIN PROGRAM Python.
> def splitURL(url):
> t = url.replace(r'http://','',1)
> return t.split(r'/',4)
> END PROGRAM.
> ****************************.
>
> Now you can use SPSSINC TRANS to return the split string.
>
> ****************************.
> *when pasting this make sure line breaks dont sneak in!.
> DATA LIST FREE / url (A500).
> BEGIN DATA
> http://www.amazon.co.uk/gp/feature.html/ref=gw_183381267_1/280-7736284-9043160?ie=UTF8&docId=1000819073&nav_sdd=aps&pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1&pf_rd_r=0E998J5H0PQH1ZNFWKCR&pf_rd_t=101&pf_rd_p=536784447&pf_rd_i=468294
> http://www1.skysports.com/football/news/11669/9520711/premier-league-raheem-sterling-defended-by-liverpool-boss-brendan-rodgers
> http://www.whatcar.com/home/used-cars
> http://www.amazon.co.uk/gp/product/B00I4WQ618/ref=s9_simh_gw_p465_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-3&pf_rd_r=0KTS5WW6D0TB29T9NXVY&pf_rd_t=101&pf_rd_p=455333147&pf_rd_i=468294
> END DATA.
> SPSSINC TRANS Result = Split1 Split2 Split3 Split4 Split5 TYPE = 100
> /FORMULA splitURL(url).
> ****************************.
>
> You can accomplish a similar feat in native SPSS code as well. (E.g. create
> a vector of strings, replace "http://", the loop over the string and
> extract
> everything between two left slashes. )
>
>
>
> -----
> Andy W
> [hidden email]
> http://andrewpwheeler.wordpress.com/
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/Subtract-information-from-the-URL-Browser-http-tp5727644p5727645.html
>
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD