Here is a quick python program that replaces the "http://" at the beginning of the string and then splits the string up to 5 times (for the 5th string it will just place all that is left).
****************************. BEGIN PROGRAM Python. def splitURL(url): t = url.replace(r'http://','',1) return t.split(r'/',4) END PROGRAM. ****************************. Now you can use SPSSINC TRANS to return the split string. ****************************. *when pasting this make sure line breaks dont sneak in!. DATA LIST FREE / url (A500). BEGIN DATA http://www.amazon.co.uk/gp/feature.html/ref=gw_183381267_1/280-7736284-9043160?ie=UTF8&docId=1000819073&nav_sdd=aps&pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1&pf_rd_r=0E998J5H0PQH1ZNFWKCR&pf_rd_t=101&pf_rd_p=536784447&pf_rd_i=468294 http://www1.skysports.com/football/news/11669/9520711/premier-league-raheem-sterling-defended-by-liverpool-boss-brendan-rodgers http://www.whatcar.com/home/used-cars http://www.amazon.co.uk/gp/product/B00I4WQ618/ref=s9_simh_gw_p465_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-3&pf_rd_r=0KTS5WW6D0TB29T9NXVY&pf_rd_t=101&pf_rd_p=455333147&pf_rd_i=468294 END DATA. SPSSINC TRANS Result = Split1 Split2 Split3 Split4 Split5 TYPE = 100 /FORMULA splitURL(url). ****************************. You can accomplish a similar feat in native SPSS code as well. (E.g. create a vector of strings, replace "http://", the loop over the string and extract everything between two left slashes. ) |
Here is what I could come up with as an example for just native SPSS code wrapped up in a macro.
*******************************************. DEFINE !URLParse (url = !TOKENS(1) /Stub = !TOKENS(1) /N = !TOKENS(1) /StrLen = !DEFAULT(500) !TOKENS(1) ) STRING #t (!CONCAT("A",!StrLen)). COMPUTE #t = REPLACE(!url,"http://","",1). COMPUTE #f = CHAR.INDEX(#t,"/"). VECTOR !Stub(!N,A100). LOOP #i = 1 TO !N. DO IF (#f > 0) AND (#i < !N). COMPUTE !Stub(#i) = CHAR.SUBSTR(#t,1,#f-1). COMPUTE #t = CHAR.SUBSTR(#t,#f+1,500). COMPUTE #f = CHAR.INDEX(#t,"/"). ELSE. COMPUTE !Stub(#i) = #t. COMPUTE #f = -1. END IF. END LOOP IF #f = -1. !ENDDEFINE. !UrlParse url = url Stub = Base N = 5. EXECUTE. *******************************************. |
In reply to this post by Andy W
Hi Andy,
Thanks for your python code. It worked! However, I have a few URL that start with https not with http. https://www.google.com.au/_/chrome/newtab?rlz=1C1CHWA_enAU595AU595&espv=2&ie=UTF-8 different from the normal URL: http://www.ebay.co.uk/sch/i.html?_trksid=p2047675.m570.l1313.TR7.TRC1.A0.H0.Xferrari+engine&_nkw=ferrari+engine&_sacat=0&_from=R40 What can I do? Thanks!
mils
|
Administrator
|
Obviously a second replace statement!
Read the code and try to grok it!
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by mils
As David said, just a second replace in either code snippet would do the trick. This is a case in which order does not matter. In python you can modify the object multiple times, e.g.
****************************. BEGIN PROGRAM Python. def splitURL(url): t = url.replace(r'http://','',1).replace(r'https://','',1) return t.split(r'/',4) END PROGRAM. ****************************. |
In reply to this post by Andy W
begin program. import re def splitURL(url): return re.split("(?<!/)/", url, 6) end program. This returns the protocol (http, https, etc), the domain (which may need a little more cleaning), followed by the locations. Six indicates the maximum number of splits. Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ----- Original Message ----- > From: Andy W <[hidden email]> > To: [hidden email] > Cc: > Sent: Friday, October 17, 2014 6:40 PM > Subject: Re: [SPSSX-L] Subtract information from the URL/Browser (http://) > > Here is a quick python program that replaces the "http://" at the > beginning > of the string and then splits the string up to 5 times (for the 5th string > it will just place all that is left). > > ****************************. > BEGIN PROGRAM Python. > def splitURL(url): > t = url.replace(r'http://','',1) > return t.split(r'/',4) > END PROGRAM. > ****************************. > > Now you can use SPSSINC TRANS to return the split string. > > ****************************. > *when pasting this make sure line breaks dont sneak in!. > DATA LIST FREE / url (A500). > BEGIN DATA > http://www.amazon.co.uk/gp/feature.html/ref=gw_183381267_1/280-7736284-9043160?ie=UTF8&docId=1000819073&nav_sdd=aps&pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1&pf_rd_r=0E998J5H0PQH1ZNFWKCR&pf_rd_t=101&pf_rd_p=536784447&pf_rd_i=468294 > http://www1.skysports.com/football/news/11669/9520711/premier-league-raheem-sterling-defended-by-liverpool-boss-brendan-rodgers > http://www.whatcar.com/home/used-cars > http://www.amazon.co.uk/gp/product/B00I4WQ618/ref=s9_simh_gw_p465_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-3&pf_rd_r=0KTS5WW6D0TB29T9NXVY&pf_rd_t=101&pf_rd_p=455333147&pf_rd_i=468294 > END DATA. > SPSSINC TRANS Result = Split1 Split2 Split3 Split4 Split5 TYPE = 100 > /FORMULA splitURL(url). > ****************************. > > You can accomplish a similar feat in native SPSS code as well. (E.g. create > a vector of strings, replace "http://", the loop over the string and > extract > everything between two left slashes. ) > > > > ----- > Andy W > [hidden email] > http://andrewpwheeler.wordpress.com/ > -- > View this message in context: > http://spssx-discussion.1045642.n5.nabble.com/Subtract-information-from-the-URL-Browser-http-tp5727644p5727645.html > > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |