Long story short, I have a file that (I believe) has some type of encoding that is preventing me from parsing text fields. I have attached the sav file I am working with to the NABBLE post, although I know that sometimes works haphazardly so I also uploaded the data file, syntax, and original xls spreadsheet the data was generated from in this dropbox link. I can provide more explicit code of how I went from the spreadsheet to the SPSS file if needed (or how the tables were generated if needed). So attached I have an example utilizing both the attached file, and some tests using the same data reading in the data directly using
I would appreciate it if someone would just open up the file and run the code to confirm that I am not crazy! My current version of SPSS is 19.0.0.2, and I have tested this on two different Window's machines (one XP and one 7). Any advice is appreciated. Andy W tables_AMIN.sav |
Reading in your sav file, although
the first case for V1 looks like BLOCK, you can see that if you change
the format to AHEX(240), that it is actually BLOCKx0A. That is BLOCK<LF>.
It's not an encoding problem. HTH, Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: Andy W <[hidden email]> To: [hidden email] Date: 06/20/2012 01:55 PM Subject: [SPSSX-L] Parsing text fields, a potential encoding issue? Sent by: "SPSSX(r) Discussion" <[hidden email]> Long story short, I have a file that (I believe) has some type of encoding that is preventing me from parsing text fields. I have attached the sav file I am working with to the NABBLE post, although I know that sometimes works haphazardly so I also uploaded the data file, syntax, and original xls spreadsheet the data was generated from in this dropbox link. I can provide more explicit code of how I went from the spreadsheet to the SPSS file if needed (or how the tables were generated if needed). So attached I have an example utilizing both the attached file, and some tests using the same data reading in the data directly using DATA LIST commands. What is weird is that in the first data list I used the data pasted exactly from the first three records and fields directly in the command, and the if statement failed to recognize "BLOCK". In the second data list set, with me just typing in the data it worked as it should. Note I also tried all of this with and without setting the system encoding to UNICODE. Below I have the syntax pasted, although I have no idea if by the time it is copy-pasted from the browser if the same problem will persist. Also I have copy and pasted the text into notepad++, there doesn't appear to be anything awry with the encoding there either I believe. dataset close ALL. output close ALL. new file. *tried also setting to UNICODE mode, had no impact. *SET Unicode = NO. *SET Unicode = YES. *FILE HANDLE data /name = "YOUR PATH HERE". get file = "data\tables_AMIN.sav". dataset name full_file. compute flag = 0. if V1 = "BLOCK " flag = 1. exe. *This is the data copy and pasted from the data file, the if statement fails. data list free ("|") / V1 (A20) V2 (A20) V3 (A20). begin data BLOCK | LOT | ADDRESS 2215 | 116 | 5210 BROADWAY 2215 | 116 | 5220 BROADWAY end data. dataset name input. dataset activate input. compute flag = 0. if V1 = "BLOCK " flag = 1. exe. list ALL. *This is the data I typed in, the if statement works like it is suppossed to. data list free ("|") / V1 (A20) V2 (A20) V3 (A20). begin data BLOCK | LOT | ADDRESS 2215 | 116 | 5210 BROADWAY 2215 | 116 | 5210 BROADWAY end data. dataset name input2. dataset activate input2. compute flag = 0. if V1 = "BLOCK " flag = 1. exe. list ALL. I would appreciate it if someone would just open up the file and run the code to confirm that I am not crazy! My current version of SPSS is 19.0.0.2, and I have tested this on two different Window's machines (one XP and one 7). Any advice is appreciated. Andy W tables_AMIN.sav
View this message in context: Parsing text fields, a potential encoding issue? Sent from the SPSSX Discussion mailing list archive at Nabble.com. |
Administrator
|
In reply to this post by Andy W
Andy,
I opened your data file (.sav) and determined upon QAD analysis that the terminating character in your string fields is *NOT* a space. It turns out to be a LF character (ASCII 10 rather than ASCII 32 -space-). HTH: David -- Quick fix: DO REPEAT V=V1 TO V8. COMPUTE #=LENGTH(RTRIM(V)). IF NUMBER(SUBSTR(V,#,1),PIB)=10 SUBSTR(V,#,1)=" ". END REPEAT. if V1 = "BLOCK " flag = 1. exe.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Administrator
|
Another option:
compute flag = index(v1,"BLOCK") GT 0. This method and David's flag the same 103 records in Andy's file.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Administrator
|
Except notice that *EVERY* string in the file has been affected by the appended <LF>.
Best to fix it ASAP. Aside from that: One could also use : COMPUTE flag=SUBSTR(V1,1,5) EQ "BLOCK". --
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Using newer functionality...
match files file=* /keep table all. do repeat #v = v1 to v10. compute #v=replace(#v, string(10, pib1),""). end repeat. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: David Marso <[hidden email]> To: [hidden email] Date: 06/20/2012 03:37 PM Subject: Re: [SPSSX-L] Parsing text fields, a potential encoding issue? Sent by: "SPSSX(r) Discussion" <[hidden email]> Except notice that *EVERY* string in the file has been affected by the appended <LF>. Best to fix it ASAP. Aside from that: One could also use : COMPUTE flag=SUBSTR(V1,1,5) EQ "BLOCK". -- Bruce Weaver wrote > > Another option: > > compute flag = index(v1,"BLOCK") GT 0. > > This method and David's flag the same 103 records in Andy's file. > > > > David Marso wrote >> >> Andy, >> I opened your data file (.sav) and determined upon QAD analysis that the >> terminating character in your string fields is *NOT* a space. It turns >> out to be a LF character (ASCII 10 rather than ASCII 32 -space-). >> HTH: David >> -- >> Quick fix: >> DO REPEAT V=V1 TO V8. >> COMPUTE #=LENGTH(RTRIM(V)). >> IF NUMBER(SUBSTR(V,#,1),PIB)=10 SUBSTR(V,#,1)=" ". >> END REPEAT. >> if V1 = "BLOCK " flag = 1. >> exe. >> >> >> >> Andy W wrote >>> >>> <p>Long story short, I have a file that (I believe) has some type of >>> encoding that is preventing me from parsing text fields. I have attached >>> the sav file I am working with to the NABBLE post, although I know that >>> sometimes works haphazardly so I also uploaded the data file, syntax, >>> and original xls spreadsheet the data was generated from in >>> http://dl.dropbox.com/u/3385251/Nabble_Post.zip this dropbox link . I >>> can provide more explicit code of how I went from the spreadsheet to the >>> SPSS file if needed (or how the tables were generated if needed).</P> >>> >>> <p>So attached I have an example utilizing both the attached file, and >>> some tests using the same data reading in the data directly using >>> <code>DATA LIST</code> commands. What is weird is that in the first data >>> list I used the data pasted exactly from the first three records and >>> fields directly in the command, and the if statement failed to recognize >>> "BLOCK". In the second data list set, with me just typing in the data it >>> worked as it should. Note I also tried all of this with and without >>> setting the system encoding to UNICODE. Below I have the syntax pasted, >>> although I have no idea if by the time it is copy-pasted from the >>> browser if the same problem will persist. Also I have copy and pasted >>> the text into notepad++, there doesn't appear to be anything awry with >>> the encoding there either I believe.</p> >>> >>> <p><code> >>> dataset close ALL.</BR> >>> output close ALL.</BR> >>> new file.</BR> >>> </BR> >>> *tried also setting to UNICODE mode, had no impact.</BR> >>> *SET Unicode = NO.</BR> >>> *SET Unicode = YES.</BR> >>> >>> *FILE HANDLE data /name = "YOUR PATH HERE".</BR> >>> get file = "data\tables_AMIN.sav".</BR> >>> dataset name full_file.</BR> >>> </BR> >>> compute flag = 0.</BR> >>> if V1 = "BLOCK " flag = 1.</BR> >>> exe.</BR> >>> >>> *This is the data copy and pasted from the data file, the if statement >>> fails.</BR> >>> data list free ("|") / V1 (A20) V2 (A20) V3 (A20).</BR> >>> begin data</BR> >>> BLOCK | LOT | ADDRESS</BR> >>> 2215 | 116 | 5210 BROADWAY</BR> >>> 2215 | 116 | 5220 BROADWAY </BR> >>> end data.</BR> >>> dataset name input.</BR> >>> dataset activate input.</BR> >>> >>> compute flag = 0.</BR> >>> if V1 = "BLOCK " flag = 1.</BR> >>> exe.</BR> >>> </BR> >>> list ALL.</BR> >>> </BR> >>> *This is the data I typed in, the if statement works like it is >>> suppossed to.</BR> >>> data list free ("|") / V1 (A20) V2 (A20) V3 (A20).</BR> >>> begin data</BR> >>> BLOCK | LOT | ADDRESS</BR> >>> 2215 | 116 | 5210 BROADWAY</BR> >>> 2215 | 116 | 5210 BROADWAY</BR> >>> end data.</BR> >>> dataset name input2.</BR> >>> dataset activate input2.</BR> >>> </BR> >>> compute flag = 0.</BR> >>> if V1 = "BLOCK " flag = 1.</BR> >>> exe.</BR> >>> </BR> >>> list ALL.</BR> >>> </code></p> >>> >>> <p>I would appreciate it if someone would just open up the file and run >>> the code to confirm that I am not crazy! My current version of SPSS is >>> 19.0.0.2, and I have tested this on two different Window's machines (one >>> XP and one 7). Any advice is appreciated.</p> >>> >>> <p>Andy W</p> >>> >>> >>> http://spssx-discussion.1045642.n5.nabble.com/file/n5713724/tables_AMIN.sav >>> tables_AMIN.sav >>> >> > -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Parsing-text-fields-a-potential-encoding-issue-tp5713724p5713728.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Thank you Jon, David and Bruce. So in the future I should just Apparently if you copy and paste directly from the data editor the line feed is not carried over into the text field (either into the native SPSS syntax editor or other text editors). Some quick experimentation suggests plain text copy and pasting on my windows machine does not carry the line feed (is the line feed character a *NIX thing?). Andy W PS: I answer a question question and I get 20 out of office replies to my email, I ask a question and don't get any (which I suppose I shouldn't be complaining about). WTF Nabble?
|
Administrator
|
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Andy W
AHEX is just a format, not a type, so you
change an An field to an AHEX2n field using the FORMATS command.
Windows uses crlf for a line break while *nix systems use just lf. Statistics tries to cope with this variation on all platforms, but it might depend on exactly what you did. Also, different editors will display these sequences differently, so only a hex dump can prove what is really there. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: Andy W <[hidden email]> To: [hidden email] Date: 06/20/2012 07:15 PM Subject: Re: [SPSSX-L] Parsing text fields, a potential encoding issue? Sent by: "SPSSX(r) Discussion" <[hidden email]> Thank you Jon, David and Bruce. So in the future I should just Alter type mystring (A = AHEX240). (is that how you guys figured it out?) What is QAD analysis? Apparently if you copy and paste directly from the data editor the line feed is not carried over into the text field (either into the native SPSS syntax editor or other text editors). Some quick experimentation suggests plain text copy and pasting on my windows machine does not carry the line feed (is the line feed character a *NIX thing?). Andy W PS: I answer a question question and I get 20 out of office replies to my email, I ask a question and don't get any (which I suppose I shouldn't be complaining about). WTF Nabble? Andy W
View this message in context: Re: Parsing text fields, a potential encoding issue? Sent from the SPSSX Discussion mailing list archive at Nabble.com.
|
Free forum by Nabble | Edit this page |