Hi,
I have a csv file that is in utf-8. I switched to unicode mode (set unicode = on) and read the file in (get data). However, accented characters are still not displayed properly. Does SPSS expect a BOM even though switching to unicode mode is already the same as saying "assume utf-8 from now on, irrespective of locale settings"? I am now using Python codecs.open to read the file in and do value.decode("utf-8").encode("cp1252") on every value (I prefer 1252). That works for all values except for one, which generates a UnicodeDecodeError. It's just one error, but I still find it kind of suspicious. An example of a wrong value when I use unicode=on is: "Aziëlaan" ('Azi\xc3\xablaan') If I use Python to decode that to a unicode string assuming utf-8, ie. value.decode("utf-8"), the result looks correct, and naturally I can encode that to a cp1252 bytestring. What am I missing here? Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
If you explicitly specify that it's UTF-8,
then the file will be read as Unicode, regardless of whether or not the
file contains a BOM. Simply running in Unicode may not be sufficient.
Rick Oliver Senior Information Developer IBM Business Analytics (SPSS) E-mail: [hidden email] From: Albert-Jan Roskam <[hidden email]> To: [hidden email], Date: 09/25/2013 08:23 AM Subject: encoding issue with a utf-8 csv file Sent by: "SPSSX(r) Discussion" <[hidden email]> Hi, I have a csv file that is in utf-8. I switched to unicode mode (set unicode = on) and read the file in (get data). However, accented characters are still not displayed properly. Does SPSS expect a BOM even though switching to unicode mode is already the same as saying "assume utf-8 from now on, irrespective of locale settings"? I am now using Python codecs.open to read the file in and do value.decode("utf-8").encode("cp1252") on every value (I prefer 1252). That works for all values except for one, which generates a UnicodeDecodeError. It's just one error, but I still find it kind of suspicious. An example of a wrong value when I use unicode=on is: "Aziëlaan" ('Azi\xc3\xablaan') If I use Python to decode that to a unicode string assuming utf-8, ie. value.decode("utf-8"), the result looks correct, and naturally I can encode that to a cp1252 bytestring. What am I missing here? Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
From: Rick Oliver <[hidden email]>
>To: [hidden email] >Sent: Wednesday, September 25, 2013 4:24 PM >Subject: Re: [SPSSX-L] encoding issue with a utf-8 csv file > > > >If you explicitly specify that it's UTF-8, then the file will be read as Unicode, regardless of whether or not the file contains a BOM. Simply running in Unicode may not be sufficient. Hi Rick, Yes, I also thought it would be strange if a BOM was needed. This is from the CSR, under GET DATA:TXT is UTF-8 in Unicode mode or the code page determined by the current locale in code page mode. That seems to suggest (to me) that set unicode=on should do the trick. But it doesn't. I could have sworn that get data had an /encoding subcommand, but I am probably mistaken with SAVE TRANSLATE. regards, Albert-JanSimple (ASCII) text data files. The encoding used to read text data files ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
I'm probably a few releases ahead of you
(ENCODING was added to GET DATA in release 22) and you're right that it
should read the file as UTF-8 in Unicode mode.
Are you sure the file is UTF-8, not UTF-16? (Support for UTF-16 was also added in 22.) Rick Oliver Senior Information Developer IBM Business Analytics (SPSS) E-mail: [hidden email] From: Albert-Jan Roskam <[hidden email]> To: Rick Oliver/Chicago/IBM@IBMUS, "[hidden email]" <[hidden email]>, Date: 09/25/2013 09:47 AM Subject: Re: [SPSSX-L] encoding issue with a utf-8 csv file From: Rick Oliver <[hidden email]> >To: [hidden email] >Sent: Wednesday, September 25, 2013 4:24 PM >Subject: Re: [SPSSX-L] encoding issue with a utf-8 csv file > > > >If you explicitly specify that it's UTF-8, then the file will be read as Unicode, regardless of whether or not the file contains a BOM. Simply running in Unicode may not be sufficient. Hi Rick, Yes, I also thought it would be strange if a BOM was needed. This is from the CSR, under GET <a href=DATA:TXT>DATA:TXT is UTF-8 in Unicode mode or the code page determined by the current locale in code page mode. That seems to suggest (to me) that set unicode=on should do the trick. But it doesn't. I could have sworn that get data had an /encoding subcommand, but I am probably mistaken with SAVE TRANSLATE. regards, Albert-JanSimple (ASCII) text data files. The encoding used to read text data files |
From: Rick Oliver <[hidden email]>
>To: Albert-Jan Roskam <[hidden email]> >Cc: "[hidden email]" <[hidden email]> >Sent: Wednesday, September 25, 2013 4:55 PM >Subject: Re: [SPSSX-L] encoding issue with a utf-8 csv file > > >I'm probably a few releases ahead of you (ENCODING was added to GET DATA in release 22) and you're right that it should read the file as UTF-8 in Unicode mode. Ah, yes, I am using SPSS v22.0.0.2. Makes sense to add an /ENCODING subcommand to GET DATA too, just like to many other commands. It always feels a little drastic to use SET UNICODE=ON as it requires that all nonempty datasets be closed. Still, using set unicode=on, followed by get data does not cause the accented chars to be displayed properly. Using python codecs.open(fn, "utf-8") and then transcode it do cp1252 bytestrings does work. I am then using get data to open the transcoded file. >Are you sure the file is UTF-8, not UTF-16? (Support for UTF-16 was also added in 22.) I also tried utf-16, but nope. The one error that I got was probably a data entry error. Instead of à ('\xe0', backtick-SHIFT-a) they typed à ('\xc3', SHIFT-backtick-SHIFT-a), which can only be decoded to unicode using cp1252, not utf-8. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |