SPSSX Discussion

encoding issue with a utf-8 csv file

Classic

List

Threaded

5 messages Options

Albert-Jan Roskam

encoding issue with a utf-8 csv file

Hi,

I have a csv file that is in utf-8. I switched to unicode mode (set unicode = on) and read the file in (get data).
However, accented characters are still not displayed properly. Does SPSS expect a BOM even though switching
to unicode mode is already the same as saying "assume utf-8 from now on, irrespective of locale settings"?
I am now using Python codecs.open to read the file in and do value.decode("utf-8").encode("cp1252") on every value (I prefer 1252).
That works for all values except for one, which generates a UnicodeDecodeError. It's just one error, but I still find it
kind of suspicious.

An example of a wrong value when I use unicode=on is: "AziÃ«laan" ('Azi\xc3\xablaan')
If I use Python to decode that to a unicode string assuming utf-8, ie. value.decode("utf-8"), the result looks correct,
and naturally I can encode that to a cp1252 bytestring.

What am I missing here?

Regards,
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Rick Oliver-3

Re: encoding issue with a utf-8 csv file

If you explicitly specify that it's UTF-8, then the file will be read as Unicode, regardless of whether or not the file contains a BOM. Simply running in Unicode may not be sufficient.

Rick Oliver
Senior Information Developer
IBM Business Analytics (SPSS)
E-mail: [hidden email]

From: Albert-Jan Roskam <[hidden email]>
To: [hidden email],
Date: 09/25/2013 08:23 AM
Subject: encoding issue with a utf-8 csv file
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Hi, I have a csv file that is in utf-8. I switched to unicode mode (set unicode = on) and read the file in (get data). However, accented characters are still not displayed properly. Does SPSS expect a BOM even though switching to unicode mode is already the same as saying "assume utf-8 from now on, irrespective of locale settings"? I am now using Python codecs.open to read the file in and do value.decode("utf-8").encode("cp1252") on every value (I prefer 1252). That works for all values except for one, which generates a UnicodeDecodeError. It's just one error, but I still find it kind of suspicious. An example of a wrong value when I use unicode=on is: "AziÃ«laan" ('Azi\xc3\xablaan') If I use Python to decode that to a unicode string assuming utf-8, ie. value.decode("utf-8"), the result looks correct, and naturally I can encode that to a cp1252 bytestring. What am I missing here? Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Albert-Jan Roskam

Re: encoding issue with a utf-8 csv file

From: Rick Oliver <[hidden email]>
>To: [hidden email]
>Sent: Wednesday, September 25, 2013 4:24 PM
>Subject: Re: [SPSSX-L] encoding issue with a utf-8 csv file
>
>
>
>If you explicitly specify that it's UTF-8, then the file will be read as Unicode, regardless of whether or not the file contains a BOM. Simply running in Unicode may not be sufficient.

Hi Rick,

Yes, I also thought it would be strange if a BOM was needed. This is from the CSR, under GET DATA:TXT
is UTF-8 in Unicode mode or the code page determined by the current
locale in code page mode.

That seems to suggest (to me) that set unicode=on should do the trick. But it doesn't. I could have sworn that get data had an /encoding subcommand, but I am probably mistaken with SAVE TRANSLATE.

regards,
Albert-JanSimple (ASCII) text data files. The encoding used to read text data files

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Rick Oliver-3

Re: encoding issue with a utf-8 csv file

I'm probably a few releases ahead of you (ENCODING was added to GET DATA in release 22) and you're right that it should read the file as UTF-8 in Unicode mode.

Are you sure the file is UTF-8, not UTF-16? (Support for UTF-16 was also added in 22.)

Rick Oliver
Senior Information Developer
IBM Business Analytics (SPSS)
E-mail: [hidden email]

From: Albert-Jan Roskam <[hidden email]>
To: Rick Oliver/Chicago/IBM@IBMUS, "[hidden email]" <[hidden email]>,
Date: 09/25/2013 09:47 AM
Subject: Re: [SPSSX-L] encoding issue with a utf-8 csv file

From: Rick Oliver <[hidden email]> >To: [hidden email] >Sent: Wednesday, September 25, 2013 4:24 PM >Subject: Re: [SPSSX-L] encoding issue with a utf-8 csv file > > > >If you explicitly specify that it's UTF-8, then the file will be read as Unicode, regardless of whether or not the file contains a BOM. Simply running in Unicode may not be sufficient. Hi Rick, Yes, I also thought it would be strange if a BOM was needed. This is from the CSR, under GET<a href=DATA:TXT>DATA:TXTis UTF-8 in Unicode mode or the code page determined by the current locale in code page mode. That seems to suggest (to me) that set unicode=on should do the trick. But it doesn't. I could have sworn that get data had an /encoding subcommand, but I am probably mistaken with SAVE TRANSLATE. regards, Albert-JanSimple (ASCII) text data files. The encoding used to read text data files

Albert-Jan Roskam

Re: encoding issue with a utf-8 csv file

From: Rick Oliver <[hidden email]>
>To: Albert-Jan Roskam <[hidden email]>
>Cc: "[hidden email]" <[hidden email]>
>Sent: Wednesday, September 25, 2013 4:55 PM
>Subject: Re: [SPSSX-L] encoding issue with a utf-8 csv file
>
>
>I'm probably a few releases ahead of you (ENCODING was added to GET DATA in release 22) and you're right that it should read the file as UTF-8 in Unicode mode.

Ah, yes, I am using SPSS v22.0.0.2. Makes sense to add an /ENCODING subcommand to GET DATA too, just like to many other commands. It always feels a little drastic to use SET UNICODE=ON as it requires that all nonempty datasets be closed. Still, using set unicode=on, followed by get data does not cause the accented chars to be displayed properly. Using python codecs.open(fn, "utf-8") and then transcode it do cp1252 bytestrings does work. I am then using get data to open the transcoded file.

>Are you sure the file is UTF-8, not UTF-16? (Support for UTF-16 was also added in 22.)

I also tried utf-16, but nope. The one error that I got was probably a data entry error. Instead of à ('\xe0', backtick-SHIFT-a) they typed Ã ('\xc3', SHIFT-backtick-SHIFT-a), which can only be decoded to unicode using cp1252, not utf-8.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD