encoding issue with a utf-8 csv file

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

encoding issue with a utf-8 csv file

Albert-Jan Roskam
Hi,

I have a csv file that is in utf-8. I switched to unicode mode (set unicode = on) and read the file in (get data).
However, accented characters are still not displayed properly. Does SPSS expect a BOM even though switching
to unicode mode is already the same as saying "assume utf-8 from now on, irrespective of locale settings"?
I am now using Python codecs.open to read the file in and do value.decode("utf-8").encode("cp1252") on every value (I prefer 1252).
That works for all values except for one, which generates a UnicodeDecodeError. It's just one error, but I still find it
kind of suspicious.

An example of a wrong value when I use unicode=on is: "Aziëlaan" ('Azi\xc3\xablaan')
If I use Python to decode that to a unicode string assuming utf-8, ie. value.decode("utf-8"), the result looks correct,
and naturally I can encode that to a cp1252 bytestring.

What am I missing here?

Regards,
Albert-Jan


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: encoding issue with a utf-8 csv file

Rick Oliver-3
If you explicitly specify that it's UTF-8, then the file will be read as Unicode, regardless of whether or not the file contains a BOM. Simply running in Unicode may not be sufficient.

Rick Oliver
Senior Information Developer
IBM Business Analytics (SPSS)
E-mail: [hidden email]




From:        Albert-Jan Roskam <[hidden email]>
To:        [hidden email],
Date:        09/25/2013 08:23 AM
Subject:        encoding issue with a utf-8 csv file
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Hi,

I have a csv file that is in utf-8. I switched to unicode mode (set unicode = on) and read the file in (get data).
However, accented characters are still not displayed properly. Does SPSS expect a BOM even though switching
to unicode mode is already the same as saying "assume utf-8 from now on, irrespective of locale settings"?
I am now using Python codecs.open to read the file in and do value.decode("utf-8").encode("cp1252") on every value (I prefer 1252).
That works for all values except for one, which generates a UnicodeDecodeError. It's just one error, but I still find it
kind of suspicious.

An example of a wrong value when I use unicode=on is: "Aziëlaan" ('Azi\xc3\xablaan')
If I use Python to decode that to a unicode string assuming utf-8, ie. value.decode("utf-8"), the result looks correct,
and naturally I can encode that to a cp1252 bytestring.

What am I missing here?

Regards,
Albert-Jan


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: encoding issue with a utf-8 csv file

Albert-Jan Roskam
From: Rick Oliver <[hidden email]>
>To: [hidden email]
>Sent: Wednesday, September 25, 2013 4:24 PM
>Subject: Re: [SPSSX-L] encoding issue with a utf-8 csv file
>
>
>
>If you explicitly specify that it's UTF-8, then the file will be read as Unicode, regardless of whether or not the file contains a BOM. Simply running in Unicode may not be sufficient.

Hi Rick,

Yes, I also thought it would be strange if a BOM was needed. This is from the CSR, under GET DATA:TXT
is UTF-8 in Unicode mode or the code page determined by the current
locale in code page mode.

That seems to suggest (to me) that set unicode=on should do the trick. But it doesn't. I could have sworn that get data had an /encoding subcommand, but I am probably mistaken with SAVE TRANSLATE.

regards,
Albert-JanSimple (ASCII) text data files. The encoding used to read text data files

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: encoding issue with a utf-8 csv file

Rick Oliver-3
I'm probably a few releases ahead of you (ENCODING was added to GET DATA in release 22) and you're right that it should read the file as UTF-8 in Unicode mode.

Are you sure the file is UTF-8, not UTF-16? (Support for UTF-16 was also added in 22.)

Rick Oliver
Senior Information Developer
IBM Business Analytics (SPSS)
E-mail: [hidden email]




From:        Albert-Jan Roskam <[hidden email]>
To:        Rick Oliver/Chicago/IBM@IBMUS, "[hidden email]" <[hidden email]>,
Date:        09/25/2013 09:47 AM
Subject:        Re: [SPSSX-L] encoding issue with a utf-8 csv file




From: Rick Oliver <[hidden email]>
>To: [hidden email]
>Sent: Wednesday, September 25, 2013 4:24 PM
>Subject: Re: [SPSSX-L] encoding issue with a utf-8 csv file
>
>
>
>If you explicitly specify that it's UTF-8, then the file will be read as Unicode, regardless of whether or not the file contains a BOM. Simply running in Unicode may not be sufficient.

Hi Rick,
 
Yes, I also thought it would be strange if a BOM was needed. This is from the CSR, under GET
<a href=DATA:TXT>DATA:TXT
is UTF-8 in Unicode mode or the code page determined by the current
locale in code page mode.
 
That seems to suggest (to me) that set unicode=on should do the trick. But it doesn't. I could have sworn that get data had an /encoding subcommand, but I am probably mistaken with SAVE TRANSLATE.
 
regards,
Albert-JanSimple (ASCII) text data files. The encoding used to read text data files


Reply | Threaded
Open this post in threaded view
|

Re: encoding issue with a utf-8 csv file

Albert-Jan Roskam
From: Rick Oliver <[hidden email]>
>To: Albert-Jan Roskam <[hidden email]>
>Cc: "[hidden email]" <[hidden email]>
>Sent: Wednesday, September 25, 2013 4:55 PM
>Subject: Re: [SPSSX-L] encoding issue with a utf-8 csv file
>
>
>I'm probably a few releases ahead of you (ENCODING was added to GET DATA in release 22) and you're right that it should read the file as UTF-8 in Unicode mode.

Ah, yes, I am using SPSS v22.0.0.2. Makes sense to add an /ENCODING subcommand to GET DATA too, just like to many other commands. It always feels a little drastic to use SET UNICODE=ON as it requires that all nonempty datasets be closed. Still, using set unicode=on, followed by get data does not cause the accented chars to be displayed properly. Using python codecs.open(fn, "utf-8") and then transcode it do cp1252 bytestrings does work. I am then using get data to open the transcoded file.

>Are you sure the file is UTF-8, not UTF-16? (Support for UTF-16 was also added in 22.)

I also tried utf-16, but nope. The one error that I got was probably a data entry error. Instead of à ('\xe0', backtick-SHIFT-a) they typed à ('\xc3', SHIFT-backtick-SHIFT-a), which can only be decoded to unicode using cp1252, not utf-8.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD