Reading UTF-8 encoded files in a mixed environment

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading UTF-8 encoded files in a mixed environment

Eero Olli
Dear List,

Has anyone experience with using UTF-8 with SPSS v19?  I have tried to get help from the SPSS Norway office but without any luck.
For several years ,I have been waiting for this feature, but now that it is here, I am not sure I can use it after all.
I am using SPSS v19 in a Windows environment.

The ability to read UTF-8 coded files is great, because I read some files from a SQL database, that writes in UTF-8.
I have other software like Ultraedit, that reads any enconding, and saves the file with the same enconding, unless you ask it to change the enconding. That is userfriendly behavior.  Now I use Ultraedit to convert UTF-8 files to ASCII before feeding them to SPSS v19.

There are several things with UTF-8 and SPSS I have not been able to figure out.  To me it looks like that
SET UNICODE ON
Will change all read-write operations of spss syntax and datafiles files to UTF-8.  This gives lots of trouble (all Scandinavian characters get wacky), as at present only few files are UTF-8 encoded.  But it looks like EVERYTHING or NOTHING must be in UTF-8.

Questions:
1) How can I read just one UTF-8 encoded csv file with SPSS, without having the whole production environment in UTF-8?
2) Is it possible to have a mixed encoding environment for SPSS: Some data files in UTF-8 and everything else in in ASCII or Windows 1254 (Unicode)?
3) Has anyone tried to convert an existing SPSS production environment (data, production scripts, syntax, outputs in various etc) using UTF-8.  Was it a success?

I would be happy to hear of any experiences, before I decide how to move further with this.

Sincerely,

Eero Olli

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Reading UTF-8 encoded files in a mixed environment

Jon K Peck
See below.

Jon Peck (no "h")
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        Eero Olli <[hidden email]>
To:        [hidden email]
Date:        08/15/2011 06:55 AM
Subject:        [SPSSX-L] Reading UTF-8 encoded files in a mixed environment
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Dear List,

Has anyone experience with using UTF-8 with SPSS v19?  I have tried to get help from the SPSS Norway office but without any luck.
For several years ,I have been waiting for this feature, but now that it is here, I am not sure I can use it after all.
I am using SPSS v19 in a Windows environment.

>>>SPSS has supported utf-8 since Version 16.

The ability to read UTF-8 coded files is great, because I read some files from a SQL database, that writes in UTF-8.
I have other software like Ultraedit, that reads any encoding, and saves the file with the same encoding, unless you ask it to change the encoding. That is userfriendly behavior.  Now I use Ultraedit to convert UTF-8 files to ASCII before feeding them to SPSS v19.

There are several things with UTF-8 and SPSS I have not been able to figure out.  To me it looks like that
SET UNICODE ON
Will change all read-write operations of spss syntax and datafiles files to UTF-8.  This gives lots of trouble (all Scandinavian characters get wacky), as at present only few files are UTF-8 encoded.  But it looks like EVERYTHING or NOTHING must be in UTF-8.


>>>Statistics has to be either in Unicode mode or in code page mode.  These can't be mixed.  But data sources can be read in any encoding and will be converted to the appropriate internal representation (Unicode utf8 or code page) if possible.  If characters don't display correctly, that means that the input encoding specification was wrong.

In Unicode mode, any characters can be represented.  In code page mode, characters that cannot be represented in that mode will be replaced with question marks.

For GET DATA, SPSS looks for a byte order mark at the start of a file to see whether it is in utf8.  DATA LIST, however, allows you to specify an input encoding explicitly.  ODBC access automatically knows the encoding based on the source characteristics.

Questions:
1) How can I read just one UTF-8 encoded csv file with SPSS, without having the whole production environment in UTF-8?


>>>I would recommend moving to an entire Unicode environment, but you can use DATA LIST if the file does not have a BOM.

2) Is it possible to have a mixed encoding environment for SPSS: Some data files in UTF-8 and everything else in in ASCII or Windows 1254 (Unicode)?


>>>As above, internally everything must be consistent, but you can specify the source data encoding.  Windows 1254 is not Unicode.  1254 is the Turkish code page.

3) Has anyone tried to convert an existing SPSS production environment (data, production scripts, syntax, outputs in various etc) using UTF-8.  Was it a success?


>>>I normally use Unicode mode for everything (except code page testing).  It solves many problems.
There are three caveats: if your data contains extended characters, and you save it in Unicode mode, SPSS versions prior to 15 will not be able to interpret and display the text properly.
Starting with V15, the character encoding is marked in the file, so SPSS automatically knows how to read the text properly.

Second, when converting code page texts to Unicode, string field widths are tripled in order to guarantee that no text will be lost.  This is almost always overly pessimistic.  You can use the ALTER TYPE command to optimize the field width to hold the characters actually present.  If there are no extended characters, the resulting field width will be the same as if you were in code page mode.

Third, the old character functions such as substr are byte oriented.  Since utf-8 characters have a different byte layout, the position and length indicators will not generally be appropriate.  The newer functions such as char.substr are character oriented and work the same way in either code page or Unicode mode.  These functions, where relevant, automatically trim trailing blanks from strings.

HTH,
Jon Peck

I would be happy to hear of any experiences, before I decide how to move further with this.

Sincerely,

Eero Olli

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD