SPSSX Discussion

UNICODE or locale's writing system?

Classic

List

Threaded

6 messages Options

Robert L

UNICODE or locale's writing system?

I guess the problems will persist as long as people have old data files and where there will be a mismatch between these files and whatever language setting the system will be running. For the future, could anything be said about the best language set up in an environment where people want to use (Swedish) diacritical letters for variable names and labels, as well as in syntax files? Should it be UNICODE OFF as default? Any recommendations and/or experiences are welcome.

Robert

*´*´*´*´*´*´*´

Robert Lundqvist

Norrbotten county council

Sweden

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Robert Lundqvist

Jon K Peck

Re: UNICODE or locale's writing system?

I would strongly suggest that you use UNICODE ON. Swedish diacritics are in the standard western code page (1252), which is probably what your Windows setting is, so old sav files that use that encoding should open without any data loss. You might want to use ALTER TYPE to reduce string variable sizes if the file has to be converted, and for sure don't keep resaving in code page and reopening as then string expansion will happen again when you reopen the file.

If you have sav files or other inputs that are not in code page 1252 but are in some other single code page, you can specify the encoding when you do the import, or you can set the Statistics locale appropriately. (You can now set the Statistics locale from Edit > Options > Language as well as in syntax.)

You will also see in 22 or 23 that the Python Essentials installation is fully integrated into the main Statistics install, and, in V23, lots of extension commands are installed with it.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Robert Lundqvist <[hidden email]>
To: [hidden email]
Date: 04/13/2015 07:57 AM
Subject: [SPSSX-L] UNICODE or locale's writing system?
Sent by: "SPSSX(r) Discussion" <[hidden email]>

We are about to upgrade to SPSS ver 23 (running under a network license), which could be the right time to find the best settings. Today there are problems from time to time with the language settings, and I haven’t found out what would be the best setup. Today, the default is UNICODE ON which works for most users. However, I have personally set UNICODE OFF, since this has meant the least problems with Swedish (diacritical) letters in the syntax files. Still, there are problems when I open other users’ files: diacritical letters are replaced by question marks or other strange symbols.

I guess the problems will persist as long as people have old data files and where there will be a mismatch between these files and whatever language setting the system will be running. For the future, could anything be said about the best language set up in an environment where people want to use (Swedish) diacritical letters for variable names and labels, as well as in syntax files? Should it be UNICODE OFF as default? Any recommendations and/or experiences are welcome.

Robert
*´*´*´*´*´*´*´
Robert Lundqvist
Norrbotten county council
Sweden

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Albert-Jan Roskam-2

Re: UNICODE or locale's writing system?

In reply to this post by Robert L

Hi,

We recently changed to v22. We chose not to follow the new unicode default. String values in existing .sav files (created in codepage mode) will triple unless the user switches back to codepage mode. Fixed width files and utf8 don't go well together (with multibyte chars). Afaik, syntax files created in unicode mode start with a BOM, so Spss knows it's in utf8. But without BOM, you may see 2 or 3 weird chars (mojibake) where there is supposed to be one accented letter. Btw, 3rd party software does not always know how to deal with a BOM. Afaik all northern European chars can be represented in Windows-1252.

I can send you the 'reg add' commands to change the unicode default to codepage, if you like.

Albert-Jan

From: Robert Lundqvist <[hidden email]>;
To: <[hidden email]>;
Subject: [SPSSX-L] UNICODE or locale's writing system?
Sent: Mon, Apr 13, 2015 1:56:48 PM

Robert

*´*´*´*´*´*´*´

Robert Lundqvist

Norrbotten county council

Sweden

Robert L

SV: [SPSSX-L] UNICODE or locale's writing system?

In reply to this post by Jon K Peck

Robert

Från: Jon K Peck [mailto:[hidden email]]
Skickat: den 13 april 2015 16:09
Till: Robert Lundqvist
Kopia: [hidden email]
Ämne: Re: [SPSSX-L] UNICODE or locale's writing system?

Robert Lundqvist

Jon K Peck

Re: SV: [SPSSX-L] UNICODE or locale's writing system?

You can control your mode setting via Edit > Options > Language, and the setting will be remembered from session to session. There is no need to do this by direct manipulation of the Registry, although if installing on many machines, the Registry script could automate the default setting.

You should be consistent throughout the organization in the default mode in order to avoid constantly converting sav files from code page to Unicode and back. And generally after converting a code page file to Unicode, recovering unnecessary expansion via ALTER TYPE or "Minimize string widths based on observed values" may be useful if the file contains a lot of string data but is mostly 7-bit ascii text.

If data are read and show question marks or mojibake, you should fix that before moving on. Those errors could happen if the input file is not actually encoded in the expected way, so whether you are reading it in code page or Unicode mode, you should resolve the issue before saving the data. Of course, if the input characters span code pages, e.g., Swedish and Russian, it can only be processed in Unicode mode. SPSS data files created by version 15 or later have the character encoding marked in the file. I have seen a few instances, however, of sav files created by third parties that were incorrectly marked or not marked at all.

In either mode, best practice is to abandon the old string functions that are byte oriented and use the new CHAR.* functions, which are character oriented. That syntax will then work in either mode.

By the way, if you have large sav files, check out the SPSS Statistics Compressed (zsav) format. It often dramatically reduces the size of the file, but I suspect that most users are not aware of it. Working recently with a rather large file, I saw that the sav size was 854MB while the zsav file was 275MB - a ratio of 32%. zsav was introduced in version 18. It is the second choice in File > Save As for data or /ZCOMPRESSED in SAVE syntax. The GET command automatically recognizes the compression type. There is some extra overhead in compression/decompression, but this is often more than compensated for by the reduced amount of i/o activity.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Robert Lundqvist <[hidden email]>
To: [hidden email]
Date: 04/14/2015 07:27 AM
Subject: [SPSSX-L] SV: [SPSSX-L] UNICODE or locale's writing system?
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Seems as if there are two answers here, Jon is suggesting UNICODE ON and Albert-Jan the opposite. I will discuss this further with our sys admins. However, whatever the solution, I guess there will be instances in the future where there is a mismatch between old files and the new settings. If it happens, if there are strange symbols (mojibake is it?) in either the syntax files or the data files, is there any easy way to convert the files into a better format? Would that be the “reg add” commands you are suggesting, Albert-Jan?

Robert

Från: Jon K Peck [mailto:peck@...]
Skickat: den 13 april 2015 16:09
Till: Robert Lundqvist
Kopia: [hidden email]
Ämne: Re: [SPSSX-L] UNICODE or locale's writing system?

I would strongly suggest that you use UNICODE ON. Swedish diacritics are in the standard western code page (1252), which is probably what your Windows setting is, so old sav files that use that encoding should open without any data loss. You might want to use ALTER TYPE to reduce string variable sizes if the file has to be converted, and for sure don't keep resaving in code page and reopening as then string expansion will happen again when you reopen the file.

If you have sav files or other inputs that are not in code page 1252 but are in some other single code page, you can specify the encoding when you do the import, or you can set the Statistics locale appropriately. (You can now set the Statistics locale from Edit > Options > Language as well as in syntax.)

You will also see in 22 or 23 that the Python Essentials installation is fully integrated into the main Statistics install, and, in V23, lots of extension commands are installed with it.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
peck@...
phone: 720-342-5621

From: Robert Lundqvist <Robert.Lundqvist@...>
To: [hidden email]
Date: 04/13/2015 07:57 AM
Subject: [SPSSX-L] UNICODE or locale's writing system?
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Albert-Jan Roskam-2

Re: SV: [SPSSX-L] UNICODE or locale's writing system?

Hi Robert,

These are the 'add reg' commands for SPSS v22, with a Dutch locale. You could probably omit the locale setting. I just wanted this
to be guaranteed to be the same on every computer (useful e.g. when reading .csv files). As Jon said, you can accomplish the same thing via the GUI, but these commands are handy when you are configuring many machines.

:: notice that caps are escaped! (why?)
set spss_base_path=HKEY_LOCAL_MACHINE\Software\JavaSoft\Prefs\com\ibm\/S/P/S/S
set spssKey=%spss_base_path%\/Statistics\22.0
set keyname=%spssKey%\core\set
reg add "%keyname%" /v "/Unicode" /t REG_SZ /d "/No" /f
reg add "%keyname%" /v "/Locale" /t REG_SZ /d "nl_/N/L.cp1252" /f

set keyname=%spssKey%\ui\dialog_settings\welcome_page
reg add "%keyname%" /v "show_welcome_dialog" /t REG_SZ /d "0" /f
reg add "%keyname%" /v "shown_unicode_warning" /t REG_SZ /d "1" /f

You can also modify the registry to use an external Python interpreter, but I later discovered that it's also possible (and easier, because, after all, we all loathe the Windows registry, don't we?) to set the PYTHONHOME environment variable. Be sure to either add the original site-packages (with the spss/python api files) to PYTHONPATH, or to copy those files to the external site-packages dir (we created an 'spss22' dir there and use .pth files to refer to spssaux, spss, SpssClient etc.)

Regards,

Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a

fresh water system, and public health, what have the Romans ever done for us?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--------------------------------------------
On Tue, 4/14/15, Jon K Peck <[hidden email]> wrote:

Subject: Re: [SPSSX-L] SV: [SPSSX-L] UNICODE or locale's writing system?
To: [hidden email]
Date: Tuesday, April 14, 2015, 3:55 PM

You can control your mode setting via Edit
> Options > Language, and the setting will be
remembered from session
to session. There is no need to do this by direct
manipulation of
the Registry, although if installing on many machines, the
Registry script
could automate the default setting.

You should
be consistent throughout
the organization in the default mode in order to avoid
constantly converting
sav files from code page to Unicode and back. And
generally after
converting a code page file to Unicode, recovering
unnecessary expansion
via ALTER TYPE or "Minimize string widths based on
observed values"
may be useful if the file contains a lot of string data but
is mostly 7-bit
ascii text.

If data
are read and show question marks
or mojibake, you should fix that before moving on. Those
errors could
happen if the input file is not actually encoded in the
expected way, so
whether you are reading it in code page or Unicode mode, you
should resolve
the issue before saving the data. Of course, if the input
characters
span code pages, e.g., Swedish and Russian, it can only be
processed in
Unicode mode. SPSS data files created by version 15 or
later have
the character encoding marked in the file. I have seen a
few instances,
however, of sav files created by third parties that were
incorrectly marked
or not marked at all.

In either
mode, best practice is to
abandon the old string functions that are byte oriented and
use the new
CHAR.* functions, which are character oriented. That
syntax will
then work in either mode.

By the
way, if you have large sav files,
check out the SPSS Statistics Compressed (zsav) format. It
often
dramatically reduces the size of the file, but I suspect
that most users
are not aware of it. Working recently with a rather large
file, I
saw that the sav size was 854MB while the zsav file was
275MB - a
ratio of 32%. zsav was introduced in version 18. It is
the
second choice in File > Save As for data or /ZCOMPRESSED
in SAVE syntax.
  The GET command automatically recognizes the compression
type. There
is some extra overhead in compression/decompression, but
this is often
more than compensated for by the reduced amount of i/o
activity.

Jon Peck (no "h") aka Kim

Senior Software Engineer, IBM

[hidden email]

phone: 720-342-5621

From:
  Robert Lundqvist
<[hidden email]>

To:
  [hidden email]

Date:
  04/14/2015 07:27
AM

Subject:
   [SPSSX-L] SV:
[SPSSX-L] UNICODE or locale's writing system?

Sent by:
   "SPSSX(r)
Discussion" <[hidden email]>

Seems as if there are two
answers here, Jon is suggesting UNICODE ON and Albert-Jan
the opposite.
I will discuss this further with our sys admins. However,
whatever the
solution, I guess there will be instances in the future
where there is
a mismatch between old files and the new settings. If it
happens, if there
are strange symbols (mojibake is it?) in either the syntax
files or the
data files, is there any easy way to convert the files into
a better format?
Would that be the “reg add” commands you are suggesting,
Albert-Jan?



Robert



Från:
Jon K Peck [mailto:[hidden email]]

Skickat: den 13 april 2015 16:09

Till: Robert Lundqvist

Kopia: [hidden email]

Ämne: Re: [SPSSX-L] UNICODE or locale's writing
system?



I would
strongly suggest that you use UNICODE
ON. Swedish diacritics are in the standard western code
page (1252),
which is probably what your Windows setting is, so old sav
files that use
that encoding should open without any data loss. You might
want to
use ALTER TYPE to reduce string variable sizes if the file
has to be converted,
and for sure don't keep resaving in code page and
reopening as then string
expansion will happen again when you reopen the
file.

If you have sav files or other inputs that are not in code
page 1252 but
are in some other single code page, you can specify the
encoding when you
do the import, or you can set the Statistics locale
appropriately. (You
can now set the Statistics locale from Edit > Options
> Language
as well as in syntax.)

You will also see in 22 or 23 that the Python Essentials
installation is
fully integrated into the main Statistics install, and, in
V23, lots of
extension commands are installed with it.

Jon Peck (no "h") aka Kim

Senior Software Engineer, IBM

[hidden email]

phone: 720-342-5621

From: Robert
Lundqvist <[hidden email]>

To: [hidden email]

Date: 04/13/2015
07:57 AM

Subject: [SPSSX-L]
UNICODE or locale's writing system?

Sent by: "SPSSX(r)
Discussion" <[hidden email]>

We are about to upgrade to SPSS ver 23 (running under a
network license),
which could be the right time to find the best settings.
Today there are
problems from time to time with the language settings, and I
haven’t found
out what would be the best setup. Today, the default is
UNICODE ON which
works for most users. However, I have personally set UNICODE
OFF, since
this has meant the least problems with Swedish (diacritical)
letters in
the syntax files. Still, there are problems when I open
other users’ files:
diacritical letters are replaced by question marks or other
strange symbols.

I guess the problems will persist as long as people have old
data files
and where there will be a mismatch between these files and
whatever language
setting the system will be running. For the future, could
anything be said
about the best language set up in an environment where
people want to use
(Swedish) diacritical letters for variable names and labels,
as well as
in syntax files? Should it be UNICODE OFF as default? Any
recommendations
and/or experiences are welcome.

Robert

*´*´*´*´*´*´*´

Robert Lundqvist

Norrbotten county council

Sweden
===================== To manage
your subscription to SPSSX-L, send a message to [hidden email]
(not to SPSSX-L), with no body text except the command. To
leave the list,
send the command SIGNOFF SPSSX-L For a list of commands to
manage subscriptions,
send the command INFO REFCARD
===================== To manage your subscription
to SPSSX-L,
send a message to [hidden email]
(not to SPSSX-L), with no body text except the command. To
leave the list,
send the command SIGNOFF SPSSX-L For a list of commands to
manage subscriptions,
send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email]
(not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the
command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD