UNICODE or locale's writing system?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

UNICODE or locale's writing system?

Robert L

We are about to upgrade to SPSS ver 23 (running under a network license), which could be the right time to find the best settings. Today there are problems from time to time with the language settings, and I haven’t found out what would be the best setup. Today, the default is UNICODE ON which works for most users. However, I have personally set UNICODE OFF, since this has meant the least problems with Swedish (diacritical) letters in the syntax files. Still, there are problems when I open other users’ files: diacritical letters are replaced by question marks or other strange symbols.

 

I guess the problems will persist as long as people have old data files and where there will be a mismatch between these files and whatever language setting the system will be running. For the future, could anything be said about the best language set up in an environment where people want to use (Swedish) diacritical letters for variable names and labels, as well as in syntax files? Should it be UNICODE OFF as default? Any recommendations and/or experiences are welcome.

 

Robert

*´*´*´*´*´*´*´

Robert Lundqvist

Norrbotten county council

Sweden

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Robert Lundqvist
Reply | Threaded
Open this post in threaded view
|

Re: UNICODE or locale's writing system?

Jon K Peck
I would strongly suggest that you use UNICODE ON.  Swedish diacritics are in the standard western code page (1252), which is probably what your Windows setting is, so old sav files that use that encoding should open without any data loss.  You might want to use ALTER TYPE to reduce string variable sizes if the file has to be converted, and for sure don't keep resaving in code page and reopening as then string expansion will happen again when you reopen the file.

If you have sav files or other inputs that are not in code page 1252 but are in some other single code page, you can specify the encoding when  you do the import, or you can set the Statistics locale appropriately.  (You can now set the Statistics locale from Edit > Options > Language as well as in syntax.)

You will also see in 22 or 23 that the Python Essentials installation is fully integrated into the main Statistics install, and, in V23, lots of extension commands are installed with it.




Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Robert Lundqvist <[hidden email]>
To:        [hidden email]
Date:        04/13/2015 07:57 AM
Subject:        [SPSSX-L] UNICODE or locale's writing system?
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




We are about to upgrade to SPSS ver 23 (running under a network license), which could be the right time to find the best settings. Today there are problems from time to time with the language settings, and I haven’t found out what would be the best setup. Today, the default is UNICODE ON which works for most users. However, I have personally set UNICODE OFF, since this has meant the least problems with Swedish (diacritical) letters in the syntax files. Still, there are problems when I open other users’ files: diacritical letters are replaced by question marks or other strange symbols.
 
I guess the problems will persist as long as people have old data files and where there will be a mismatch between these files and whatever language setting the system will be running. For the future, could anything be said about the best language set up in an environment where people want to use (Swedish) diacritical letters for variable names and labels, as well as in syntax files? Should it be UNICODE OFF as default? Any recommendations and/or experiences are welcome.
 
Robert
*´*´*´*´*´*´*´
Robert Lundqvist
Norrbotten county council
Sweden

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: UNICODE or locale's writing system?

Albert-Jan Roskam-2
In reply to this post by Robert L
Hi,

We recently changed to v22. We chose not to follow the new unicode default. String values in existing .sav files (created in codepage mode) will triple unless the user switches back to codepage mode. Fixed width files and utf8 don't go well together (with multibyte chars). Afaik, syntax files created in unicode mode start with a BOM, so Spss knows it's in utf8. But without BOM, you may see 2 or 3 weird chars (mojibake) where there is supposed to be one accented letter. Btw, 3rd party software does not always know how to deal with a BOM. Afaik all northern European chars can be represented in Windows-1252.

I can send you the 'reg add' commands to change the unicode default to codepage, if you like.

Albert-Jan


From: Robert Lundqvist <[hidden email]>;
To: <[hidden email]>;
Subject: [SPSSX-L] UNICODE or locale's writing system?
Sent: Mon, Apr 13, 2015 1:56:48 PM

We are about to upgrade to SPSS ver 23 (running under a network license), which could be the right time to find the best settings. Today there are problems from time to time with the language settings, and I haven’t found out what would be the best setup. Today, the default is UNICODE ON which works for most users. However, I have personally set UNICODE OFF, since this has meant the least problems with Swedish (diacritical) letters in the syntax files. Still, there are problems when I open other users’ files: diacritical letters are replaced by question marks or other strange symbols.

 

I guess the problems will persist as long as people have old data files and where there will be a mismatch between these files and whatever language setting the system will be running. For the future, could anything be said about the best language set up in an environment where people want to use (Swedish) diacritical letters for variable names and labels, as well as in syntax files? Should it be UNICODE OFF as default? Any recommendations and/or experiences are welcome.

 

Robert

*´*´*´*´*´*´*´

Robert Lundqvist

Norrbotten county council

Sweden

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

SV: [SPSSX-L] UNICODE or locale's writing system?

Robert L
In reply to this post by Jon K Peck

Seems as if there are two answers here, Jon is suggesting UNICODE ON and Albert-Jan the opposite. I will discuss this further with our sys admins. However, whatever the solution, I guess there will be instances in the future where there is a mismatch between old files and the new settings. If it happens, if there are strange symbols (mojibake is it?) in either the syntax files or the data files, is there any easy way to convert the files into a better format? Would that be the “reg add” commands you are suggesting, Albert-Jan?

 

Robert

 

Från: Jon K Peck [mailto:[hidden email]]
Skickat: den 13 april 2015 16
:09
Till: Robert Lundqvist
Kopia: [hidden email]
Ämne: Re: [SPSSX-L] UNICODE or locale's writing system?

 

I would strongly suggest that you use UNICODE ON.  Swedish diacritics are in the standard western code page (1252), which is probably what your Windows setting is, so old sav files that use that encoding should open without any data loss.  You might want to use ALTER TYPE to reduce string variable sizes if the file has to be converted, and for sure don't keep resaving in code page and reopening as then string expansion will happen again when you reopen the file.

If you have sav files or other inputs that are not in code page 1252 but are in some other single code page, you can specify the encoding when  you do the import, or you can set the Statistics locale appropriately.  (You can now set the Statistics locale from Edit > Options > Language as well as in syntax.)

You will also see in 22 or 23 that the Python Essentials installation is fully integrated into the main Statistics install, and, in V23, lots of extension commands are installed with it.




Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Robert Lundqvist <[hidden email]>
To:        [hidden email]
Date:        04/13/2015 07:57 AM
Subject:        [SPSSX-L] UNICODE or locale's writing system?
Sent by:        "SPSSX(r) Discussion" <[hidden email]>





We are about to upgrade to SPSS ver 23 (running under a network license), which could be the right time to find the best settings. Today there are problems from time to time with the language settings, and I haven’t found out what would be the best setup. Today, the default is UNICODE ON which works for most users. However, I have personally set UNICODE OFF, since this has meant the least problems with Swedish (diacritical) letters in the syntax files. Still, there are problems when I open other users’ files: diacritical letters are replaced by question marks or other strange symbols.
 
I guess the problems will persist as long as people have old data files and where there will be a mismatch between these files and whatever language setting the system will be running. For the future, could anything be said about the best language set up in an environment where people want to use (Swedish) diacritical letters for variable names and labels, as well as in syntax files? Should it be UNICODE OFF as default? Any recommendations and/or experiences are welcome.
 
Robert
*´*´*´*´*´*´*´
Robert Lundqvist
Norrbotten county council
Sweden

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Robert Lundqvist
Reply | Threaded
Open this post in threaded view
|

Re: SV: [SPSSX-L] UNICODE or locale's writing system?

Jon K Peck
You can control your mode setting via Edit > Options > Language, and the setting will be remembered from session to session.  There is no need to do this by direct manipulation of the Registry, although if installing on many machines, the Registry script could automate the default setting.

You should be consistent throughout the organization in the default mode in order to avoid constantly converting sav files from code page to Unicode and back.  And generally after converting a code page file to Unicode, recovering unnecessary expansion via ALTER TYPE or "Minimize string widths based on observed values" may be useful if the file contains a lot of string data but is mostly 7-bit ascii text.

If data are read and show question marks or mojibake, you should fix that before moving on.  Those errors could happen if the input file is not actually encoded in the expected way, so whether you are reading it in code page or Unicode mode, you should resolve the issue before saving the data.  Of course, if the input characters span code pages, e.g., Swedish and Russian, it can only be processed in Unicode mode.  SPSS data files created by version 15 or later have the character encoding marked in the file.  I have seen a few instances, however, of sav files created by third parties that were incorrectly marked or not marked at all.

In either mode, best practice is to abandon the old string functions that are byte oriented and use the new CHAR.* functions, which are character oriented.  That syntax will then work in either mode.

By the way, if you have large sav files, check out the SPSS Statistics Compressed (zsav) format.  It often dramatically reduces the size of the file, but I suspect that most users are not aware of it.  Working recently with a rather large file, I saw that the sav size was 854MB while the zsav file was 275MB -  a ratio of 32%.  zsav was introduced in version 18.  It is the second choice in File > Save As for data or /ZCOMPRESSED in SAVE syntax.  The GET command automatically recognizes the compression type.  There is some extra overhead in compression/decompression, but this is often more than compensated for by the reduced amount of i/o activity.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Robert Lundqvist <[hidden email]>
To:        [hidden email]
Date:        04/14/2015 07:27 AM
Subject:        [SPSSX-L] SV: [SPSSX-L] UNICODE or locale's writing system?
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Seems as if there are two answers here, Jon is suggesting UNICODE ON and Albert-Jan the opposite. I will discuss this further with our sys admins. However, whatever the solution, I guess there will be instances in the future where there is a mismatch between old files and the new settings. If it happens, if there are strange symbols (mojibake is it?) in either the syntax files or the data files, is there any easy way to convert the files into a better format? Would that be the “reg add” commands you are suggesting, Albert-Jan?
 
Robert
 
Från: Jon K Peck [mailto:peck@...]
Skickat:
den 13 april 2015 16:09
Till:
Robert Lundqvist
Kopia:
[hidden email]
Ämne:
Re: [SPSSX-L] UNICODE or locale's writing system?

 
I would strongly suggest that you use UNICODE ON.  Swedish diacritics are in the standard western code page (1252), which is probably what your Windows setting is, so old sav files that use that encoding should open without any data loss.  You might want to use ALTER TYPE to reduce string variable sizes if the file has to be converted, and for sure don't keep resaving in code page and reopening as then string expansion will happen again when you reopen the file.

If you have sav files or other inputs that are not in code page 1252 but are in some other single code page, you can specify the encoding when  you do the import, or you can set the Statistics locale appropriately.  (You can now set the Statistics locale from Edit > Options > Language as well as in syntax.)


You will also see in 22 or 23 that the Python Essentials installation is fully integrated into the main Statistics install, and, in V23, lots of extension commands are installed with it.





Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM

peck@...
phone: 720-342-5621





From:        
Robert Lundqvist <Robert.Lundqvist@...>
To:        
[hidden email]
Date:        
04/13/2015 07:57 AM
Subject:        
[SPSSX-L] UNICODE or locale's writing system?
Sent by:        
"SPSSX(r) Discussion" <[hidden email]>





We are about to upgrade to SPSS ver 23 (running under a network license), which could be the right time to find the best settings. Today there are problems from time to time with the language settings, and I haven’t found out what would be the best setup. Today, the default is UNICODE ON which works for most users. However, I have personally set UNICODE OFF, since this has meant the least problems with Swedish (diacritical) letters in the syntax files. Still, there are problems when I open other users’ files: diacritical letters are replaced by question marks or other strange symbols.

 
I guess the problems will persist as long as people have old data files and where there will be a mismatch between these files and whatever language setting the system will be running. For the future, could anything be said about the best language set up in an environment where people want to use (Swedish) diacritical letters for variable names and labels, as well as in syntax files? Should it be UNICODE OFF as default? Any recommendations and/or experiences are welcome.

 
Robert

*´*´*´*´*´*´*´

Robert Lundqvist

Norrbotten county council

Sweden

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: SV: [SPSSX-L] UNICODE or locale's writing system?

Albert-Jan Roskam-2
Hi Robert,

These are the 'add reg' commands for SPSS v22, with a Dutch locale. You could probably omit the locale setting. I just wanted this
to be guaranteed to be the same on every computer (useful e.g. when reading .csv files). As Jon said, you can accomplish the same thing via the GUI, but these commands are handy when you are configuring many machines.

:: notice that caps are escaped! (why?)
set spss_base_path=HKEY_LOCAL_MACHINE\Software\JavaSoft\Prefs\com\ibm\/S/P/S/S
set spssKey=%spss_base_path%\/Statistics\22.0
set keyname=%spssKey%\core\set
reg add "%keyname%" /v "/Unicode" /t REG_SZ /d "/No" /f
reg add "%keyname%" /v "/Locale" /t REG_SZ /d "nl_/N/L.cp1252" /f

set keyname=%spssKey%\ui\dialog_settings\welcome_page
reg add "%keyname%" /v "show_welcome_dialog" /t REG_SZ /d "0" /f
reg add "%keyname%" /v "shown_unicode_warning" /t REG_SZ /d "1" /f

You can also modify the registry to use an external Python interpreter, but I later discovered that it's also possible (and easier, because, after all, we all loathe the Windows registry, don't we?) to set the PYTHONHOME environment variable. Be sure to either add the original site-packages (with the spss/python api files) to PYTHONPATH, or to copy those files to the external site-packages dir (we created an 'spss22' dir there and use .pth files to refer to spssaux, spss, SpssClient etc.)


Regards,

Albert-Jan



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a

fresh water system, and public health, what have the Romans ever done for us?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--------------------------------------------
On Tue, 4/14/15, Jon K Peck <[hidden email]> wrote:

 Subject: Re: [SPSSX-L] SV: [SPSSX-L] UNICODE or locale's writing system?
 To: [hidden email]
 Date: Tuesday, April 14, 2015, 3:55 PM
 
 You can control your mode setting via Edit
 > Options > Language, and the setting will be
 remembered from session
 to session.  There is no need to do this by direct
 manipulation of
 the Registry, although if installing on many machines, the
 Registry script
 could automate the default setting.
 
 
 
 You should
 be consistent throughout
 the organization in the default mode in order to avoid
 constantly converting
 sav files from code page to Unicode and back.  And
 generally after
 converting a code page file to Unicode, recovering
 unnecessary expansion
 via ALTER TYPE or "Minimize string widths based on
 observed values"
 may be useful if the file contains a lot of string data but
 is mostly 7-bit
 ascii text.
 
 
 
 If data
 are read and show question marks
 or mojibake, you should fix that before moving on.  Those
 errors could
 happen if the input file is not actually encoded in the
 expected way, so
 whether you are reading it in code page or Unicode mode, you
 should resolve
 the issue before saving the data.  Of course, if the input
 characters
 span code pages, e.g., Swedish and Russian, it can only be
 processed in
 Unicode mode.  SPSS data files created by version 15 or
 later have
 the character encoding marked in the file.  I have seen a
 few instances,
 however, of sav files created by third parties that were
 incorrectly marked
 or not marked at all.
 
 
 
 In either
 mode, best practice is to
 abandon the old string functions that are byte oriented and
 use the new
 CHAR.* functions, which are character oriented.  That
 syntax will
 then work in either mode.
 
 
 
 By the
 way, if you have large sav files,
 check out the SPSS Statistics Compressed (zsav) format.  It
 often
 dramatically reduces the size of the file, but I suspect
 that most users
 are not aware of it.  Working recently with a rather large
 file, I
 saw that the sav size was 854MB while the zsav file was
 275MB -  a
 ratio of 32%.  zsav was introduced in version 18.  It is
 the
 second choice in File > Save As for data or /ZCOMPRESSED
 in SAVE syntax.
  The GET command automatically recognizes the compression
 type.  There
 is some extra overhead in compression/decompression, but
 this is often
 more than compensated for by the reduced amount of i/o
 activity.
 
 
 
 
 
 Jon Peck (no "h") aka Kim
 
 Senior Software Engineer, IBM
 
 [hidden email]
 
 phone: 720-342-5621
 
 
 
 
 
 
 
 
 
 From:      
  Robert Lundqvist
 <[hidden email]>
 
 To:      
  [hidden email]
 
 Date:      
  04/14/2015 07:27
 AM
 
 Subject:    
    [SPSSX-L] SV:
 [SPSSX-L] UNICODE or locale's writing system?
 
 Sent by:    
    "SPSSX(r)
 Discussion" <[hidden email]>
 
 
 
 
 
 
 
 
 Seems as if there are two
 answers here, Jon is suggesting UNICODE ON and Albert-Jan
 the opposite.
 I will discuss this further with our sys admins. However,
 whatever the
 solution, I guess there will be instances in the future
 where there is
 a mismatch between old files and the new settings. If it
 happens, if there
 are strange symbols (mojibake is it?) in either the syntax
 files or the
 data files, is there any easy way to convert the files into
 a better format?
 Would that be the “reg add” commands you are suggesting,
 Albert-Jan?
 
  
 
 Robert
 
  
 
 Från:
 Jon K Peck [mailto:[hidden email]]
 
 
 Skickat: den 13 april 2015 16:09
 
 Till: Robert Lundqvist
 
 Kopia: [hidden email]
 
 Ämne: Re: [SPSSX-L] UNICODE or locale's writing
 system?
 
  
 
 I would
 strongly suggest that you use UNICODE
 ON.  Swedish diacritics are in the standard western code
 page (1252),
 which is probably what your Windows setting is, so old sav
 files that use
 that encoding should open without any data loss.  You might
 want to
 use ALTER TYPE to reduce string variable sizes if the file
 has to be converted,
 and for sure don't keep resaving in code page and
 reopening as then string
 expansion will happen again when you reopen the
 file.
 
 
 
 
 If you have sav files or other inputs that are not in code
 page 1252 but
 are in some other single code page, you can specify the
 encoding when  you
 do the import, or you can set the Statistics locale
 appropriately.  (You
 can now set the Statistics locale from Edit > Options
 > Language
 as well as in syntax.)
 
 
 
 You will also see in 22 or 23 that the Python Essentials
 installation is
 fully integrated into the main Statistics install, and, in
 V23, lots of
 extension commands are installed with it.
 
 
 
 
 
 
 
 
 
 
 Jon Peck (no "h") aka Kim
 
 Senior Software Engineer, IBM
 
 [hidden email]
 
 phone: 720-342-5621
 
 
 
 
 
 
 
 
 
 From:        Robert
 Lundqvist <[hidden email]>
 
 
 To:        [hidden email]
 
 
 Date:        04/13/2015
 07:57 AM
 
 
 Subject:        [SPSSX-L]
 UNICODE or locale's writing system?
 
 
 Sent by:        "SPSSX(r)
 Discussion" <[hidden email]>
 
 
 
 
 
 
 
 
 
 
 We are about to upgrade to SPSS ver 23 (running under a
 network license),
 which could be the right time to find the best settings.
 Today there are
 problems from time to time with the language settings, and I
 haven’t found
 out what would be the best setup. Today, the default is
 UNICODE ON which
 works for most users. However, I have personally set UNICODE
 OFF, since
 this has meant the least problems with Swedish (diacritical)
 letters in
 the syntax files. Still, there are problems when I open
 other users’ files:
 diacritical letters are replaced by question marks or other
 strange symbols.
 
 
   
 
 I guess the problems will persist as long as people have old
 data files
 and where there will be a mismatch between these files and
 whatever language
 setting the system will be running. For the future, could
 anything be said
 about the best language set up in an environment where
 people want to use
 (Swedish) diacritical letters for variable names and labels,
 as well as
 in syntax files? Should it be UNICODE OFF as default? Any
 recommendations
 and/or experiences are welcome.
 
 
   
 
 Robert
 
 
 *´*´*´*´*´*´*´
 
 Robert Lundqvist
 
 Norrbotten county council
 
 Sweden
 ===================== To manage
 your subscription to SPSSX-L, send a message to [hidden email]
 (not to SPSSX-L), with no body text except the command. To
 leave the list,
 send the command SIGNOFF SPSSX-L For a list of commands to
 manage subscriptions,
 send the command INFO REFCARD
 ===================== To manage your subscription
 to SPSSX-L,
 send a message to [hidden email]
 (not to SPSSX-L), with no body text except the command. To
 leave the list,
 send the command SIGNOFF SPSSX-L For a list of commands to
 manage subscriptions,
 send the command INFO REFCARD
 
 =====================
 To manage your subscription to SPSSX-L, send a message to
 [hidden email]
 (not to SPSSX-L), with no body text except the
 command. To leave the list, send the command
 SIGNOFF SPSSX-L
 For a list of commands to manage subscriptions, send the
 command
 INFO REFCARD


=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD