We are about to upgrade to SPSS ver 23 (running under a network license), which could be the right time to find the best settings. Today there
are problems from time to time with the language settings, and I haven’t found out what would be the best setup. Today, the default is UNICODE ON which works for most users. However, I have personally set UNICODE OFF, since this has meant the least problems
with Swedish (diacritical) letters in the syntax files. Still, there are problems when I open other users’ files: diacritical letters are replaced by question marks or other strange symbols. I guess the problems will persist as long as people have old data files and where there will be a mismatch between these files and whatever language
setting the system will be running. For the future, could anything be said about the best language set up in an environment where people want to use (Swedish) diacritical letters for variable names and labels, as well as in syntax files? Should it be UNICODE
OFF as default? Any recommendations and/or experiences are welcome. Robert *´*´*´*´*´*´*´ Robert Lundqvist Norrbotten county council Sweden
Robert Lundqvist
|
I would strongly suggest that you use UNICODE
ON. Swedish diacritics are in the standard western code page (1252),
which is probably what your Windows setting is, so old sav files that use
that encoding should open without any data loss. You might want to
use ALTER TYPE to reduce string variable sizes if the file has to be converted,
and for sure don't keep resaving in code page and reopening as then string
expansion will happen again when you reopen the file.
If you have sav files or other inputs that are not in code page 1252 but are in some other single code page, you can specify the encoding when you do the import, or you can set the Statistics locale appropriately. (You can now set the Statistics locale from Edit > Options > Language as well as in syntax.) You will also see in 22 or 23 that the Python Essentials installation is fully integrated into the main Statistics install, and, in V23, lots of extension commands are installed with it. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Robert Lundqvist <[hidden email]> To: [hidden email] Date: 04/13/2015 07:57 AM Subject: [SPSSX-L] UNICODE or locale's writing system? Sent by: "SPSSX(r) Discussion" <[hidden email]> We are about to upgrade to SPSS ver 23 (running under a network license), which could be the right time to find the best settings. Today there are problems from time to time with the language settings, and I haven’t found out what would be the best setup. Today, the default is UNICODE ON which works for most users. However, I have personally set UNICODE OFF, since this has meant the least problems with Swedish (diacritical) letters in the syntax files. Still, there are problems when I open other users’ files: diacritical letters are replaced by question marks or other strange symbols. I guess the problems will persist as long as people have old data files and where there will be a mismatch between these files and whatever language setting the system will be running. For the future, could anything be said about the best language set up in an environment where people want to use (Swedish) diacritical letters for variable names and labels, as well as in syntax files? Should it be UNICODE OFF as default? Any recommendations and/or experiences are welcome. Robert *´*´*´*´*´*´*´ Robert Lundqvist Norrbotten county council Sweden ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Robert L
From: Robert Lundqvist <[hidden email]>; To: <[hidden email]>; Subject: [SPSSX-L] UNICODE or locale's writing system? Sent: Mon, Apr 13, 2015 1:56:48 PM
|
In reply to this post by Jon K Peck
Seems as if there are two answers here, Jon is suggesting UNICODE ON and Albert-Jan the opposite. I will discuss this further with our sys admins.
However, whatever the solution, I guess there will be instances in the future where there is a mismatch between old files and the new settings. If it happens, if there are strange symbols (mojibake is it?) in either the syntax files or the data files, is there
any easy way to convert the files into a better format? Would that be the “reg add” commands you are suggesting, Albert-Jan? Robert Från: Jon K Peck [mailto:[hidden email]]
I would strongly suggest that you use UNICODE ON. Swedish diacritics are in the standard western code page (1252), which is probably what your Windows setting is, so old sav
files that use that encoding should open without any data loss. You might want to use ALTER TYPE to reduce string variable sizes if the file has to be converted, and for sure don't keep resaving in code page and reopening as then string expansion will happen
again when you reopen the file.
===================== To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Robert Lundqvist
|
You can control your mode setting via Edit
> Options > Language, and the setting will be remembered from session
to session. There is no need to do this by direct manipulation of
the Registry, although if installing on many machines, the Registry script
could automate the default setting.
You should be consistent throughout the organization in the default mode in order to avoid constantly converting sav files from code page to Unicode and back. And generally after converting a code page file to Unicode, recovering unnecessary expansion via ALTER TYPE or "Minimize string widths based on observed values" may be useful if the file contains a lot of string data but is mostly 7-bit ascii text. If data are read and show question marks or mojibake, you should fix that before moving on. Those errors could happen if the input file is not actually encoded in the expected way, so whether you are reading it in code page or Unicode mode, you should resolve the issue before saving the data. Of course, if the input characters span code pages, e.g., Swedish and Russian, it can only be processed in Unicode mode. SPSS data files created by version 15 or later have the character encoding marked in the file. I have seen a few instances, however, of sav files created by third parties that were incorrectly marked or not marked at all. In either mode, best practice is to abandon the old string functions that are byte oriented and use the new CHAR.* functions, which are character oriented. That syntax will then work in either mode. By the way, if you have large sav files, check out the SPSS Statistics Compressed (zsav) format. It often dramatically reduces the size of the file, but I suspect that most users are not aware of it. Working recently with a rather large file, I saw that the sav size was 854MB while the zsav file was 275MB - a ratio of 32%. zsav was introduced in version 18. It is the second choice in File > Save As for data or /ZCOMPRESSED in SAVE syntax. The GET command automatically recognizes the compression type. There is some extra overhead in compression/decompression, but this is often more than compensated for by the reduced amount of i/o activity. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Robert Lundqvist <[hidden email]> To: [hidden email] Date: 04/14/2015 07:27 AM Subject: [SPSSX-L] SV: [SPSSX-L] UNICODE or locale's writing system? Sent by: "SPSSX(r) Discussion" <[hidden email]> Seems as if there are two answers here, Jon is suggesting UNICODE ON and Albert-Jan the opposite. I will discuss this further with our sys admins. However, whatever the solution, I guess there will be instances in the future where there is a mismatch between old files and the new settings. If it happens, if there are strange symbols (mojibake is it?) in either the syntax files or the data files, is there any easy way to convert the files into a better format? Would that be the “reg add” commands you are suggesting, Albert-Jan? Robert Från: Jon K Peck [mailto:peck@...] Skickat: den 13 april 2015 16:09 Till: Robert Lundqvist Kopia: [hidden email] Ämne: Re: [SPSSX-L] UNICODE or locale's writing system? I would strongly suggest that you use UNICODE ON. Swedish diacritics are in the standard western code page (1252), which is probably what your Windows setting is, so old sav files that use that encoding should open without any data loss. You might want to use ALTER TYPE to reduce string variable sizes if the file has to be converted, and for sure don't keep resaving in code page and reopening as then string expansion will happen again when you reopen the file. If you have sav files or other inputs that are not in code page 1252 but are in some other single code page, you can specify the encoding when you do the import, or you can set the Statistics locale appropriately. (You can now set the Statistics locale from Edit > Options > Language as well as in syntax.) You will also see in 22 or 23 that the Python Essentials installation is fully integrated into the main Statistics install, and, in V23, lots of extension commands are installed with it. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM peck@... phone: 720-342-5621 From: Robert Lundqvist <Robert.Lundqvist@...> To: [hidden email] Date: 04/13/2015 07:57 AM Subject: [SPSSX-L] UNICODE or locale's writing system? Sent by: "SPSSX(r) Discussion" <[hidden email]> We are about to upgrade to SPSS ver 23 (running under a network license), which could be the right time to find the best settings. Today there are problems from time to time with the language settings, and I haven’t found out what would be the best setup. Today, the default is UNICODE ON which works for most users. However, I have personally set UNICODE OFF, since this has meant the least problems with Swedish (diacritical) letters in the syntax files. Still, there are problems when I open other users’ files: diacritical letters are replaced by question marks or other strange symbols. I guess the problems will persist as long as people have old data files and where there will be a mismatch between these files and whatever language setting the system will be running. For the future, could anything be said about the best language set up in an environment where people want to use (Swedish) diacritical letters for variable names and labels, as well as in syntax files? Should it be UNICODE OFF as default? Any recommendations and/or experiences are welcome. Robert *´*´*´*´*´*´*´ Robert Lundqvist Norrbotten county council Sweden ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Hi Robert,
These are the 'add reg' commands for SPSS v22, with a Dutch locale. You could probably omit the locale setting. I just wanted this to be guaranteed to be the same on every computer (useful e.g. when reading .csv files). As Jon said, you can accomplish the same thing via the GUI, but these commands are handy when you are configuring many machines. :: notice that caps are escaped! (why?) set spss_base_path=HKEY_LOCAL_MACHINE\Software\JavaSoft\Prefs\com\ibm\/S/P/S/S set spssKey=%spss_base_path%\/Statistics\22.0 set keyname=%spssKey%\core\set reg add "%keyname%" /v "/Unicode" /t REG_SZ /d "/No" /f reg add "%keyname%" /v "/Locale" /t REG_SZ /d "nl_/N/L.cp1252" /f set keyname=%spssKey%\ui\dialog_settings\welcome_page reg add "%keyname%" /v "show_welcome_dialog" /t REG_SZ /d "0" /f reg add "%keyname%" /v "shown_unicode_warning" /t REG_SZ /d "1" /f You can also modify the registry to use an external Python interpreter, but I later discovered that it's also possible (and easier, because, after all, we all loathe the Windows registry, don't we?) to set the PYTHONHOME environment variable. Be sure to either add the original site-packages (with the spss/python api files) to PYTHONPATH, or to copy those files to the external site-packages dir (we created an 'spss22' dir there and use .pth files to refer to spssaux, spss, SpssClient etc.) Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -------------------------------------------- On Tue, 4/14/15, Jon K Peck <[hidden email]> wrote: Subject: Re: [SPSSX-L] SV: [SPSSX-L] UNICODE or locale's writing system? To: [hidden email] Date: Tuesday, April 14, 2015, 3:55 PM You can control your mode setting via Edit > Options > Language, and the setting will be remembered from session to session. There is no need to do this by direct manipulation of the Registry, although if installing on many machines, the Registry script could automate the default setting. You should be consistent throughout the organization in the default mode in order to avoid constantly converting sav files from code page to Unicode and back. And generally after converting a code page file to Unicode, recovering unnecessary expansion via ALTER TYPE or "Minimize string widths based on observed values" may be useful if the file contains a lot of string data but is mostly 7-bit ascii text. If data are read and show question marks or mojibake, you should fix that before moving on. Those errors could happen if the input file is not actually encoded in the expected way, so whether you are reading it in code page or Unicode mode, you should resolve the issue before saving the data. Of course, if the input characters span code pages, e.g., Swedish and Russian, it can only be processed in Unicode mode. SPSS data files created by version 15 or later have the character encoding marked in the file. I have seen a few instances, however, of sav files created by third parties that were incorrectly marked or not marked at all. In either mode, best practice is to abandon the old string functions that are byte oriented and use the new CHAR.* functions, which are character oriented. That syntax will then work in either mode. By the way, if you have large sav files, check out the SPSS Statistics Compressed (zsav) format. It often dramatically reduces the size of the file, but I suspect that most users are not aware of it. Working recently with a rather large file, I saw that the sav size was 854MB while the zsav file was 275MB - a ratio of 32%. zsav was introduced in version 18. It is the second choice in File > Save As for data or /ZCOMPRESSED in SAVE syntax. The GET command automatically recognizes the compression type. There is some extra overhead in compression/decompression, but this is often more than compensated for by the reduced amount of i/o activity. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Robert Lundqvist <[hidden email]> To: [hidden email] Date: 04/14/2015 07:27 AM Subject: [SPSSX-L] SV: [SPSSX-L] UNICODE or locale's writing system? Sent by: "SPSSX(r) Discussion" <[hidden email]> Seems as if there are two answers here, Jon is suggesting UNICODE ON and Albert-Jan the opposite. I will discuss this further with our sys admins. However, whatever the solution, I guess there will be instances in the future where there is a mismatch between old files and the new settings. If it happens, if there are strange symbols (mojibake is it?) in either the syntax files or the data files, is there any easy way to convert the files into a better format? Would that be the “reg add” commands you are suggesting, Albert-Jan? Robert Från: Jon K Peck [mailto:[hidden email]] Skickat: den 13 april 2015 16:09 Till: Robert Lundqvist Kopia: [hidden email] Ämne: Re: [SPSSX-L] UNICODE or locale's writing system? I would strongly suggest that you use UNICODE ON. Swedish diacritics are in the standard western code page (1252), which is probably what your Windows setting is, so old sav files that use that encoding should open without any data loss. You might want to use ALTER TYPE to reduce string variable sizes if the file has to be converted, and for sure don't keep resaving in code page and reopening as then string expansion will happen again when you reopen the file. If you have sav files or other inputs that are not in code page 1252 but are in some other single code page, you can specify the encoding when you do the import, or you can set the Statistics locale appropriately. (You can now set the Statistics locale from Edit > Options > Language as well as in syntax.) You will also see in 22 or 23 that the Python Essentials installation is fully integrated into the main Statistics install, and, in V23, lots of extension commands are installed with it. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Robert Lundqvist <[hidden email]> To: [hidden email] Date: 04/13/2015 07:57 AM Subject: [SPSSX-L] UNICODE or locale's writing system? Sent by: "SPSSX(r) Discussion" <[hidden email]> We are about to upgrade to SPSS ver 23 (running under a network license), which could be the right time to find the best settings. Today there are problems from time to time with the language settings, and I haven’t found out what would be the best setup. Today, the default is UNICODE ON which works for most users. However, I have personally set UNICODE OFF, since this has meant the least problems with Swedish (diacritical) letters in the syntax files. Still, there are problems when I open other users’ files: diacritical letters are replaced by question marks or other strange symbols. I guess the problems will persist as long as people have old data files and where there will be a mismatch between these files and whatever language setting the system will be running. For the future, could anything be said about the best language set up in an environment where people want to use (Swedish) diacritical letters for variable names and labels, as well as in syntax files? Should it be UNICODE OFF as default? Any recommendations and/or experiences are welcome. Robert *´*´*´*´*´*´*´ Robert Lundqvist Norrbotten county council Sweden ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |