Unicode Encoding

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode Encoding

Atai Winkler

Hi

 

 

I have been running a program many times and just this morning I receive the following error message when it reads in a .sav file.

 

 

 

Warning # 5281.  Command name: GET FILE

SPSS Statistics is running in Unicode encoding mode.  This file is encoded in

a locale-specific (code page) encoding.  The defined width of any string

variables are automatically tripled in order to avoid possible data loss.  You

can use ALTER TYPE to set the width of string variables to the width of the

longest observed value for each string variable.

 

 

The encoding at the top of all the programs is

 

* Encoding: UTF-8.

 

The bottom right of the screen says ‘Unicode on’.

 

Why does this happen and how can I correct it?

 

Thank you.

 

Atai

 

Dr Atai Winkler

Principal Consultant
PAM Analytics

 

[hidden email]

 

pamanalytics.com

 

 

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Unicode Encoding

John F Hall

Just open the file and carry on anyway.  I do this all the time, especially with older files, but sometimes it is advisable to save them with a different name.

 

John F Hall MA (Cantab) Dip Ed (Dunelm)

IBM-SPSS Academic Author 9900074

 

Email: [hidden email]

Website: Journeys in Survey Research

Course: Survey Analysis Workshop (SPSS)

 

 

 

From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Atai Winkler
Sent: 19 January 2021 22:15
To: [hidden email]
Subject: Unicode Encoding

 

Hi

 

 

I have been running a program many times and just this morning I receive the following error message when it reads in a .sav file.

 

 

 

Warning # 5281.  Command name: GET FILE

SPSS Statistics is running in Unicode encoding mode.  This file is encoded in

a locale-specific (code page) encoding.  The defined width of any string

variables are automatically tripled in order to avoid possible data loss.  You

can use ALTER TYPE to set the width of string variables to the width of the

longest observed value for each string variable.

 

 

The encoding at the top of all the programs is

 

* Encoding: UTF-8.

 

The bottom right of the screen says ‘Unicode on’.

 

Why does this happen and how can I correct it?

 

Thank you.

 

Atai

 

Dr Atai Winkler

Principal Consultant
PAM Analytics

 

[hidden email]

 

pamanalytics.com

 

 

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Unicode Encoding

Jon Peck
In reply to this post by Atai Winkler
This is not an error message.  It's just informative.  When Statistics is running in Unicode mode, which has been the default since V21, and you open a sav file that was created in code page mode, all text, including strings in data, variable names and labels, etc, are converted from code page encoding to Unicode.  If your text is just plain ascii characters, the encoding is actually the same, but if it contains accented characters or text with nonwestern characters, their internal codes change, and strings get longer, because Unicode supports well over 100,000 characters, so the codes don't necessarily fit in a single byte.  Using Unicode means that you can use any combination of characters, and strings will be displayed and handled correctly anywhere in the world.

In order to guarantee that no text is lost, string variable widths are tripled.  That is a worst case expansion.  If you take the suggestion in the gui to use ALTER TYPE or do this explicitly to minimize the string sizes, the extra space will be reclaimed, so for plain ascii text, you are back where you started or even smaller if there was excess blank space.  When you resave the data file, it will be marked as in Unicode, and you won't see that warning again when you reopen it.

Most code will be unaffected, but you should use the char.* string functions, which are character oriented, rather than the equivalent old byte-oriented functions.  Another benefit is that these functions automatically strip trailing blanks, so that you no longer need to use RTRIM.

On Wed, Jan 20, 2021 at 3:15 AM Atai Winkler <[hidden email]> wrote:

Hi

 

 

I have been running a program many times and just this morning I receive the following error message when it reads in a .sav file.

 

 

 

Warning # 5281.  Command name: GET FILE

SPSS Statistics is running in Unicode encoding mode.  This file is encoded in

a locale-specific (code page) encoding.  The defined width of any string

variables are automatically tripled in order to avoid possible data loss.  You

can use ALTER TYPE to set the width of string variables to the width of the

longest observed value for each string variable.

 

 

The encoding at the top of all the programs is

 

* Encoding: UTF-8.

 

The bottom right of the screen says ‘Unicode on’.

 

Why does this happen and how can I correct it?

 

Thank you.

 

Atai

 

Dr Atai Winkler

Principal Consultant
PAM Analytics

 

[hidden email]

 

pamanalytics.com

 

 

 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Unicode Encoding

Bruce Weaver
Administrator
And here is how to implement the advice about ALTER TYPE:

* Format all string variables to have the maximum width needed.
ALTER TYPE ALL (A=AMIN).




Jon Peck wrote

> This is not an error message.  It's just informative.  When Statistics is
> running in Unicode mode, which has been the default since V21, and you
> open
> a sav file that was created in code page mode, all text, including strings
> in data, variable names and labels, etc, are converted from code page
> encoding to Unicode.  If your text is just plain ascii characters, the
> encoding is actually the same, but if it contains accented characters or
> text with nonwestern characters, their internal codes change, and strings
> get longer, because Unicode supports well over 100,000 characters, so the
> codes don't necessarily fit in a single byte.  Using Unicode means that
> you
> can use any combination of characters, and strings will be displayed and
> handled correctly anywhere in the world.
>
> In order to guarantee that no text is lost, string variable widths are
> tripled.  That is a worst case expansion.  If you take the suggestion in
> the gui to use ALTER TYPE or do this explicitly to minimize the string
> sizes, the extra space will be reclaimed, so for plain ascii text, you are
> back where you started or even smaller if there was excess blank space.
> When you resave the data file, it will be marked as in Unicode, and you
> won't see that warning again when you reopen it.
>
> Most code will be unaffected, but you should use the char.* string
> functions, which are character oriented, rather than the equivalent old
> byte-oriented functions.  Another benefit is that these functions
> automatically strip trailing blanks, so that you no longer need to use
> RTRIM.
>
> On Wed, Jan 20, 2021 at 3:15 AM Atai Winkler &lt;

> atai.winkler@

> &gt;
> wrote:
>
>> Hi
>>
>>
>>
>>
>>
>> I have been running a program many times and just this morning I receive
>> the following error message when it reads in a .sav file.
>>
>>
>>
>>
>>
>>
>>
>> Warning # 5281.  Command name: GET FILE
>>
>> SPSS Statistics is running in Unicode encoding mode.  This file is
>> encoded
>> in
>>
>> a locale-specific (code page) encoding.  The defined width of any string
>>
>> variables are automatically tripled in order to avoid possible data loss.
>> You
>>
>> can use ALTER TYPE to set the width of string variables to the width of
>> the
>>
>> longest observed value for each string variable.
>>
>>
>>
>>
>>
>> The encoding at the top of all the programs is
>>
>>
>>
>> * Encoding: UTF-8.
>>
>>
>>
>> The bottom right of the screen says ‘Unicode on’.
>>
>>
>>
>> Why does this happen and how can I correct it?
>>
>>
>>
>> Thank you.
>>
>>
>>
>> Atai
>>
>>
>>
>> Dr Atai Winkler
>>
>> Principal Consultant
>> PAM Analytics
>>
>>
>>
>>

> atai.winkler@

>>
>>
>>
>> pamanalytics.com &lt;http://www.pamanalytics.com/&gt;
>>
>>
>>
>>
>>
>>
>> ===================== To manage your subscription to SPSSX-L, send a
>> message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text
>> except the command. To leave the list, send the command SIGNOFF SPSSX-L
>> For
>> a list of commands to manage subscriptions, send the command INFO REFCARD
>
>
>
> --
> Jon K Peck

> jkpeck@

>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Unicode Encoding

MLIves
The issue with both Jon and Bruce's suggestions is when you have multiple large files or both old (non Unicode) and new (Unicode) files that are merged together, it can be very tedious to ALTER TYPE for all string variables, and AMIN may differ for the same variable across files.

I found that double clicking on the file name to open it in SPSS, results in the notice AND a question of whether you want to multiply the string widths.  If you don't use a language that uses the accented or nonwestern characters, and choose 'No' as the response to that question.  The file is opened in Unicode without changing the variable widths and can be saved as a Unicode version.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Bruce Weaver
Sent: Wednesday, January 20, 2021 10:50 AM
To: [hidden email]
Subject: Re: [SPSSX-L] Unicode Encoding

EXTERNAL EMAIL: This email originated from outside of the organization. Do not click any links or open any attachments unless you trust the sender and know the content is safe.

And here is how to implement the advice about ALTER TYPE:

* Format all string variables to have the maximum width needed.
ALTER TYPE ALL (A=AMIN).




Jon Peck wrote

> This is not an error message.  It's just informative.  When Statistics
> is running in Unicode mode, which has been the default since V21, and
> you open a sav file that was created in code page mode, all text,
> including strings in data, variable names and labels, etc, are
> converted from code page encoding to Unicode.  If your text is just
> plain ascii characters, the encoding is actually the same, but if it
> contains accented characters or text with nonwestern characters, their
> internal codes change, and strings get longer, because Unicode
> supports well over 100,000 characters, so the codes don't necessarily
> fit in a single byte.  Using Unicode means that you can use any
> combination of characters, and strings will be displayed and handled
> correctly anywhere in the world.
>
> In order to guarantee that no text is lost, string variable widths are
> tripled.  That is a worst case expansion.  If you take the suggestion
> in the gui to use ALTER TYPE or do this explicitly to minimize the
> string sizes, the extra space will be reclaimed, so for plain ascii
> text, you are back where you started or even smaller if there was excess blank space.
> When you resave the data file, it will be marked as in Unicode, and
> you won't see that warning again when you reopen it.
>
> Most code will be unaffected, but you should use the char.* string
> functions, which are character oriented, rather than the equivalent
> old byte-oriented functions.  Another benefit is that these functions
> automatically strip trailing blanks, so that you no longer need to use
> RTRIM.
>
> On Wed, Jan 20, 2021 at 3:15 AM Atai Winkler &lt;

> atai.winkler@

> &gt;
> wrote:
>
>> Hi
>>
>>
>>
>>
>>
>> I have been running a program many times and just this morning I
>> receive the following error message when it reads in a .sav file.
>>
>>
>>
>>
>>
>>
>>
>> Warning # 5281.  Command name: GET FILE
>>
>> SPSS Statistics is running in Unicode encoding mode.  This file is
>> encoded in
>>
>> a locale-specific (code page) encoding.  The defined width of any
>> string
>>
>> variables are automatically tripled in order to avoid possible data loss.
>> You
>>
>> can use ALTER TYPE to set the width of string variables to the width
>> of the
>>
>> longest observed value for each string variable.
>>
>>
>>
>>
>>
>> The encoding at the top of all the programs is
>>
>>
>>
>> * Encoding: UTF-8.
>>
>>
>>
>> The bottom right of the screen says 'Unicode on'.
>>
>>
>>
>> Why does this happen and how can I correct it?
>>
>>
>>
>> Thank you.
>>
>>
>>
>> Atai
>>
>>
>>
>> Dr Atai Winkler
>>
>> Principal Consultant
>> PAM Analytics
>>
>>
>>
>>

> atai.winkler@

>>
>>
>>
>> pamanalytics.com
>> &lt;https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2F
>> www.pamanalytics.com%2F%26gt&amp;data=04%7C01%7CMelissa.Ives%40ct.gov
>> %7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738
>> b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
>> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
>> =7FtWw295%2BicLRLJ%2FPqY7U3K4Q9qWQjWVVMM7dLlGBKM%3D&amp;reserved=0;
>>
>>
>>
>>
>>
>>
>> ===================== To manage your subscription to SPSSX-L, send a
>> message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text
>> except the command. To leave the list, send the command SIGNOFF
>> SPSSX-L For a list of commands to manage subscriptions, send the
>> command INFO REFCARD
>
>
>
> --
> Jon K Peck

> jkpeck@

>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to
> manage subscriptions, send the command INFO REFCARD





-----
--
Bruce Weaver
[hidden email]
https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsites.google.com%2Fa%2Flakeheadu.ca%2Fbweaver%2F&amp;data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BQaHhJRR9UoSSRFq4dw5zEPafonSvMEF0xTohZ%2FJDOo%3D&amp;reserved=0

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
Sent from: https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspssx-discussion.1045642.n5.nabble.com%2F&amp;data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QOZMWpe2jvIY%2FHWiraZ9yPqzIF5GWQ6HJi041BW%2BHds%3D&amp;reserved=0

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

________________________________

This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Unicode Encoding

Jon Peck
You can use the STATS ADJUST WIDTHS extension command to synchronize string widths over a batch of sav files with one command.

As for the second point, that is not correct.  You cannot open a code page file with Statistics in Unicode mode without tripling the widths, at least temporarily, but you can turn off Unicode mode.



On Thu, Jan 21, 2021 at 8:49 AM Ives, Melissa L <[hidden email]> wrote:
The issue with both Jon and Bruce's suggestions is when you have multiple large files or both old (non Unicode) and new (Unicode) files that are merged together, it can be very tedious to ALTER TYPE for all string variables, and AMIN may differ for the same variable across files.

I found that double clicking on the file name to open it in SPSS, results in the notice AND a question of whether you want to multiply the string widths.  If you don't use a language that uses the accented or nonwestern characters, and choose 'No' as the response to that question.  The file is opened in Unicode without changing the variable widths and can be saved as a Unicode version.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Bruce Weaver
Sent: Wednesday, January 20, 2021 10:50 AM
To: [hidden email]
Subject: Re: [SPSSX-L] Unicode Encoding

EXTERNAL EMAIL: This email originated from outside of the organization. Do not click any links or open any attachments unless you trust the sender and know the content is safe.

And here is how to implement the advice about ALTER TYPE:

* Format all string variables to have the maximum width needed.
ALTER TYPE ALL (A=AMIN).




Jon Peck wrote
> This is not an error message.  It's just informative.  When Statistics
> is running in Unicode mode, which has been the default since V21, and
> you open a sav file that was created in code page mode, all text,
> including strings in data, variable names and labels, etc, are
> converted from code page encoding to Unicode.  If your text is just
> plain ascii characters, the encoding is actually the same, but if it
> contains accented characters or text with nonwestern characters, their
> internal codes change, and strings get longer, because Unicode
> supports well over 100,000 characters, so the codes don't necessarily
> fit in a single byte.  Using Unicode means that you can use any
> combination of characters, and strings will be displayed and handled
> correctly anywhere in the world.
>
> In order to guarantee that no text is lost, string variable widths are
> tripled.  That is a worst case expansion.  If you take the suggestion
> in the gui to use ALTER TYPE or do this explicitly to minimize the
> string sizes, the extra space will be reclaimed, so for plain ascii
> text, you are back where you started or even smaller if there was excess blank space.
> When you resave the data file, it will be marked as in Unicode, and
> you won't see that warning again when you reopen it.
>
> Most code will be unaffected, but you should use the char.* string
> functions, which are character oriented, rather than the equivalent
> old byte-oriented functions.  Another benefit is that these functions
> automatically strip trailing blanks, so that you no longer need to use
> RTRIM.
>
> On Wed, Jan 20, 2021 at 3:15 AM Atai Winkler &lt;

> atai.winkler@

> &gt;
> wrote:
>
>> Hi
>>
>>
>>
>>
>>
>> I have been running a program many times and just this morning I
>> receive the following error message when it reads in a .sav file.
>>
>>
>>
>>
>>
>>
>>
>> Warning # 5281.  Command name: GET FILE
>>
>> SPSS Statistics is running in Unicode encoding mode.  This file is
>> encoded in
>>
>> a locale-specific (code page) encoding.  The defined width of any
>> string
>>
>> variables are automatically tripled in order to avoid possible data loss.
>> You
>>
>> can use ALTER TYPE to set the width of string variables to the width
>> of the
>>
>> longest observed value for each string variable.
>>
>>
>>
>>
>>
>> The encoding at the top of all the programs is
>>
>>
>>
>> * Encoding: UTF-8.
>>
>>
>>
>> The bottom right of the screen says 'Unicode on'.
>>
>>
>>
>> Why does this happen and how can I correct it?
>>
>>
>>
>> Thank you.
>>
>>
>>
>> Atai
>>
>>
>>
>> Dr Atai Winkler
>>
>> Principal Consultant
>> PAM Analytics
>>
>>
>>
>>

> atai.winkler@

>>
>>
>>
>> pamanalytics.com
>> &lt;https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2F
>> www.pamanalytics.com%2F%26gt&amp;data=04%7C01%7CMelissa.Ives%40ct.gov
>> %7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738
>> b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
>> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
>> =7FtWw295%2BicLRLJ%2FPqY7U3K4Q9qWQjWVVMM7dLlGBKM%3D&amp;reserved=0;
>>
>>
>>
>>
>>
>>
>> ===================== To manage your subscription to SPSSX-L, send a
>> message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text
>> except the command. To leave the list, send the command SIGNOFF
>> SPSSX-L For a list of commands to manage subscriptions, send the
>> command INFO REFCARD
>
>
>
> --
> Jon K Peck

> jkpeck@

>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to
> manage subscriptions, send the command INFO REFCARD





-----
--
Bruce Weaver
[hidden email]
https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsites.google.com%2Fa%2Flakeheadu.ca%2Fbweaver%2F&amp;data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BQaHhJRR9UoSSRFq4dw5zEPafonSvMEF0xTohZ%2FJDOo%3D&amp;reserved=0

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
Sent from: https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspssx-discussion.1045642.n5.nabble.com%2F&amp;data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QOZMWpe2jvIY%2FHWiraZ9yPqzIF5GWQ6HJi041BW%2BHds%3D&amp;reserved=0

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

________________________________

This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Unicode Encoding

Jon Peck
To elaborate a little more on what happens with the No answer about not reclaiming excess space in string variables, the widths are tripled, so if you do this will all the files that you might want to use together, they will work the same way but just be a little bigger.  That isn't usually a big problem with sav files as they tend not to have a lot of string variables.  But you would have to open and resave all of those files.

With a "yes" answer, Statistics just runs ALTER TYPE (A=AMIN) to reset the widths.  However, the outcome is data dependent, so there is no guarantee that string widths will match across files afterwards.  But, as I said, STATS ADJUST WIDTHS will easily fix this for a batch of files.  It provides several ways to set the strings widths, but the default is like an A=AMIN except across all the selected files.

Another thing to consider is that with Statistics V27+, Python 2 is no longer supported out of the box, although you can install and use it anyway, and Python 3 is Unicode only, so in code page mode you can't use the extension commands on the Extension Hub, which were all converted to Python 3.  I posted a note about a new extension command yesterday that converts your existing Python 2 code to Python 3.

On Thu, Jan 21, 2021 at 8:53 AM Jon Peck <[hidden email]> wrote:
You can use the STATS ADJUST WIDTHS extension command to synchronize string widths over a batch of sav files with one command.

As for the second point, that is not correct.  You cannot open a code page file with Statistics in Unicode mode without tripling the widths, at least temporarily, but you can turn off Unicode mode.



On Thu, Jan 21, 2021 at 8:49 AM Ives, Melissa L <[hidden email]> wrote:
The issue with both Jon and Bruce's suggestions is when you have multiple large files or both old (non Unicode) and new (Unicode) files that are merged together, it can be very tedious to ALTER TYPE for all string variables, and AMIN may differ for the same variable across files.

I found that double clicking on the file name to open it in SPSS, results in the notice AND a question of whether you want to multiply the string widths.  If you don't use a language that uses the accented or nonwestern characters, and choose 'No' as the response to that question.  The file is opened in Unicode without changing the variable widths and can be saved as a Unicode version.

Melissa

-----Original Message-----
From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Bruce Weaver
Sent: Wednesday, January 20, 2021 10:50 AM
To: [hidden email]
Subject: Re: [SPSSX-L] Unicode Encoding

EXTERNAL EMAIL: This email originated from outside of the organization. Do not click any links or open any attachments unless you trust the sender and know the content is safe.

And here is how to implement the advice about ALTER TYPE:

* Format all string variables to have the maximum width needed.
ALTER TYPE ALL (A=AMIN).




Jon Peck wrote
> This is not an error message.  It's just informative.  When Statistics
> is running in Unicode mode, which has been the default since V21, and
> you open a sav file that was created in code page mode, all text,
> including strings in data, variable names and labels, etc, are
> converted from code page encoding to Unicode.  If your text is just
> plain ascii characters, the encoding is actually the same, but if it
> contains accented characters or text with nonwestern characters, their
> internal codes change, and strings get longer, because Unicode
> supports well over 100,000 characters, so the codes don't necessarily
> fit in a single byte.  Using Unicode means that you can use any
> combination of characters, and strings will be displayed and handled
> correctly anywhere in the world.
>
> In order to guarantee that no text is lost, string variable widths are
> tripled.  That is a worst case expansion.  If you take the suggestion
> in the gui to use ALTER TYPE or do this explicitly to minimize the
> string sizes, the extra space will be reclaimed, so for plain ascii
> text, you are back where you started or even smaller if there was excess blank space.
> When you resave the data file, it will be marked as in Unicode, and
> you won't see that warning again when you reopen it.
>
> Most code will be unaffected, but you should use the char.* string
> functions, which are character oriented, rather than the equivalent
> old byte-oriented functions.  Another benefit is that these functions
> automatically strip trailing blanks, so that you no longer need to use
> RTRIM.
>
> On Wed, Jan 20, 2021 at 3:15 AM Atai Winkler &lt;

> atai.winkler@

> &gt;
> wrote:
>
>> Hi
>>
>>
>>
>>
>>
>> I have been running a program many times and just this morning I
>> receive the following error message when it reads in a .sav file.
>>
>>
>>
>>
>>
>>
>>
>> Warning # 5281.  Command name: GET FILE
>>
>> SPSS Statistics is running in Unicode encoding mode.  This file is
>> encoded in
>>
>> a locale-specific (code page) encoding.  The defined width of any
>> string
>>
>> variables are automatically tripled in order to avoid possible data loss.
>> You
>>
>> can use ALTER TYPE to set the width of string variables to the width
>> of the
>>
>> longest observed value for each string variable.
>>
>>
>>
>>
>>
>> The encoding at the top of all the programs is
>>
>>
>>
>> * Encoding: UTF-8.
>>
>>
>>
>> The bottom right of the screen says 'Unicode on'.
>>
>>
>>
>> Why does this happen and how can I correct it?
>>
>>
>>
>> Thank you.
>>
>>
>>
>> Atai
>>
>>
>>
>> Dr Atai Winkler
>>
>> Principal Consultant
>> PAM Analytics
>>
>>
>>
>>

> atai.winkler@

>>
>>
>>
>> pamanalytics.com
>> &lt;https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2F
>> www.pamanalytics.com%2F%26gt&amp;data=04%7C01%7CMelissa.Ives%40ct.gov
>> %7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738
>> b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
>> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
>> =7FtWw295%2BicLRLJ%2FPqY7U3K4Q9qWQjWVVMM7dLlGBKM%3D&amp;reserved=0;
>>
>>
>>
>>
>>
>>
>> ===================== To manage your subscription to SPSSX-L, send a
>> message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text
>> except the command. To leave the list, send the command SIGNOFF
>> SPSSX-L For a list of commands to manage subscriptions, send the
>> command INFO REFCARD
>
>
>
> --
> Jon K Peck

> jkpeck@

>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the command. To leave the
> list, send the command SIGNOFF SPSSX-L For a list of commands to
> manage subscriptions, send the command INFO REFCARD





-----
--
Bruce Weaver
[hidden email]
https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsites.google.com%2Fa%2Flakeheadu.ca%2Fbweaver%2F&amp;data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BQaHhJRR9UoSSRFq4dw5zEPafonSvMEF0xTohZ%2FJDOo%3D&amp;reserved=0

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
Sent from: https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspssx-discussion.1045642.n5.nabble.com%2F&amp;data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QOZMWpe2jvIY%2FHWiraZ9yPqzIF5GWQ6HJi041BW%2BHds%3D&amp;reserved=0

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

________________________________

This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Jon K Peck
[hidden email]



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Unicode Encoding

wsu_wright
In reply to this post by MLIves
The problem we have with unicode is that it appears it is not capable of detecting previously defined unicode environments, and accordingly, artificially expands column width of columns which are already in unicode format.  For example, we import and export data from our student information system (Ellucian Banner) that resides in an oracle database that is unicode defined.  So a column that holds a 4 char code, like a major code of 'G20F', is defined in the oracle database as being 16 characters in width.  As long as we have unicode turned OFF in SPSS, that column width of 16 will be honored as we import from the orcale database making it easy to export back into the database.  If, however, we have unicode turned ON in SPSS, than SPSS takes the previously defined unicode column from our database and applies unicode on it again making the column change from 16 to 48.  This  then requires the use of ALTER TYPE to reconfigure the SPSS unicode column (48) back to its original unicode co
 lumn size (16) in the database of origin in order to perform exports.  Thankfully, at least as of version 27, you can change the setting of unicode in the Options\Language tab to OFF, which allows us to not have to modify literally hundreds of syntax jobs to insert ALTER TYPE and define every string column back to its original unicode state (which can include hundreds of columns).  This ability to set the SPSS settings makes it convenient for import/exports to a unicode environment, while still having the option to temporarily turn unicode ON when needed.  However, in working with the development team designing the new SPSS interface, SPSS-NX, at this point the option to set unicode OFF in the SPSS settings will no longer be available.  Instead, if you choose to prevent SPSS from redundantly expanding columns on previously defined unicode columns, you must insert the setting (SET UNICODE=OFF) in each syntax. We would have to alter thousands of syntax jobs to accommodate something th
 at could be a setting rather than a syntax function. Adding insult to injury, we would not only have to insert the SET UNICODE=OFF syntax but also insert ALTER TYPE and redefine the correct unicode width on each string column we need to export which could be from a few string columns to hundreds of string defined formats. SPSS-NX is not yet done so I'm hoping they change their minds on not allowing users to set unicode OFF at the settings level.



> On January 21, 2021 at 10:48 AM "Ives, Melissa L" <[hidden email]> wrote:
>
>
> The issue with both Jon and Bruce's suggestions is when you have multiple large files or both old (non Unicode) and new (Unicode) files that are merged together, it can be very tedious to ALTER TYPE for all string variables, and AMIN may differ for the same variable across files.
>
> I found that double clicking on the file name to open it in SPSS, results in the notice AND a question of whether you want to multiply the string widths.  If you don't use a language that uses the accented or nonwestern characters, and choose 'No' as the response to that question.  The file is opened in Unicode without changing the variable widths and can be saved as a Unicode version.
>
> Melissa
>
> -----Original Message-----
> From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Bruce Weaver
> Sent: Wednesday, January 20, 2021 10:50 AM
> To: [hidden email]
> Subject: Re: [SPSSX-L] Unicode Encoding
>
> EXTERNAL EMAIL: This email originated from outside of the organization. Do not click any links or open any attachments unless you trust the sender and know the content is safe.
>
> And here is how to implement the advice about ALTER TYPE:
>
> * Format all string variables to have the maximum width needed.
> ALTER TYPE ALL (A=AMIN).
>
>
>
>
> Jon Peck wrote
> > This is not an error message.  It's just informative.  When Statistics
> > is running in Unicode mode, which has been the default since V21, and
> > you open a sav file that was created in code page mode, all text,
> > including strings in data, variable names and labels, etc, are
> > converted from code page encoding to Unicode.  If your text is just
> > plain ascii characters, the encoding is actually the same, but if it
> > contains accented characters or text with nonwestern characters, their
> > internal codes change, and strings get longer, because Unicode
> > supports well over 100,000 characters, so the codes don't necessarily
> > fit in a single byte.  Using Unicode means that you can use any
> > combination of characters, and strings will be displayed and handled
> > correctly anywhere in the world.
> >
> > In order to guarantee that no text is lost, string variable widths are
> > tripled.  That is a worst case expansion.  If you take the suggestion
> > in the gui to use ALTER TYPE or do this explicitly to minimize the
> > string sizes, the extra space will be reclaimed, so for plain ascii
> > text, you are back where you started or even smaller if there was excess blank space.
> > When you resave the data file, it will be marked as in Unicode, and
> > you won't see that warning again when you reopen it.
> >
> > Most code will be unaffected, but you should use the char.* string
> > functions, which are character oriented, rather than the equivalent
> > old byte-oriented functions.  Another benefit is that these functions
> > automatically strip trailing blanks, so that you no longer need to use
> > RTRIM.
> >
> > On Wed, Jan 20, 2021 at 3:15 AM Atai Winkler <
>
> > atai.winkler@
>
> > >
> > wrote:
> >
> >> Hi
> >>
> >>
> >>
> >>
> >>
> >> I have been running a program many times and just this morning I
> >> receive the following error message when it reads in a .sav file.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Warning # 5281.  Command name: GET FILE
> >>
> >> SPSS Statistics is running in Unicode encoding mode.  This file is
> >> encoded in
> >>
> >> a locale-specific (code page) encoding.  The defined width of any
> >> string
> >>
> >> variables are automatically tripled in order to avoid possible data loss.
> >> You
> >>
> >> can use ALTER TYPE to set the width of string variables to the width
> >> of the
> >>
> >> longest observed value for each string variable.
> >>
> >>
> >>
> >>
> >>
> >> The encoding at the top of all the programs is
> >>
> >>
> >>
> >> * Encoding: UTF-8.
> >>
> >>
> >>
> >> The bottom right of the screen says 'Unicode on'.
> >>
> >>
> >>
> >> Why does this happen and how can I correct it?
> >>
> >>
> >>
> >> Thank you.
> >>
> >>
> >>
> >> Atai
> >>
> >>
> >>
> >> Dr Atai Winkler
> >>
> >> Principal Consultant
> >> PAM Analytics
> >>
> >>
> >>
> >>
>
> > atai.winkler@
>
> >>
> >>
> >>
> >> pamanalytics.com
> >> <https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2F
> >> www.pamanalytics.com%2F%26gt&data=04%7C01%7CMelissa.Ives%40ct.gov
> >> %7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738
> >> b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> >> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata
> >> =7FtWw295%2BicLRLJ%2FPqY7U3K4Q9qWQjWVVMM7dLlGBKM%3D&reserved=0;
> >>
> >>
> >>
> >>
> >>
> >>
> >> ===================== To manage your subscription to SPSSX-L, send a
> >> message to
>
> > LISTSERV@.UGA
>
> >  (not to SPSSX-L), with no body text
> >> except the command. To leave the list, send the command SIGNOFF
> >> SPSSX-L For a list of commands to manage subscriptions, send the
> >> command INFO REFCARD
> >
> >
> >
> > --
> > Jon K Peck
>
> > jkpeck@
>
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message to
>
> > LISTSERV@.UGA
>
> >  (not to SPSSX-L), with no body text except the command. To leave the
> > list, send the command SIGNOFF SPSSX-L For a list of commands to
> > manage subscriptions, send the command INFO REFCARD
>
>
>
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsites.google.com%2Fa%2Flakeheadu.ca%2Fbweaver%2F&data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BQaHhJRR9UoSSRFq4dw5zEPafonSvMEF0xTohZ%2FJDOo%3D&reserved=0
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> Sent from: https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspssx-discussion.1045642.n5.nabble.com%2F&data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QOZMWpe2jvIY%2FHWiraZ9yPqzIF5GWQ6HJi041BW%2BHds%3D&reserved=0
>
> =====================
> To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
>
> ________________________________
>
> This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Unicode Encoding

Jon Peck
This sounds to me like a problem with the Oracle ODBC driver you are using or maybe the database encoding definition.  Normally, a driver would supply the encoding information, and if already in Unicode, Statistics would not increase the string width.  However, it might be that the Oracle database is in UTF-16, which is two bytes per character, while Statistics uses utf-8, which is 1 to 3 bytes per character so would still require some adjustment.  I know that some years ago Oracle supported an encoding they called utf-8, but it was actually incorrect.  There was a big fuss, and Oracle didn't want to change their encoding definition, because it would have broken some Oracle jobs.  I think they came up with another way to define the encoding, but I don't know where things stand now.

This link
has some discussion of Oracle Unicode encoding names.

If there is a specific encoding name in use that Statistics does not currently understand, I suggest reporting that as a bug.  The page I referenced above says
Note that UTF8 and AL32UTF8 are Oracle specific names and UTF-8 (with a -) refers to the Unicode standard UTF-8 encoding scheme.

Besides running ALTER TYPE on strings after the import, which would fix things, if you need to set widths according to previous exact widths, the STATS ADJUST WIDTHS extension command can adjust widths for a batch of files in various ways, including having a reference file or dataset of widths to process.  Unfortunately, the new UI version of Statistics does not currently support Python or R programs or extension commands, although I have bugged them about this regularly.

If you do need to alter a lot of syntax at some point, you might consider using a startup script (Python or Basic) to set a configuration or define a macro or extension outside the existing scripts.

On Sun, Jan 31, 2021 at 6:09 AM coxspss coxspss <[hidden email]> wrote:
The problem we have with unicode is that it appears it is not capable of detecting previously defined unicode environments, and accordingly, artificially expands column width of columns which are already in unicode format.  For example, we import and export data from our student information system (Ellucian Banner) that resides in an oracle database that is unicode defined.  So a column that holds a 4 char code, like a major code of 'G20F', is defined in the oracle database as being 16 characters in width.  As long as we have unicode turned OFF in SPSS, that column width of 16 will be honored as we import from the orcale database making it easy to export back into the database.  If, however, we have unicode turned ON in SPSS, than SPSS takes the previously defined unicode column from our database and applies unicode on it again making the column change from 16 to 48.  This  then requires the use of ALTER TYPE to reconfigure the SPSS unicode column (48) back to its original unicode co
 lumn size (16) in the database of origin in order to perform exports.  Thankfully, at least as of version 27, you can change the setting of unicode in the Options\Language tab to OFF, which allows us to not have to modify literally hundreds of syntax jobs to insert ALTER TYPE and define every string column back to its original unicode state (which can include hundreds of columns).  This ability to set the SPSS settings makes it convenient for import/exports to a unicode environment, while still having the option to temporarily turn unicode ON when needed.  However, in working with the development team designing the new SPSS interface, SPSS-NX, at this point the option to set unicode OFF in the SPSS settings will no longer be available.  Instead, if you choose to prevent SPSS from redundantly expanding columns on previously defined unicode columns, you must insert the setting (SET UNICODE=OFF) in each syntax. We would have to alter thousands of syntax jobs to accommodate something th
 at could be a setting rather than a syntax function. Adding insult to injury, we would not only have to insert the SET UNICODE=OFF syntax but also insert ALTER TYPE and redefine the correct unicode width on each string column we need to export which could be from a few string columns to hundreds of string defined formats. SPSS-NX is not yet done so I'm hoping they change their minds on not allowing users to set unicode OFF at the settings level.



> On January 21, 2021 at 10:48 AM "Ives, Melissa L" <[hidden email]> wrote:
>
>
> The issue with both Jon and Bruce's suggestions is when you have multiple large files or both old (non Unicode) and new (Unicode) files that are merged together, it can be very tedious to ALTER TYPE for all string variables, and AMIN may differ for the same variable across files.
>
> I found that double clicking on the file name to open it in SPSS, results in the notice AND a question of whether you want to multiply the string widths.  If you don't use a language that uses the accented or nonwestern characters, and choose 'No' as the response to that question.  The file is opened in Unicode without changing the variable widths and can be saved as a Unicode version.
>
> Melissa
>
> -----Original Message-----
> From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Bruce Weaver
> Sent: Wednesday, January 20, 2021 10:50 AM
> To: [hidden email]
> Subject: Re: [SPSSX-L] Unicode Encoding
>
> EXTERNAL EMAIL: This email originated from outside of the organization. Do not click any links or open any attachments unless you trust the sender and know the content is safe.
>
> And here is how to implement the advice about ALTER TYPE:
>
> * Format all string variables to have the maximum width needed.
> ALTER TYPE ALL (A=AMIN).
>
>
>
>
> Jon Peck wrote
> > This is not an error message.  It's just informative.  When Statistics
> > is running in Unicode mode, which has been the default since V21, and
> > you open a sav file that was created in code page mode, all text,
> > including strings in data, variable names and labels, etc, are
> > converted from code page encoding to Unicode.  If your text is just
> > plain ascii characters, the encoding is actually the same, but if it
> > contains accented characters or text with nonwestern characters, their
> > internal codes change, and strings get longer, because Unicode
> > supports well over 100,000 characters, so the codes don't necessarily
> > fit in a single byte.  Using Unicode means that you can use any
> > combination of characters, and strings will be displayed and handled
> > correctly anywhere in the world.
> >
> > In order to guarantee that no text is lost, string variable widths are
> > tripled.  That is a worst case expansion.  If you take the suggestion
> > in the gui to use ALTER TYPE or do this explicitly to minimize the
> > string sizes, the extra space will be reclaimed, so for plain ascii
> > text, you are back where you started or even smaller if there was excess blank space.
> > When you resave the data file, it will be marked as in Unicode, and
> > you won't see that warning again when you reopen it.
> >
> > Most code will be unaffected, but you should use the char.* string
> > functions, which are character oriented, rather than the equivalent
> > old byte-oriented functions.  Another benefit is that these functions
> > automatically strip trailing blanks, so that you no longer need to use
> > RTRIM.
> >
> > On Wed, Jan 20, 2021 at 3:15 AM Atai Winkler <
>
> > atai.winkler@
>
> > >
> > wrote:
> >
> >> Hi
> >>
> >>
> >>
> >>
> >>
> >> I have been running a program many times and just this morning I
> >> receive the following error message when it reads in a .sav file.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Warning # 5281.  Command name: GET FILE
> >>
> >> SPSS Statistics is running in Unicode encoding mode.  This file is
> >> encoded in
> >>
> >> a locale-specific (code page) encoding.  The defined width of any
> >> string
> >>
> >> variables are automatically tripled in order to avoid possible data loss.
> >> You
> >>
> >> can use ALTER TYPE to set the width of string variables to the width
> >> of the
> >>
> >> longest observed value for each string variable.
> >>
> >>
> >>
> >>
> >>
> >> The encoding at the top of all the programs is
> >>
> >>
> >>
> >> * Encoding: UTF-8.
> >>
> >>
> >>
> >> The bottom right of the screen says 'Unicode on'.
> >>
> >>
> >>
> >> Why does this happen and how can I correct it?
> >>
> >>
> >>
> >> Thank you.
> >>
> >>
> >>
> >> Atai
> >>
> >>
> >>
> >> Dr Atai Winkler
> >>
> >> Principal Consultant
> >> PAM Analytics
> >>
> >>
> >>
> >>
>
> > atai.winkler@
>
> >>
> >>
> >>
> >> pamanalytics.com
> >> <https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2F
> >> www.pamanalytics.com%2F%26gt&data=04%7C01%7CMelissa.Ives%40ct.gov
> >> %7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738
> >> b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> >> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata
> >> =7FtWw295%2BicLRLJ%2FPqY7U3K4Q9qWQjWVVMM7dLlGBKM%3D&reserved=0;
> >>
> >>
> >>
> >>
> >>
> >>
> >> ===================== To manage your subscription to SPSSX-L, send a
> >> message to
>
> > LISTSERV@.UGA
>
> >  (not to SPSSX-L), with no body text
> >> except the command. To leave the list, send the command SIGNOFF
> >> SPSSX-L For a list of commands to manage subscriptions, send the
> >> command INFO REFCARD
> >
> >
> >
> > --
> > Jon K Peck
>
> > jkpeck@
>
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message to
>
> > LISTSERV@.UGA
>
> >  (not to SPSSX-L), with no body text except the command. To leave the
> > list, send the command SIGNOFF SPSSX-L For a list of commands to
> > manage subscriptions, send the command INFO REFCARD
>
>
>
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsites.google.com%2Fa%2Flakeheadu.ca%2Fbweaver%2F&data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BQaHhJRR9UoSSRFq4dw5zEPafonSvMEF0xTohZ%2FJDOo%3D&reserved=0
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> Sent from: https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspssx-discussion.1045642.n5.nabble.com%2F&data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QOZMWpe2jvIY%2FHWiraZ9yPqzIF5GWQ6HJi041BW%2BHds%3D&reserved=0
>
> =====================
> To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
>
> ________________________________
>
> This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Unicode Encoding

wsu_wright

Thanks Jon.  I'll check with our system folks tomorrow on the UTF issue and get back to you.  In terms of drivers, we are currently using the IBM SPSS OEM 8.0 Oracle Wire Protocol driver provided via the Data Access Pack.

On January 31, 2021 at 11:03 AM Jon Peck <[hidden email]> wrote:

This sounds to me like a problem with the Oracle ODBC driver you are using or maybe the database encoding definition.  Normally, a driver would supply the encoding information, and if already in Unicode, Statistics would not increase the string width.  However, it might be that the Oracle database is in UTF-16, which is two bytes per character, while Statistics uses utf-8, which is 1 to 3 bytes per character so would still require some adjustment.  I know that some years ago Oracle supported an encoding they called utf-8, but it was actually incorrect.  There was a big fuss, and Oracle didn't want to change their encoding definition, because it would have broken some Oracle jobs.  I think they came up with another way to define the encoding, but I don't know where things stand now.

This link
has some discussion of Oracle Unicode encoding names.

If there is a specific encoding name in use that Statistics does not currently understand, I suggest reporting that as a bug.  The page I referenced above says
Note that UTF8 and AL32UTF8 are Oracle specific names and UTF-8 (with a -) refers to the Unicode standard UTF-8 encoding scheme.

Besides running ALTER TYPE on strings after the import, which would fix things, if you need to set widths according to previous exact widths, the STATS ADJUST WIDTHS extension command can adjust widths for a batch of files in various ways, including having a reference file or dataset of widths to process.  Unfortunately, the new UI version of Statistics does not currently support Python or R programs or extension commands, although I have bugged them about this regularly.

If you do need to alter a lot of syntax at some point, you might consider using a startup script (Python or Basic) to set a configuration or define a macro or extension outside the existing scripts.

On Sun, Jan 31, 2021 at 6:09 AM coxspss coxspss <[hidden email]> wrote:
The problem we have with unicode is that it appears it is not capable of detecting previously defined unicode environments, and accordingly, artificially expands column width of columns which are already in unicode format.  For example, we import and export data from our student information system (Ellucian Banner) that resides in an oracle database that is unicode defined.  So a column that holds a 4 char code, like a major code of 'G20F', is defined in the oracle database as being 16 characters in width.  As long as we have unicode turned OFF in SPSS, that column width of 16 will be honored as we import from the orcale database making it easy to export back into the database.  If, however, we have unicode turned ON in SPSS, than SPSS takes the previously defined unicode column from our database and applies unicode on it again making the column change from 16 to 48.  This  then requires the use of ALTER TYPE to reconfigure the SPSS unicode column (48) back to its original unicode co
 lumn size (16) in the database of origin in order to perform exports.  Thankfully, at least as of version 27, you can change the setting of unicode in the Options\Language tab to OFF, which allows us to not have to modify literally hundreds of syntax jobs to insert ALTER TYPE and define every string column back to its original unicode state (which can include hundreds of columns).  This ability to set the SPSS settings makes it convenient for import/exports to a unicode environment, while still having the option to temporarily turn unicode ON when needed.  However, in working with the development team designing the new SPSS interface, SPSS-NX, at this point the option to set unicode OFF in the SPSS settings will no longer be available.  Instead, if you choose to prevent SPSS from redundantly expanding columns on previously defined unicode columns, you must insert the setting (SET UNICODE=OFF) in each syntax. We would have to alter thousands of syntax jobs to accommodate something th
 at could be a setting rather than a syntax function. Adding insult to injury, we would not only have to insert the SET UNICODE=OFF syntax but also insert ALTER TYPE and redefine the correct unicode width on each string column we need to export which could be from a few string columns to hundreds of string defined formats. SPSS-NX is not yet done so I'm hoping they change their minds on not allowing users to set unicode OFF at the settings level.



> On January 21, 2021 at 10:48 AM "Ives, Melissa L" <[hidden email]> wrote:
>
>
> The issue with both Jon and Bruce's suggestions is when you have multiple large files or both old (non Unicode) and new (Unicode) files that are merged together, it can be very tedious to ALTER TYPE for all string variables, and AMIN may differ for the same variable across files.
>
> I found that double clicking on the file name to open it in SPSS, results in the notice AND a question of whether you want to multiply the string widths.  If you don't use a language that uses the accented or nonwestern characters, and choose 'No' as the response to that question.  The file is opened in Unicode without changing the variable widths and can be saved as a Unicode version.
>
> Melissa
>
> -----Original Message-----
> From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Bruce Weaver
> Sent: Wednesday, January 20, 2021 10:50 AM
> To: [hidden email]
> Subject: Re: [SPSSX-L] Unicode Encoding
>
> EXTERNAL EMAIL: This email originated from outside of the organization. Do not click any links or open any attachments unless you trust the sender and know the content is safe.
>
> And here is how to implement the advice about ALTER TYPE:
>
> * Format all string variables to have the maximum width needed.
> ALTER TYPE ALL (A=AMIN).
>
>
>
>
> Jon Peck wrote
> > This is not an error message.  It's just informative.  When Statistics
> > is running in Unicode mode, which has been the default since V21, and
> > you open a sav file that was created in code page mode, all text,
> > including strings in data, variable names and labels, etc, are
> > converted from code page encoding to Unicode.  If your text is just
> > plain ascii characters, the encoding is actually the same, but if it
> > contains accented characters or text with nonwestern characters, their
> > internal codes change, and strings get longer, because Unicode
> > supports well over 100,000 characters, so the codes don't necessarily
> > fit in a single byte.  Using Unicode means that you can use any
> > combination of characters, and strings will be displayed and handled
> > correctly anywhere in the world.
> >
> > In order to guarantee that no text is lost, string variable widths are
> > tripled.  That is a worst case expansion.  If you take the suggestion
> > in the gui to use ALTER TYPE or do this explicitly to minimize the
> > string sizes, the extra space will be reclaimed, so for plain ascii
> > text, you are back where you started or even smaller if there was excess blank space.
> > When you resave the data file, it will be marked as in Unicode, and
> > you won't see that warning again when you reopen it.
> >
> > Most code will be unaffected, but you should use the char.* string
> > functions, which are character oriented, rather than the equivalent
> > old byte-oriented functions.  Another benefit is that these functions
> > automatically strip trailing blanks, so that you no longer need to use
> > RTRIM.
> >
> > On Wed, Jan 20, 2021 at 3:15 AM Atai Winkler <
>
> > atai.winkler@
>
> > >
> > wrote:
> >
> >> Hi
> >>
> >>
> >>
> >>
> >>
> >> I have been running a program many times and just this morning I
> >> receive the following error message when it reads in a .sav file.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Warning # 5281.  Command name: GET FILE
> >>
> >> SPSS Statistics is running in Unicode encoding mode.  This file is
> >> encoded in
> >>
> >> a locale-specific (code page) encoding.  The defined width of any
> >> string
> >>
> >> variables are automatically tripled in order to avoid possible data loss.
> >> You
> >>
> >> can use ALTER TYPE to set the width of string variables to the width
> >> of the
> >>
> >> longest observed value for each string variable.
> >>
> >>
> >>
> >>
> >>
> >> The encoding at the top of all the programs is
> >>
> >>
> >>
> >> * Encoding: UTF-8.
> >>
> >>
> >>
> >> The bottom right of the screen says 'Unicode on'.
> >>
> >>
> >>
> >> Why does this happen and how can I correct it?
> >>
> >>
> >>
> >> Thank you.
> >>
> >>
> >>
> >> Atai
> >>
> >>
> >>
> >> Dr Atai Winkler
> >>
> >> Principal Consultant
> >> PAM Analytics
> >>
> >>
> >>
> >>
>
> > atai.winkler@
>
> >>
> >>
> >>
> >> pamanalytics.com
> >> <https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2F
> >> www.pamanalytics.com%2F%26gt&data=04%7C01%7CMelissa.Ives%40ct.gov
> >> %7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738
> >> b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> >> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata
> >> =7FtWw295%2BicLRLJ%2FPqY7U3K4Q9qWQjWVVMM7dLlGBKM%3D&reserved=0;
> >>
> >>
> >>
> >>
> >>
> >>
> >> ===================== To manage your subscription to SPSSX-L, send a
> >> message to
>
> > LISTSERV@.UGA
>
> >  (not to SPSSX-L), with no body text
> >> except the command. To leave the list, send the command SIGNOFF
> >> SPSSX-L For a list of commands to manage subscriptions, send the
> >> command INFO REFCARD
> >
> >
> >
> > --
> > Jon K Peck
>
> > jkpeck@
>
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message to
>
> > LISTSERV@.UGA
>
> >  (not to SPSSX-L), with no body text except the command. To leave the
> > list, send the command SIGNOFF SPSSX-L For a list of commands to
> > manage subscriptions, send the command INFO REFCARD
>
>
>
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsites.google.com%2Fa%2Flakeheadu.ca%2Fbweaver%2F&data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BQaHhJRR9UoSSRFq4dw5zEPafonSvMEF0xTohZ%2FJDOo%3D&reserved=0
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> Sent from: https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspssx-discussion.1045642.n5.nabble.com%2F&data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QOZMWpe2jvIY%2FHWiraZ9yPqzIF5GWQ6HJi041BW%2BHds%3D&reserved=0
>
> =====================
> To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
>
> ________________________________
>
> This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Unicode Encoding

wsu_wright

John,  I did verify that our oracle environment is AL32UTF8.  Given we are using the SPSS supplied Oracle Wire Protocol & we are AL32UTF8, are you suggesting this should be an IBM ticket?


On January 31, 2021 at 12:50 PM coxspss coxspss <[hidden email]> wrote:

Thanks Jon.  I'll check with our system folks tomorrow on the UTF issue and get back to you.  In terms of drivers, we are currently using the IBM SPSS OEM 8.0 Oracle Wire Protocol driver provided via the Data Access Pack.

On January 31, 2021 at 11:03 AM Jon Peck <[hidden email]> wrote:

This sounds to me like a problem with the Oracle ODBC driver you are using or maybe the database encoding definition.  Normally, a driver would supply the encoding information, and if already in Unicode, Statistics would not increase the string width.  However, it might be that the Oracle database is in UTF-16, which is two bytes per character, while Statistics uses utf-8, which is 1 to 3 bytes per character so would still require some adjustment.  I know that some years ago Oracle supported an encoding they called utf-8, but it was actually incorrect.  There was a big fuss, and Oracle didn't want to change their encoding definition, because it would have broken some Oracle jobs.  I think they came up with another way to define the encoding, but I don't know where things stand now.

This link
has some discussion of Oracle Unicode encoding names.

If there is a specific encoding name in use that Statistics does not currently understand, I suggest reporting that as a bug.  The page I referenced above says
Note that UTF8 and AL32UTF8 are Oracle specific names and UTF-8 (with a -) refers to the Unicode standard UTF-8 encoding scheme.

Besides running ALTER TYPE on strings after the import, which would fix things, if you need to set widths according to previous exact widths, the STATS ADJUST WIDTHS extension command can adjust widths for a batch of files in various ways, including having a reference file or dataset of widths to process.  Unfortunately, the new UI version of Statistics does not currently support Python or R programs or extension commands, although I have bugged them about this regularly.

If you do need to alter a lot of syntax at some point, you might consider using a startup script (Python or Basic) to set a configuration or define a macro or extension outside the existing scripts.

On Sun, Jan 31, 2021 at 6:09 AM coxspss coxspss <[hidden email]> wrote:
The problem we have with unicode is that it appears it is not capable of detecting previously defined unicode environments, and accordingly, artificially expands column width of columns which are already in unicode format.  For example, we import and export data from our student information system (Ellucian Banner) that resides in an oracle database that is unicode defined.  So a column that holds a 4 char code, like a major code of 'G20F', is defined in the oracle database as being 16 characters in width.  As long as we have unicode turned OFF in SPSS, that column width of 16 will be honored as we import from the orcale database making it easy to export back into the database.  If, however, we have unicode turned ON in SPSS, than SPSS takes the previously defined unicode column from our database and applies unicode on it again making the column change from 16 to 48.  This  then requires the use of ALTER TYPE to reconfigure the SPSS unicode column (48) back to its original unicode co
 lumn size (16) in the database of origin in order to perform exports.  Thankfully, at least as of version 27, you can change the setting of unicode in the Options\Language tab to OFF, which allows us to not have to modify literally hundreds of syntax jobs to insert ALTER TYPE and define every string column back to its original unicode state (which can include hundreds of columns).  This ability to set the SPSS settings makes it convenient for import/exports to a unicode environment, while still having the option to temporarily turn unicode ON when needed.  However, in working with the development team designing the new SPSS interface, SPSS-NX, at this point the option to set unicode OFF in the SPSS settings will no longer be available.  Instead, if you choose to prevent SPSS from redundantly expanding columns on previously defined unicode columns, you must insert the setting (SET UNICODE=OFF) in each syntax. We would have to alter thousands of syntax jobs to accommodate something th
 at could be a setting rather than a syntax function. Adding insult to injury, we would not only have to insert the SET UNICODE=OFF syntax but also insert ALTER TYPE and redefine the correct unicode width on each string column we need to export which could be from a few string columns to hundreds of string defined formats. SPSS-NX is not yet done so I'm hoping they change their minds on not allowing users to set unicode OFF at the settings level.



> On January 21, 2021 at 10:48 AM "Ives, Melissa L" <[hidden email]> wrote:
>
>
> The issue with both Jon and Bruce's suggestions is when you have multiple large files or both old (non Unicode) and new (Unicode) files that are merged together, it can be very tedious to ALTER TYPE for all string variables, and AMIN may differ for the same variable across files.
>
> I found that double clicking on the file name to open it in SPSS, results in the notice AND a question of whether you want to multiply the string widths.  If you don't use a language that uses the accented or nonwestern characters, and choose 'No' as the response to that question.  The file is opened in Unicode without changing the variable widths and can be saved as a Unicode version.
>
> Melissa
>
> -----Original Message-----
> From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Bruce Weaver
> Sent: Wednesday, January 20, 2021 10:50 AM
> To: [hidden email]
> Subject: Re: [SPSSX-L] Unicode Encoding
>
> EXTERNAL EMAIL: This email originated from outside of the organization. Do not click any links or open any attachments unless you trust the sender and know the content is safe.
>
> And here is how to implement the advice about ALTER TYPE:
>
> * Format all string variables to have the maximum width needed.
> ALTER TYPE ALL (A=AMIN).
>
>
>
>
> Jon Peck wrote
> > This is not an error message.  It's just informative.  When Statistics
> > is running in Unicode mode, which has been the default since V21, and
> > you open a sav file that was created in code page mode, all text,
> > including strings in data, variable names and labels, etc, are
> > converted from code page encoding to Unicode.  If your text is just
> > plain ascii characters, the encoding is actually the same, but if it
> > contains accented characters or text with nonwestern characters, their
> > internal codes change, and strings get longer, because Unicode
> > supports well over 100,000 characters, so the codes don't necessarily
> > fit in a single byte.  Using Unicode means that you can use any
> > combination of characters, and strings will be displayed and handled
> > correctly anywhere in the world.
> >
> > In order to guarantee that no text is lost, string variable widths are
> > tripled.  That is a worst case expansion.  If you take the suggestion
> > in the gui to use ALTER TYPE or do this explicitly to minimize the
> > string sizes, the extra space will be reclaimed, so for plain ascii
> > text, you are back where you started or even smaller if there was excess blank space.
> > When you resave the data file, it will be marked as in Unicode, and
> > you won't see that warning again when you reopen it.
> >
> > Most code will be unaffected, but you should use the char.* string
> > functions, which are character oriented, rather than the equivalent
> > old byte-oriented functions.  Another benefit is that these functions
> > automatically strip trailing blanks, so that you no longer need to use
> > RTRIM.
> >
> > On Wed, Jan 20, 2021 at 3:15 AM Atai Winkler <
>
> > atai.winkler@
>
> > >
> > wrote:
> >
> >> Hi
> >>
> >>
> >>
> >>
> >>
> >> I have been running a program many times and just this morning I
> >> receive the following error message when it reads in a .sav file.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Warning # 5281.  Command name: GET FILE
> >>
> >> SPSS Statistics is running in Unicode encoding mode.  This file is
> >> encoded in
> >>
> >> a locale-specific (code page) encoding.  The defined width of any
> >> string
> >>
> >> variables are automatically tripled in order to avoid possible data loss.
> >> You
> >>
> >> can use ALTER TYPE to set the width of string variables to the width
> >> of the
> >>
> >> longest observed value for each string variable.
> >>
> >>
> >>
> >>
> >>
> >> The encoding at the top of all the programs is
> >>
> >>
> >>
> >> * Encoding: UTF-8.
> >>
> >>
> >>
> >> The bottom right of the screen says 'Unicode on'.
> >>
> >>
> >>
> >> Why does this happen and how can I correct it?
> >>
> >>
> >>
> >> Thank you.
> >>
> >>
> >>
> >> Atai
> >>
> >>
> >>
> >> Dr Atai Winkler
> >>
> >> Principal Consultant
> >> PAM Analytics
> >>
> >>
> >>
> >>
>
> > atai.winkler@
>
> >>
> >>
> >>
> >> pamanalytics.com
> >> <https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2F
> >> www.pamanalytics.com%2F%26gt&data=04%7C01%7CMelissa.Ives%40ct.gov
> >> %7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738
> >> b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> >> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata
> >> =7FtWw295%2BicLRLJ%2FPqY7U3K4Q9qWQjWVVMM7dLlGBKM%3D&reserved=0;
> >>
> >>
> >>
> >>
> >>
> >>
> >> ===================== To manage your subscription to SPSSX-L, send a
> >> message to
>
> > LISTSERV@.UGA
>
> >  (not to SPSSX-L), with no body text
> >> except the command. To leave the list, send the command SIGNOFF
> >> SPSSX-L For a list of commands to manage subscriptions, send the
> >> command INFO REFCARD
> >
> >
> >
> > --
> > Jon K Peck
>
> > jkpeck@
>
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message to
>
> > LISTSERV@.UGA
>
> >  (not to SPSSX-L), with no body text except the command. To leave the
> > list, send the command SIGNOFF SPSSX-L For a list of commands to
> > manage subscriptions, send the command INFO REFCARD
>
>
>
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsites.google.com%2Fa%2Flakeheadu.ca%2Fbweaver%2F&data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BQaHhJRR9UoSSRFq4dw5zEPafonSvMEF0xTohZ%2FJDOo%3D&reserved=0
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> Sent from: https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspssx-discussion.1045642.n5.nabble.com%2F&data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QOZMWpe2jvIY%2FHWiraZ9yPqzIF5GWQ6HJi041BW%2BHds%3D&reserved=0
>
> =====================
> To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
>
> ________________________________
>
> This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


 


 

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Unicode Encoding

Jon Peck

Yes,  but you might also want to try the ODBC driver from Oracle.  From the note on the page I mentioned, though, I am not sure that AL32UTF8 is actually the standard UTF-8 encoding as per the Unicode Consortium.  If your text is plain ascii, though, that would not matter.

On Mon, Feb 1, 2021 at 9:15 AM coxspss coxspss <[hidden email]> wrote:

John,  I did verify that our oracle environment is AL32UTF8.  Given we are using the SPSS supplied Oracle Wire Protocol & we are AL32UTF8, are you suggesting this should be an IBM ticket?


On January 31, 2021 at 12:50 PM coxspss coxspss <[hidden email]> wrote:

Thanks Jon.  I'll check with our system folks tomorrow on the UTF issue and get back to you.  In terms of drivers, we are currently using the IBM SPSS OEM 8.0 Oracle Wire Protocol driver provided via the Data Access Pack.

On January 31, 2021 at 11:03 AM Jon Peck <[hidden email]> wrote:

This sounds to me like a problem with the Oracle ODBC driver you are using or maybe the database encoding definition.  Normally, a driver would supply the encoding information, and if already in Unicode, Statistics would not increase the string width.  However, it might be that the Oracle database is in UTF-16, which is two bytes per character, while Statistics uses utf-8, which is 1 to 3 bytes per character so would still require some adjustment.  I know that some years ago Oracle supported an encoding they called utf-8, but it was actually incorrect.  There was a big fuss, and Oracle didn't want to change their encoding definition, because it would have broken some Oracle jobs.  I think they came up with another way to define the encoding, but I don't know where things stand now.

This link
has some discussion of Oracle Unicode encoding names.

If there is a specific encoding name in use that Statistics does not currently understand, I suggest reporting that as a bug.  The page I referenced above says
Note that UTF8 and AL32UTF8 are Oracle specific names and UTF-8 (with a -) refers to the Unicode standard UTF-8 encoding scheme.

Besides running ALTER TYPE on strings after the import, which would fix things, if you need to set widths according to previous exact widths, the STATS ADJUST WIDTHS extension command can adjust widths for a batch of files in various ways, including having a reference file or dataset of widths to process.  Unfortunately, the new UI version of Statistics does not currently support Python or R programs or extension commands, although I have bugged them about this regularly.

If you do need to alter a lot of syntax at some point, you might consider using a startup script (Python or Basic) to set a configuration or define a macro or extension outside the existing scripts.

On Sun, Jan 31, 2021 at 6:09 AM coxspss coxspss <[hidden email]> wrote:
The problem we have with unicode is that it appears it is not capable of detecting previously defined unicode environments, and accordingly, artificially expands column width of columns which are already in unicode format.  For example, we import and export data from our student information system (Ellucian Banner) that resides in an oracle database that is unicode defined.  So a column that holds a 4 char code, like a major code of 'G20F', is defined in the oracle database as being 16 characters in width.  As long as we have unicode turned OFF in SPSS, that column width of 16 will be honored as we import from the orcale database making it easy to export back into the database.  If, however, we have unicode turned ON in SPSS, than SPSS takes the previously defined unicode column from our database and applies unicode on it again making the column change from 16 to 48.  This  then requires the use of ALTER TYPE to reconfigure the SPSS unicode column (48) back to its original unicode co
 lumn size (16) in the database of origin in order to perform exports.  Thankfully, at least as of version 27, you can change the setting of unicode in the Options\Language tab to OFF, which allows us to not have to modify literally hundreds of syntax jobs to insert ALTER TYPE and define every string column back to its original unicode state (which can include hundreds of columns).  This ability to set the SPSS settings makes it convenient for import/exports to a unicode environment, while still having the option to temporarily turn unicode ON when needed.  However, in working with the development team designing the new SPSS interface, SPSS-NX, at this point the option to set unicode OFF in the SPSS settings will no longer be available.  Instead, if you choose to prevent SPSS from redundantly expanding columns on previously defined unicode columns, you must insert the setting (SET UNICODE=OFF) in each syntax. We would have to alter thousands of syntax jobs to accommodate something th
 at could be a setting rather than a syntax function. Adding insult to injury, we would not only have to insert the SET UNICODE=OFF syntax but also insert ALTER TYPE and redefine the correct unicode width on each string column we need to export which could be from a few string columns to hundreds of string defined formats. SPSS-NX is not yet done so I'm hoping they change their minds on not allowing users to set unicode OFF at the settings level.



> On January 21, 2021 at 10:48 AM "Ives, Melissa L" <[hidden email]> wrote:
>
>
> The issue with both Jon and Bruce's suggestions is when you have multiple large files or both old (non Unicode) and new (Unicode) files that are merged together, it can be very tedious to ALTER TYPE for all string variables, and AMIN may differ for the same variable across files.
>
> I found that double clicking on the file name to open it in SPSS, results in the notice AND a question of whether you want to multiply the string widths.  If you don't use a language that uses the accented or nonwestern characters, and choose 'No' as the response to that question.  The file is opened in Unicode without changing the variable widths and can be saved as a Unicode version.
>
> Melissa
>
> -----Original Message-----
> From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Bruce Weaver
> Sent: Wednesday, January 20, 2021 10:50 AM
> To: [hidden email]
> Subject: Re: [SPSSX-L] Unicode Encoding
>
> EXTERNAL EMAIL: This email originated from outside of the organization. Do not click any links or open any attachments unless you trust the sender and know the content is safe.
>
> And here is how to implement the advice about ALTER TYPE:
>
> * Format all string variables to have the maximum width needed.
> ALTER TYPE ALL (A=AMIN).
>
>
>
>
> Jon Peck wrote
> > This is not an error message.  It's just informative.  When Statistics
> > is running in Unicode mode, which has been the default since V21, and
> > you open a sav file that was created in code page mode, all text,
> > including strings in data, variable names and labels, etc, are
> > converted from code page encoding to Unicode.  If your text is just
> > plain ascii characters, the encoding is actually the same, but if it
> > contains accented characters or text with nonwestern characters, their
> > internal codes change, and strings get longer, because Unicode
> > supports well over 100,000 characters, so the codes don't necessarily
> > fit in a single byte.  Using Unicode means that you can use any
> > combination of characters, and strings will be displayed and handled
> > correctly anywhere in the world.
> >
> > In order to guarantee that no text is lost, string variable widths are
> > tripled.  That is a worst case expansion.  If you take the suggestion
> > in the gui to use ALTER TYPE or do this explicitly to minimize the
> > string sizes, the extra space will be reclaimed, so for plain ascii
> > text, you are back where you started or even smaller if there was excess blank space.
> > When you resave the data file, it will be marked as in Unicode, and
> > you won't see that warning again when you reopen it.
> >
> > Most code will be unaffected, but you should use the char.* string
> > functions, which are character oriented, rather than the equivalent
> > old byte-oriented functions.  Another benefit is that these functions
> > automatically strip trailing blanks, so that you no longer need to use
> > RTRIM.
> >
> > On Wed, Jan 20, 2021 at 3:15 AM Atai Winkler <
>
> > atai.winkler@
>
> > >
> > wrote:
> >
> >> Hi
> >>
> >>
> >>
> >>
> >>
> >> I have been running a program many times and just this morning I
> >> receive the following error message when it reads in a .sav file.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Warning # 5281.  Command name: GET FILE
> >>
> >> SPSS Statistics is running in Unicode encoding mode.  This file is
> >> encoded in
> >>
> >> a locale-specific (code page) encoding.  The defined width of any
> >> string
> >>
> >> variables are automatically tripled in order to avoid possible data loss.
> >> You
> >>
> >> can use ALTER TYPE to set the width of string variables to the width
> >> of the
> >>
> >> longest observed value for each string variable.
> >>
> >>
> >>
> >>
> >>
> >> The encoding at the top of all the programs is
> >>
> >>
> >>
> >> * Encoding: UTF-8.
> >>
> >>
> >>
> >> The bottom right of the screen says 'Unicode on'.
> >>
> >>
> >>
> >> Why does this happen and how can I correct it?
> >>
> >>
> >>
> >> Thank you.
> >>
> >>
> >>
> >> Atai
> >>
> >>
> >>
> >> Dr Atai Winkler
> >>
> >> Principal Consultant
> >> PAM Analytics
> >>
> >>
> >>
> >>
>
> > atai.winkler@
>
> >>
> >>
> >>
> >> pamanalytics.com
> >> <https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2F
> >> www.pamanalytics.com%2F%26gt&data=04%7C01%7CMelissa.Ives%40ct.gov
> >> %7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738
> >> b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> >> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata
> >> =7FtWw295%2BicLRLJ%2FPqY7U3K4Q9qWQjWVVMM7dLlGBKM%3D&reserved=0;
> >>
> >>
> >>
> >>
> >>
> >>
> >> ===================== To manage your subscription to SPSSX-L, send a
> >> message to
>
> > LISTSERV@.UGA
>
> >  (not to SPSSX-L), with no body text
> >> except the command. To leave the list, send the command SIGNOFF
> >> SPSSX-L For a list of commands to manage subscriptions, send the
> >> command INFO REFCARD
> >
> >
> >
> > --
> > Jon K Peck
>
> > jkpeck@
>
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message to
>
> > LISTSERV@.UGA
>
> >  (not to SPSSX-L), with no body text except the command. To leave the
> > list, send the command SIGNOFF SPSSX-L For a list of commands to
> > manage subscriptions, send the command INFO REFCARD
>
>
>
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsites.google.com%2Fa%2Flakeheadu.ca%2Fbweaver%2F&data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BQaHhJRR9UoSSRFq4dw5zEPafonSvMEF0xTohZ%2FJDOo%3D&reserved=0
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> Sent from: https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspssx-discussion.1045642.n5.nabble.com%2F&data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QOZMWpe2jvIY%2FHWiraZ9yPqzIF5GWQ6HJi041BW%2BHds%3D&reserved=0
>
> =====================
> To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
>
> ________________________________
>
> This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


 


 

--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Unicode Encoding

Jon Peck
In reply to this post by wsu_wright
Here is a little more information on Oracle utf8/utf-8 character sets from a third-party site.  Surrogate characters are quite rare, so unless you are a linguist or in an exotic locale, you are unlikely to encounter any.

Recently, one of our clients had a question on the differences between these two character sets since they were in the process of making their application global.  In an upcoming whitepaper, we will discuss in detail what it takes (from a RDBMS perspective) to address localization and globalization issues.  As far as these two character sets go in Oracle,  the only difference between AL32UTF8 and UTF8 character sets is that AL32UTF8 stores characters beyond U+FFFF as four bytes (exactly as Unicode defines UTF-8). Oracle’s “UTF8” stores these characters as a sequence of two UTF-16 surrogate characters encoded using UTF-8 (or six bytes per character).  Besides this storage difference, another difference is better support for supplementary characters in AL32UTF8 character set.

 


On Mon, Feb 1, 2021 at 9:15 AM coxspss coxspss <[hidden email]> wrote:

John,  I did verify that our oracle environment is AL32UTF8.  Given we are using the SPSS supplied Oracle Wire Protocol & we are AL32UTF8, are you suggesting this should be an IBM ticket?


On January 31, 2021 at 12:50 PM coxspss coxspss <[hidden email]> wrote:

Thanks Jon.  I'll check with our system folks tomorrow on the UTF issue and get back to you.  In terms of drivers, we are currently using the IBM SPSS OEM 8.0 Oracle Wire Protocol driver provided via the Data Access Pack.

On January 31, 2021 at 11:03 AM Jon Peck <[hidden email]> wrote:

This sounds to me like a problem with the Oracle ODBC driver you are using or maybe the database encoding definition.  Normally, a driver would supply the encoding information, and if already in Unicode, Statistics would not increase the string width.  However, it might be that the Oracle database is in UTF-16, which is two bytes per character, while Statistics uses utf-8, which is 1 to 3 bytes per character so would still require some adjustment.  I know that some years ago Oracle supported an encoding they called utf-8, but it was actually incorrect.  There was a big fuss, and Oracle didn't want to change their encoding definition, because it would have broken some Oracle jobs.  I think they came up with another way to define the encoding, but I don't know where things stand now.

This link
has some discussion of Oracle Unicode encoding names.

If there is a specific encoding name in use that Statistics does not currently understand, I suggest reporting that as a bug.  The page I referenced above says
Note that UTF8 and AL32UTF8 are Oracle specific names and UTF-8 (with a -) refers to the Unicode standard UTF-8 encoding scheme.

Besides running ALTER TYPE on strings after the import, which would fix things, if you need to set widths according to previous exact widths, the STATS ADJUST WIDTHS extension command can adjust widths for a batch of files in various ways, including having a reference file or dataset of widths to process.  Unfortunately, the new UI version of Statistics does not currently support Python or R programs or extension commands, although I have bugged them about this regularly.

If you do need to alter a lot of syntax at some point, you might consider using a startup script (Python or Basic) to set a configuration or define a macro or extension outside the existing scripts.

On Sun, Jan 31, 2021 at 6:09 AM coxspss coxspss <[hidden email]> wrote:
The problem we have with unicode is that it appears it is not capable of detecting previously defined unicode environments, and accordingly, artificially expands column width of columns which are already in unicode format.  For example, we import and export data from our student information system (Ellucian Banner) that resides in an oracle database that is unicode defined.  So a column that holds a 4 char code, like a major code of 'G20F', is defined in the oracle database as being 16 characters in width.  As long as we have unicode turned OFF in SPSS, that column width of 16 will be honored as we import from the orcale database making it easy to export back into the database.  If, however, we have unicode turned ON in SPSS, than SPSS takes the previously defined unicode column from our database and applies unicode on it again making the column change from 16 to 48.  This  then requires the use of ALTER TYPE to reconfigure the SPSS unicode column (48) back to its original unicode co
 lumn size (16) in the database of origin in order to perform exports.  Thankfully, at least as of version 27, you can change the setting of unicode in the Options\Language tab to OFF, which allows us to not have to modify literally hundreds of syntax jobs to insert ALTER TYPE and define every string column back to its original unicode state (which can include hundreds of columns).  This ability to set the SPSS settings makes it convenient for import/exports to a unicode environment, while still having the option to temporarily turn unicode ON when needed.  However, in working with the development team designing the new SPSS interface, SPSS-NX, at this point the option to set unicode OFF in the SPSS settings will no longer be available.  Instead, if you choose to prevent SPSS from redundantly expanding columns on previously defined unicode columns, you must insert the setting (SET UNICODE=OFF) in each syntax. We would have to alter thousands of syntax jobs to accommodate something th
 at could be a setting rather than a syntax function. Adding insult to injury, we would not only have to insert the SET UNICODE=OFF syntax but also insert ALTER TYPE and redefine the correct unicode width on each string column we need to export which could be from a few string columns to hundreds of string defined formats. SPSS-NX is not yet done so I'm hoping they change their minds on not allowing users to set unicode OFF at the settings level.



> On January 21, 2021 at 10:48 AM "Ives, Melissa L" <[hidden email]> wrote:
>
>
> The issue with both Jon and Bruce's suggestions is when you have multiple large files or both old (non Unicode) and new (Unicode) files that are merged together, it can be very tedious to ALTER TYPE for all string variables, and AMIN may differ for the same variable across files.
>
> I found that double clicking on the file name to open it in SPSS, results in the notice AND a question of whether you want to multiply the string widths.  If you don't use a language that uses the accented or nonwestern characters, and choose 'No' as the response to that question.  The file is opened in Unicode without changing the variable widths and can be saved as a Unicode version.
>
> Melissa
>
> -----Original Message-----
> From: SPSSX(r) Discussion <[hidden email]> On Behalf Of Bruce Weaver
> Sent: Wednesday, January 20, 2021 10:50 AM
> To: [hidden email]
> Subject: Re: [SPSSX-L] Unicode Encoding
>
> EXTERNAL EMAIL: This email originated from outside of the organization. Do not click any links or open any attachments unless you trust the sender and know the content is safe.
>
> And here is how to implement the advice about ALTER TYPE:
>
> * Format all string variables to have the maximum width needed.
> ALTER TYPE ALL (A=AMIN).
>
>
>
>
> Jon Peck wrote
> > This is not an error message.  It's just informative.  When Statistics
> > is running in Unicode mode, which has been the default since V21, and
> > you open a sav file that was created in code page mode, all text,
> > including strings in data, variable names and labels, etc, are
> > converted from code page encoding to Unicode.  If your text is just
> > plain ascii characters, the encoding is actually the same, but if it
> > contains accented characters or text with nonwestern characters, their
> > internal codes change, and strings get longer, because Unicode
> > supports well over 100,000 characters, so the codes don't necessarily
> > fit in a single byte.  Using Unicode means that you can use any
> > combination of characters, and strings will be displayed and handled
> > correctly anywhere in the world.
> >
> > In order to guarantee that no text is lost, string variable widths are
> > tripled.  That is a worst case expansion.  If you take the suggestion
> > in the gui to use ALTER TYPE or do this explicitly to minimize the
> > string sizes, the extra space will be reclaimed, so for plain ascii
> > text, you are back where you started or even smaller if there was excess blank space.
> > When you resave the data file, it will be marked as in Unicode, and
> > you won't see that warning again when you reopen it.
> >
> > Most code will be unaffected, but you should use the char.* string
> > functions, which are character oriented, rather than the equivalent
> > old byte-oriented functions.  Another benefit is that these functions
> > automatically strip trailing blanks, so that you no longer need to use
> > RTRIM.
> >
> > On Wed, Jan 20, 2021 at 3:15 AM Atai Winkler <
>
> > atai.winkler@
>
> > >
> > wrote:
> >
> >> Hi
> >>
> >>
> >>
> >>
> >>
> >> I have been running a program many times and just this morning I
> >> receive the following error message when it reads in a .sav file.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Warning # 5281.  Command name: GET FILE
> >>
> >> SPSS Statistics is running in Unicode encoding mode.  This file is
> >> encoded in
> >>
> >> a locale-specific (code page) encoding.  The defined width of any
> >> string
> >>
> >> variables are automatically tripled in order to avoid possible data loss.
> >> You
> >>
> >> can use ALTER TYPE to set the width of string variables to the width
> >> of the
> >>
> >> longest observed value for each string variable.
> >>
> >>
> >>
> >>
> >>
> >> The encoding at the top of all the programs is
> >>
> >>
> >>
> >> * Encoding: UTF-8.
> >>
> >>
> >>
> >> The bottom right of the screen says 'Unicode on'.
> >>
> >>
> >>
> >> Why does this happen and how can I correct it?
> >>
> >>
> >>
> >> Thank you.
> >>
> >>
> >>
> >> Atai
> >>
> >>
> >>
> >> Dr Atai Winkler
> >>
> >> Principal Consultant
> >> PAM Analytics
> >>
> >>
> >>
> >>
>
> > atai.winkler@
>
> >>
> >>
> >>
> >> pamanalytics.com
> >> <https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2F
> >> www.pamanalytics.com%2F%26gt&data=04%7C01%7CMelissa.Ives%40ct.gov
> >> %7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738
> >> b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> >> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata
> >> =7FtWw295%2BicLRLJ%2FPqY7U3K4Q9qWQjWVVMM7dLlGBKM%3D&reserved=0;
> >>
> >>
> >>
> >>
> >>
> >>
> >> ===================== To manage your subscription to SPSSX-L, send a
> >> message to
>
> > LISTSERV@.UGA
>
> >  (not to SPSSX-L), with no body text
> >> except the command. To leave the list, send the command SIGNOFF
> >> SPSSX-L For a list of commands to manage subscriptions, send the
> >> command INFO REFCARD
> >
> >
> >
> > --
> > Jon K Peck
>
> > jkpeck@
>
> >
> > =====================
> > To manage your subscription to SPSSX-L, send a message to
>
> > LISTSERV@.UGA
>
> >  (not to SPSSX-L), with no body text except the command. To leave the
> > list, send the command SIGNOFF SPSSX-L For a list of commands to
> > manage subscriptions, send the command INFO REFCARD
>
>
>
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsites.google.com%2Fa%2Flakeheadu.ca%2Fbweaver%2F&data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BQaHhJRR9UoSSRFq4dw5zEPafonSvMEF0xTohZ%2FJDOo%3D&reserved=0
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> Sent from: https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspssx-discussion.1045642.n5.nabble.com%2F&data=04%7C01%7CMelissa.Ives%40ct.gov%7Cc20f3ea8e7f541338bb408d8bd5b11f3%7C118b7cfaa3dd48b9b02631ff69bb738b%7C0%7C0%7C637467546150371642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QOZMWpe2jvIY%2FHWiraZ9yPqzIF5GWQ6HJi041BW%2BHds%3D&reserved=0
>
> =====================
> To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
>
> ________________________________
>
> This correspondence contains proprietary information some or all of which may be legally privileged; it is for the intended recipient only. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this correspondence and completely dispose of the correspondence immediately. Please notify the sender if you have received this email in error. NOTE: Messages to or from the State of Connecticut domain may be subject to the Freedom of Information statutes and regulations.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


 


 



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD