SPSSX Discussion

Followup to "insert character in string'

Classic

List

Threaded

4 messages Options

Maguin, Eugene

Followup to "insert character in string'

I had another problem which when fixed yielded a character using the posted syntax bit 'string(241,PIB1)'.

However, it turns out that the '+/-' character is not produced by code page=241; it is produced by Ansi/ISO Latin I=177. Question is why is this? I assumed US English is codepage=850.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck

Re: Followup to "insert character in string'

You have the wrong code page for US English. 850 is an older code page used mainly in Europe and on some older OS'S. You want code page 1252, in cp1252 the +/- character is hex B1 or decimal 177. Of course, in Unicode mode everything works.

You can enter this directly from your keyboard (on Windows) by typing alt-0177 using the numeric keypad. B1 happens also to be the Unicode code point.

Regards,
Jon

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: "Maguin, Eugene" <[hidden email]>
To: [hidden email],
Date: 01/03/2013 12:34 PM
Subject: [SPSSX-L] Followup to "insert character in string'
Sent by: "SPSSX(r) Discussion" <[hidden email]>

I had another problem which when fixed yielded a character using the posted syntax bit 'string(241,PIB1)'. However, it turns out that the '+/-' character is not produced by code page=241; it is produced by Ansi/ISO Latin I=177. Question is why is this? I assumed US English is codepage=850. Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Maguin, Eugene

Re: Followup to "insert character in string'

Thank you for your reply. I learned some new things. And, now, of course, some followup questions.

Something I’ve always been curious about is why the number for a character needs to be converted to PIB1 in order for the character to appear. I understand that String(177,F3.0) can’t mean two things and that the common meaning is ‘177’. Is it that the character set is stored somewhere, maybe in the OS code, as PIB?

I think I remember seeing that 21 is all Unicode (but are there multiple Unicode standards?. So again there will be a lookup table like there is now. How will people be able to see that? Will it be in the documentation somewhere?

Thanks, Gene Maguin

From: Jon K Peck [mailto:[hidden email]]
Sent: Thursday, January 03, 2013 8:52 PM
To: Maguin, Eugene
Cc: [hidden email]
Subject: Re: [SPSSX-L] Followup to "insert character in string'

I had another problem which when fixed yielded a character using the posted syntax bit 'string(241,PIB1)'.

However, it turns out that the '+/-' character is not produced by code page=241; it is produced by Ansi/ISO Latin I=177. Question is why is this? I assumed US English is codepage=850.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck

Re: Followup to "insert character in string'

See below.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: "Maguin, Eugene" <[hidden email]>
To: Jon K Peck/Chicago/IBM@IBMUS,
Cc: "[hidden email]" <[hidden email]>
Date: 01/04/2013 07:47 AM
Subject: RE: [SPSSX-L] Followup to "insert character in string'

Thank you for your reply. I learned some new things. And, now, of course, some followup questions.

Something I’ve always been curious about is why the number for a character needs to be converted to PIB1 in order for the character to appear. I understand that String(177,F3.0) can’t mean two things and that the common meaning is ‘177’. Is it that the character set is stored somewhere, maybe in the OS code, as PIB?
>>>The purpose of PIB format is actually to treat the numeric value as a character. '177' would be stored as three characters with the character code for each digit. So it would look like 313737 internally, because the code point value for '1' is x31, etc. If you use AHEX format on a character variable (double the width), you can see the numeric codes representing the characters in hexadecimal. All the code pages currently in use except IBM mainframe EBCDIC code digits the same way AFAIK.

I think I remember seeing that 21 is all Unicode (but are there multiple Unicode standards?. So again there will be a lookup table like there is now. How will people be able to see that? Will it be in the documentation somewhere?
>>>There is one Unicode standard, but because Unicode defines over 100,000 characters (really!), you can't fit the code point value into just one or even two bytes. Let's take a Greek capital omega as an example.
Omega is not included in cp 1252. You would have to use Windows cp 1253 to include the Greek alphabet. If you also needed to include accented roman characters, you would be out of luck, because they are not in cp 1253. Worse, unless you know what code page your characters are in, you can't interpret the codes correctly. Starting with SPSS version 15, the code page in use is recorded in the SAV file, although there are still sav files generated by third party code that do not include this information. Enter Unicode.

Unicode assigns the value 03A9 to omega. It assigns the value E9 to lower case e with acute accent.

So both can coexist in the same scheme, but the values in that scheme require at least two bytes per character. In fact, though, since Unicode defines more than 64K characters, even two bytes is not always enough. 0010FE80, for example, is a valid Unicode code point. As you might expect, though, characters beyond 64K are very rare and are not well supported in most software.

So every character has a fixed numerical value assigned by the Unicode standard, but how that value is actually stored in computer memory varies according to the encoding scheme. For practical reasons, allocating two bytes for every character is not always desirable. The letter e, say, which has character code 65, could be stored in one byte. So Statistics uses UTF-8 (Unicode Transformation Format-8) to store data in variable length units between one and three bytes long. This is why when a code page file is read in Unicode mode, we have to triple the field width, which is defined in bytes, in order to guarantee that no data are lost. (ALTER TYPE can shrink field widths back to what is actually required for the data they hold, which will be one byte per character if the contents are all plain 7-bit ascii.)

Going back to omega, in UTF-8 it is actually stored as CEA9, but fortunately you don't need to know that. What you do need to know is that the number of bytes used for a character varies. That is why we introduced the char.* string functions in V16 so that you can ignore this problem except that field widths are always defined in bytes, not characters. Statistics also uses utf-16, which is a different encoding of the same character values, in the frontend, but this is transparent. That means that any spv file can be opened in any locale or Unicode mode and will display correctly.

Boiling this down, here's what you need to know about Unicode mode
1) Code page data files are automatically converted, but it is critical that the code page information is correct. If it isn't declared in the file, the current SPSS locale setting is used. Syntax files are also converted but I'll skip how that works for now.
2) Use the char.* functions where they exist so that your code will work on characters regardless of the number of bytes they require.
3) If you need characters that don't fit into a single code page, you must use Unicode mode.
4) If you save a file in Unicode mode AND it includes characters outside plain ascii, SPSS versions before 16 will not display the text correctly, although it can still be used. Functions like upper and lower case conversion will not work correctly in that case except for 7-bit ascii.
5) If you open a Unicode file when Statistics is in code page mode, it will convert characters to that code page, but any characters not defined in that code page are inevitably lost. So if you are using cp1252 and the file contains an omega, it will be lost. You see these lost characters as ? in the file.

So this is doubtless more than you wanted to know, but it's a cold, cruel world :-) If it is any comfort, Microsoft Office has been purely Unicode since Office 97, and the world has survived.

Regards,

Thanks, Gene Maguin

From: Jon K Peck [mailto:peck@...]
Sent: Thursday, January 03, 2013 8:52 PM
To: Maguin, Eugene
Cc: [hidden email]
Subject: Re: [SPSSX-L] Followup to "insert character in string'

You have the wrong code page for US English. 850 is an older code page used mainly in Europe and on some older OS'S. You want code page 1252, in cp1252 the +/- character is hex B1 or decimal 177. Of course, in Unicode mode everything works.

You can enter this directly from your keyboard (on Windows) by typing alt-0177 using the numeric keypad. B1 happens also to be the Unicode code point.

Regards,
Jon

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
peck@...
new phone: 720-342-5621

From: "Maguin, Eugene" <emaguin@...>
To: [hidden email],
Date: 01/03/2013 12:34 PM
Subject: [SPSSX-L] Followup to "insert character in string'
Sent by: "SPSSX(r) Discussion" <[hidden email]>

I had another problem which when fixed yielded a character using the posted syntax bit 'string(241,PIB1)'.

However, it turns out that the '+/-' character is not produced by code page=241; it is produced by Ansi/ISO Latin I=177. Question is why is this? I assumed US English is codepage=850.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD