SPSSX Discussion

insert charcter in string

Classic

List

Threaded

6 messages Options

Maguin, Eugene

insert charcter in string

All,

I want to add a '+/-' character (code page 850=241) to a concantenated string. Question is how to convert the code page value to the character. Ray Levesque and Richard Ristow have posted on this in the past but I can't get their code to work.

I did this: STRING(241,PIB1) to attempt to create the character.

How do I do it correctly?

Thanks, Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: insert charcter in string

At 02:12 PM 1/3/2013, Maguin, Eugene wrote:

>I want to add a '+/-' character (code page
>850=241) to a concatenated string. Question is
>how to convert the code page value to the character.
>
>I did this: STRING(241,PIB1) to attempt to create the character.

The code we posted assumed single-byte character
representations, which was true at the time. From
version 16 on, that's been known as 'code page'
mode. In code page mode, the posted techniques
should work to get any desired byte value as a
character. Whether 241 would be interpreted as +/-, I don't know.

But it may be that you're operating in Unicode
mode, as is available and encouraged from SPSS 16
on. (The latest SPSS I have running is v.14, so I
can't test any of the following.) In Unicode, it
appears that a character designation is always an
integer, conventionally represented in hex and
preceded by 'U+'; it looks like the character you want is U+00B1 (1).

Now, there are several ways of representing a
stream of Unicode characters (which, remember,
are integers) as a stream of bytes. From hints in
the documentation(2), SPSS probably uses the
representation UTF-8. UTF-8 represents characters
0-127, which exactly match ASCII codes and
characters, as single bytes, always with the
high-order bit 0. Characters numbered above 127
are represented as two or more bytes, with the
high-order bit 1 in all of them and the integer
spread across them(3). Value B1x fits into two bytes, of the form

110xxxxx 10xxxxxx

where 'x's are numeric bits; the binary for B1x
is 10110001; can be split into 5 and 6 bits as
00010 110001, so the character is represented as the two bytes

11000010 10110001; in hex, 62 F1.

Now, the question is, how can you get a byte
string, in hex, into a Unicode string? I will
guess that the method that I posted in 2006
(following Raynald Levesque) would do it(4).
Here's the substance. Watch out! If you can put
an arbitrary byte string into an SPSS Unicode
string, you can enter text that isn't valid in UTF-8.

>I've adapted [Raynald Levesque's] code to make a
>crude hex-to-character converter:
>
>NEW FILE.
>DATA LIST FIXED
> / HEX_CHAR 01-20 (A).
>
>Data List will read 1 records from the command file
>
>Variable Rec Start End Format
>HEX_CHAR 1 1 20 A20
>
>BEGIN DATA
>5261796e616c64
>END DATA.
>
>STRING CHAR(A20).
>LOOP #POS = 01 TO 99 BY 2
> IF #POS LE LENGTH(HEX_CHAR) - 1.
>. NUMERIC #HEX_HI
> #HEX_LO (F2).
>. STRING #HEX_DIG
> #ASCII (A1).
>
>. COMPUTE #HEX_DIG = LOWER(SUBSTR(HEX_CHAR,#POS, 1)).
>. COMPUTE #HEX_HI = INDEX('0123456789abcdef',#HEX_DIG).
>. DO IF #HEX_HI GT 0.
>. COMPUTE #HEX_HI = #HEX_HI - 1.
>. ELSE.
>. BREAK.
>. END IF.
>
>. COMPUTE #HEX_DIG = LOWER(SUBSTR(HEX_CHAR,#POS+1,1)).
>. COMPUTE #HEX_LO = INDEX('0123456789abcdef',#HEX_DIG).
>. DO IF #HEX_LO GT 0.
>. COMPUTE #HEX_LO = #HEX_LO - 1.
>. ELSE.
>. BREAK.
>. END IF.
>
>. COMPUTE #ASCII = STRING(16*#HEX_HI+#HEX_LO,PIB1).
>. COMPUTE CHAR = CONCAT(RTRIM(CHAR),#ASCII).
>END LOOP.
>LIST.

(By the way, Gene, I see you put me onto Raynald's code in the first place.)

I'd start out by generating the desired UTF-8
character in a string containing nothing else,
and then using SPSS string-manipulating functions
to put it into the desired string.

=====================
(1) Wikipedia article "List of Unicode characters"

(2) For example, from the v.18 Command Syntax Reference, (p.105):

"String functions that include a byte position or
count argument or return a byte position or count
may return different results in Unicode mode than
in code page mode. For example, é is one byte in
code page mode but is two bytes in Unicode mode;
so résumé is six bytes in code page mode and eight bytes in Unicode mode."

(Compare the Wikipedia article on UTF-8).

(3) Wikipedia article UTF-8

(4) SPSSX-L posting
Date: Wed, 6 Dec 2006 20:39:58 -0500
From: Richard Ristow <[hidden email]>
Subject: Re: Non-printing characters
To: [hidden email]

X-ELNK-Info: spv=0;
X-ELNK-AV: 0
X-ELNK-Info: sbv=0; sbrc=.0; sbf=0b; sbw=000;

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: insert charcter in string

In reply to this post by Maguin, Eugene

A second thought, without checking for responses to the previous posting:

Recent SPSS documentation is vague about how AHEX formats operate
when in Unicode mode. I'll take a flying guess that they operate byte
by byte, with each byte written as, or read in from, two hex digits,
without regard to whether the bytes belong to multiple-byte
characters. If that is so, an easier way to get bytes 62 and F1 into
a string (if this works) could be

STRING PlusOrMinus (A2) /* Minimum, to allow the two-byte character */
COMPUTE PlusOrMinus = STRING('62F1',AHEX4).

By the way, it shouldn't be very hard to write SPSS code that
converts integer Unicode values into the corresponding UTF-8
single-byte or multiple-byte representations.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck

Re: insert character in string

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: Richard Ristow <[hidden email]>
To: [hidden email],
Date: 01/13/2013 05:19 PM
Subject: Re: [SPSSX-L] insert charcter in string
Sent by: "SPSSX(r) Discussion" <[hidden email]>

A second thought, without checking for responses to the previous posting: Recent SPSS documentation is vague about how AHEX formats operate when in Unicode mode.
>>>AHEX works the way it always has. It shows the contents in a string field as hexadecimal numbers. Since field widths are always in bytes, the doubling rule from A to AHEX for the width still applies. Bytes per character will, of course vary from 1-3 in Utf-8 or 1-2 in code page multibyte character sets.

I'll take a flying guess that they operate byte by byte, with each byte written as, or read in from, two hex digits, without regard to whether the bytes belong to multiple-byte characters. If that is so, an easier way to get bytes 62 and F1 into a string (if this works) could be STRING PlusOrMinus (A2) /* Minimum, to allow the two-byte character */ COMPUTE PlusOrMinus = STRING('62F1',AHEX4). >>>I posted a detailed explanation and solution for all this some time ago. Using PIB format and the string function is what we usually recommend.
By the way, it shouldn't be very hard to write SPSS code that converts integer Unicode values into the corresponding UTF-8 single-byte or multiple-byte representations. >>>There is no such thing as a single byte utf-8 representation.

Regards,
Jon ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Richard Ristow

Re: insert character in string

At 08:53 PM 1/13/2013, Jon K Peck wrote:

>>Recent SPSS documentation is vague about how AHEX formats operate
>>when in Unicode mode.
>AHEX works the way it always has. It shows the contents in a string
>field as hexadecimal numbers. Since field widths are always in
>bytes, the doubling rule from A to AHEX for the width still
>applies. Bytes per character will, of course vary from 1-3 in Utf-8
>or 1-2 in code page multibyte character sets.

Thank you. So AHEX represents the *bytes* of a character string in
hex, whether (as formerly) those correspond one-to-one to characters,
or (in Unicode) there may be more than one byte for a single character.

>>STRING PlusOrMinus (A2) /* Minimum, to allow the two-byte character */
>>COMPUTE PlusOrMinus = STRING('62F1',AHEX4).
>I posted a detailed explanation and solution for all this some time
>ago. Using PIB format and the string function is what we usually recommend.

So, Gene, see my first posting in response to your question. Or, try
the above anyway, and see if it works -- it is, at any rate, more compact code.

Jon noted,
>There is no such thing as a single byte utf-8 representation.

Have I missed something? Jon's posting says that there are some
single-byte representations ("Bytes per character will vary from 1-3
in Utf-8"), and I understand the same from the Wikipedia article on UTF-8:

"One-byte codes are used for the ASCII values 0 through 127. In this
case the UTF-8 code has the same value as the ASCII code. The
high-order bit of these codes is always 0."

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck

Re: insert character in string

What I meant by ...
Jon noted, >There is no such thing as a single byte utf-8 representation. Have I missed something? Jon's posting says that there are some single-byte representations ("Bytes per character will vary from 1-3 in Utf-8"), and I understand the same from the Wikipedia article on UTF-8:
...
was not that some characters don't fit in one byte but that utf-8 is intrinsically multibyte.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: Richard Ristow <[hidden email]>
To: [hidden email],
Date: 01/15/2013 12:14 PM
Subject: Re: [SPSSX-L] insert character in string
Sent by: "SPSSX(r) Discussion" <[hidden email]>

At 08:53 PM 1/13/2013, Jon K Peck wrote: >>Recent SPSS documentation is vague about how AHEX formats operate >>when in Unicode mode. >AHEX works the way it always has. It shows the contents in a string >field as hexadecimal numbers. Since field widths are always in >bytes, the doubling rule from A to AHEX for the width still >applies. Bytes per character will, of course vary from 1-3 in Utf-8 >or 1-2 in code page multibyte character sets. Thank you. So AHEX represents the *bytes* of a character string in hex, whether (as formerly) those correspond one-to-one to characters, or (in Unicode) there may be more than one byte for a single character. >>STRING PlusOrMinus (A2) /* Minimum, to allow the two-byte character */ >>COMPUTE PlusOrMinus = STRING('62F1',AHEX4). >I posted a detailed explanation and solution for all this some time >ago. Using PIB format and the string function is what we usually recommend. So, Gene, see my first posting in response to your question. Or, try the above anyway, and see if it works -- it is, at any rate, more compact code. Jon noted, >There is no such thing as a single byte utf-8 representation. Have I missed something? Jon's posting says that there are some single-byte representations ("Bytes per character will vary from 1-3 in Utf-8"), and I understand the same from the Wikipedia article on UTF-8: "One-byte codes are used for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0." ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD