All,
I want to add a '+/-' character (code page 850=241) to a concantenated string. Question is how to convert the code page value to the character. Ray Levesque and Richard Ristow have posted on this in the past but I can't get their code to work. I did this: STRING(241,PIB1) to attempt to create the character. How do I do it correctly? Thanks, Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
At 02:12 PM 1/3/2013, Maguin, Eugene wrote:
>I want to add a '+/-' character (code page >850=241) to a concatenated string. Question is >how to convert the code page value to the character. > >I did this: STRING(241,PIB1) to attempt to create the character. The code we posted assumed single-byte character representations, which was true at the time. From version 16 on, that's been known as 'code page' mode. In code page mode, the posted techniques should work to get any desired byte value as a character. Whether 241 would be interpreted as +/-, I don't know. But it may be that you're operating in Unicode mode, as is available and encouraged from SPSS 16 on. (The latest SPSS I have running is v.14, so I can't test any of the following.) In Unicode, it appears that a character designation is always an integer, conventionally represented in hex and preceded by 'U+'; it looks like the character you want is U+00B1 (1). Now, there are several ways of representing a stream of Unicode characters (which, remember, are integers) as a stream of bytes. From hints in the documentation(2), SPSS probably uses the representation UTF-8. UTF-8 represents characters 0-127, which exactly match ASCII codes and characters, as single bytes, always with the high-order bit 0. Characters numbered above 127 are represented as two or more bytes, with the high-order bit 1 in all of them and the integer spread across them(3). Value B1x fits into two bytes, of the form 110xxxxx 10xxxxxx where 'x's are numeric bits; the binary for B1x is 10110001; can be split into 5 and 6 bits as 00010 110001, so the character is represented as the two bytes 11000010 10110001; in hex, 62 F1. Now, the question is, how can you get a byte string, in hex, into a Unicode string? I will guess that the method that I posted in 2006 (following Raynald Levesque) would do it(4). Here's the substance. Watch out! If you can put an arbitrary byte string into an SPSS Unicode string, you can enter text that isn't valid in UTF-8. >I've adapted [Raynald Levesque's] code to make a >crude hex-to-character converter: > >NEW FILE. >DATA LIST FIXED > / HEX_CHAR 01-20 (A). > >Data List will read 1 records from the command file > >Variable Rec Start End Format >HEX_CHAR 1 1 20 A20 > >BEGIN DATA >5261796e616c64 >END DATA. > >STRING CHAR(A20). >LOOP #POS = 01 TO 99 BY 2 > IF #POS LE LENGTH(HEX_CHAR) - 1. >. NUMERIC #HEX_HI > #HEX_LO (F2). >. STRING #HEX_DIG > #ASCII (A1). > >. COMPUTE #HEX_DIG = LOWER(SUBSTR(HEX_CHAR,#POS, 1)). >. COMPUTE #HEX_HI = INDEX('0123456789abcdef',#HEX_DIG). >. DO IF #HEX_HI GT 0. >. COMPUTE #HEX_HI = #HEX_HI - 1. >. ELSE. >. BREAK. >. END IF. > >. COMPUTE #HEX_DIG = LOWER(SUBSTR(HEX_CHAR,#POS+1,1)). >. COMPUTE #HEX_LO = INDEX('0123456789abcdef',#HEX_DIG). >. DO IF #HEX_LO GT 0. >. COMPUTE #HEX_LO = #HEX_LO - 1. >. ELSE. >. BREAK. >. END IF. > >. COMPUTE #ASCII = STRING(16*#HEX_HI+#HEX_LO,PIB1). >. COMPUTE CHAR = CONCAT(RTRIM(CHAR),#ASCII). >END LOOP. >LIST. (By the way, Gene, I see you put me onto Raynald's code in the first place.) I'd start out by generating the desired UTF-8 character in a string containing nothing else, and then using SPSS string-manipulating functions to put it into the desired string. ===================== (1) Wikipedia article "List of Unicode characters" (2) For example, from the v.18 Command Syntax Reference, (p.105): "String functions that include a byte position or count argument or return a byte position or count may return different results in Unicode mode than in code page mode. For example, é is one byte in code page mode but is two bytes in Unicode mode; so résumé is six bytes in code page mode and eight bytes in Unicode mode." (Compare the Wikipedia article on UTF-8). (3) Wikipedia article UTF-8 (4) SPSSX-L posting Date: Wed, 6 Dec 2006 20:39:58 -0500 From: Richard Ristow <[hidden email]> Subject: Re: Non-printing characters To: [hidden email] X-ELNK-Info: spv=0; X-ELNK-AV: 0 X-ELNK-Info: sbv=0; sbrc=.0; sbf=0b; sbw=000; ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Maguin, Eugene
A second thought, without checking for responses to the previous posting:
Recent SPSS documentation is vague about how AHEX formats operate when in Unicode mode. I'll take a flying guess that they operate byte by byte, with each byte written as, or read in from, two hex digits, without regard to whether the bytes belong to multiple-byte characters. If that is so, an easier way to get bytes 62 and F1 into a string (if this works) could be STRING PlusOrMinus (A2) /* Minimum, to allow the two-byte character */ COMPUTE PlusOrMinus = STRING('62F1',AHEX4). By the way, it shouldn't be very hard to write SPSS code that converts integer Unicode values into the corresponding UTF-8 single-byte or multiple-byte representations. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: Richard Ristow <[hidden email]> To: [hidden email], Date: 01/13/2013 05:19 PM Subject: Re: [SPSSX-L] insert charcter in string Sent by: "SPSSX(r) Discussion" <[hidden email]> A second thought, without checking for responses to the previous posting: Recent SPSS documentation is vague about how AHEX formats operate when in Unicode mode. >>>AHEX works the way it always has. It shows the contents in a string field as hexadecimal numbers. Since field widths are always in bytes, the doubling rule from A to AHEX for the width still applies. Bytes per character will, of course vary from 1-3 in Utf-8 or 1-2 in code page multibyte character sets. I'll take a flying guess that they operate byte by byte, with each byte written as, or read in from, two hex digits, without regard to whether the bytes belong to multiple-byte characters. If that is so, an easier way to get bytes 62 and F1 into a string (if this works) could be STRING PlusOrMinus (A2) /* Minimum, to allow the two-byte character */ COMPUTE PlusOrMinus = STRING('62F1',AHEX4). >>>I posted a detailed explanation and solution for all this some time ago. Using PIB format and the string function is what we usually recommend. By the way, it shouldn't be very hard to write SPSS code that converts integer Unicode values into the corresponding UTF-8 single-byte or multiple-byte representations. >>>There is no such thing as a single byte utf-8 representation. Regards, Jon ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
At 08:53 PM 1/13/2013, Jon K Peck wrote:
>>Recent SPSS documentation is vague about how AHEX formats operate >>when in Unicode mode. >AHEX works the way it always has. It shows the contents in a string >field as hexadecimal numbers. Since field widths are always in >bytes, the doubling rule from A to AHEX for the width still >applies. Bytes per character will, of course vary from 1-3 in Utf-8 >or 1-2 in code page multibyte character sets. Thank you. So AHEX represents the *bytes* of a character string in hex, whether (as formerly) those correspond one-to-one to characters, or (in Unicode) there may be more than one byte for a single character. >>STRING PlusOrMinus (A2) /* Minimum, to allow the two-byte character */ >>COMPUTE PlusOrMinus = STRING('62F1',AHEX4). >I posted a detailed explanation and solution for all this some time >ago. Using PIB format and the string function is what we usually recommend. So, Gene, see my first posting in response to your question. Or, try the above anyway, and see if it works -- it is, at any rate, more compact code. Jon noted, >There is no such thing as a single byte utf-8 representation. Have I missed something? Jon's posting says that there are some single-byte representations ("Bytes per character will vary from 1-3 in Utf-8"), and I understand the same from the Wikipedia article on UTF-8: "One-byte codes are used for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0." ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
What I meant by ...
Jon noted, >There is no such thing as a single byte utf-8 representation. Have I missed something? Jon's posting says that there are some single-byte representations ("Bytes per character will vary from 1-3 in Utf-8"), and I understand the same from the Wikipedia article on UTF-8: ... was not that some characters don't fit in one byte but that utf-8 is intrinsically multibyte. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: Richard Ristow <[hidden email]> To: [hidden email], Date: 01/15/2013 12:14 PM Subject: Re: [SPSSX-L] insert character in string Sent by: "SPSSX(r) Discussion" <[hidden email]> At 08:53 PM 1/13/2013, Jon K Peck wrote: >>Recent SPSS documentation is vague about how AHEX formats operate >>when in Unicode mode. >AHEX works the way it always has. It shows the contents in a string >field as hexadecimal numbers. Since field widths are always in >bytes, the doubling rule from A to AHEX for the width still >applies. Bytes per character will, of course vary from 1-3 in Utf-8 >or 1-2 in code page multibyte character sets. Thank you. So AHEX represents the *bytes* of a character string in hex, whether (as formerly) those correspond one-to-one to characters, or (in Unicode) there may be more than one byte for a single character. >>STRING PlusOrMinus (A2) /* Minimum, to allow the two-byte character */ >>COMPUTE PlusOrMinus = STRING('62F1',AHEX4). >I posted a detailed explanation and solution for all this some time >ago. Using PIB format and the string function is what we usually recommend. So, Gene, see my first posting in response to your question. Or, try the above anyway, and see if it works -- it is, at any rate, more compact code. Jon noted, >There is no such thing as a single byte utf-8 representation. Have I missed something? Jon's posting says that there are some single-byte representations ("Bytes per character will vary from 1-3 in Utf-8"), and I understand the same from the Wikipedia article on UTF-8: "One-byte codes are used for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0." ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |