Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction.
To: "Albert-Jan Roskam" <[hidden email]> Cc: [hidden email] Date: Friday, January 10, 2014, 5:08 PM With over 100,000 characters in Unicode, why scrimp on dashes? ===> Good thing only normal hyphens are allowed in URLs (http://tools.ietf.org/html/rfc3986#page-13). E.g. "Visit our page at www dot foo 'mongolian todo soft hyphen' bar dot com" sounds *very* awkward. ;-) It would also make URLs very vulnerable to hacking attempts. From: � � � � Albert-Jan Roskam <[hidden email]> To: � � � � [hidden email], Jon K Peck/Chicago/IBM@IBMUS, Date: � � � � 01/10/2014 08:30 AM Subject: � � � � Re: [SPSSX-L] Odd, very odd, something. Info correction. So many "high" dashes! Why not just have one and only one. Code Name U+002D hyphen-minus U+007E tilde (when used as swung dash) U+058A armenian hyphen U+05BE hebrew punctuation maqaf U+1400 canadian syllabics hyphen U+1806 mongolian todo soft hyphen U+2010 hyphen U+2011 non-breaking hyphen U+2012 figure dash U+2013 en dash U+2014 em dash U+2015 horizontal bar (=quotation dash) U+2053 swung dash U+207B superscript minus U+208B subscript minus U+2212 minus sign U+2E17 double oblique hyphen U+301C wav e da s h U+3030 wav y da s h U+30A0 katakana-hiragana double hyphen U+FE31 presentation form for vertical em dash U+FE32 presentation form for vertical en dash U+FE58 small em dash U+FE63 small hyphen-minus U+FF0D fullwidth hyphen-minus source: http://www.unicode.org/versions/Unicode6.3.0/ch06.pdf, p 196. Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -------------------------------------------- On Thu, 1/9/14, Jon K Peck <[hidden email]> wrote: Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction. To: [hidden email] Date: Thursday, January 9, 2014, 1:41 AM I believe that you had a non-ascii dash, which is two bytes in Unicode, and the logic of your code would only work if each character, including the dash, is one byte, so the result is an invalid utf-8 character. � If any of the input fields can also contain accented or other non-ascii characters, the situation will be even worse. When you retyped the RANGE string, you apparently got an ascii dash. It is important for people to stop assuming that a byte is a character and to use the char.* functions that Statistics has provided since V16. � And avoid left hand side substr. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: � � � � "Maguin, Eugene" <[hidden email]> To: � � � � [hidden email], Date: � � � � 01/08/2014 02:45 PM Subject: � � � � Re: [SPSSX-L] Odd, very odd, something. Info correction. Sent by: � � � � "SPSSX(r) Discussion" <[hidden email]> I just looked at the edit-options-general box and the two options in character encoding section are both grayed out but the Unicode circle � is bulleted. So perhaps I am really running in Unicode and didn’t realize it. � I retyped the line COMPUTE RANGE=’ � � - � � ‘. And re-ran the section and no diamonds, just dashes. Even if I start over, I can’t reproduce the problem. So: FWIW. � Gene Maguin � � � From: Jon K Peck [mailto:[hidden email]] Sent: Wednesday, January 08, 2014 3:56 PM To: Maguin, Eugene Cc: [hidden email] Subject: Re: [SPSSX-L] Odd, very odd, something � The question mark indicates that you have an unprintable character in that location. � If you are not in Unicode mode and using a western code page such as the usual cp1252, there are only a few such character slots. � Please post some code that shows this behavior. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: � � � � "Maguin, Eugene" <[hidden email]> To: � � � � [hidden email], Date: � � � � 01/08/2014 01:41 PM Subject: � � � � [SPSSX-L] Odd, very odd, something Sent by: � � � � "SPSSX(r) Discussion" <[hidden email]> Given an A11 variable, initially defined to be ‘…..-…..’, where a dot is a space, I replace using the SUBSTR function the 5 spaces on either side of the dash character, ‘-‘, with a 5 character string such as ‘Jan08’. The result in the data window shows a black diamond shaped character with an embedded, white question mark character. An example is ‘Jan08�Dec10’. � � So, naturally, the question is what is going on? And how can it be fixed so that the dash character shows instead of the diamond character? � � If it matters: 21, fully patched, not Unicode. � � Thanks, Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
However, non-ascii characters, including
these hyphen variations, can be used in urls. They have to be encoded
in hex form, but they can be misleading as they may appear displayed in
their character form, depending on the browser.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Albert-Jan Roskam <[hidden email]> To: [hidden email], Date: 01/10/2014 10:03 AM Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction. Sent by: "SPSSX(r) Discussion" <[hidden email]> Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction. To: "Albert-Jan Roskam" <[hidden email]> Cc: [hidden email] Date: Friday, January 10, 2014, 5:08 PM With over 100,000 characters in Unicode, why scrimp on dashes? ===> Good thing only normal hyphens are allowed in URLs (http://tools.ietf.org/html/rfc3986#page-13). E.g. "Visit our page at www dot foo 'mongolian todo soft hyphen' bar dot com" sounds *very* awkward. ;-) It would also make URLs very vulnerable to hacking attempts. From: Albert-Jan Roskam <[hidden email]> To: [hidden email], Jon K Peck/Chicago/IBM@IBMUS, Date: 01/10/2014 08:30 AM Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction. So many "high" dashes! Why not just have one and only one. Code Name U+002D hyphen-minus U+007E tilde (when used as swung dash) U+058A armenian hyphen U+05BE hebrew punctuation maqaf U+1400 canadian syllabics hyphen U+1806 mongolian todo soft hyphen U+2010 hyphen U+2011 non-breaking hyphen U+2012 figure dash U+2013 en dash U+2014 em dash U+2015 horizontal bar (=quotation dash) U+2053 swung dash U+207B superscript minus U+208B subscript minus U+2212 minus sign U+2E17 double oblique hyphen U+301C wav e da s h U+3030 wav y da s h U+30A0 katakana-hiragana double hyphen U+FE31 presentation form for vertical em dash U+FE32 presentation form for vertical en dash U+FE58 small em dash U+FE63 small hyphen-minus U+FF0D fullwidth hyphen-minus source: http://www.unicode.org/versions/Unicode6.3.0/ch06.pdf, p 196. Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -------------------------------------------- On Thu, 1/9/14, Jon K Peck <[hidden email]> wrote: Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction. To: [hidden email] Date: Thursday, January 9, 2014, 1:41 AM I believe that you had a non-ascii dash, which is two bytes in Unicode, and the logic of your code would only work if each character, including the dash, is one byte, so the result is an invalid utf-8 character. If any of the input fields can also contain accented or other non-ascii characters, the situation will be even worse. When you retyped the RANGE string, you apparently got an ascii dash. It is important for people to stop assuming that a byte is a character and to use the char.* functions that Statistics has provided since V16. And avoid left hand side substr. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: "Maguin, Eugene" <[hidden email]> To: [hidden email], Date: 01/08/2014 02:45 PM Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction. Sent by: "SPSSX(r) Discussion" <[hidden email]> I just looked at the edit-options-general box and the two options in character encoding section are both grayed out but the Unicode circle is bulleted. So perhaps I am really running in Unicode and didn’t realize it. I retyped the line COMPUTE RANGE=’ - ‘. And re-ran the section and no diamonds, just dashes. Even if I start over, I can’t reproduce the problem. So: FWIW. Gene Maguin From: Jon K Peck [mailto:peck@...] Sent: Wednesday, January 08, 2014 3:56 PM To: Maguin, Eugene Cc: [hidden email] Subject: Re: [SPSSX-L] Odd, very odd, something The question mark indicates that you have an unprintable character in that location. If you are not in Unicode mode and using a western code page such as the usual cp1252, there are only a few such character slots. Please post some code that shows this behavior. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: "Maguin, Eugene" <[hidden email]> To: [hidden email], Date: 01/08/2014 01:41 PM Subject: [SPSSX-L] Odd, very odd, something Sent by: "SPSSX(r) Discussion" <[hidden email]> Given an A11 variable, initially defined to be ‘…..-…..’, where a dot is a space, I replace using the SUBSTR function the 5 spaces on either side of the dash character, ‘-‘, with a 5 character string such as ‘Jan08’. The result in the data window shows a black diamond shaped character with an embedded, white question mark character. An example is ‘Jan08�Dec10’. So, naturally, the question is what is going on? And how can it be fixed so that the dash character shows instead of the diamond character? If it matters: 21, fully patched, not Unicode. Thanks, Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
--------------------------------------------
On Fri, 1/10/14, Jon K Peck <[hidden email]> wrote: Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction. To: [hidden email] Date: Friday, January 10, 2014, 6:29 PM However, non-ascii characters, including these hyphen variations, can be used in urls. � They have to be encoded in hex form, but they can be misleading as they may appear displayed in their character form, depending on the browser. ===> You're right! www.foo%E2%80%90bar.com and www.foo%D6%8Abar.com look identical in Firefox. Impossible to tell the difference. >>> url_hyphen = u"www.foo\u2010bar.com" >>> url_armenian = u"www.foo\u058Abar.com" >>> print url_hyphen, url_armenian www.foo‐bar.com www.foo� bar.com >>> import urllib >>> print urllib.quote(url_hyphen.encode("utf-8")) www.foo%E2%80%90bar.com >>> print urllib.quote(url_armenian.encode("utf-8")) www.foo%D6%8Abar.com Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: � � � � Albert-Jan Roskam <[hidden email]> To: � � � � [hidden email], Date: � � � � 01/10/2014 10:03 AM Subject: � � � � Re: [SPSSX-L] Odd, very odd, something. Info correction. Sent by: � � � � "SPSSX(r) Discussion" <[hidden email]> � Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction. To: "Albert-Jan Roskam" <[hidden email]> Cc: [hidden email] Date: Friday, January 10, 2014, 5:08 PM With over 100,000 characters in Unicode, why scrimp on dashes? ===> Good thing only normal hyphens are allowed in URLs (http://tools.ietf.org/html/rfc3986#page-13). E.g. "Visit our page at www dot foo 'mongolian todo soft hyphen' bar dot com" sounds *very* awkward. ;-) It would also make URLs very vulnerable to hacking attempts. From: � � � Albert-Jan Roskam <[hidden email]> To: � � [hidden email], Jon K Peck/Chicago/IBM@IBMUS, Date: � � � 01/10/2014 08:30 AM Subject: � � � Re: [SPSSX-L] Odd, very odd, something. Info correction. So many "high" dashes! Why not just have one and only one. Code Name U+002D hyphen-minus U+007E tilde (when used as swung dash) U+058A armenian hyphen U+05BE hebrew punctuation maqaf U+1400 canadian syllabics hyphen U+1806 mongolian todo soft hyphen U+2010 hyphen U+2011 non-breaking hyphen U+2012 figure dash U+2013 en dash U+2014 em dash U+2015 horizontal bar (=quotation dash) U+2053 swung dash U+207B superscript minus U+208B subscript minus U+2212 minus sign U+2E17 double oblique hyphen U+301C wav e da s h U+3030 wav y da s h U+30A0 katakana-hiragana double hyphen U+FE31 presentation form for vertical em dash U+FE32 presentation form for vertical en dash U+FE58 small em dash U+FE63 small hyphen-minus U+FF0D fullwidth hyphen-minus source: http://www.unicode.org/versions/Unicode6.3.0/ch06.pdf, p 196. Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -------------------------------------------- On Thu, 1/9/14, Jon K Peck <[hidden email]> wrote: � Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction. � To: [hidden email] � Date: Thursday, January 9, 2014, 1:41 AM � I believe � that you had a non-ascii dash, � which is two bytes in Unicode, and the logic of your code � would only work � if each character, including the dash, is one byte, so the � result is an � invalid utf-8 character. � If any of the input fields � can also contain � accented or other non-ascii characters, the situation will � be even worse. � When you retyped the RANGE � string, you � apparently got an ascii dash. � It is important for people � to stop assuming � that a byte is a character and to use the char.* functions � that Statistics � has provided since V16. � And avoid left hand side � substr. � Jon Peck (no "h") aka Kim � Senior Software Engineer, IBM � [hidden email] � phone: 720-342-5621 � From: � � � � "Maguin, � Eugene" � <[hidden email]> � To: � � � [hidden email], � Date: � � � � 01/08/2014 � 02:45 PM � Subject: � � � � Re: � [SPSSX-L] � Odd, very odd, something. Info correction. � Sent by: � � � � "SPSSX(r) � Discussion" <[hidden email]> � I just looked � at the edit-options-general � box and the two options in character encoding section are � both grayed out � but the Unicode circle � is bulleted. So perhaps I am � really running � in Unicode and didn’t realize it. � � I retyped the � line COMPUTE � RANGE=’ � � - � � ‘. And re-ran the � section and no � diamonds, just dashes. Even if I start over, I can’t � reproduce the problem. � So: FWIW. � � Gene � Maguin � � � � From: Jon K Peck � [mailto:[hidden email]] � Sent: Wednesday, January 08, 2014 3:56 PM � To: Maguin, Eugene � Cc: [hidden email] � Subject: Re: [SPSSX-L] Odd, very odd, something � � The question mark indicates that � you have � an unprintable character in that location. � If you are � not in Unicode � mode and using a western code page such as the usual cp1252, � there are � only a few such character slots. � Please post some code � that shows � this behavior. � Jon Peck (no "h") aka Kim � Senior Software Engineer, IBM � [hidden email] � phone: 720-342-5621 � From: � � � � "Maguin, � Eugene" <[hidden email]> � To: � � � � [hidden email], � Date: � � � � 01/08/2014 � 01:41 PM � Subject: � � � � [SPSSX-L] � Odd, very odd, something � Sent by: � � � � "SPSSX(r) � Discussion" <[hidden email]> � Given an A11 variable, initially defined to be � ‘…..-…..’, where a dot � is a space, I replace using the SUBSTR function the 5 spaces � on either � side of the dash character, ‘-‘, with a 5 character � string such as ‘Jan08’. � The result in the data window shows a black diamond shaped � character with � an embedded, white question mark character. An example is � ‘Jan08�Dec10’. � � So, naturally, the question is what is going on? � And how can it be fixed so that the dash character shows � instead of the � diamond character? � � If it matters: 21, fully patched, not Unicode. � � Thanks, Gene Maguin ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |