Odd, very odd, something

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Odd, very odd, something. Info correction.

Albert-Jan Roskam
 Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction.
 To: "Albert-Jan Roskam" <[hidden email]>
 Cc: [hidden email]
 Date: Friday, January 10, 2014, 5:08 PM

 With over 100,000 characters in Unicode, why scrimp on dashes?


 ===> Good thing only normal hyphens are allowed in URLs (http://tools.ietf.org/html/rfc3986#page-13). E.g. "Visit our page at www dot foo 'mongolian todo soft hyphen' bar dot com" sounds *very* awkward. ;-) It would also make URLs very vulnerable to hacking attempts.










 From:
 �  �  �
 � Albert-Jan
 Roskam <[hidden email]>

 To: �
 �  �
 � [hidden email],
 Jon K Peck/Chicago/IBM@IBMUS,

 Date:
 �  �  �
 � 01/10/2014
 08:30 AM

 Subject:
 �  �
 �  � Re:
 [SPSSX-L]
 Odd, very odd, something. Info correction.








 So many "high" dashes! Why not
 just have
 one and only one.



 Code Name

 U+002D hyphen-minus

 U+007E tilde (when used as swung dash)

 U+058A armenian hyphen

 U+05BE hebrew punctuation maqaf

 U+1400 canadian syllabics hyphen

 U+1806 mongolian todo soft hyphen

 U+2010 hyphen

 U+2011 non-breaking hyphen

 U+2012 figure dash

 U+2013 en dash

 U+2014 em dash

 U+2015 horizontal bar (=quotation dash)

 U+2053 swung dash

 U+207B superscript minus

 U+208B subscript minus

 U+2212 minus sign

 U+2E17 double oblique hyphen

 U+301C wav e da s h

 U+3030 wav y da s h

 U+30A0 katakana-hiragana double hyphen

 U+FE31 presentation form for vertical em dash

 U+FE32 presentation form for vertical en dash

 U+FE58 small em dash

 U+FE63 small hyphen-minus

 U+FF0D fullwidth hyphen-minus



 source: http://www.unicode.org/versions/Unicode6.3.0/ch06.pdf,
 p 196.



 Regards,



 Albert-Jan







 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



 All right, but apart from the sanitation, the medicine,
 education, wine,
 public order, irrigation, roads, a



 fresh water system, and public health, what have the Romans
 ever done for
 us?



 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



 --------------------------------------------

 On Thu, 1/9/14, Jon K Peck <[hidden email]> wrote:



  Subject: Re: [SPSSX-L] Odd, very odd, something. Info
 correction.

  To: [hidden email]

  Date: Thursday, January 9, 2014, 1:41 AM



  I believe

  that you had a non-ascii dash,

  which is two bytes in Unicode, and the logic of your code

  would only work

  if each character, including the dash, is one byte, so the

  result is an

  invalid utf-8 character. � If any of the input fields

  can also contain

  accented or other non-ascii characters, the situation will

  be even worse.







  When you retyped the RANGE

  string, you

  apparently got an ascii dash.







  It is important for people

  to stop assuming

  that a byte is a character and to use the char.* functions

  that Statistics

  has provided since V16. � And avoid left hand side

  substr.











  Jon Peck (no "h") aka Kim



  Senior Software Engineer, IBM



  [hidden email]



  phone: 720-342-5621



















  From:

  �  �  �

  � "Maguin,

  Eugene"

  <[hidden email]>



  To: �

  �  �

  � [hidden email],





  Date:

  �  �  �

  � 01/08/2014

  02:45 PM



  Subject:

  �  �

  �  � Re:

  [SPSSX-L]

  Odd, very odd, something. Info correction.



  Sent by:

  �  �

  �  � "SPSSX(r)

  Discussion" <[hidden email]>

















  I just looked

  at the edit-options-general

  box and the two options in character encoding section are

  both grayed out

  but the Unicode circle � is bulleted. So perhaps I am

  really running

  in Unicode and didn’t realize it.



  �



  I retyped the

  line COMPUTE

  RANGE=’ �  �  - �  �  ‘. And re-ran
 the

  section and no

  diamonds, just dashes. Even if I start over, I can’t

  reproduce the problem.

  So: FWIW.



  �



  Gene

  Maguin



  �



  �



  �



  From: Jon K Peck

  [mailto:[hidden email]]





  Sent: Wednesday, January 08, 2014 3:56 PM



  To: Maguin, Eugene



  Cc: [hidden email]



  Subject: Re: [SPSSX-L] Odd, very odd, something



  �



  The question mark indicates that

  you have

  an unprintable character in that location. � If you
 are

  not in Unicode

  mode and using a western code page such as the usual
 cp1252,

  there are

  only a few such character slots. � Please post some
 code

  that shows

  this behavior.













  Jon Peck (no "h") aka Kim



  Senior Software Engineer, IBM



  [hidden email]



  phone: 720-342-5621



















  From: �  �  �  � "Maguin,

  Eugene" <[hidden email]>





  To: �  �  �  � [hidden email],





  Date: �  �  �  � 01/08/2014

  01:41 PM





  Subject: �  �  �  � [SPSSX-L]

  Odd, very odd, something



  Sent by: �  �  �  � "SPSSX(r)

  Discussion" <[hidden email]>





















  Given an A11 variable, initially defined to be

  ‘…..-…..’, where a dot

  is a space, I replace using the SUBSTR function the 5
 spaces

  on either

  side of the dash character, ‘-‘, with a 5 character

  string such as ‘Jan08’.

  The result in the data window shows a black diamond shaped

  character with

  an embedded, white question mark character. An example is

  ‘Jan08�Dec10’.





  � �



  So, naturally, the question is what is going on?



  And how can it be fixed so that the dash character shows

  instead of the

  diamond character?



  � �



  If it matters: 21, fully patched, not Unicode.





  � �



  Thanks, Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Odd, very odd, something. Info correction.

Jon K Peck
However, non-ascii characters, including these hyphen variations, can be used in urls.  They have to be encoded in hex form, but they can be misleading as they may appear displayed in their character form, depending on the browser.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Albert-Jan Roskam <[hidden email]>
To:        [hidden email],
Date:        01/10/2014 10:03 AM
Subject:        Re: [SPSSX-L] Odd, very odd, something. Info correction.
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




 Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction.
To: "Albert-Jan Roskam" <[hidden email]>
Cc: [hidden email]
Date: Friday, January 10, 2014, 5:08 PM

With over 100,000 characters in Unicode, why scrimp on dashes?


===> Good thing only normal hyphens are allowed in URLs (
http://tools.ietf.org/html/rfc3986#page-13). E.g. "Visit our page at www dot foo 'mongolian todo soft hyphen' bar dot com" sounds *very* awkward. ;-) It would also make URLs very vulnerable to hacking attempts.










From:
   
 Albert-Jan
Roskam <[hidden email]>

To:
 
 [hidden email],
Jon K Peck/Chicago/IBM@IBMUS,

Date:
   
 01/10/2014
08:30 AM

Subject:
 
   Re:
[SPSSX-L]
Odd, very odd, something. Info correction.








So many "high" dashes! Why not
just have
one and only one.



Code Name

U+002D hyphen-minus

U+007E tilde (when used as swung dash)

U+058A armenian hyphen

U+05BE hebrew punctuation maqaf

U+1400 canadian syllabics hyphen

U+1806 mongolian todo soft hyphen

U+2010 hyphen

U+2011 non-breaking hyphen

U+2012 figure dash

U+2013 en dash

U+2014 em dash

U+2015 horizontal bar (=quotation dash)

U+2053 swung dash

U+207B superscript minus

U+208B subscript minus

U+2212 minus sign

U+2E17 double oblique hyphen

U+301C wav e da s h

U+3030 wav y da s h

U+30A0 katakana-hiragana double hyphen

U+FE31 presentation form for vertical em dash

U+FE32 presentation form for vertical en dash

U+FE58 small em dash

U+FE63 small hyphen-minus

U+FF0D fullwidth hyphen-minus



source:
http://www.unicode.org/versions/Unicode6.3.0/ch06.pdf,
p 196.



Regards,



Albert-Jan







~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



All right, but apart from the sanitation, the medicine,
education, wine,
public order, irrigation, roads, a



fresh water system, and public health, what have the Romans
ever done for
us?



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



--------------------------------------------

On Thu, 1/9/14, Jon K Peck <[hidden email]> wrote:



 Subject: Re: [SPSSX-L] Odd, very odd, something. Info
correction.

 To: [hidden email]

 Date: Thursday, January 9, 2014, 1:41 AM



 I believe

 that you had a non-ascii dash,

 which is two bytes in Unicode, and the logic of your code

 would only work

 if each character, including the dash, is one byte, so the

 result is an

 invalid utf-8 character.  If any of the input fields

 can also contain

 accented or other non-ascii characters, the situation will

 be even worse.







 When you retyped the RANGE

 string, you

 apparently got an ascii dash.







 It is important for people

 to stop assuming

 that a byte is a character and to use the char.* functions

 that Statistics

 has provided since V16.  And avoid left hand side

 substr.











 Jon Peck (no "h") aka Kim



 Senior Software Engineer, IBM



 [hidden email]



 phone: 720-342-5621



















 From:

     

  "Maguin,

 Eugene"

 <[hidden email]>



 To:

   

  [hidden email],





 Date:

     

  01/08/2014

 02:45 PM



 Subject:

   

    Re:

 [SPSSX-L]

 Odd, very odd, something. Info correction.



 Sent by:

   

    "SPSSX(r)

 Discussion" <[hidden email]>

















 I just looked

 at the edit-options-general

 box and the two options in character encoding section are

 both grayed out

 but the Unicode circle  is bulleted. So perhaps I am

 really running

 in Unicode and didn’t realize it.



 



 I retyped the

 line COMPUTE

 RANGE=’     -     ‘. And re-ran
the

 section and no

 diamonds, just dashes. Even if I start over, I can’t

 reproduce the problem.

 So: FWIW.



 



 Gene

 Maguin



 



 



 



 From: Jon K Peck

 [
mailto:peck@...]





 Sent: Wednesday, January 08, 2014 3:56 PM



 To: Maguin, Eugene



 Cc: [hidden email]



 Subject: Re: [SPSSX-L] Odd, very odd, something



 



 The question mark indicates that

 you have

 an unprintable character in that location.  If you
are

 not in Unicode

 mode and using a western code page such as the usual
cp1252,

 there are

 only a few such character slots.  Please post some
code

 that shows

 this behavior.













 Jon Peck (no "h") aka Kim



 Senior Software Engineer, IBM



 [hidden email]



 phone: 720-342-5621



















 From:        "Maguin,

 Eugene" <[hidden email]>





 To:        [hidden email],





 Date:        01/08/2014

 01:41 PM





 Subject:        [SPSSX-L]

 Odd, very odd, something



 Sent by:        "SPSSX(r)

 Discussion" <[hidden email]>





















 Given an A11 variable, initially defined to be

 ‘…..-…..’, where a dot

 is a space, I replace using the SUBSTR function the 5
spaces

 on either

 side of the dash character, ‘-‘, with a 5 character

 string such as ‘Jan08’.

 The result in the data window shows a black diamond shaped

 character with

 an embedded, white question mark character. An example is

 ‘Jan08�Dec10’.





 



 So, naturally, the question is what is going on?



 And how can it be fixed so that the dash character shows

 instead of the

 diamond character?



 



 If it matters: 21, fully patched, not Unicode.





 



 Thanks, Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


Reply | Threaded
Open this post in threaded view
|

Re: Odd, very odd, something. Info correction.

Albert-Jan Roskam
--------------------------------------------
On Fri, 1/10/14, Jon K Peck <[hidden email]> wrote:

 Subject: Re: [SPSSX-L] Odd, very odd, something. Info correction.
 To: [hidden email]
 Date: Friday, January 10, 2014, 6:29 PM

 However,
 non-ascii characters, including
 these hyphen variations, can be used in urls. � They
 have to be encoded
 in hex form, but they can be misleading as they may appear
 displayed in
 their character form, depending on the browser.

===> You're right! www.foo%E2%80%90bar.com and www.foo%D6%8Abar.com look identical in Firefox. Impossible to tell the difference.

>>> url_hyphen = u"www.foo\u2010bar.com"
>>> url_armenian = u"www.foo\u058Abar.com"
>>> print url_hyphen, url_armenian
www.foo‐bar.com www.foo� bar.com
>>> import urllib
>>> print urllib.quote(url_hyphen.encode("utf-8"))
www.foo%E2%80%90bar.com
>>> print urllib.quote(url_armenian.encode("utf-8"))
www.foo%D6%8Abar.com




 Jon Peck (no "h") aka Kim

 Senior Software Engineer, IBM

 [hidden email]

 phone: 720-342-5621









 From:
 �  �  �
 � Albert-Jan
 Roskam <[hidden email]>

 To: �
 �  �
 � [hidden email],


 Date:
 �  �  �
 � 01/10/2014
 10:03 AM

 Subject:
 �  �
 �  � Re:
 [SPSSX-L]
 Odd, very odd, something. Info correction.

 Sent by:
 �  �
 �  � "SPSSX(r)
 Discussion" <[hidden email]>








 � Subject: Re: [SPSSX-L] Odd, very
 odd, something.
 Info correction.

  To: "Albert-Jan Roskam" <[hidden email]>

  Cc: [hidden email]

  Date: Friday, January 10, 2014, 5:08 PM



  With over 100,000 characters in Unicode, why scrimp on
 dashes?





  ===> Good thing only normal hyphens are allowed in URLs
 (http://tools.ietf.org/html/rfc3986#page-13).
 E.g. "Visit our page at www dot foo 'mongolian todo
 soft hyphen' bar
 dot com" sounds *very* awkward. ;-) It would also make
 URLs very vulnerable
 to hacking attempts.





















  From:

  �  �

  � Albert-Jan

  Roskam <[hidden email]>



  To:

  �

  � [hidden email],

  Jon K Peck/Chicago/IBM@IBMUS,



  Date:

  �  �

  � 01/10/2014

  08:30 AM



  Subject:

  �

  �  � Re:

  [SPSSX-L]

  Odd, very odd, something. Info correction.

















  So many "high" dashes! Why not

  just have

  one and only one.







  Code Name



  U+002D hyphen-minus



  U+007E tilde (when used as swung dash)



  U+058A armenian hyphen



  U+05BE hebrew punctuation maqaf



  U+1400 canadian syllabics hyphen



  U+1806 mongolian todo soft hyphen



  U+2010 hyphen



  U+2011 non-breaking hyphen



  U+2012 figure dash



  U+2013 en dash



  U+2014 em dash



  U+2015 horizontal bar (=quotation dash)



  U+2053 swung dash



  U+207B superscript minus



  U+208B subscript minus



  U+2212 minus sign



  U+2E17 double oblique hyphen



  U+301C wav e da s h



  U+3030 wav y da s h



  U+30A0 katakana-hiragana double hyphen



  U+FE31 presentation form for vertical em dash



  U+FE32 presentation form for vertical en dash



  U+FE58 small em dash



  U+FE63 small hyphen-minus



  U+FF0D fullwidth hyphen-minus







  source: http://www.unicode.org/versions/Unicode6.3.0/ch06.pdf,

  p 196.







  Regards,







  Albert-Jan















  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~







  All right, but apart from the sanitation, the medicine,

  education, wine,

  public order, irrigation, roads, a







  fresh water system, and public health, what have the
 Romans

  ever done for

  us?







  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~







  --------------------------------------------



  On Thu, 1/9/14, Jon K Peck <[hidden email]> wrote:







  � Subject: Re: [SPSSX-L] Odd, very odd, something.
 Info

  correction.



  � To: [hidden email]



  � Date: Thursday, January 9, 2014, 1:41 AM







  � I believe



  � that you had a non-ascii dash,



  � which is two bytes in Unicode, and the logic of your
 code



  � would only work



  � if each character, including the dash, is one byte,
 so the



  � result is an



  � invalid utf-8 character. � If any of the input
 fields



  � can also contain



  � accented or other non-ascii characters, the situation
 will



  � be even worse.















  � When you retyped the RANGE



  � string, you



  � apparently got an ascii dash.















  � It is important for people



  � to stop assuming



  � that a byte is a character and to use the char.*
 functions



  � that Statistics



  � has provided since V16. � And avoid left hand
 side



  � substr.























  � Jon Peck (no "h") aka Kim







  � Senior Software Engineer, IBM







  � [hidden email]







  � phone: 720-342-5621







































  � From:



  �  �  �



  �  "Maguin,



  � Eugene"



  � <[hidden email]>







  � To:



  �  �



  �  [hidden email],











  � Date:



  �  �  �



  �  01/08/2014



  � 02:45 PM







  � Subject:



  �  �



  �  �  Re:



  � [SPSSX-L]



  � Odd, very odd, something. Info correction.







  � Sent by:



  �  �



  �  �  "SPSSX(r)



  � Discussion" <[hidden email]>



































  � I just looked



  � at the edit-options-general



  � box and the two options in character encoding section
 are



  � both grayed out



  � but the Unicode circle � is bulleted. So perhaps
 I am



  � really running



  � in Unicode and didn’t realize it.







  �







  � I retyped the



  � line COMPUTE



  � RANGE=’ �  �  - �  �  ‘. And
 re-ran

  the



  � section and no



  � diamonds, just dashes. Even if I start over, I
 can’t



  � reproduce the problem.



  � So: FWIW.







  �







  � Gene



  � Maguin







  �







  �







  �







  � From: Jon K Peck



  � [mailto:[hidden email]]











  � Sent: Wednesday, January 08, 2014 3:56 PM







  � To: Maguin, Eugene







  � Cc: [hidden email]







  � Subject: Re: [SPSSX-L] Odd, very odd, something







  �







  � The question mark indicates that



  � you have



  � an unprintable character in that location. � If
 you

  are



  � not in Unicode



  � mode and using a western code page such as the usual

  cp1252,



  � there are



  � only a few such character slots. � Please post
 some

  code



  � that shows



  � this behavior.



























  � Jon Peck (no "h") aka Kim







  � Senior Software Engineer, IBM







  � [hidden email]







  � phone: 720-342-5621







































  � From: �  �  �  � "Maguin,



  � Eugene" <[hidden email]>











  � To: �  �  �
 � [hidden email],











  � Date: �  �  �  � 01/08/2014



  � 01:41 PM











  � Subject: �  �  �  � [SPSSX-L]



  � Odd, very odd, something







  � Sent by: �  �  �  � "SPSSX(r)



  � Discussion" <[hidden email]>











































  � Given an A11 variable, initially defined to be



  � ‘…..-…..’, where a dot



  � is a space, I replace using the SUBSTR function the
 5

  spaces



  � on either



  � side of the dash character, ‘-‘, with a 5
 character



  � string such as ‘Jan08’.



  � The result in the data window shows a black diamond
 shaped



  � character with



  � an embedded, white question mark character. An
 example is



  � ‘Jan08�Dec10’.











  �







  � So, naturally, the question is what is going on?







  � And how can it be fixed so that the dash character
 shows



  � instead of the



  � diamond character?







  �







  � If it matters: 21, fully patched, not Unicode.











  �







  � Thanks, Gene Maguin



 =====================

 To manage your subscription to SPSSX-L, send a message to

 [hidden email] (not to SPSSX-L), with no body
 text except the

 command. To leave the list, send the command

 SIGNOFF SPSSX-L

 For a list of commands to manage subscriptions, send the
 command

 INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
12