|
Hi,
Thanks to Jon Peck and Richard Ristow for their advice on how to remove hidden control characters visible in Data Editor but not copyable. I have not yet been able to get Richard's approach to work. Jon's suggestion to try the REPLACE function has the appeal of simplicity if I could get a handle on the character needing to be replaced. The REPLACE function below, for example, works fine in replacing the character 'z' (or any other character) with a space. COMPUTE var=REPLACE(Old_var,'z',' '). EXECUTE. Unfortunately, my attempts to substitute various ASCII codes for 'z' (the hidden carriage return/line feed characters) have not met with success. I am using ODBC to access data from MS Access via SPSS 14. Any help is appreciated. Thanks. Kevin _________________________________________________________________ Refi Now: Rates near 39yr lows! $430,000 Mortgage for $1,399/mo - Calculate new payment http://www.lowermybills.com/lre/index.jsp?sourceid=lmb-9632-17727&moid=7581 |
|
If your data file is not too large, you could try an editor such as
Ultraedit or similar, that displays all non-standard/non-printing characters, to identify and replace or remove them first. HTH Mike -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Kevin Hynes Sent: Thursday, February 22, 2007 7:53 PM To: [hidden email] Subject: Deleting embedded control characters (CR/LF) from very long text field Hi, Thanks to Jon Peck and Richard Ristow for their advice on how to remove hidden control characters visible in Data Editor but not copyable. I have not yet been able to get Richard's approach to work. Jon's suggestion to try the REPLACE function has the appeal of simplicity if I could get a handle on the character needing to be replaced. The REPLACE function below, for example, works fine in replacing the character 'z' (or any other character) with a space. COMPUTE var=REPLACE(Old_var,'z',' '). EXECUTE. Unfortunately, my attempts to substitute various ASCII codes for 'z' (the hidden carriage return/line feed characters) have not met with success. I am using ODBC to access data from MS Access via SPSS 14. Any help is appreciated. Thanks. Kevin _________________________________________________________________ Refi Now: Rates near 39yr lows! $430,000 Mortgage for $1,399/mo - Calculate new payment http://www.lowermybills.com/lre/index.jsp?sourceid=lmb-9632-17727&moid=7 581 |
|
In reply to this post by Kevin Hynes
The easiest way to do this would be with a Python regular expression program, and the trans and extendedTransforms modules on Developer Central make this pretty easy.
But you can do this in straight (well, slightly bent) SPSS. If it survives the email transmission, the syntax below contains a CR/LF sequence in the second field, so you can just change the variable names and replacement ('x'), and you're good to go. If the email creams this, you can do this: create a text field in Access and enter a cr/lf (ctrl-enter will do this). Read that into SPSS via ODBC and copy a cell with this value (not from the edit bar, copy the cell directly), and paste it into a replace function. STRING nobadch (A8). COMPUTE nobadch = REPLACE(badch," ",'x') . -Jon Peck -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Kevin Hynes Sent: Thursday, February 22, 2007 6:53 PM To: [hidden email] Subject: [SPSSX-L] Deleting embedded control characters (CR/LF) from very long text field Hi, Thanks to Jon Peck and Richard Ristow for their advice on how to remove hidden control characters visible in Data Editor but not copyable. I have not yet been able to get Richard's approach to work. Jon's suggestion to try the REPLACE function has the appeal of simplicity if I could get a handle on the character needing to be replaced. The REPLACE function below, for example, works fine in replacing the character 'z' (or any other character) with a space. COMPUTE var=REPLACE(Old_var,'z',' '). EXECUTE. Unfortunately, my attempts to substitute various ASCII codes for 'z' (the hidden carriage return/line feed characters) have not met with success. I am using ODBC to access data from MS Access via SPSS 14. Any help is appreciated. Thanks. Kevin _________________________________________________________________ Refi Now: Rates near 39yr lows! $430,000 Mortgage for $1,399/mo - Calculate new payment http://www.lowermybills.com/lre/index.jsp?sourceid=lmb-9632-17727&moid=7581 |
|
In reply to this post by Kevin Hynes
At 04:52 PM 2/22/2007, Kevin Hynes wrote:
>Thanks to Jon Peck and Richard Ristow for their advice on how to >remove hidden control characters visible in Data Editor but not >copyable. I have not yet been able to get Richard's approach to work. >Jon's suggestion to try the REPLACE function has the appeal of >simplicity if I could get a handle on the character needing to be >replaced. What went wrong with what I posted? It was tested code; on the other hand, that was only one test run. >The REPLACE function has the appeal of simplicity REPLACE (which Jon suggested) is definitely better than using SUBSTRING (which I originally suggested), if you have SPSS 14+ and have REPLACE. I see I posted a version using REPLACE. >if I could get a handle on the character needing to be replaced. Exactly. I posted logic (using a technique from Raynald Levesque) to get those characters from the ASCII numerical equivalents. What went wrong? You'll understand that "I cannot get the approach to work" isn't very helpful for debugging. |
|
Hi all,
Sorry for the delay in responding to Richard's question below. I'm not saying that the suggestions do not work; it's just that I was not able to implement the suggested approaches. I have to confess I ultimately dodged the issue by asking IT to resolve the problem on the front end. They used the following Perl to strip the line breaks and carriage returns: $comment =~ s/\n/ /g; # replace newlines with spaces $comment =~ s/\r//g; # remove hard returns $comment =~ s/\cM//g; # delete ^M's $comment =~ s/"/'/g; # replaces double quotes with single quotes I apologize for not having saved a good SPSS example for debugging. Here is what I was working with: DO REPEAT VAR=COMMENT. COMPUTE LocHex01 = INDEX(VAR,' '). IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1). END REPEAT. EXECUTE. For what it's woth, I was able to utilize Jon's suggestion for copying the hidden characters from MS Access. However, when I copied the character into the apostrophes in the COMPUTE statement, the hidden characters essentially were "translated" and created a hard return in the COMPUTE statement which then produced an error in the program since the statement could not be continued across two lines without inserting a "+" for continuation. But then inserting the "+" destroyed the hard return. Unfortunately I wrote over the syntax and did not save a copy. Basically, after copying the hidden character, the program looked like the following: DO REPEAT VAR=COMMENT. COMPUTE LocHex01 = INDEX(VAR,' '). IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1). END REPEAT. EXECUTE. I did save a copy of 3 records which I had edited down to about 25 caharacters even though the length was 1400. "Comment" was to be the result of having removed the hard return/bad char from Junk. Listing the data produces the following: The variables are listed in the following order: LINE 1: CrnSt LINE 2: JUNK LINE 3: COMMENT CrnSt: 1 JUNK: 1. slide show COMMENT: CrnSt: 2 JUNK: 2. One suggestion COMMENT: CrnSt: 3 JUNK: 3. were clear COMMENT: Number of cases read: 3 Number of cases listed: 3 >From: Richard Ristow <[hidden email]> >To: Kevin Hynes <[hidden email]>,[hidden email] >Subject: Re: Deleting embedded control characters >Date: Fri, 23 Feb 2007 12:37:00 -0800 > >At 04:52 PM 2/22/2007, Kevin Hynes wrote: > >>Thanks to Jon Peck and Richard Ristow for their advice on how to remove >>hidden control characters visible in Data Editor but not copyable. I have >>not yet been able to get Richard's approach to work. Jon's suggestion to >>try the REPLACE function has the appeal of simplicity if I could get a >>handle on the character needing to be replaced. > >What went wrong with what I posted? It was tested code; on the other hand, >that was only one test run. > >>The REPLACE function has the appeal of simplicity > >REPLACE (which Jon suggested) is definitely better than using SUBSTRING >(which I originally suggested), if you have SPSS 14+ and have REPLACE. I >see I posted a version using REPLACE. > >>if I could get a handle on the character needing to be replaced. > >Exactly. I posted logic (using a technique from Raynald Levesque) to get >those characters from the ASCII numerical equivalents. What went wrong? >You'll understand that "I cannot get the approach to work" isn't very >helpful for debugging. > > > _________________________________________________________________ Mortgage refinance is hot 1) Rates near 30-yr lows 2) Good credit get intro-rate 4.625%* https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h2a5f&s=4056&p=5117&disc=y&vers=743 |
|
At 07:54 PM 3/19/2007, Kevin Hynes wrote:
>Here is what I was working with [indents added]: > >DO REPEAT VAR=COMMENT. >. COMPUTE LocHex01 = INDEX(VAR,' '). >. IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1). >END REPEAT. >EXECUTE. This isn't the problem you ran into; but, I'm not sure what the DO REPEAT is there for. If I read it right, the whole structure is equivalent to . COMPUTE LocHex01 = INDEX(COMMENT,' '). . IF (LocHex01 > 0) COMMENT = SUBSTR(COMMENT,1,LocHex01-1). If you want to check for, and remove, all instances of some character, you need a LOOP construct. >I was able to utilize Jon's suggestion for copying the hidden >characters from MS Access. However, when I copied the character into >the apostrophes in the COMPUTE statement, the hidden characters >essentially were "translated" and created a hard return in the COMPUTE >statement They weren't translated. They were, and stayed, what they had been in Access. Your problem was, and is, that a return character is a return character; if you paste it into SPSS syntax, it has whatever meaning it has in the syntax. The quote marks delimit a literal string - a 'character literal'. MOST, BUT NOT ALL, characters between quotes become part of the string, not part of the syntax. It looks like CR is one of the exceptions. To repeat: the CR was not translated, in any way, shape or form. It started as a CR and ended as a CR. Since CR characters have a syntactic meaning - the end of a line of the command - even if they're within quotes, you got the result you saw. You CANNOT, it looks like, get a CR as a one-character string, or as part of a multiple-character string, by enclosing it in quotes. Of course you got >DO REPEAT VAR=COMMENT. >COMPUTE LocHex01 = INDEX(VAR,' >'). >IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1). >END REPEAT. >EXECUTE. Now: there are more ways to get a character into a string value, than by enclosing it in quotes. You may remember: on Wed, 31 Jan 2007 <23:27:49 -0500> in your thread, I posted a technique by Raynald Levesque, passed on by Gene Maguin, to get any character into a string value (NOT string literal, between quotes) from its ASCII numeric value. THAT is how to get a CR as an SPSS character value, rather than as part of a syntax line. I won't repeat from that post; you should be able to look it up OK. For numeric values of ASCII characters, see many sources, including http://www.asciitable.com/. CR is 13d, or 0Ex. If you're doing OK having the data pre-processed before you read it into SPSS, fine. But the problem can be solved in SPSS. Python isn't even needed, and I'm not sure it would even make things easier. You may be making this hard on yourself, by not using the information you have. If you do want to pursue an SPSS solution, the 31 January posting should give you what you need. Remember, there are two parts: A. Get a list of the characters you want to remove, as an SPSS string value (NOT an SPSS string literal) B. With a LOOP construct, search for, and remove or replace, every character on that list. The 31 January posting uses a loop with SUBSTR. On Thu, 1 Feb 2007 <11:29:33 -0500> I posted a simpler variation, using REPLACE instead of SUBSTR. -Best regards and good luck, Richard |
|
I had the same annoying problem and I wrote this macro for it. It's
actually the other way around. It does not delete the hidden characters but it keeps the 'normal' characters which you can define yourself. When a not not normal character is found it will bij replaced by a space. The macro: DEFINE !DelENTER (!POS !CHAREND('/')) !DO !Var !IN (!1) LOOP #i=1 TO LENGTH(RTRIM(!Var)) . IF (RANGE(SUBSTR(!Var,#i,1),'0','9','a','z','A','Z')~=1 AND ANY(SUBSTR(!Var,#i,1),'&','%',';',',',':','-','/','?','#','*','(',')','$ ','%','\','+','=','.')~=1) SUBSTR(!Var,#i,1)="". END LOOP. EXE. !DOEND. !ENDDEFINE . Call the macro for v1,v2 ... with: !DelENTER v1 v2 / . Hope this helps, Max -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: dinsdag 20 maart 2007 6:25 To: [hidden email] Subject: Re: Deleting embedded control characters At 07:54 PM 3/19/2007, Kevin Hynes wrote: >Here is what I was working with [indents added]: > >DO REPEAT VAR=COMMENT. >. COMPUTE LocHex01 = INDEX(VAR,' '). >. IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1). >END REPEAT. >EXECUTE. This isn't the problem you ran into; but, I'm not sure what the DO REPEAT is there for. If I read it right, the whole structure is equivalent to . COMPUTE LocHex01 = INDEX(COMMENT,' '). . IF (LocHex01 > 0) COMMENT = SUBSTR(COMMENT,1,LocHex01-1). If you want to check for, and remove, all instances of some character, you need a LOOP construct. >I was able to utilize Jon's suggestion for copying the hidden >characters from MS Access. However, when I copied the character into >the apostrophes in the COMPUTE statement, the hidden characters >essentially were "translated" and created a hard return in the COMPUTE >statement They weren't translated. They were, and stayed, what they had been in Access. Your problem was, and is, that a return character is a return character; if you paste it into SPSS syntax, it has whatever meaning it has in the syntax. The quote marks delimit a literal string - a 'character literal'. MOST, BUT NOT ALL, characters between quotes become part of the string, not part of the syntax. It looks like CR is one of the exceptions. To repeat: the CR was not translated, in any way, shape or form. It started as a CR and ended as a CR. Since CR characters have a syntactic meaning - the end of a line of the command - even if they're within quotes, you got the result you saw. You CANNOT, it looks like, get a CR as a one-character string, or as part of a multiple-character string, by enclosing it in quotes. Of course you got >DO REPEAT VAR=COMMENT. >COMPUTE LocHex01 = INDEX(VAR,' >'). >IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1). >END REPEAT. >EXECUTE. Now: there are more ways to get a character into a string value, than by enclosing it in quotes. You may remember: on Wed, 31 Jan 2007 <23:27:49 -0500> in your thread, I posted a technique by Raynald Levesque, passed on by Gene Maguin, to get any character into a string value (NOT string literal, between quotes) from its ASCII numeric value. THAT is how to get a CR as an SPSS character value, rather than as part of a syntax line. I won't repeat from that post; you should be able to look it up OK. For numeric values of ASCII characters, see many sources, including http://www.asciitable.com/. CR is 13d, or 0Ex. If you're doing OK having the data pre-processed before you read it into SPSS, fine. But the problem can be solved in SPSS. Python isn't even needed, and I'm not sure it would even make things easier. You may be making this hard on yourself, by not using the information you have. If you do want to pursue an SPSS solution, the 31 January posting should give you what you need. Remember, there are two parts: A. Get a list of the characters you want to remove, as an SPSS string value (NOT an SPSS string literal) B. With a LOOP construct, search for, and remove or replace, every character on that list. The 31 January posting uses a loop with SUBSTR. On Thu, 1 Feb 2007 <11:29:33 -0500> I posted a simpler variation, using REPLACE instead of SUBSTR. -Best regards and good luck, Richard |
|
If you use the macro below, bear in mind that it may remove accented roman characters, and it definitely will not work with nonwestern scripts. RANGE respects the collating sequence of the locale in which it is run, and in some locales, accented characters fall outside the a-z (or A-Z) range.
Since all control characters are below hexadecimal 20, using range on that interval is safer. There are some tricky characters, taking the western code page (1252), for example, the non-breaking space is xA0 (=160 decimal) . (You can impress your friends and have blanks in variable names by using this character at the cost of making your syntax terminally confusing.) This code removes tab characters, value 09. Values 10 and 12 correspond to cr and lf characters. compute notabs = replace(strwtabs, string(09, ib1),' '). The most powerful way to do this, however, is with Python and regular expressions (the trans.py and extendedTransforms.subs functions package this conveniently). -Jon Peck SPSS -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Max Bell Sent: Tuesday, March 20, 2007 3:22 AM To: [hidden email] Subject: Re: [SPSSX-L] Deleting embedded control characters I had the same annoying problem and I wrote this macro for it. It's actually the other way around. It does not delete the hidden characters but it keeps the 'normal' characters which you can define yourself. When a not not normal character is found it will bij replaced by a space. The macro: DEFINE !DelENTER (!POS !CHAREND('/')) !DO !Var !IN (!1) LOOP #i=1 TO LENGTH(RTRIM(!Var)) . IF (RANGE(SUBSTR(!Var,#i,1),'0','9','a','z','A','Z')~=1 AND ANY(SUBSTR(!Var,#i,1),'&','%',';',',',':','-','/','?','#','*','(',')','$ ','%','\','+','=','.')~=1) SUBSTR(!Var,#i,1)="". END LOOP. EXE. !DOEND. !ENDDEFINE . Call the macro for v1,v2 ... with: !DelENTER v1 v2 / . Hope this helps, Max -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Richard Ristow Sent: dinsdag 20 maart 2007 6:25 To: [hidden email] Subject: Re: Deleting embedded control characters At 07:54 PM 3/19/2007, Kevin Hynes wrote: >Here is what I was working with [indents added]: > >DO REPEAT VAR=COMMENT. >. COMPUTE LocHex01 = INDEX(VAR,' '). >. IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1). >END REPEAT. >EXECUTE. This isn't the problem you ran into; but, I'm not sure what the DO REPEAT is there for. If I read it right, the whole structure is equivalent to . COMPUTE LocHex01 = INDEX(COMMENT,' '). . IF (LocHex01 > 0) COMMENT = SUBSTR(COMMENT,1,LocHex01-1). If you want to check for, and remove, all instances of some character, you need a LOOP construct. >I was able to utilize Jon's suggestion for copying the hidden >characters from MS Access. However, when I copied the character into >the apostrophes in the COMPUTE statement, the hidden characters >essentially were "translated" and created a hard return in the COMPUTE >statement They weren't translated. They were, and stayed, what they had been in Access. Your problem was, and is, that a return character is a return character; if you paste it into SPSS syntax, it has whatever meaning it has in the syntax. The quote marks delimit a literal string - a 'character literal'. MOST, BUT NOT ALL, characters between quotes become part of the string, not part of the syntax. It looks like CR is one of the exceptions. To repeat: the CR was not translated, in any way, shape or form. It started as a CR and ended as a CR. Since CR characters have a syntactic meaning - the end of a line of the command - even if they're within quotes, you got the result you saw. You CANNOT, it looks like, get a CR as a one-character string, or as part of a multiple-character string, by enclosing it in quotes. Of course you got >DO REPEAT VAR=COMMENT. >COMPUTE LocHex01 = INDEX(VAR,' >'). >IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1). >END REPEAT. >EXECUTE. Now: there are more ways to get a character into a string value, than by enclosing it in quotes. You may remember: on Wed, 31 Jan 2007 <23:27:49 -0500> in your thread, I posted a technique by Raynald Levesque, passed on by Gene Maguin, to get any character into a string value (NOT string literal, between quotes) from its ASCII numeric value. THAT is how to get a CR as an SPSS character value, rather than as part of a syntax line. I won't repeat from that post; you should be able to look it up OK. For numeric values of ASCII characters, see many sources, including http://www.asciitable.com/. CR is 13d, or 0Ex. If you're doing OK having the data pre-processed before you read it into SPSS, fine. But the problem can be solved in SPSS. Python isn't even needed, and I'm not sure it would even make things easier. You may be making this hard on yourself, by not using the information you have. If you do want to pursue an SPSS solution, the 31 January posting should give you what you need. Remember, there are two parts: A. Get a list of the characters you want to remove, as an SPSS string value (NOT an SPSS string literal) B. With a LOOP construct, search for, and remove or replace, every character on that list. The 31 January posting uses a loop with SUBSTR. On Thu, 1 Feb 2007 <11:29:33 -0500> I posted a simpler variation, using REPLACE instead of SUBSTR. -Best regards and good luck, Richard |
| Free forum by Nabble | Edit this page |
