Deleting embedded control characters (CR/LF) from very long text field

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Deleting embedded control characters (CR/LF) from very long text field

Kevin Hynes
Hi,

Thanks to Jon Peck and Richard Ristow for their advice on how to remove
hidden control characters visible in Data Editor but not copyable. I have
not yet been able to get Richard's approach to work. Jon's suggestion to try
the REPLACE function has the appeal of simplicity if I could get a handle on
the character needing to be replaced.

The REPLACE function below, for example, works fine in replacing the
character 'z' (or any other character) with a space.

COMPUTE var=REPLACE(Old_var,'z',' ').
EXECUTE.

Unfortunately, my attempts to substitute various ASCII codes for 'z' (the
hidden carriage return/line feed characters) have not met with success.

I am using ODBC to access data from MS Access via SPSS 14. Any help is
appreciated.

Thanks. Kevin

_________________________________________________________________
Refi Now: Rates near 39yr lows!  $430,000 Mortgage for $1,399/mo - Calculate
new payment
http://www.lowermybills.com/lre/index.jsp?sourceid=lmb-9632-17727&moid=7581
Reply | Threaded
Open this post in threaded view
|

Re: Deleting embedded control characters (CR/LF) from very long text field

Roberts, Michael
If your data file is not too large, you could try an editor such as
Ultraedit or similar, that displays all non-standard/non-printing
characters, to identify and replace or remove them first.

HTH

Mike




-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Kevin Hynes
Sent: Thursday, February 22, 2007 7:53 PM
To: [hidden email]
Subject: Deleting embedded control characters (CR/LF) from very long
text field

Hi,

Thanks to Jon Peck and Richard Ristow for their advice on how to remove
hidden control characters visible in Data Editor but not copyable. I
have not yet been able to get Richard's approach to work. Jon's
suggestion to try the REPLACE function has the appeal of simplicity if I
could get a handle on the character needing to be replaced.

The REPLACE function below, for example, works fine in replacing the
character 'z' (or any other character) with a space.

COMPUTE var=REPLACE(Old_var,'z',' ').
EXECUTE.

Unfortunately, my attempts to substitute various ASCII codes for 'z'
(the hidden carriage return/line feed characters) have not met with
success.

I am using ODBC to access data from MS Access via SPSS 14. Any help is
appreciated.

Thanks. Kevin

_________________________________________________________________
Refi Now: Rates near 39yr lows!  $430,000 Mortgage for $1,399/mo -
Calculate new payment
http://www.lowermybills.com/lre/index.jsp?sourceid=lmb-9632-17727&moid=7
581
Reply | Threaded
Open this post in threaded view
|

Re: Deleting embedded control characters (CR/LF) from very long text field

Peck, Jon
In reply to this post by Kevin Hynes
The easiest way to do this would be with a Python regular expression program, and the trans and extendedTransforms modules on Developer Central make this pretty easy.

But you can do this in straight (well, slightly bent) SPSS.

If it survives the email transmission, the syntax below contains a CR/LF sequence in the second field, so you can just change the variable names and replacement ('x'), and you're good to go.

If the email creams this, you can do this: create a text field in Access and enter a cr/lf
 (ctrl-enter will do this).  Read that into SPSS via ODBC and copy a cell with this value (not from the edit bar, copy the cell directly), and paste it into a replace function.

STRING nobadch (A8).
COMPUTE nobadch = REPLACE(badch,"      ",'x') .

-Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Kevin Hynes
Sent: Thursday, February 22, 2007 6:53 PM
To: [hidden email]
Subject: [SPSSX-L] Deleting embedded control characters (CR/LF) from very long text field

Hi,

Thanks to Jon Peck and Richard Ristow for their advice on how to remove
hidden control characters visible in Data Editor but not copyable. I have
not yet been able to get Richard's approach to work. Jon's suggestion to try
the REPLACE function has the appeal of simplicity if I could get a handle on
the character needing to be replaced.

The REPLACE function below, for example, works fine in replacing the
character 'z' (or any other character) with a space.

COMPUTE var=REPLACE(Old_var,'z',' ').
EXECUTE.

Unfortunately, my attempts to substitute various ASCII codes for 'z' (the
hidden carriage return/line feed characters) have not met with success.

I am using ODBC to access data from MS Access via SPSS 14. Any help is
appreciated.

Thanks. Kevin

_________________________________________________________________
Refi Now: Rates near 39yr lows!  $430,000 Mortgage for $1,399/mo - Calculate
new payment
http://www.lowermybills.com/lre/index.jsp?sourceid=lmb-9632-17727&moid=7581
Reply | Threaded
Open this post in threaded view
|

Re: Deleting embedded control characters

Richard Ristow
In reply to this post by Kevin Hynes
At 04:52 PM 2/22/2007, Kevin Hynes wrote:

>Thanks to Jon Peck and Richard Ristow for their advice on how to
>remove hidden control characters visible in Data Editor but not
>copyable. I have not yet been able to get Richard's approach to work.
>Jon's suggestion to try the REPLACE function has the appeal of
>simplicity if I could get a handle on the character needing to be
>replaced.

What went wrong with what I posted? It was tested code; on the other
hand, that was only one test run.

>The REPLACE function has the appeal of simplicity

REPLACE (which Jon suggested) is definitely better than using SUBSTRING
(which I originally suggested), if you have SPSS 14+ and have REPLACE.
I see I posted a version using REPLACE.

>if I could get a handle on the character needing to be replaced.

Exactly. I posted logic (using a technique from Raynald Levesque) to
get those characters from the ASCII numerical equivalents. What went
wrong? You'll understand that "I cannot get the approach to work" isn't
very helpful for debugging.
Reply | Threaded
Open this post in threaded view
|

Re: Deleting embedded control characters

Kevin Hynes
Hi all,

Sorry for the delay in responding to Richard's question below. I'm not
saying that the suggestions do not work; it's just that I was not able to
implement the suggested approaches. I have to confess I ultimately dodged
the issue by asking IT to resolve the problem on the front end. They used
the following Perl to strip the line breaks and carriage returns:

    $comment =~ s/\n/ /g;   # replace newlines with spaces
    $comment =~ s/\r//g;    # remove hard returns
    $comment =~ s/\cM//g;   # delete ^M's
    $comment =~ s/"/'/g;    # replaces double quotes with single quotes

I apologize for not having saved a good SPSS example for debugging. Here is
what I was working with:

DO REPEAT VAR=COMMENT.
COMPUTE LocHex01 = INDEX(VAR,' ').
IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1).
END REPEAT.
EXECUTE.

For what it's woth, I was able to utilize Jon's suggestion for copying the
hidden characters from MS Access. However, when I copied the character into
the apostrophes in the COMPUTE statement,
the hidden characters essentially were "translated" and created a hard
return in the COMPUTE statement which then produced an error in the program
since the statement could not be continued across two lines without
inserting a "+" for continuation. But then inserting the "+" destroyed the
hard return. Unfortunately I wrote over the syntax and did not save a copy.
Basically, after copying the hidden character, the program looked like the
following:

DO REPEAT VAR=COMMENT.
COMPUTE LocHex01 = INDEX(VAR,'
').
IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1).
END REPEAT.
EXECUTE.

I did save a copy of 3 records which I had edited down to about 25
caharacters even though the length was 1400. "Comment" was to be the result
of having removed the hard return/bad char  from Junk. Listing the data
produces the following:

The variables are listed in the following order:

LINE   1: CrnSt

LINE   2: JUNK

LINE   3: COMMENT


       CrnSt:     1
        JUNK:     1. slide
show
     COMMENT:

       CrnSt:     2
        JUNK:     2. One
suggestion
     COMMENT:

       CrnSt:     3
        JUNK:     3. were
clear
     COMMENT:


Number of cases read:  3    Number of cases listed:  3


>From: Richard Ristow <[hidden email]>
>To: Kevin Hynes <[hidden email]>,[hidden email]
>Subject: Re: Deleting embedded control characters
>Date: Fri, 23 Feb 2007 12:37:00 -0800
>
>At 04:52 PM 2/22/2007, Kevin Hynes wrote:
>
>>Thanks to Jon Peck and Richard Ristow for their advice on how to remove
>>hidden control characters visible in Data Editor but not copyable. I have
>>not yet been able to get Richard's approach to work. Jon's suggestion to
>>try the REPLACE function has the appeal of simplicity if I could get a
>>handle on the character needing to be replaced.
>
>What went wrong with what I posted? It was tested code; on the other hand,
>that was only one test run.
>
>>The REPLACE function has the appeal of simplicity
>
>REPLACE (which Jon suggested) is definitely better than using SUBSTRING
>(which I originally suggested), if you have SPSS 14+ and have REPLACE. I
>see I posted a version using REPLACE.
>
>>if I could get a handle on the character needing to be replaced.
>
>Exactly. I posted logic (using a technique from Raynald Levesque) to get
>those characters from the ASCII numerical equivalents. What went wrong?
>You'll understand that "I cannot get the approach to work" isn't very
>helpful for debugging.
>
>
>

_________________________________________________________________
Mortgage refinance is hot 1) Rates near 30-yr lows 2) Good credit get
intro-rate 4.625%*
https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h2a5f&s=4056&p=5117&disc=y&vers=743
Reply | Threaded
Open this post in threaded view
|

Re: Deleting embedded control characters

Richard Ristow
At 07:54 PM 3/19/2007, Kevin Hynes wrote:

>Here is what I was working with [indents added]:
>
>DO REPEAT VAR=COMMENT.
>.  COMPUTE LocHex01 = INDEX(VAR,' ').
>.  IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1).
>END REPEAT.
>EXECUTE.

This isn't the problem you ran into; but, I'm not sure what the DO
REPEAT is there for. If I read it right, the whole structure is
equivalent to

.  COMPUTE LocHex01 = INDEX(COMMENT,' ').
.  IF (LocHex01 > 0) COMMENT = SUBSTR(COMMENT,1,LocHex01-1).

If you want to check for, and remove, all instances of some character,
you need a LOOP construct.

>I was able to utilize Jon's suggestion for copying the hidden
>characters from MS Access. However, when I copied the character into
>the apostrophes in the COMPUTE statement, the hidden characters
>essentially were "translated" and created a hard return in the COMPUTE
>statement

They weren't translated. They were, and stayed, what they had been in
Access. Your problem was, and is, that a return character is a return
character; if you paste it into SPSS syntax, it has whatever meaning it
has in the syntax.

The quote marks delimit a literal string - a 'character literal'. MOST,
BUT NOT ALL, characters between quotes become part of the string, not
part of the syntax. It looks like CR is one of the exceptions.

To repeat: the CR was not translated, in any way, shape or form. It
started as a CR and ended as a CR. Since CR characters have a syntactic
meaning - the end of a line of the command - even if they're within
quotes, you got the result you saw. You CANNOT, it looks like, get a CR
as a one-character string, or as part of a multiple-character string,
by enclosing it in quotes. Of course you got

>DO REPEAT VAR=COMMENT.
>COMPUTE LocHex01 = INDEX(VAR,'
>').
>IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1).
>END REPEAT.
>EXECUTE.

Now: there are more ways to get a character into a string value, than
by enclosing it in quotes. You may remember: on Wed, 31 Jan 2007
<23:27:49 -0500> in your thread, I posted a technique by Raynald
Levesque, passed on by Gene Maguin, to get any character into a string
value (NOT string literal, between quotes) from its ASCII numeric
value. THAT is how to get a CR as an SPSS character value, rather than
as part of a syntax line. I won't repeat from that post; you should be
able to look it up OK.

For numeric values of ASCII characters, see many sources, including
http://www.asciitable.com/.  CR is 13d, or 0Ex.

If you're doing OK having the data pre-processed before you read it
into SPSS, fine. But the problem can be solved in SPSS. Python isn't
even needed, and I'm not sure it would even make things easier.

You may be making this hard on yourself, by not using the information
you have. If you do want to pursue an SPSS solution, the 31 January
posting should give you what you need. Remember, there are two parts:

A. Get a list of the characters you want to remove, as an SPSS string
value (NOT an SPSS string literal)

B. With a LOOP construct, search for, and remove or replace, every
character on that list.

The 31 January posting uses a loop with SUBSTR. On Thu, 1 Feb 2007
<11:29:33 -0500> I posted a simpler variation, using REPLACE instead of
SUBSTR.

-Best regards and good luck,
  Richard
Reply | Threaded
Open this post in threaded view
|

Re: Deleting embedded control characters

Max Bell-2
I had the same annoying problem and I wrote this macro for it. It's
actually the other way around. It does not delete the hidden characters
but it keeps the 'normal' characters which you can define yourself. When
a not not normal character is found it will bij replaced by a space.

The macro:

DEFINE !DelENTER (!POS !CHAREND('/'))
!DO !Var !IN (!1)
LOOP #i=1 TO  LENGTH(RTRIM(!Var)) .
IF (RANGE(SUBSTR(!Var,#i,1),'0','9','a','z','A','Z')~=1 AND
ANY(SUBSTR(!Var,#i,1),'&','%',';',',',':','-','/','?','#','*','(',')','$
','%','\','+','=','.')~=1)  SUBSTR(!Var,#i,1)="".
END LOOP.
EXE.
!DOEND.
!ENDDEFINE .

Call the macro for v1,v2 ...  with:
!DelENTER  v1 v2 / .


Hope this helps, Max


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: dinsdag 20 maart 2007 6:25
To: [hidden email]
Subject: Re: Deleting embedded control characters

At 07:54 PM 3/19/2007, Kevin Hynes wrote:

>Here is what I was working with [indents added]:
>
>DO REPEAT VAR=COMMENT.
>.  COMPUTE LocHex01 = INDEX(VAR,' ').
>.  IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1).
>END REPEAT.
>EXECUTE.

This isn't the problem you ran into; but, I'm not sure what the DO
REPEAT is there for. If I read it right, the whole structure is
equivalent to

.  COMPUTE LocHex01 = INDEX(COMMENT,' ').
.  IF (LocHex01 > 0) COMMENT = SUBSTR(COMMENT,1,LocHex01-1).

If you want to check for, and remove, all instances of some character,
you need a LOOP construct.

>I was able to utilize Jon's suggestion for copying the hidden
>characters from MS Access. However, when I copied the character into
>the apostrophes in the COMPUTE statement, the hidden characters
>essentially were "translated" and created a hard return in the COMPUTE
>statement

They weren't translated. They were, and stayed, what they had been in
Access. Your problem was, and is, that a return character is a return
character; if you paste it into SPSS syntax, it has whatever meaning it
has in the syntax.

The quote marks delimit a literal string - a 'character literal'. MOST,
BUT NOT ALL, characters between quotes become part of the string, not
part of the syntax. It looks like CR is one of the exceptions.

To repeat: the CR was not translated, in any way, shape or form. It
started as a CR and ended as a CR. Since CR characters have a syntactic
meaning - the end of a line of the command - even if they're within
quotes, you got the result you saw. You CANNOT, it looks like, get a CR
as a one-character string, or as part of a multiple-character string, by
enclosing it in quotes. Of course you got

>DO REPEAT VAR=COMMENT.
>COMPUTE LocHex01 = INDEX(VAR,'
>').
>IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1).
>END REPEAT.
>EXECUTE.

Now: there are more ways to get a character into a string value, than by
enclosing it in quotes. You may remember: on Wed, 31 Jan 2007
<23:27:49 -0500> in your thread, I posted a technique by Raynald
Levesque, passed on by Gene Maguin, to get any character into a string
value (NOT string literal, between quotes) from its ASCII numeric value.
THAT is how to get a CR as an SPSS character value, rather than as part
of a syntax line. I won't repeat from that post; you should be able to
look it up OK.

For numeric values of ASCII characters, see many sources, including
http://www.asciitable.com/.  CR is 13d, or 0Ex.

If you're doing OK having the data pre-processed before you read it into
SPSS, fine. But the problem can be solved in SPSS. Python isn't even
needed, and I'm not sure it would even make things easier.

You may be making this hard on yourself, by not using the information
you have. If you do want to pursue an SPSS solution, the 31 January
posting should give you what you need. Remember, there are two parts:

A. Get a list of the characters you want to remove, as an SPSS string
value (NOT an SPSS string literal)

B. With a LOOP construct, search for, and remove or replace, every
character on that list.

The 31 January posting uses a loop with SUBSTR. On Thu, 1 Feb 2007
<11:29:33 -0500> I posted a simpler variation, using REPLACE instead of
SUBSTR.

-Best regards and good luck,
  Richard
Reply | Threaded
Open this post in threaded view
|

Re: Deleting embedded control characters

Peck, Jon
If you use the macro below, bear in mind that it may remove accented roman characters, and it definitely will not work with nonwestern scripts.  RANGE respects the collating sequence of the locale in which it is run, and in some locales, accented characters fall outside the a-z (or A-Z) range.

Since all control characters are below hexadecimal 20, using range on that interval is safer.
There are some tricky characters, taking the western code page (1252), for example, the non-breaking space is xA0 (=160 decimal) .  (You can impress your friends and have blanks in variable names by using this character at the cost of making your syntax terminally confusing.)

This code removes tab characters, value 09.  Values 10 and 12 correspond to cr and lf characters.

compute notabs = replace(strwtabs, string(09, ib1),' ').

The most powerful way to do this, however, is with Python and regular expressions (the trans.py and extendedTransforms.subs functions package this conveniently).

-Jon Peck
SPSS

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Max Bell
Sent: Tuesday, March 20, 2007 3:22 AM
To: [hidden email]
Subject: Re: [SPSSX-L] Deleting embedded control characters

I had the same annoying problem and I wrote this macro for it. It's
actually the other way around. It does not delete the hidden characters
but it keeps the 'normal' characters which you can define yourself. When
a not not normal character is found it will bij replaced by a space.

The macro:

DEFINE !DelENTER (!POS !CHAREND('/'))
!DO !Var !IN (!1)
LOOP #i=1 TO  LENGTH(RTRIM(!Var)) .
IF (RANGE(SUBSTR(!Var,#i,1),'0','9','a','z','A','Z')~=1 AND
ANY(SUBSTR(!Var,#i,1),'&','%',';',',',':','-','/','?','#','*','(',')','$
','%','\','+','=','.')~=1)  SUBSTR(!Var,#i,1)="".
END LOOP.
EXE.
!DOEND.
!ENDDEFINE .

Call the macro for v1,v2 ...  with:
!DelENTER  v1 v2 / .


Hope this helps, Max


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: dinsdag 20 maart 2007 6:25
To: [hidden email]
Subject: Re: Deleting embedded control characters

At 07:54 PM 3/19/2007, Kevin Hynes wrote:

>Here is what I was working with [indents added]:
>
>DO REPEAT VAR=COMMENT.
>.  COMPUTE LocHex01 = INDEX(VAR,' ').
>.  IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1).
>END REPEAT.
>EXECUTE.

This isn't the problem you ran into; but, I'm not sure what the DO
REPEAT is there for. If I read it right, the whole structure is
equivalent to

.  COMPUTE LocHex01 = INDEX(COMMENT,' ').
.  IF (LocHex01 > 0) COMMENT = SUBSTR(COMMENT,1,LocHex01-1).

If you want to check for, and remove, all instances of some character,
you need a LOOP construct.

>I was able to utilize Jon's suggestion for copying the hidden
>characters from MS Access. However, when I copied the character into
>the apostrophes in the COMPUTE statement, the hidden characters
>essentially were "translated" and created a hard return in the COMPUTE
>statement

They weren't translated. They were, and stayed, what they had been in
Access. Your problem was, and is, that a return character is a return
character; if you paste it into SPSS syntax, it has whatever meaning it
has in the syntax.

The quote marks delimit a literal string - a 'character literal'. MOST,
BUT NOT ALL, characters between quotes become part of the string, not
part of the syntax. It looks like CR is one of the exceptions.

To repeat: the CR was not translated, in any way, shape or form. It
started as a CR and ended as a CR. Since CR characters have a syntactic
meaning - the end of a line of the command - even if they're within
quotes, you got the result you saw. You CANNOT, it looks like, get a CR
as a one-character string, or as part of a multiple-character string, by
enclosing it in quotes. Of course you got

>DO REPEAT VAR=COMMENT.
>COMPUTE LocHex01 = INDEX(VAR,'
>').
>IF (LocHex01 > 0)VAR = SUBSTR(VAR,1,LocHex01-1).
>END REPEAT.
>EXECUTE.

Now: there are more ways to get a character into a string value, than by
enclosing it in quotes. You may remember: on Wed, 31 Jan 2007
<23:27:49 -0500> in your thread, I posted a technique by Raynald
Levesque, passed on by Gene Maguin, to get any character into a string
value (NOT string literal, between quotes) from its ASCII numeric value.
THAT is how to get a CR as an SPSS character value, rather than as part
of a syntax line. I won't repeat from that post; you should be able to
look it up OK.

For numeric values of ASCII characters, see many sources, including
http://www.asciitable.com/.  CR is 13d, or 0Ex.

If you're doing OK having the data pre-processed before you read it into
SPSS, fine. But the problem can be solved in SPSS. Python isn't even
needed, and I'm not sure it would even make things easier.

You may be making this hard on yourself, by not using the information
you have. If you do want to pursue an SPSS solution, the 31 January
posting should give you what you need. Remember, there are two parts:

A. Get a list of the characters you want to remove, as an SPSS string
value (NOT an SPSS string literal)

B. With a LOOP construct, search for, and remove or replace, every
character on that list.

The 31 January posting uses a loop with SUBSTR. On Thu, 1 Feb 2007
<11:29:33 -0500> I posted a simpler variation, using REPLACE instead of
SUBSTR.

-Best regards and good luck,
  Richard