Remove repeated "words" in a string variable

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Remove repeated "words" in a string variable

BjornAhlstrom
Hi, I´m working on a large dataset. In one string variable I have ICD-10 codes. In many cases they are repeated like this:" T810 J969 T812 R651 J809 T810 N178 B371 M628 B968C T810 T812...". As you can see there are five "T812" and three T810. Sometimes there are 50 repeats of ICU codes which make this variable unnecessary wide. I would like to keep only one of each code for each case. I have tride the syntax presented in this discussion: http://spssx-discussion.1045642.n5.nabble.com/How-to-remove-duplicate-repeated-character-in-a-variable-td5728880.html However, the suggested solutions does not give the desired result. Data example: DATA LIST LIST / id * icd(a50). BEGIN DATA 1 "T079 S370 S220 S270 S220 S369 T079 T079 S370 S220 " 2 "J809B N179 J969 R572 J459 J159 J969 J809C R651 N179 " 3 "I609 N179 R572 J809C B371 I609 N179 N179 I609" END DATA. Any suggestions? Thanks in advance, Björn

Sent from the SPSSX Discussion mailing list archive at Nabble.com.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Remove repeated "words" in a string variable

Andy W
The python code at the link can be modified slightly to do what you want.
Note that this does not retain the same order of the ICD codes, so if the
first one is the primary or whatever it may shift it.

***************************************************.
DATA LIST LIST / id (F1.0) icd(a100).
BEGIN DATA
1 "T079 S370 S220 S270 S220 S369 T079 T079 S370 S220 "
2 "J809B N179 J969 R572 J459 J159 J969 J809C R651 N179 "
3 "I609 N179 R572 J809C B371 I609 N179 N179 I609"
END DATA.
DATASET NAME ICD.
EXECUTE.

BEGIN PROGRAM PYTHON.
# Defining a function to delete duplicate substrings
def del_sub(x):
    split_str = x.split(" ") #doesnt keep the same order
    return " ".join(set(split_str)).strip()

# Example    
te = "I609 N179 R572 J809C B371 I609 N179 N179 I609"
print( del_sub(te) )
END PROGRAM.

SPSSINC TRANS RESULT=NoDupICD TYPE=100
  /FORMULA "del_sub(x=icd)".
***************************************************.




-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Remove repeated "words" in a string variable

Jon Peck
In reply to this post by BjornAhlstrom
Here is a simple solution using the SPSSINC TRANS extension command, which can be installed from the Extensions > Extension Hub if yu don't already  have it.
First, it defines a function; then TRANS uses it.  The TYPE value is the length for the output string variable - typically the same as the input length.
This does not preserve order.  That would require a bit more complicated code.

begin program python3.
def unique(codes):
    return " ".join(set(codes.split()))
end program.

spssinc trans result=codes type=200
/formula "unique(codes)".

On Thu, May 20, 2021 at 11:53 AM BjornAhlstrom <[hidden email]> wrote:
Hi, I´m working on a large dataset. In one string variable I have ICD-10 codes. In many cases they are repeated like this:" T810 J969 T812 R651 J809 T810 N178 B371 M628 B968C T810 T812...". As you can see there are five "T812" and three T810. Sometimes there are 50 repeats of ICU codes which make this variable unnecessary wide. I would like to keep only one of each code for each case. I have tride the syntax presented in this discussion: http://spssx-discussion.1045642.n5.nabble.com/How-to-remove-duplicate-repeated-character-in-a-variable-td5728880.html However, the suggested solutions does not give the desired result. Data example: DATA LIST LIST / id * icd(a50). BEGIN DATA 1 "T079 S370 S220 S270 S220 S369 T079 T079 S370 S220 " 2 "J809B N179 J969 R572 J459 J159 J969 J809C R651 N179 " 3 "I609 N179 R572 J809C B371 I609 N179 N179 I609" END DATA. Any suggestions? Thanks in advance, Björn

Sent from the SPSSX Discussion mailing list archive at Nabble.com.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Remove repeated "words" in a string variable

BjornAhlstrom
In reply to this post by Andy W
Thanks, it works like a charm! I´m not worried about the code order.

Regards,
Björn



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Remove repeated "words" in a string variable

BjornAhlstrom
In reply to this post by Jon Peck
Thanks for helping out!

Sadly SPSSIC TRANS does not play on my University computer. Need to look in
to that. Luckily I got another solution in this thread.

Best regards,
Björn



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

SV: Remove repeated "words" in a string variable

Robert L
In reply to this post by BjornAhlstrom

The suggested python solution are of course neat, but I could not help thinking about a possible ”native” SPSS solution. There might be more, but I think this would work too. The syntax is not clean, I have used the menus to generate syntax:

 

*****************************.

*Data in text file, followed by import to SPSS.

PRESERVE.

SET DECIMAL COMMA.

 

GET DATA  /TYPE=TXT

  /FILE="diagnoses.txt"

  /ENCODING='UTF8'

  /DELCASE=LINE

  /DELIMITERS=" "

  /ARRANGEMENT=DELIMITED

  /FIRSTCASE=1

  /DATATYPEMIN PERCENTAGE=95.0

  /VARIABLES=

  id AUTO

  V2 AUTO

  V3 AUTO

  V4 AUTO

  V5 AUTO

  V6 AUTO

  V7 AUTO

  V8 AUTO

  V9 AUTO

  V10 AUTO

  V11 AUTO

  /MAP.

RESTORE.

 

CACHE.

EXECUTE.

DATASET NAME data0 WINDOW=FRONT.

 

*Adding copies of dataset just to get something to compare with.

DATASET ACTIVATE data0.

DATASET COPY data1.

DATASET ACTIVATE data1.

*Wide to long format.

VARSTOCASES

  /MAKE trans1 FROM V2 V3 V4 V5 V6 V7 V8 V9 V10 V11

  /INDEX=Index1(10)

  /KEEP=id

  /NULL=KEEP.

 

DATASET ACTIVATE data1.

DATASET COPY data2.

DATASET ACTIVATE data2.

 

* Identify Duplicate Cases.

SORT CASES BY id(A) trans1(A) Index1(A).

MATCH FILES

  /FILE=*

  /BY id trans1

  /FIRST=PrimaryFirst.

VARIABLE LABELS  PrimaryFirst 'Indicator of each first matching case as Primary'.

VALUE LABELS  PrimaryFirst 0 'Duplicate Case' 1 'Primary Case'.

VARIABLE LEVEL  PrimaryFirst (ORDINAL).

 

*Select only first records.

SELECT IF PrimaryFirst.

 

EXECUTE.

 

DATASET ACTIVATE data2.

DATASET COPY data3.

DATASET ACTIVATE data3.

 

*Long to wide.

SORT CASES BY id Index1.

CASESTOVARS

  /ID=id

  /GROUPBY=VARIABLE.

 

Från: SPSSX(r) Discussion [mailto:[hidden email]] För BjornAhlstrom
Skickat: den 20 maj 2021 19:53
Till: [hidden email]
Ämne: Remove repeated "words" in a string variable

 

Hi, I´m working on a large dataset. In one string variable I have ICD-10 codes. In many cases they are repeated like this:" T810 J969 T812 R651 J809 T810 N178 B371 M628 B968C T810 T812...". As you can see there are five "T812" and three T810. Sometimes there are 50 repeats of ICU codes which make this variable unnecessary wide. I would like to keep only one of each code for each case. I have tride the syntax presented in this discussion: http://spssx-discussion.1045642.n5.nabble.com/How-to-remove-duplicate-repeated-character-in-a-variable-td5728880.html However, the suggested solutions does not give the desired result. Data example: DATA LIST LIST / id * icd(a50). BEGIN DATA 1 "T079 S370 S220 S270 S220 S369 T079 T079 S370 S220 " 2 "J809B N179 J969 R572 J459 J159 J969 J809C R651 N179 " 3 "I609 N179 R572 J809C B371 I609 N179 N179 I609" END DATA. Any suggestions? Thanks in advance, Björn


Sent from the SPSSX Discussion mailing list archive at Nabble.com.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Robert Lundqvist