identify duplicate strings across columns

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

identify duplicate strings across columns

wsu_wright
Tried searching the archives but may have missed it.  Need to identify
whether a record has any duplicate values across a set of columns.  In
the below examples, ID records 13, 14 & 16 would be duplicate candidates
that need to be flagged (e.g., 0,1 1=yes).  Note that IDs can be
repeating as in the 2 records on ID 14 but only one is a duplicate
candidate.  No python please.

ID  C1  C2  C3  C4
12  A3  A7  H5  B2
13  A7  D3  J4  D3
14  G3
14  F3  H7  A4  H7
15  J9  H5
16  F3  F3

David

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: identify duplicate strings across columns

Ki Park
You could try this. I used your data to run this code.

*This is to keep the original ID as C5 so you can see after this is indexed.
compute c5=id.
execute.


*Data transformation.
VARSTOCASES
  /ID=id
  /MAKE trans1 FROM C1 to   c4
  /INDEX=Index1(4)
  /KEEP=c5
  /NULL=KEEP.


*This sort is critical.
sort cases by id trans1.

*Flag your variables. I think from here, you can extract what you need.
compute flag=0.
if trans1=lag(trans1) flag=1.
execute.

*Of course, this code may not work if your data is more complex than you
have shared.


Ki





--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: identify duplicate strings across columns

Ki Park
*The last three lines of code should be:
*it needs a "do if" for ids.
*The previous one was also flagging "missing" or repeated empty cells.

compute flag3=0.
do if id=lag(id).
if trans1=lag(trans1) and trans1<>"" flag3=1.
end if.
execute.

__
Ki



--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: identify duplicate strings across columns

David Marso
Administrator
In reply to this post by wsu_wright
VECTOR c=C1 TO C4.
LOOP #=1 TO 3.
+  LOOP ##=#+1 TO 4.
+    IF c(#) EQ c(##) flagged =1.
+  END LOOP.
END LOOP.


coxspss wrote

> Tried searching the archives but may have missed it.  Need to identify
> whether a record has any duplicate values across a set of columns.  In
> the below examples, ID records 13, 14 & 16 would be duplicate candidates
> that need to be flagged (e.g., 0,1 1=yes).  Note that IDs can be
> repeating as in the 2 records on ID 14 but only one is a duplicate
> candidate.  No python please.
>
> ID  C1  C2  C3  C4
> 12  A3  A7  H5  B2
> 13  A7  D3  J4  D3
> 14  G3
> 14  F3  H7  A4  H7
> 15  J9  H5
> 16  F3  F3
>
> David
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: identify duplicate strings across columns

wsu_wright
In reply to this post by wsu_wright
David,

Thanks for the syntax, as always, your contributions are eloquent &
concise.  However, I'm having issues with nulls.  In cases where there
are multiple nulls across the columns per record, it rightly identifies
them as a duplicate but in the case of this project nulls are to be
ignored, only cells with non null or code entries are to be evaluated.
The current syntax, when I ran it, identifies ID records 13 thru 16 as
having duplicate codes whereas IDs 14 (1st occurrence) & 15 should not
be flagged as duplicates.  Is there a way to adjust the syntax to bypass
null cells?


On Sat, Aug 25, 2018 at 12:01 PM, David Marso wrote:

> VECTOR c=C1 TO C4.
> LOOP #=1 TO 3.
> +  LOOP ##=#+1 TO 4.
> +    IF c(#) EQ c(##) flagged =1.
> +  END LOOP.
> END LOOP.
>
>
> coxspss wrote
>> Tried searching the archives but may have missed it.  Need to
>> identify whether a record has any duplicate values across a set of
>> columns.  In the below examples, ID records 13, 14 & 16 would be
>> duplicate candidates that need to be flagged (e.g., 0,1 1=yes).  Note
>> that IDs can be repeating as in the 2 records on ID 14 but only one
>> is a duplicate candidate.  No python please.
>>
>> ID  C1  C2  C3  C4
>> 12  A3  A7  H5  B2
>> 13  A7  D3  J4  D3
>> 14  G3
>> 14  F3  H7  A4  H7
>> 15  J9  H5
>> 16  F3  F3
>>
>> David
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>
>> LISTSERV@.UGA
>
>>  (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>
>
>
>
>
> -----
> Please reply to the list and not to my personal email.
> Those desiring my consulting or training services please feel free to
> email me.
> ---
> "Nolite dare sanctum canibus neque mittatis margaritas vestras ante
> porcos ne forte conculcent eas pedibus suis."
> Cum es damnatorum possederunt porcos iens ut salire off sanguinum
> cliff in abyssum?"
> --
> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except
> the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: identify duplicate strings across columns

David Marso
Administrator
IF  c(#) NE ""  AND c(#) EQ c(##) flagged =1.


coxspss wrote

> David,
>
> Thanks for the syntax, as always, your contributions are eloquent &
> concise.  However, I'm having issues with nulls.  In cases where there
> are multiple nulls across the columns per record, it rightly identifies
> them as a duplicate but in the case of this project nulls are to be
> ignored, only cells with non null or code entries are to be evaluated.
> The current syntax, when I ran it, identifies ID records 13 thru 16 as
> having duplicate codes whereas IDs 14 (1st occurrence) & 15 should not
> be flagged as duplicates.  Is there a way to adjust the syntax to bypass
> null cells?
>
>
> On Sat, Aug 25, 2018 at 12:01 PM, David Marso wrote:
>
>> VECTOR c=C1 TO C4.
>> LOOP #=1 TO 3.
>> +  LOOP ##=#+1 TO 4.
>> +    IF c(#) EQ c(##) flagged =1.
>> +  END LOOP.
>> END LOOP.
>>
>>
>> coxspss wrote
>>> Tried searching the archives but may have missed it.  Need to
>>> identify whether a record has any duplicate values across a set of
>>> columns.  In the below examples, ID records 13, 14 & 16 would be
>>> duplicate candidates that need to be flagged (e.g., 0,1 1=yes).  Note
>>> that IDs can be repeating as in the 2 records on ID 14 but only one
>>> is a duplicate candidate.  No python please.
>>>
>>> ID  C1  C2  C3  C4
>>> 12  A3  A7  H5  B2
>>> 13  A7  D3  J4  D3
>>> 14  G3
>>> 14  F3  H7  A4  H7
>>> 15  J9  H5
>>> 16  F3  F3
>>>
>>> David
>>>
>>> =====================
>>> To manage your subscription to SPSSX-L, send a message to
>>
>>> LISTSERV@.UGA
>>
>>>  (not to SPSSX-L), with no body text except the
>>> command. To leave the list, send the command
>>> SIGNOFF SPSSX-L
>>> For a list of commands to manage subscriptions, send the command
>>> INFO REFCARD
>>
>>
>>
>>
>>
>> -----
>> Please reply to the list and not to my personal email.
>> Those desiring my consulting or training services please feel free to
>> email me.
>> ---
>> "Nolite dare sanctum canibus neque mittatis margaritas vestras ante
>> porcos ne forte conculcent eas pedibus suis."
>> Cum es damnatorum possederunt porcos iens ut salire off sanguinum
>> cliff in abyssum?"
>> --
>> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>>

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except
>> the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

>  (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD





-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
Sent from: http://spssx-discussion.1045642.n5.nabble.com/

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: identify duplicate strings across columns

wsu_wright
In reply to this post by wsu_wright
Perfect, thank you David.


On Sun, Aug 26, 2018 at 7:14 AM, David Marso wrote:

> IF  c(#) NE ""  AND c(#) EQ c(##) flagged =1.
>
> coxspss wrote
>> David,
>>
>> Thanks for the syntax, as always, your contributions are eloquent &
>> concise.  However, I'm having issues with nulls.  In cases where
>> there are multiple nulls across the columns per record, it rightly
>> identifies them as a duplicate but in the case of this project nulls
>> are to be ignored, only cells with non null or code entries are to be
>> evaluated. The current syntax, when I ran it, identifies ID records
>> 13 thru 16 as having duplicate codes whereas IDs 14 (1st occurrence)
>> & 15 should not be flagged as duplicates.  Is there a way to adjust
>> the syntax to bypass null cells?
>>
>>
>> On Sat, Aug 25, 2018 at 12:01 PM, David Marso wrote:
>>
>>> VECTOR c=C1 TO C4.
>>> LOOP #=1 TO 3.
>>> +  LOOP ##=#+1 TO 4.
>>> +    IF c(#) EQ c(##) flagged =1.
>>> +  END LOOP.
>>> END LOOP.
>>>
>>>
>>> coxspss wrote
>>>> Tried searching the archives but may have missed it.  Need to
>>>> identify whether a record has any duplicate values across a set of
>>>> columns.  In the below examples, ID records 13, 14 & 16 would be
>>>> duplicate candidates that need to be flagged (e.g., 0,1 1=yes).
>>>> Note that IDs can be repeating as in the 2 records on ID 14 but
>>>> only one is a duplicate candidate.  No python please.
>>>>
>>>> ID  C1  C2  C3  C4
>>>> 12  A3  A7  H5  B2
>>>> 13  A7  D3  J4  D3
>>>> 14  G3
>>>> 14  F3  H7  A4  H7
>>>> 15  J9  H5
>>>> 16  F3  F3
>>>>
>>>> David
>>>>
>>>> =====================
>>>> To manage your subscription to SPSSX-L, send a message to
>>>
>>>> LISTSERV@.UGA
>>>
>>>>  (not to SPSSX-L), with no body text except the
>>>> command. To leave the list, send the command
>>>> SIGNOFF SPSSX-L
>>>> For a list of commands to manage subscriptions, send the command
>>>> INFO REFCARD
>>>
>>>
>>>
>>>
>>>
>>> -----
>>> Please reply to the list and not to my personal email.
>>> Those desiring my consulting or training services please feel free
>>> to email me.
>>> ---
>>> "Nolite dare sanctum canibus neque mittatis margaritas vestras ante
>>> porcos ne forte conculcent eas pedibus suis."
>>> Cum es damnatorum possederunt porcos iens ut salire off sanguinum
>>> cliff in abyssum?"
>>> --
>>> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>>>
>>> =====================
>>> To manage your subscription to SPSSX-L, send a message to
>>>
>
>> LISTSERV@.UGA
>
>>  (not to SPSSX-L), with no body text except
>>> the
>>> command. To leave the list, send the command
>>> SIGNOFF SPSSX-L
>>> For a list of commands to manage subscriptions, send the command
>>> INFO REFCARD
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>
>> LISTSERV@.UGA
>
>>  (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>
>
>
>
>
> -----
> Please reply to the list and not to my personal email.
> Those desiring my consulting or training services please feel free to
> email me.
> ---
> "Nolite dare sanctum canibus neque mittatis margaritas vestras ante
> porcos ne forte conculcent eas pedibus suis."
> Cum es damnatorum possederunt porcos iens ut salire off sanguinum
> cliff in abyssum?"
> --
> Sent from: http://spssx-discussion.1045642.n5.nabble.com/
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except
> the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD