SPSSX Discussion

Identify Duplicates Rowwise

Classic

List

Threaded

8 messages Options

Rajeshms

Identify Duplicates Rowwise

Hi all,

I have

ID String1 String2 String3 String4.......... String684

947475 ...................................................................................................

That is,I have 684 string variables and 947475 cases,now I need to identify duplicates in string 1 to string 684 and remove those values(Row wise) not column wise,as we do by default in SPSS.

Any syntax or suggestion will help me. I have tried using restructure and transpose but was not successful.

Regards,

Rich Ulrich

Re: Identify Duplicates Rowwise

Details?
Everything simple starts with going from Wide to Long, but ...
The file sounds too big for FLIP, but writing with CasesToVars or
merely using LOOP with XSAVE is simple. What failed for you?

In Long form: sort, and eliminate dups by AGG or LAG( ), and then
there is the step of reconstructing Long records. Lines *can* be
sorted back to the original order, if you want, and LAG( ) used to
insert a sequential ordering without gaps.

Do strings start out sorted in order? Is it okay if they end up that way?
If not sorted: Do you want the result to have blanks, or be crunched to
the left? Use the position of earliest one?

--
Rich Ulrich

Date: Fri, 20 Sep 2013 08:50:24 +0530
From: [hidden email]
Subject: Identify Duplicates Rowwise
To: [hidden email]

Hi all,

I have

ID String1 String2 String3 String4.......... String684

947475 ...................................................................................................

That is,I have 684 string variables and 947475 cases,now I need to identify duplicates in string 1 to string 684 and remove those values(Row wise) not column wise,as we do by default in SPSS.

Any syntax or suggestion will help me. I have tried using restructure and transpose but was not successful.

Andy W

Re: Identify Duplicates Rowwise

In reply to this post by Rajeshms

This is going to be pretty painful anyway you slice it I believe. While Rich gives good advice, a VARSTOCASES that large will take a very long time and may not be possible with the full set (947475*684~650 million). Slicing it up into smaller chunks will work though.

I've used variants of the code below before to avoid the VARSTOCASES, I have no idea though how long this will take with near a million cases and (684*(684-1)/2~234,000) comparisons. How big the strings are I suspect will make a difference as well.

******************************************************.
data list free / str1 to str4 (4A1).
begin data
a a a a
a b c d
a a c d
a b b d
a b c c
end data.
dataset name string.

*ID duplicates - note the last string can never be a duplicate.
vector flag(3,F1.0).
vector str = str1 to str4.
loop #i = 1 to 3.
loop #j = #i+1 to 4.
if str(#i) = str(#j) flag(#i) = 1.
end loop.
end loop.
recode flag1 to flag3 (SYSMIS = 0)(ELSE = COPY).
exe.
******************************************************.

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Alexandra Chirilov-2

Re: Identify Duplicates Rowwise

You can try to use this macro for removing the duplicated rowwise.
To call it just write: duble var1 var2 var3 .... varn.

define duble (!positional= !charend('|')
/zm =!default(miss) !charend('|')) .

!let !ciag=''.
!let !jeden=''.
!let !dwa=''.
recode !1 (0=sysmis).

!do !i !in (!1).
!let !jeden=(!i).
!do !j !in (!1).
!let !dwa=(!j).
!if (!jeden<>!dwa) !then
if (!i=!j) !j=$sysmis.
!ifend.
!doend.
!doend.

!do !i !in (!1).
!let !ciag=!concat(!ciag,',',!i).
!doend.
!let !ciag=!tail(!ciag).
if (nvalid(!ciag)=0) !zm=0.
value labels /!zm 0'missing case'.

!enddefine.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Andy W
Sent: Friday, September 20, 2013 4:04 PM
To: [hidden email]
Subject: Re: Identify Duplicates Rowwise

This is going to be pretty painful anyway you slice it I believe. While Rich
gives good advice, a VARSTOCASES that large will take a very long time and
may not be possible with the full set (947475*684~650 million). Slicing it
up into smaller chunks will work though.

I've used variants of the code below before to avoid the VARSTOCASES, I have
no idea though how long this will take with near a million cases and
(684*(684-1)/2~234,000) comparisons. How big the strings are I suspect will
make a difference as well.

******************************************************.
data list free / str1 to str4 (4A1).
begin data
a a a a
a b c d
a a c d
a b b d
a b c c
end data.
dataset name string.

*ID duplicates - note the last string can never be a duplicate.
vector flag(3,F1.0).
vector str = str1 to str4.
loop #i = 1 to 3.
loop #j = #i+1 to 4.
if str(#i) = str(#j) flag(#i) = 1.
end loop.
end loop.
recode flag1 to flag3 (SYSMIS = 0)(ELSE = COPY).
exe.
******************************************************.

-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicates-Rowwise-tp
5722130p5722133.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

David Marso

Re: Identify Duplicates Rowwise

Administrator

In reply to this post by Rajeshms

I have NO idea of what you mean by
"I need to identify duplicates in string 1 to string 684 and remove those values(Row wise)"
Please do us (and yourself) a favor by cobbling together a small example of what the data looks like before and what it should look like when done?
---

Rajeshms wrote

Hi all,

I have

ID String1 String2 String3
String4.......... String684
1
2
3
.
.
.
.
.
947475
...................................................................................................

That is,I have 684 string variables and 947475 cases,now I need to identify
duplicates in string 1 to string 684 and remove those values(Row wise) not
column wise,as we do by default in SPSS.

Any syntax or suggestion will help me. I have tried using restructure and
transpose but was not successful.
--
Regards,

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Art Kendall

Re: Identify Duplicates Rowwise

In reply to this post by Rajeshms

Is there any importance to the order of the value in a case?
I.e., if you had 4 variables

cherry cherry apple cherry pear   cherry apple appleapple pear   cherry cherryorange pear   cherry apple

would you this work for you
apple cherry blank blankapple cherry pear   blankapple cherry pear   blankapple cherry orange pear

or do you want

cherry blank apple blank pear   cherry apple blankapple pear   cherry blankorange pear   cherry applealso are there a limited number of values that the 684 can take on?What is the purpose of deduplicating the list of 684 variables? I.e. what is the context, there may be another way to reach you goal.

Art Kendall
Social Research Consultants

On 9/19/2013 11:22 PM, Rajeshms [via SPSSX Discussion] wrote:

Hi all,

I have

ID String1 String2 String3 String4.......... String684

1

2

3

.

.

.

.

.

947475 ...................................................................................................

That is,I have 684 string variables and 947475 cases,now I need to identify duplicates in string 1 to string 684 and remove those values(Row wise) not column wise,as we do by default in SPSS.

Any syntax or suggestion will help me. I have tried using restructure and transpose but was not successful.
--

Regards,

If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicates-Rowwise-tp5722130.html

To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants

Rich Ulrich

Re: Identify Duplicates Rowwise

In reply to this post by Andy W

I don't know why a VarsToCases this simple should take a lot of time,
but here is the XSAVE version that does the same thing. If some of the
strings are already blank (is 684 just a maximum?), you don't have to write them.
This XSAVE should be quick (relatively) since the sort-and-compare by AGGREGATE
is what entails a huge number of steps for each line.

As Andy points out, there are a whole lot of comparisons to be done, and
any process could be time-consuming for a million instances. It could be
useful to set N OF CASES to 1000 in order to benchmark alternate versions.

*example using A10.
string strval(A10) .
vector str = str1 to str684.
loop loc = 1 to 684
compute strval= str(loc).
XSAVE outfile= */keep= ID loc strval .
end loop.
EXE.

--
Rich Ulrich

> Date: Fri, 20 Sep 2013 06:03:50 -0700

> From: [hidden email]
> Subject: Re: Identify Duplicates Rowwise
> To: [hidden email]
>
> This is going to be pretty painful anyway you slice it I believe. While Rich
> gives good advice, a VARSTOCASES that large will take a very long time and
> may not be possible with the full set (947475*684~650 million). Slicing it
> up into smaller chunks will work though.
>
> I've used variants of the code below before to avoid the VARSTOCASES, I have
> no idea though how long this will take with near a million cases and
> (684*(684-1)/2~234,000) comparisons. How big the strings are I suspect will
> make a difference as well.
>
> ******************************************************.
> data list free / str1 to str4 (4A1).
> begin data
> a a a a
> a b c d
> a a c d
> a b b d
> a b c c
> end data.
> dataset name string.
>
> *ID duplicates - note the last string can never be a duplicate.
> vector flag(3,F1.0).
> vector str = str1 to str4.
> loop #i = 1 to 3.
> loop #j = #i+1 to 4.
> if str(#i) = str(#j) flag(#i) = 1.
> end loop.
> end loop.
> recode flag1 to flag3 (SYSMIS = 0)(ELSE = COPY).
> exe.
> ******************************************************.
>

David Marso

Re: Identify Duplicates Rowwise

Administrator

"XSAVE outfile= */keep= ID loc strval ."
Really ;-) We'll blame it on Friday !

I'd say this beast calls for

DATASET DECLARE agg.
VARSTOCASES /ID=id /MAKE strvar FROM S001 TO S700 .
AGGREGATE OUTFILE agg / BREAK ID strvar / Count=N.
DATASET ACTIVATE agg.
CASESTOVARS /ID=id .

VARSTOCASES happens almost instantaneously on my machine.

Rich Ulrich wrote

I don't know why a VarsToCases this simple should take a lot of time,
but here is the XSAVE version that does the same thing. If some of the
strings are already blank (is 684 just a maximum?), you don't have to write them.
This XSAVE should be quick (relatively) since the sort-and-compare by AGGREGATE
is what entails a huge number of steps for each line.

As Andy points out, there are a whole lot of comparisons to be done, and
any process could be time-consuming for a million instances. It could be
useful to set N OF CASES to 1000 in order to benchmark alternate versions.

*example using A10.
string strval(A10) .
vector str = str1 to str684.
loop loc = 1 to 684
compute strval= str(loc).
XSAVE outfile= */keep= ID loc strval .
end loop.
EXE.

--
Rich Ulrich

> Date: Fri, 20 Sep 2013 06:03:50 -0700
> From: [hidden email]
> Subject: Re: Identify Duplicates Rowwise
> To: [hidden email]
>
> This is going to be pretty painful anyway you slice it I believe. While Rich
> gives good advice, a VARSTOCASES that large will take a very long time and
> may not be possible with the full set (947475*684~650 million). Slicing it
> up into smaller chunks will work though.
>
> I've used variants of the code below before to avoid the VARSTOCASES, I have
> no idea though how long this will take with near a million cases and
> (684*(684-1)/2~234,000) comparisons. How big the strings are I suspect will
> make a difference as well.
>
> ******************************************************.
> data list free / str1 to str4 (4A1).
> begin data
> a a a a
> a b c d
> a a c d
> a b b d
> a b c c
> end data.
> dataset name string.
>
> *ID duplicates - note the last string can never be a duplicate.
> vector flag(3,F1.0).
> vector str = str1 to str4.
> loop #i = 1 to 3.
> loop #j = #i+1 to 4.
> if str(#i) = str(#j) flag(#i) = 1.
> end loop.
> end loop.
> recode flag1 to flag3 (SYSMIS = 0)(ELSE = COPY).
> exe.
> ******************************************************.
>