Identify Duplicates Rowwise

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Identify Duplicates Rowwise

Rajeshms
Hi all,

I have

ID                  String1      String2       String3       String4.......... String684
1
2
3
.
.
.
.
.
947475           ...................................................................................................

That is,I have 684 string variables and 947475 cases,now I need to identify duplicates in string 1 to string 684 and remove those values(Row wise) not column wise,as we do by default in SPSS.

Any syntax or suggestion will help me. I have tried using restructure and transpose but was not successful.
--
Regards,





Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicates Rowwise

Rich Ulrich
Details? 
Everything simple starts with going from Wide to Long, but ...
The file sounds too big for FLIP, but writing with CasesToVars or
merely using LOOP with XSAVE  is simple.  What failed for you?

In Long form: sort, and eliminate dups by AGG or LAG( ), and then
there is the step of reconstructing Long records.  Lines *can* be
sorted back to the original order, if you want, and LAG( ) used to
insert a sequential ordering without gaps.

Do strings start out sorted in order?  Is it okay if they end up that way?
If not sorted:  Do you want the result to have blanks, or be crunched to
the left?  Use the position of earliest one?

--
Rich Ulrich


Date: Fri, 20 Sep 2013 08:50:24 +0530
From: [hidden email]
Subject: Identify Duplicates Rowwise
To: [hidden email]

Hi all,

I have

ID                  String1      String2       String3       String4.......... String684
1
2
3
.
.
.
.
.
947475           ...................................................................................................

That is,I have 684 string variables and 947475 cases,now I need to identify duplicates in string 1 to string 684 and remove those values(Row wise) not column wise,as we do by default in SPSS.

Any syntax or suggestion will help me. I have tried using restructure and transpose but was not successful.




Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicates Rowwise

Andy W
In reply to this post by Rajeshms
This is going to be pretty painful anyway you slice it I believe. While Rich gives good advice, a VARSTOCASES that large will take a very long time and may not be possible with the full set (947475*684~650 million). Slicing it up into smaller chunks will work though.

I've used variants of the code below before to avoid the VARSTOCASES, I have no idea though how long this will take with near a million cases and (684*(684-1)/2~234,000) comparisons. How big the strings are I suspect will make a difference as well.

******************************************************.
data list free / str1 to str4 (4A1).
begin data
a a a a
a b c d
a a c d
a b b d
a b c c
end data.
dataset name string.

*ID duplicates - note the last string can never be a duplicate.
vector flag(3,F1.0).
vector str = str1 to str4.
loop #i = 1 to 3.
  loop #j = #i+1 to 4.
    if str(#i) = str(#j) flag(#i) = 1.
  end loop.
end loop.
recode flag1 to flag3 (SYSMIS = 0)(ELSE = COPY).
exe.
******************************************************.
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicates Rowwise

Alexandra Chirilov-2
You can try to use this macro for removing the duplicated rowwise.
To call it just write: duble var1 var2 var3 .... varn.


define duble (!positional= !charend('|')
 /zm =!default(miss) !charend('|')) .

!let !ciag=''.
!let !jeden=''.
!let !dwa=''.
recode !1 (0=sysmis).

!do !i !in (!1).
!let !jeden=(!i).
!do !j !in (!1).
!let !dwa=(!j).
!if (!jeden<>!dwa) !then
if (!i=!j) !j=$sysmis.
!ifend.
!doend.
!doend.

!do !i !in (!1).
!let !ciag=!concat(!ciag,',',!i).
!doend.
!let !ciag=!tail(!ciag).
if (nvalid(!ciag)=0) !zm=0.
value labels /!zm 0'missing case'.

!enddefine.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Andy W
Sent: Friday, September 20, 2013 4:04 PM
To: [hidden email]
Subject: Re: Identify Duplicates Rowwise

This is going to be pretty painful anyway you slice it I believe. While Rich
gives good advice, a VARSTOCASES that large will take a very long time and
may not be possible with the full set (947475*684~650 million). Slicing it
up into smaller chunks will work though.

I've used variants of the code below before to avoid the VARSTOCASES, I have
no idea though how long this will take with near a million cases and
(684*(684-1)/2~234,000) comparisons. How big the strings are I suspect will
make a difference as well.

******************************************************.
data list free / str1 to str4 (4A1).
begin data
a a a a
a b c d
a a c d
a b b d
a b c c
end data.
dataset name string.

*ID duplicates - note the last string can never be a duplicate.
vector flag(3,F1.0).
vector str = str1 to str4.
loop #i = 1 to 3.
  loop #j = #i+1 to 4.
    if str(#i) = str(#j) flag(#i) = 1.
  end loop.
end loop.
recode flag1 to flag3 (SYSMIS = 0)(ELSE = COPY).
exe.
******************************************************.



-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicates-Rowwise-tp
5722130p5722133.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command SIGNOFF SPSSX-L For a list of
commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicates Rowwise

David Marso
Administrator
In reply to this post by Rajeshms
I have NO idea of what you mean by
"I need to identify duplicates in string 1 to string 684 and remove those values(Row wise)"
Please do us (and yourself) a favor by cobbling together a small example of what the data looks like before and what it should look like when done?
---
Rajeshms wrote
Hi all,

I have

ID                  String1      String2       String3
String4.......... String684
1
2
3
.
.
.
.
.
947475
...................................................................................................

That is,I have 684 string variables and 947475 cases,now I need to identify
duplicates in string 1 to string 684 and remove those values(Row wise) not
column wise,as we do by default in SPSS.

Any syntax or suggestion will help me. I have tried using restructure and
transpose but was not successful.
--
Regards,
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicates Rowwise

Art Kendall
In reply to this post by Rajeshms
Is there any importance to the order of the value in a case?
I.e., if you had 4 variables

cherry cherry apple  cherry
pear   cherry apple  apple
apple  pear   cherry cherry
orange pear   cherry apple


would you this work for you
apple cherry blank  blank
apple cherry pear   blank
apple cherry pear   blank
apple cherry orange pear


or do you want

cherry blank  apple  blank
pear   cherry apple  blank
apple  pear   cherry blank
orange pear   cherry apple


also are there a limited number of values that the 684 can take on?

What is the purpose of deduplicating the list of 684 variables? I.e. what is the context, there may be another way to reach you goal.
Art Kendall
Social Research Consultants
On 9/19/2013 11:22 PM, Rajeshms [via SPSSX Discussion] wrote:
Hi all,

I have

ID                  String1      String2       String3       String4.......... String684
1
2
3
.
.
.
.
.
947475           ...................................................................................................

That is,I have 684 string variables and 947475 cases,now I need to identify duplicates in string 1 to string 684 and remove those values(Row wise) not column wise,as we do by default in SPSS.

Any syntax or suggestion will help me. I have tried using restructure and transpose but was not successful.
--
Regards,








If you reply to this email, your message will be added to the discussion below:
http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicates-Rowwise-tp5722130.html
To start a new topic under SPSSX Discussion, email [hidden email]
To unsubscribe from SPSSX Discussion, click here.
NAML

Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicates Rowwise

Rich Ulrich
In reply to this post by Andy W
I don't know why a VarsToCases this simple should take a lot of time,
but here is the XSAVE version that does the same thing.  If some of the
strings are already blank (is 684 just a maximum?), you don't have to write them. 
This XSAVE should be quick (relatively) since the sort-and-compare  by AGGREGATE
is what entails a huge number of steps for each line.

As Andy points out, there are a whole lot of comparisons to be done, and
any process could be time-consuming for a million instances.  It could be
useful to set N OF CASES to 1000  in order to benchmark alternate versions.

*example using A10.
string  strval(A10) .
vector str = str1 to str684.
loop loc = 1 to 684
compute strval= str(loc).
XSAVE outfile= */keep= ID loc strval .
end loop.
EXE.

--
Rich Ulrich

> Date: Fri, 20 Sep 2013 06:03:50 -0700

> From: [hidden email]
> Subject: Re: Identify Duplicates Rowwise
> To: [hidden email]
>
> This is going to be pretty painful anyway you slice it I believe. While Rich
> gives good advice, a VARSTOCASES that large will take a very long time and
> may not be possible with the full set (947475*684~650 million). Slicing it
> up into smaller chunks will work though.
>
> I've used variants of the code below before to avoid the VARSTOCASES, I have
> no idea though how long this will take with near a million cases and
> (684*(684-1)/2~234,000) comparisons. How big the strings are I suspect will
> make a difference as well.
>
> ******************************************************.
> data list free / str1 to str4 (4A1).
> begin data
> a a a a
> a b c d
> a a c d
> a b b d
> a b c c
> end data.
> dataset name string.
>
> *ID duplicates - note the last string can never be a duplicate.
> vector flag(3,F1.0).
> vector str = str1 to str4.
> loop #i = 1 to 3.
> loop #j = #i+1 to 4.
> if str(#i) = str(#j) flag(#i) = 1.
> end loop.
> end loop.
> recode flag1 to flag3 (SYSMIS = 0)(ELSE = COPY).
> exe.
> ******************************************************.
>
Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicates Rowwise

David Marso
Administrator
"XSAVE outfile= */keep= ID loc strval ."
Really ;-)  We'll blame it on Friday !

I'd say this beast calls for

DATASET DECLARE agg.
VARSTOCASES /ID=id /MAKE strvar FROM S001 TO S700 .
AGGREGATE   OUTFILE agg / BREAK ID strvar / Count=N.
DATASET ACTIVATE agg.
CASESTOVARS  /ID=id .

VARSTOCASES happens almost instantaneously on my machine.

Rich Ulrich wrote
I don't know why a VarsToCases this simple should take a lot of time,
but here is the XSAVE version that does the same thing.  If some of the
strings are already blank (is 684 just a maximum?), you don't have to write them.  
This XSAVE should be quick (relatively) since the sort-and-compare  by AGGREGATE
is what entails a huge number of steps for each line.

As Andy points out, there are a whole lot of comparisons to be done, and
any process could be time-consuming for a million instances.  It could be
useful to set N OF CASES to 1000  in order to benchmark alternate versions.

*example using A10.
string  strval(A10) .
vector str = str1 to str684.
loop loc = 1 to 684
compute strval= str(loc).
XSAVE outfile= */keep= ID loc strval .
end loop.
EXE.

--
Rich Ulrich

> Date: Fri, 20 Sep 2013 06:03:50 -0700
> From: [hidden email]
> Subject: Re: Identify Duplicates Rowwise
> To: [hidden email]
>
> This is going to be pretty painful anyway you slice it I believe. While Rich
> gives good advice, a VARSTOCASES that large will take a very long time and
> may not be possible with the full set (947475*684~650 million). Slicing it
> up into smaller chunks will work though.
>
> I've used variants of the code below before to avoid the VARSTOCASES, I have
> no idea though how long this will take with near a million cases and
> (684*(684-1)/2~234,000) comparisons. How big the strings are I suspect will
> make a difference as well.
>
> ******************************************************.
> data list free / str1 to str4 (4A1).
> begin data
> a a a a
> a b c d
> a a c d
> a b b d
> a b c c
> end data.
> dataset name string.
>
> *ID duplicates - note the last string can never be a duplicate.
> vector flag(3,F1.0).
> vector str = str1 to str4.
> loop #i = 1 to 3.
>   loop #j = #i+1 to 4.
>     if str(#i) = str(#j) flag(#i) = 1.
>   end loop.
> end loop.
> recode flag1 to flag3 (SYSMIS = 0)(ELSE = COPY).
> exe.
> ******************************************************.
>
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"