Hi all,
I have ID String1 String2 String3 String4.......... String684 1 2 3 . .
. . . 947475 ................................................................................................... That is,I have 684 string variables and 947475 cases,now I need to identify duplicates in string 1 to string 684 and remove those values(Row wise) not column wise,as we do by default in SPSS.
Any syntax or suggestion will help me. I have tried using restructure and transpose but was not successful. -- Regards,
|
Details?
Everything simple starts with going from Wide to Long, but ... The file sounds too big for FLIP, but writing with CasesToVars or merely using LOOP with XSAVE is simple. What failed for you? In Long form: sort, and eliminate dups by AGG or LAG( ), and then there is the step of reconstructing Long records. Lines *can* be sorted back to the original order, if you want, and LAG( ) used to insert a sequential ordering without gaps. Do strings start out sorted in order? Is it okay if they end up that way? If not sorted: Do you want the result to have blanks, or be crunched to the left? Use the position of earliest one? -- Rich Ulrich Date: Fri, 20 Sep 2013 08:50:24 +0530 From: [hidden email] Subject: Identify Duplicates Rowwise To: [hidden email] Hi all, I have ID String1 String2 String3 String4.......... String684 1 2 3 . .
. . . 947475 ................................................................................................... That is,I have 684 string variables and 947475 cases,now I need to identify duplicates in string 1 to string 684 and remove those values(Row wise) not column wise,as we do by default in SPSS.
Any syntax or suggestion will help me. I have tried using restructure and transpose but was not successful. |
In reply to this post by Rajeshms
This is going to be pretty painful anyway you slice it I believe. While Rich gives good advice, a VARSTOCASES that large will take a very long time and may not be possible with the full set (947475*684~650 million). Slicing it up into smaller chunks will work though.
I've used variants of the code below before to avoid the VARSTOCASES, I have no idea though how long this will take with near a million cases and (684*(684-1)/2~234,000) comparisons. How big the strings are I suspect will make a difference as well. ******************************************************. data list free / str1 to str4 (4A1). begin data a a a a a b c d a a c d a b b d a b c c end data. dataset name string. *ID duplicates - note the last string can never be a duplicate. vector flag(3,F1.0). vector str = str1 to str4. loop #i = 1 to 3. loop #j = #i+1 to 4. if str(#i) = str(#j) flag(#i) = 1. end loop. end loop. recode flag1 to flag3 (SYSMIS = 0)(ELSE = COPY). exe. ******************************************************. |
You can try to use this macro for removing the duplicated rowwise.
To call it just write: duble var1 var2 var3 .... varn. define duble (!positional= !charend('|') /zm =!default(miss) !charend('|')) . !let !ciag=''. !let !jeden=''. !let !dwa=''. recode !1 (0=sysmis). !do !i !in (!1). !let !jeden=(!i). !do !j !in (!1). !let !dwa=(!j). !if (!jeden<>!dwa) !then if (!i=!j) !j=$sysmis. !ifend. !doend. !doend. !do !i !in (!1). !let !ciag=!concat(!ciag,',',!i). !doend. !let !ciag=!tail(!ciag). if (nvalid(!ciag)=0) !zm=0. value labels /!zm 0'missing case'. !enddefine. -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Andy W Sent: Friday, September 20, 2013 4:04 PM To: [hidden email] Subject: Re: Identify Duplicates Rowwise This is going to be pretty painful anyway you slice it I believe. While Rich gives good advice, a VARSTOCASES that large will take a very long time and may not be possible with the full set (947475*684~650 million). Slicing it up into smaller chunks will work though. I've used variants of the code below before to avoid the VARSTOCASES, I have no idea though how long this will take with near a million cases and (684*(684-1)/2~234,000) comparisons. How big the strings are I suspect will make a difference as well. ******************************************************. data list free / str1 to str4 (4A1). begin data a a a a a b c d a a c d a b b d a b c c end data. dataset name string. *ID duplicates - note the last string can never be a duplicate. vector flag(3,F1.0). vector str = str1 to str4. loop #i = 1 to 3. loop #j = #i+1 to 4. if str(#i) = str(#j) flag(#i) = 1. end loop. end loop. recode flag1 to flag3 (SYSMIS = 0)(ELSE = COPY). exe. ******************************************************. ----- Andy W [hidden email] http://andrewpwheeler.wordpress.com/ -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Identify-Duplicates-Rowwise-tp 5722130p5722133.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by Rajeshms
I have NO idea of what you mean by
"I need to identify duplicates in string 1 to string 684 and remove those values(Row wise)" Please do us (and yourself) a favor by cobbling together a small example of what the data looks like before and what it should look like when done? ---
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Rajeshms
Is
there any importance to the order of the value in a case?
I.e., if you had 4 variables cherry cherry apple cherry pear cherry apple apple apple pear cherry cherry orange pear cherry apple would you this work for you apple cherry blank blank apple cherry pear blank apple cherry pear blank apple cherry orange pear or do you want cherry blank apple blank pear cherry apple blank apple pear cherry blank orange pear cherry apple also are there a limited number of values that the 684 can take on? What is the purpose of deduplicating the list of 684 variables? I.e. what is the context, there may be another way to reach you goal. Art Kendall Social Research ConsultantsOn 9/19/2013 11:22 PM, Rajeshms [via SPSSX Discussion] wrote:
Art Kendall
Social Research Consultants |
In reply to this post by Andy W
I don't know why a VarsToCases this simple should take a lot of time,
but here is the XSAVE version that does the same thing. If some of the strings are already blank (is 684 just a maximum?), you don't have to write them. This XSAVE should be quick (relatively) since the sort-and-compare by AGGREGATE is what entails a huge number of steps for each line. As Andy points out, there are a whole lot of comparisons to be done, and any process could be time-consuming for a million instances. It could be useful to set N OF CASES to 1000 in order to benchmark alternate versions. *example using A10. string strval(A10) . vector str = str1 to str684. loop loc = 1 to 684 compute strval= str(loc). XSAVE outfile= */keep= ID loc strval . end loop. EXE. -- Rich Ulrich > Date: Fri, 20 Sep 2013 06:03:50 -0700 > From: [hidden email] > Subject: Re: Identify Duplicates Rowwise > To: [hidden email] > > This is going to be pretty painful anyway you slice it I believe. While Rich > gives good advice, a VARSTOCASES that large will take a very long time and > may not be possible with the full set (947475*684~650 million). Slicing it > up into smaller chunks will work though. > > I've used variants of the code below before to avoid the VARSTOCASES, I have > no idea though how long this will take with near a million cases and > (684*(684-1)/2~234,000) comparisons. How big the strings are I suspect will > make a difference as well. > > ******************************************************. > data list free / str1 to str4 (4A1). > begin data > a a a a > a b c d > a a c d > a b b d > a b c c > end data. > dataset name string. > > *ID duplicates - note the last string can never be a duplicate. > vector flag(3,F1.0). > vector str = str1 to str4. > loop #i = 1 to 3. > loop #j = #i+1 to 4. > if str(#i) = str(#j) flag(#i) = 1. > end loop. > end loop. > recode flag1 to flag3 (SYSMIS = 0)(ELSE = COPY). > exe. > ******************************************************. > |
Administrator
|
"XSAVE outfile= */keep= ID loc strval ."
Really ;-) We'll blame it on Friday ! I'd say this beast calls for DATASET DECLARE agg. VARSTOCASES /ID=id /MAKE strvar FROM S001 TO S700 . AGGREGATE OUTFILE agg / BREAK ID strvar / Count=N. DATASET ACTIVATE agg. CASESTOVARS /ID=id . VARSTOCASES happens almost instantaneously on my machine.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Free forum by Nabble | Edit this page |