Identify Duplicate Cases

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Identify Duplicate Cases

David J. Peng

Hi everyone,

 

Does anybody know a way to get around the 64 variable limit for the Identify Duplicate Cases tool? Or another solution for large data with tons of variables?

 

Thanks,

 

David

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

mpirritano

You can get around this by running if statements that include all of your criteria for example:

 

If (a=1) & (c=2) & ((d=3) or (d=7)) dup = 1.

 

I don’t think there’s a 64 variable limit on an if statement. Then you just filter on the dup variable.

 

Thanks

Matt

 

Matthew Pirritano, Ph.D.

Research Analyst IV

Medical Services Initiative (MSI)

Orange County Health Care Agency

(714) 568-5648


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David J. Peng
Sent: Monday, May 04, 2009 12:51 PM
To: [hidden email]
Subject: Identify Duplicate Cases

 

Hi everyone,

 

Does anybody know a way to get around the 64 variable limit for the Identify Duplicate Cases tool? Or another solution for large data with tons of variables?

 

Thanks,

 

David

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

mpirritano
In reply to this post by David J. Peng

Actually more appropriately it would be a bunch of if statements using the lag function:

 

First you would sort by all of the variables you want to id dups on.

 

Then just say:

 

Do if (lag(a) ne a) or

       (lag(b) ne b) … or

       (lag(z) ne z).

             Compute count = 1.

Else if (lag(a) = a) &

          (lag(b) = b) … &

          (lag(z) = z)

             Compute count = count + 1.

End if

 

 

Since you’ve got more than 64 I would advise using excel to create the syntax. I use the drag function to duplicate lots of duplicate syntax and the concatenate to create the full syntax.

Each line in excel could be the column (lag(  [new column] ‘a’ [just copy and past from the variable view all of your variables] ) new column ne new column ‘a’ [copy and past your variables again] new column )  Then you just drag all of the redundant components to get as many as you need. Concatenate the columns and then paste the finished product into spss syntax.

 

Hope that’s sort of clear.  There’s probably a simpler way. But that’ s how I’d do it. Relatively quick and painless though it would take up a lot of page space.

 

Then you just select only the 1’s.

 

matt

 

Matthew Pirritano, Ph.D.

Research Analyst IV

Medical Services Initiative (MSI)

Orange County Health Care Agency

(714) 568-5648


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David J. Peng
Sent: Monday, May 04, 2009 12:51 PM
To: [hidden email]
Subject: Identify Duplicate Cases

 

Hi everyone,

 

Does anybody know a way to get around the 64 variable limit for the Identify Duplicate Cases tool? Or another solution for large data with tons of variables?

 

Thanks,

 

David

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Peck, Jon
In reply to this post by David J. Peng

Although a few workarounds have been posted, this seems like a pretty unusual request.  What is it you are really trying to do?

 

And note that the SORT command, which you would need in order to arrange the cases for detecting duplicates in adjacent ones, has a 64 variable limit, which is why IDC is limited to 64.  With care you can get around that with repeated sorts, but still…

 

 

Regards,

Jon Peck

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David J. Peng
Sent: Monday, May 04, 2009 1:51 PM
To: [hidden email]
Subject: [SPSSX-L] Identify Duplicate Cases

 

Hi everyone,

 

Does anybody know a way to get around the 64 variable limit for the Identify Duplicate Cases tool? Or another solution for large data with tons of variables?

 

Thanks,

 

David

Reply | Threaded
Open this post in threaded view
|

Re: Identify Duplicate Cases

Richard Ristow
In reply to this post by mpirritano
At 07:08 PM 5/4/2009, Pirritano, Matthew wrote:

[You could use] of IF statements using the lag function. First you would sort by all of the variables you want to id dups on.
 
Then just say:
 
Do if (lag(a) ne a) or
       (lag(b) ne b) … or
       (lag(z) ne z).
             Compute count = 1.
Else if (lag(a) = a) &
          (lag(b) = b) … &
          (lag(z) = z)
             Compute count = count + 1.
End if

Since you’ve got more than 64 I would advise using excel to create the syntax.

Probably even easier with DO REPEAT, like this (untested). As in the above example, it only works if LEAVE is specified for "count". And treating the first case as special is necessary. (And take care what happens if you have missing data.) Code not tested:

COMPUTE    #Dup = 1.
DO IF    $CASENUM EQ 1.
.  COMPUTE #Dup  = 0.
ELSE.
.  DO REPEAT TestVar = a TO z.
.     IF LAG(TestVar) NE TestVar
.          #Dup  = 0.
.  END REPEAT.

DO IF   #Dup.
.  COMPUTE Count = Count + 1.
ELSE.
.  COMPUTE Count = 1.
END IF.

====================To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD