SPSSX Discussion

identifying duplicate/similar cases

Classic

List

Threaded

4 messages Options

Marin Vladu

identifying duplicate/similar cases

Hello everybody!

I need to identify the cases which are not 100% identical (duplicates) but over 80% identical.

Is this possible?

Many thanks!

Marin

---------------------------------
Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls. Great rates starting at 1¢/min.

Maguin, Eugene

Re: identifying duplicate/similar cases

Manu,

I think that responders may need more information to give you a good answer.
But, I'll give you an answer. Since I don't know anything else about your
problem, I'll just make stuff up to suit me when I need to. I'll also assume
that you are pretty good with syntax.

Suppose you have two files that you've added together and they are sorted in
so that records from file 1 and those from file 2 are interleaved by the
common id variable, called id. There are no duplicate ids in either file.

Suppose you want an 80% match on 10 variables, v1 to v10, all numeric with
no missing data.

So use a do repeat and count the matches.

Compute rec=1.
Compute matches=0.
Do repeat var=v1 to v10.
+ do if (id eq lag(id)).
+ compute rec=2. /* I agree. Not an elegant line.
+ if (var eq lag(var)) matches=matches+1.
+ end if.
End repeat.

Select if (rec eq 2 and match ge 8).

There you go. If this isn't what you want post more information back to the
list.

Gene Maguin

Marin Vladu

Re: identifying duplicate/similar cases

Hi Gene,

Thanks for your answer.
I'll try to be more precise about the issue: I have a data base with many variables (more than 200) which are both numeric and string, with and without blanks.
I need to flag/list cases (by an ID variable) which match at least 80%.

I hope this makes things more clear.

Thanks a lot!

Marin

Gene Maguin <[hidden email]> wrote:
Manu,

I think that responders may need more information to give you a good answer.
But, I'll give you an answer. Since I don't know anything else about your
problem, I'll just make stuff up to suit me when I need to. I'll also assume
that you are pretty good with syntax.

Suppose you have two files that you've added together and they are sorted in
so that records from file 1 and those from file 2 are interleaved by the
common id variable, called id. There are no duplicate ids in either file.

Suppose you want an 80% match on 10 variables, v1 to v10, all numeric with
no missing data.

So use a do repeat and count the matches.

Compute rec=1.
Compute matches=0.
Do repeat var=v1 to v10.
+ do if (id eq lag(id)).
+ compute rec=2. /* I agree. Not an elegant line.
+ if (var eq lag(var)) matches=matches+1.
+ end if.
End repeat.

Select if (rec eq 2 and match ge 8).

There you go. If this isn't what you want post more information back to the
list.

Gene Maguin

---------------------------------
Get your own web address for just $1.99/1st yr. We'll help. Yahoo! Small Business.

Maguin, Eugene

Re: identifying duplicate/similar cases

Marin,

The presence of string variables adds some complexity but not much. You
didn't mention it but I'll now assume that you have missing data on both
your numeric and string variables. It sounds like you have a single file
rather than the two files that I assumed yesterday. I also assumed that in
the merged files (or the present single file) you had some sets of exactly
two records each having the same id number. It occurs to me that you may
actually have sets of varying numbers of records each having the same id
number. This is a minor variant. It also occurs to me that you might have a
set of records with each having a unique id number. If this is true, the
following code won't work at all (nor will yesterdays).

Sort cases by id.
* number records within set.
Compute rec=1.
if (id eq lag(id)) rec=lag(rec)+1.

* this is for numeric variables. I think that a more skillful writer
* can use python or sax basic to write do if structure that branches
* depending on the type of variable. I'll assume that you have
* separated your variables into a list of numerics and a list of
* strings.
* note also that each record in a set except the first record is
* compared to the previous record. If there are more than two records
* in a set, you may want a different comparison.

Compute matches=0.
Do repeat var=v1 to v10. /* this is your numeric list.
+ do if (id eq lag(id)).
+ if (value(var) eq value(lag(var))) matches=matches+1.
* I'm not sure about whether it is value(lag(var)) or
* lag(value(var)). If one doesn't work, try the other.
+ end if.
End repeat.

* there is a discussion of this in the syntax reference. Short
* strings can have missing values (but not sysmis). Long strings
* can not. Short strings are 8 characters or less. I've assumed
* that your short strings have missing values. Thus, the variable
* list must include only short strings.
Do repeat var=ss1 to ss10. /* this is your short string list.
+ do if (id eq lag(id)).
+ if (value(var)) eq value(lag(var))) matches=matches+1.
+ end if.
End repeat.

Do repeat var=ls1 to ls10. /* this is your long string list.
+ do if (id eq lag(id)).
+ if (var eq lag(var)) matches=matches+1.
+ end if.
End repeat.

* what you next depends on what you want to do.
* if you want to delete 80% match records.

Select if (match ge <insert 80% number here>).

* if you want to pullout those sets whose members have an 80%
* match. Remember, record 1 has the same match frequency to
* record 2 as record 2 does to record 1.

Sort cases by id (a) rec (d).
if (id eq lag(id) and rec eq 1) matches=lag(matches).

Select if (match ge <insert 80% number here>).

Gene Maguin