Comparing cases for similarity

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Comparing cases for similarity

Silviu Matei
Hi,

I have a question: how could I compare cases in SPSS and compute similarity
ratio?
The problem looks like that: let's say a datafile with 1000 cases and 500
variables.
I need case by case comparison and computation of similarity ratio between
cases. The algorithm would compare the value for each variable, case by
case. If two cases are similar (have the same value) in variable V1, then
this adds up to the similarity score.
It seems clear to me that I need to create a new file containing (n-1)*n/2
cases (the number of all combinations between cases) where to store these
comparison values.
I do not know what syntax to use in oder to create vectors of cases instead
of vectors of variables.
As a matter of fact, I solved this for small data sets by flipping the
datafile so that I compare variables. The trouble is that for a datafile
with 1000 cases, I would need to create 999*1000/2=499500 variables in the
datafile. I dont even know if this can be done in SPSS as I left several
times my computer running during the night and it did nothing.
Could you please help me with this?

I also tried using a statistical measure such as similarities (euclidian
distance) but this is not what I need. The similarity ratio I need would be
computed as follows: if 2 cases take the same value in a variable, we
compute it as 1, if the values are different, we count it as 0. We know the
total number of variables and then we just sum all the variables where the
2 cases had the same values. By diving this by the total number of
variables, we get the similarity ratio (percentage of variables taking the
same value).

Thanks a lot for your help!

Silviu
Reply | Threaded
Open this post in threaded view
|

Re: Comparing cases for similarity

hillel vardi
Shalom

Here is a syntax for 15 cases and 4 variables witch create 15 * 15 cases
and 4 +4  variables . you can change the syntax   to 1000 cases with 500
variable .
The result will be 1000000 cases and 1000 variable witch is fairly big .
you will need spss14 or more to use the dataset command to work
otherwise you will have to save the files .
I most say that  am quit skeptical if computation of similarity ratio is
the right way of dealing with your problem

Hillel vardi

run name            Comparing cases for similarity.
input program .
loop       case_file1= 1  to 15 .
compute   var1= trunc(uniform(case_file1) * 60  ).
compute   var2= trunc(uniform(case_file1) * 60  ).
compute   var3= trunc(uniform(case_file1) * 60  ).
compute   var4= trunc(uniform(case_file1) * 60  ).
end case .
end loop .
end file .
end input program .
compute   dummi=1.
dataset name   data1 .
execute .

DATASET COPY wide.
DATASET ACTIVATE wide .
SORT CASES BY dummi .
CASESTOVARS
 /ID = dummi
 /drop=var1 to var4
 /seperator=""
 /GROUPBY = VARIABLE .

MATCH FILES /FILE=data1
 /table= wide
 /BY dummi.
EXECUTE.

VARSTOCASES  /MAKE case_file2 FROM case_file11 to case_file115
 /INDEX = Index1(15)
 /KEEP =  case_file1  var1 var2 var3 var4 dummi
 /NULL = KEEP.
dataset name long1  .

SORT CASES BY case_file2  .
match files   file=long1
            / table= data1/ rename=(case_file1 var1 to var4=case_file2
bvar1 to bvar4)
            / by case_file2
            / drop=dummi Index1.
execute .



Silviu Matei wrote:

> Hi,
>
> I have a question: how could I compare cases in SPSS and compute similarity
> ratio?
> The problem looks like that: let's say a datafile with 1000 cases and 500
> variables.
> I need case by case comparison and computation of similarity ratio between
> cases. The algorithm would compare the value for each variable, case by
> case. If two cases are similar (have the same value) in variable V1, then
> this adds up to the similarity score.
> It seems clear to me that I need to create a new file containing (n-1)*n/2
> cases (the number of all combinations between cases) where to store these
> comparison values.
> I do not know what syntax to use in oder to create vectors of cases instead
> of vectors of variables.
> As a matter of fact, I solved this for small data sets by flipping the
> datafile so that I compare variables. The trouble is that for a datafile
> with 1000 cases, I would need to create 999*1000/2=499500 variables in the
> datafile. I dont even know if this can be done in SPSS as I left several
> times my computer running during the night and it did nothing.
> Could you please help me with this?
>
> I also tried using a statistical measure such as similarities (euclidian
> distance) but this is not what I need. The similarity ratio I need would be
> computed as follows: if 2 cases take the same value in a variable, we
> compute it as 1, if the values are different, we count it as 0. We know the
> total number of variables and then we just sum all the variables where the
> 2 cases had the same values. By diving this by the total number of
> variables, we get the similarity ratio (percentage of variables taking the
> same value).
>
> Thanks a lot for your help!
>
> Silviu
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Comparing cases for similarity

Silviu Matei
In reply to this post by Silviu Matei
Hi and thanks a lot for your feedback. Unhappily, I was not clear enough
with my explanation.
Suppose I have this datafile (4 variables x 5 cases):
          Var1 Var2 Var3 Var4
case1     1    2    3    4
case2     2    3    4    1
case3     3    4    1    2
case4     4    1    2    3
case5     1    2    3    5

And I need to see if the values of each variable are similar between each
pair of cases (if similar, I put 1, if dissimilar, I put 0):

The final table would look like this:
         W1    W2   W3   W4    Nvar   SumW1_W4      Similar
case1_2   0    0    0    0     4      0             0
case1_3   0    0    0    0     4      0             0
case1_4   0    0    0    0     4      0             0
case1_5   1    1    1    0     4      3             0.75
case2_3   0    0    0    0     4      0             0
case2_4   0    0    0    0     4      0             0
case2_5   0    0    0    0     4      0             0
case3_4   0    0    0    0     4      0             0
case3_5   0    0    0    0     4      0             0
case4_5   0    0    0    0     4      0             0

Please note that the final purpose is to see if there are cases with
similarity ratio higher than a specified threshold.
I have to tell than there is no use in comparing case1 to itself, each case
has to be compared with those coming after it.
NVar is the total number of variables. SumW1_W4 is sum(W1, W2, W3, W4).
Similar=SumW1_W4 / Nvar.

Anyone could help me, please?
Thanks a lot in advance!
Reply | Threaded
Open this post in threaded view
|

Re: Comparing cases for similarity

Silviu Matei
In reply to this post by Silviu Matei
Hi Arthur and thanks a lot for your help. your solution is most closest to
what I need. One question though: why do you think SPSS eliminates from the
final matrix some cases?
I ran your code on a datafile with 1023 cases and the matrix obtained has
only 459 rows x columns. I mention that I recoded sysmis values into zeros
because it would not work.
Thanks again!

Silviu
Reply | Threaded
Open this post in threaded view
|

Fwd: Comparing cases for similarity

flo statistik
In reply to this post by Silviu Matei
hi silviu,

i've got no time to program this but you should go into the matrix - mode of
spss.

begin matrix.

end matrix.

and there you can loop over cases an variables and create a new k x k matrix
with your similarity measures.
Reply | Threaded
Open this post in threaded view
|

Re: Comparing cases for similarity

Art Kendall-2
In reply to this post by Silviu Matei
Are your sure they are not in the output file?

click inside the output just below the 459th case and see it there is
more listing below the little arrow.



Do you also have missing values where you know why the values are
missing -- USER missing?

Why did you have SYSMIS values, rather than user missing values?
Was your input data unreadable by the format you specified or did you do
transformations where arguments were outside the legitimate domain of
functions?

Depending on why you are looking at similarity, how you deal with
missing values may make a difference.

Art Kendall
Social Research Consultants

Silviu Matei wrote:

> Hi Arthur and thanks a lot for your help. your solution is most closest to
> what I need. One question though: why do you think SPSS eliminates from the
> final matrix some cases?
> I ran your code on a datafile with 1023 cases and the matrix obtained has
> only 459 rows x columns. I mention that I recoded sysmis values into zeros
> because it would not work.
> Thanks again!
>
> Silviu
>
>
>