|
Hi,
I have a question: how could I compare cases in SPSS and compute similarity ratio? The problem looks like that: let's say a datafile with 1000 cases and 500 variables. I need case by case comparison and computation of similarity ratio between cases. The algorithm would compare the value for each variable, case by case. If two cases are similar (have the same value) in variable V1, then this adds up to the similarity score. It seems clear to me that I need to create a new file containing (n-1)*n/2 cases (the number of all combinations between cases) where to store these comparison values. I do not know what syntax to use in oder to create vectors of cases instead of vectors of variables. As a matter of fact, I solved this for small data sets by flipping the datafile so that I compare variables. The trouble is that for a datafile with 1000 cases, I would need to create 999*1000/2=499500 variables in the datafile. I dont even know if this can be done in SPSS as I left several times my computer running during the night and it did nothing. Could you please help me with this? I also tried using a statistical measure such as similarities (euclidian distance) but this is not what I need. The similarity ratio I need would be computed as follows: if 2 cases take the same value in a variable, we compute it as 1, if the values are different, we count it as 0. We know the total number of variables and then we just sum all the variables where the 2 cases had the same values. By diving this by the total number of variables, we get the similarity ratio (percentage of variables taking the same value). Thanks a lot for your help! Silviu |
|
Shalom
Here is a syntax for 15 cases and 4 variables witch create 15 * 15 cases and 4 +4 variables . you can change the syntax to 1000 cases with 500 variable . The result will be 1000000 cases and 1000 variable witch is fairly big . you will need spss14 or more to use the dataset command to work otherwise you will have to save the files . I most say that am quit skeptical if computation of similarity ratio is the right way of dealing with your problem Hillel vardi run name Comparing cases for similarity. input program . loop case_file1= 1 to 15 . compute var1= trunc(uniform(case_file1) * 60 ). compute var2= trunc(uniform(case_file1) * 60 ). compute var3= trunc(uniform(case_file1) * 60 ). compute var4= trunc(uniform(case_file1) * 60 ). end case . end loop . end file . end input program . compute dummi=1. dataset name data1 . execute . DATASET COPY wide. DATASET ACTIVATE wide . SORT CASES BY dummi . CASESTOVARS /ID = dummi /drop=var1 to var4 /seperator="" /GROUPBY = VARIABLE . MATCH FILES /FILE=data1 /table= wide /BY dummi. EXECUTE. VARSTOCASES /MAKE case_file2 FROM case_file11 to case_file115 /INDEX = Index1(15) /KEEP = case_file1 var1 var2 var3 var4 dummi /NULL = KEEP. dataset name long1 . SORT CASES BY case_file2 . match files file=long1 / table= data1/ rename=(case_file1 var1 to var4=case_file2 bvar1 to bvar4) / by case_file2 / drop=dummi Index1. execute . Silviu Matei wrote: > Hi, > > I have a question: how could I compare cases in SPSS and compute similarity > ratio? > The problem looks like that: let's say a datafile with 1000 cases and 500 > variables. > I need case by case comparison and computation of similarity ratio between > cases. The algorithm would compare the value for each variable, case by > case. If two cases are similar (have the same value) in variable V1, then > this adds up to the similarity score. > It seems clear to me that I need to create a new file containing (n-1)*n/2 > cases (the number of all combinations between cases) where to store these > comparison values. > I do not know what syntax to use in oder to create vectors of cases instead > of vectors of variables. > As a matter of fact, I solved this for small data sets by flipping the > datafile so that I compare variables. The trouble is that for a datafile > with 1000 cases, I would need to create 999*1000/2=499500 variables in the > datafile. I dont even know if this can be done in SPSS as I left several > times my computer running during the night and it did nothing. > Could you please help me with this? > > I also tried using a statistical measure such as similarities (euclidian > distance) but this is not what I need. The similarity ratio I need would be > computed as follows: if 2 cases take the same value in a variable, we > compute it as 1, if the values are different, we count it as 0. We know the > total number of variables and then we just sum all the variables where the > 2 cases had the same values. By diving this by the total number of > variables, we get the similarity ratio (percentage of variables taking the > same value). > > Thanks a lot for your help! > > Silviu > > |
|
In reply to this post by Silviu Matei
Hi and thanks a lot for your feedback. Unhappily, I was not clear enough
with my explanation. Suppose I have this datafile (4 variables x 5 cases): Var1 Var2 Var3 Var4 case1 1 2 3 4 case2 2 3 4 1 case3 3 4 1 2 case4 4 1 2 3 case5 1 2 3 5 And I need to see if the values of each variable are similar between each pair of cases (if similar, I put 1, if dissimilar, I put 0): The final table would look like this: W1 W2 W3 W4 Nvar SumW1_W4 Similar case1_2 0 0 0 0 4 0 0 case1_3 0 0 0 0 4 0 0 case1_4 0 0 0 0 4 0 0 case1_5 1 1 1 0 4 3 0.75 case2_3 0 0 0 0 4 0 0 case2_4 0 0 0 0 4 0 0 case2_5 0 0 0 0 4 0 0 case3_4 0 0 0 0 4 0 0 case3_5 0 0 0 0 4 0 0 case4_5 0 0 0 0 4 0 0 Please note that the final purpose is to see if there are cases with similarity ratio higher than a specified threshold. I have to tell than there is no use in comparing case1 to itself, each case has to be compared with those coming after it. NVar is the total number of variables. SumW1_W4 is sum(W1, W2, W3, W4). Similar=SumW1_W4 / Nvar. Anyone could help me, please? Thanks a lot in advance! |
|
In reply to this post by Silviu Matei
Hi Arthur and thanks a lot for your help. your solution is most closest to
what I need. One question though: why do you think SPSS eliminates from the final matrix some cases? I ran your code on a datafile with 1023 cases and the matrix obtained has only 459 rows x columns. I mention that I recoded sysmis values into zeros because it would not work. Thanks again! Silviu |
|
In reply to this post by Silviu Matei
hi silviu,
i've got no time to program this but you should go into the matrix - mode of spss. begin matrix. end matrix. and there you can loop over cases an variables and create a new k x k matrix with your similarity measures. |
|
In reply to this post by Silviu Matei
Are your sure they are not in the output file?
click inside the output just below the 459th case and see it there is more listing below the little arrow. Do you also have missing values where you know why the values are missing -- USER missing? Why did you have SYSMIS values, rather than user missing values? Was your input data unreadable by the format you specified or did you do transformations where arguments were outside the legitimate domain of functions? Depending on why you are looking at similarity, how you deal with missing values may make a difference. Art Kendall Social Research Consultants Silviu Matei wrote: > Hi Arthur and thanks a lot for your help. your solution is most closest to > what I need. One question though: why do you think SPSS eliminates from the > final matrix some cases? > I ran your code on a datafile with 1023 cases and the matrix obtained has > only 459 rows x columns. I mention that I recoded sysmis values into zeros > because it would not work. > Thanks again! > > Silviu > > > |
| Free forum by Nabble | Edit this page |
