Postscript. At 06:17 PM 11/13/2006, I wrote:
>I'd try something like this (not tested): > >* Prepare the variables to be "left": . >NUMERIC #X1 TO #X10 (F5.3) /* or, any other format . > >* Check for match with previous case, as before: . >COMPUTE MATCH=0. >DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)). [Etc. See previous posting for the complete code.] I don't think this logic, or the earlier logic in this thread (if it had worked as intended), will detect records - combinations of the key variables - that are found in only one of the two files being compared. |
In reply to this post by Fiona Graff
Hello Fiona,
This is the one reason that we also buy a license for the Data Entry Builder of SPSS. With this program you can compare two datasets and receive an extensive report on missing cases in the one or the other dataset and of cases that apear in both datasets but differ in 1 or more variables. Unfortunately i cannot reproduce the exact way to do it as our license is expired (must have been so since august but nobody seems tho have noticed it as yet...). Perhaps an other forummember can cough it up??? Regards, Kees de Boer Ing. C.P.J. de Boer EMGO, VUmc afd. Datamanagement & Systeembeheer BS-7 D-451 (020) 44 49828 In theorie is er geen verschil tussen theorie en praktijk. De praktijk is echter anders... > -----Oorspronkelijk bericht----- > Van: SPSSX(r) Discussion [mailto:[hidden email]] > Namens Fiona Graff > Verzonden: dinsdag 14 november 2006 0:07 > Aan: [hidden email] > Onderwerp: Comparing two data sets > > Hi all, > > Does anyone know if there is an easy way to compare two data > files to determine if their data contents are identical? I > know how to do this by exporting SPSS files to excel and then > writing a formula to compare cell-by-cell, but I know there > must be an easier way in SPSS. > > Thanks very much, > > Fiona Graff > > > > > > Any information, including protected health information > (PHI), transmitted in this email is intended only for the > person or entity to which it is addressed and may contain > information that is privileged, confidential and or exempt > from disclosure under applicable Federal or State law. Any > review, retransmission, dissemination or other use of or > taking of any action in reliance upon, protected health > information (PHI) by persons or entities other than the > intended recipient is prohibited. If you received this email > in error, please contact the sender and delete the material > from any computer. > |
In reply to this post by Fiona Graff
Hi Fiona,
I found our licence code for the Data Entry Builder and tried it out. It would go like this: 1) Start SPSS Data Entry Builder 2) Open the orginal file (Invoer.sav): File-->Open 3) Remove the mark at "Open in Tabel Entry Mode" and click <Yes> 4) Click in the mainmenu on "View" --> "Form Entry" 5) Click in the mainmenu on "View" --> "Builder Window" 6) You are now in the SPSS Data Entry Builder 7) Clik in the mainmenu on "File" --> "Compare Versions" 8) Browse to the file you want to check against the original file 9) Put a mark at "Match Cases by ID variable" 10) Select the unique variabele in both files (often respondentnumber or patientnumber or so) 11) Put a mark at "Sort active file by case ID variable" 12) Finally click on <OK> to start the comparison The output would be something like this: ============================================================= Active file: C:\zandbak\Een.sav Verification file: C:\zandbak\Twee.sav 14-11-2006 9:16:29 Warning: Value mismatch for Case ID = 108.00. Variable Active Verification CNTMIN1 1422.41 1000.00 Warning: The verification file does not contain a matching case with the Case ID of 119.00. Warning: The verification file does not contain a matching case with the Case ID of 141.00. Warning: Value mismatch for Case ID = 715.00. Variable Active Verification COUNT5 6.00 Warning: Value mismatch for Case ID = 717.00. Variable Active Verification CNTMIN4 93.74 ============================================================= The active variables are in the original file, the verfication variables are in the second file. On line 4..6 of the output it says: ... Warning: Value mismatch for Case ID = 108.00. Variable Active Verification CNTMIN1 1422.41 1000.00 ... This means that in the original file in the case with Case ID 108.00 the variable CNTMIN1 has value 1422.41 while in the second file its value is 1000.00 Cases 119 and 141 seem to be missing etc. You can either save or print the output. This seems to me a better way than simply compare the files themselves (but of course you need a license for the Builder...) Kees Ing. C.P.J. de Boer EMGO, VUmc afd. Datamanagement & Systeembeheer BS-7 D-451 (020) 44 49828 In theorie is er geen verschil tussen theorie en praktijk. De praktijk is echter anders... _______________________________________________ > -----Oorspronkelijk bericht----- > Van: SPSSX(r) Discussion [mailto:[hidden email]] > Namens Fiona Graff > Verzonden: dinsdag 14 november 2006 0:07 > Aan: [hidden email] > Onderwerp: Comparing two data sets > > Hi all, > > Does anyone know if there is an easy way to compare two data > files to determine if their data contents are identical? I > know how to do this by exporting SPSS files to excel and then > writing a formula to compare cell-by-cell, but I know there > must be an easier way in SPSS. > > Thanks very much, > > Fiona Graff > > > > > > Any information, including protected health information > (PHI), transmitted in this email is intended only for the > person or entity to which it is addressed and may contain > information that is privileged, confidential and or exempt > from disclosure under applicable Federal or State law. Any > review, retransmission, dissemination or other use of or > taking of any action in reliance upon, protected health > information (PHI) by persons or entities other than the > intended recipient is prohibited. If you received this email > in error, please contact the sender and delete the material > from any computer. > |
In reply to this post by Maguin, Eugene
Shalom
I belive that using aggregate to find duplicates is a simpler way then using lag . As much as i can tall today (spss 14 ) there is no limit to the number of break variable . In the next example aggregate correctly identify all combination of variable types that is value vis value = true value vis sysmis = false value vis user missing = false sysmis vis sysmis = true user missing vis user missing = true sismis vis user missing = false title identify duplicates . input program . vector num(4). loop #j =1 to 15 . compute subnum=#j. loop #i =1 to 4 . compute num(#i)= trunc(unifrom(11)). end loop . end case . end loop . end file . end input program . execute . formats num1 to num4(f3) . save outfile=tmp.sav . recode num1 to num4(3 9=sysmis)(6=7) . missing values num1 to num4(7) . compute line=1. add files file=*/ file=tmp.sav /keep=subnum line num1 to num4 . recode line(1=1)(else=2). sort cases by subnum line . AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=subnum num1 to num4 /line_n = N(line). Hillel Vardi Gene Maguin wrote: > All, > > Here is the setup. Two datasets are allegedly identical. Consider variables > x1 to x10. Possible values include sysmis and user missing (two or three > values). Let's do this in syntax and not through the identify duplicates > thing. > > If there are no user missing, then this will work. > > COMPUTE MATCH=0. > DO IF (FAMID EQ LAG(FAMID)). > + DO REPEAT X=X1 TO X10. > + IF (X EQ LAG(X)) MATCH=MATCH+1. > + IF (SYSMIS(X) AND SYSMIS(X)) MATCH=MATCH+1. > + END REPEAT. > ELSE. > + COMPUTE MATCH=99. > END IF. > IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1. > > Now add user missing. I'd like to say > > COMPUTE MATCH=0. > DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)). > + DO REPEAT X=C1PRCA1 TO C1PRCA9. > + IF (X EQ VALUE(LAG(X))) MATCH=MATCH+1. > + IF (SYSMIS(X) AND SYSMIS(LAG(X))) MATCH=MATCH+1. > + END REPEAT. > ELSE. > + COMPUTE MATCH=99. > END IF. > IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1. > > However, the Value and Lag functions don't work together--in any sequence. > > A plausible alternative is > > COMPUTE MATCH=0. > DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)). > + DO REPEAT X=C1PRCA1 TO C1PRCA9. > + COMPUTE #TEMP=LAG(X). > + IF (X EQ VALUE(#TEMP)) MATCH=MATCH+1. > + IF (SYSMIS(X) AND SYSMIS(#TEMP)) MATCH=MATCH+1. > + END REPEAT. > ELSE. > + COMPUTE MATCH=99. > END IF. > IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1. > > But the problem here is that a user missing value is resent to sysmis in > this statement. > > + COMPUTE #TEMP=LAG(X). > > So things don't work correctly. > > The only alternative I can think of is to set user missing off, execute the > comparison and then set user missing back on. > > My question: is there a one step alternative? > > Follow up question: Will the identify duplicates thing work correctly in > this case? > > > Thanks, Gene Maguin > > |
In reply to this post by Richard Ristow
[snip]
>However, the Value and Lag functions don't work together--in any >sequence. Barf. One of those things SPSS didn't fully think out. (If it'll make you feel any 'better', VALUE also doesn't work for variables referenced as vector elements; see thread "VECTOR and VALUE problem", Thu, 20 Jan 2005 ff. If it'll make you feel still 'better', that one surprised the SPSS folks.) [>>>Peck, Jon] You may think this was not fully thought out, but there is in fact a good reason for these restrictions, which are documented. VALUE. VALUE(variable). Numeric or string. Returns the value of variable, ignoring user missing-value definitions for variable, which must be a variable name or a vector reference to a variable name. The reason is that when values are fetched as input to a transformation program, their use is not yet bound to a particular operator or function. In fact, they may be used in several places. Following the general rules, they are converted to sysmis based on the originating variable missing value definitions. Now certain functions have to work differently, including VALUE, because its behavior is tightly bound to metadata for its argument. So for this function to work, it has to be able to identify the variable whose value is being taken in order to counteract the general conversion rule. That means that functions such as VALUE and VALUELABEL cannot take an expression as the argument: they must have a simple variable reference. In the instance described, VALUE is being passed an expression. Of course there is nothing stopping the use of a compute with a simple VALUE function usage or a TEMPORARY removal of missing values for the variables. -Jon Peck |
Ah, well. Sorry to step on toes, especially as I've left a long trail
of bruised toes. At 08:30 AM 11/14/2006, Peck, Jon wrote: >To the (accurate) observation that >>>However, the Value and Lag functions don't work together--in any >>>sequence. >>I'd written, >>Barf. One of those things SPSS didn't fully think out. > >[>>>Peck, Jon] There is in fact a good reason for these restrictions, >which are documented. [From the Command Syntax Reference; the >definition in the SPSS 14 CSR differs slightly from this:] > >>VALUE. VALUE(variable). Numeric or string. Returns the value of >>variable, ignoring user missing-value definitions for variable, which >>must be a variable name or a vector reference to a variable name. > >The reason is that when values are fetched as input to a >transformation program, their use is not yet bound to a particular >operator or function. In fact, they may be used in several >places. Following the general rules, they are converted to sysmis >based on the originating variable missing value definitions. Granted. To simplify, I hope accurately: A function that takes a general expression as argument must treat user-missing values, for an argument consisting of a single variable, exactly as it treats system-missing values. >Now certain functions have to work differently, including VALUE, >because its behavior is tightly bound to metadata for its argument. In >the instance described, VALUE is being passed an expression[, namely >LAG(variable)]. Also granted. That said, LAG(variable) is also one of those functions that work differently, in that its argument is not only a single variable, its returned value (like that of VALUE) is a value of a variable. It would be desirable, then, to have VALUE (and SYSMIS, which also takes special cognizance of user-missing values) be passed the metadata of LAG's variable, and act on the LAGged value they receive, as they would on an un-LAGged value of the same variable. Further granting that it's a lot easier to say "not fully thought out" in retrospect, than to consider (and likely write special code for) an interaction of this subtlety, during development. ................... As for "VALUE doesn't work for variables referenced as vector elements", that restriction isn't there, at least in SPSS 14; the quoted documentation says the same. (It was a problem, in SPSS 9.) See the following, which is SPSS 14 draft output (code and output not saved separately): NEW FILE. DATA LIST FREE / A. BEGIN DATA 1 2 3 END DATA. MISSING VALUES A(2). COMPUTE B = A. COMPUTE C = VALUE(A). VECTOR VECT=A TO C. COMPUTE D_VECT = VECT(1). COMPUTE E_VECT = VALUE(VECT(1)). LIST. |-----------------------------|---------------------------| |Output Created |16-NOV-2006 14:04:03 | |-----------------------------|---------------------------| A B C D_VECT E_VECT 1.00 1.00 1.00 1.00 1.00 2.00 . 2.00 . 2.00 3.00 3.00 3.00 3.00 3.00 Number of cases read: 3 Number of cases listed: 3 |
In reply to this post by Fiona Graff
Fiona,
I found the following some years back when I had the same problem. It works perfectly. Best wishes Eric ___________________________________________________ Duanren Yuan wrote: Dear SPSS netter, Would you help me solve the problem. I asked two persons to do data entry. They used the same set of questionnaires. I want to check whether each value in the two data files are exactly the same by using SPSS program. How Can I do that? Any suggestion will be very much appreciated. Jeremy Yuan * Sample data files * . DATA LIST /ID (F1) NX NY NZ (3F1) SX SY SZ (3A1). BEGIN DATA 1124ABC 2245SEF 3234DFG END DATA. SAVE OUTFILE 'F1'. DATA LIST /ID (F1) NX NY NZ (3F1) SX SY SZ (3A1). BEGIN DATA 1123ABC 2345SEF 3234DFH END DATA. SAVE OUTFILE 'F2'. * Assuming your files are sorted by ID, have ID's which match, the variables are the same (String lengths are the same etc) *. ADD FILES FILE 'F1' / IN=ONE / FILE='F2' / IN=TWO / BY ID . DO IF TWO. DO REPEAT V=NX NY NZ SX SY SZ /TAG=TNX TNY TNZ TSX TSY TSZ. COMPUTE TAG=(V<>LAG(V)). END REPEAT. END IF. COMPUTE DIFF=SUM(TNX TO TSZ). FORMATS TNX TNY TNZ TSX TSY TSZ (F1). LIST. ID NX NY NZ SX SY SZ ONE TWO TNX TNY TNZ TSX TSY TSZ DIFF 1 1 2 4 A B C 1 0 . . . . . . . 1 1 2 3 A B C 0 1 0 0 1 0 0 0 1.00 2 2 4 5 S E F 1 0 . . . . . . . 2 3 4 5 S E F 0 1 1 0 0 0 0 0 1.00 3 2 3 4 D F G 1 0 . . . . . . . 3 2 3 4 D F H 0 1 0 0 0 0 0 1 1.00 ________________________________________________________________ Fiona Graff wrote: > Hi all, > > Does anyone know if there is an easy way to compare two data files to > determine if their data contents are identical? I know how to do this by > exporting SPSS files to excel and then writing a formula to compare > cell-by-cell, but I know there must be an easier way in SPSS. > > Thanks very much, > > Fiona Graff > > > > > > Any information, including protected health information (PHI), transmitted > in this email is intended only for the person or entity to which it is > addressed and may contain information that is privileged, confidential and or > exempt from disclosure under applicable Federal or State law. Any review, > retransmission, dissemination or other use of or taking of any action in > reliance upon, protected health information (PHI) by persons or entities other > than the intended recipient is prohibited. If you received this email in error, > please contact the sender and delete the material from any computer. > > |
Free forum by Nabble | Edit this page |