|
Hi,
I have two datasets with some the same variables. How do you check/compare the datasets to find the cases that do not match within the variables that are the same in both datasets? For example, I have facility name and type of facility in both datasets. I would like to see which facilities are in one dataset and not in the other and which facilities changed their type of facility in the datasets. We also have two datasets with all of the same variables because our data entry staff enter the same data twice. We would like to compare/check the datasets. Basically, how do you carry out the "double data entry concept" in SPSS? Is there a function in SPSS that does this? Thanks, Deepa Deepa Bhat, MPH, MS Monitoring & Evaluation Associate Technical Officer Making Medical Injections Safer (MMIS) John Snow, Inc. 1616 N. Fort Myer Drive Arlington, VA 22209 Phone: 703-528-7474 x5180 Fax: 703-528-7480 [hidden email] mmis.jsi.com |
|
Do you have SPSS Data Entry Builder module? If yes, it is very easy to
do using that module. Let me know and I can tell you how to do it. Bozena Bozena Zdaniuk, Ph.D. University of Pittsburgh UCSUR, 6th Fl. 121 University Place Pittsburgh, PA 15260 Ph.: 412-624-5736 Fax: 412-624-4810 email: [hidden email] -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Deepa Bhat Sent: Monday, July 02, 2007 3:05 PM To: [hidden email] Subject: Question on Double Data Entry Checking Hi, I have two datasets with some the same variables. How do you check/compare the datasets to find the cases that do not match within the variables that are the same in both datasets? For example, I have facility name and type of facility in both datasets. I would like to see which facilities are in one dataset and not in the other and which facilities changed their type of facility in the datasets. We also have two datasets with all of the same variables because our data entry staff enter the same data twice. We would like to compare/check the datasets. Basically, how do you carry out the "double data entry concept" in SPSS? Is there a function in SPSS that does this? Thanks, Deepa Deepa Bhat, MPH, MS Monitoring & Evaluation Associate Technical Officer Making Medical Injections Safer (MMIS) John Snow, Inc. 1616 N. Fort Myer Drive Arlington, VA 22209 Phone: 703-528-7474 x5180 Fax: 703-528-7480 [hidden email] mmis.jsi.com |
|
In reply to this post by Deepa Bhat
Deepa,
Spss has an identify duplicates function accessible from the Data dropdown in the Data window. I've never used it but there is no reason to believe that it won't do some or all of what you want. However, if that function fails, you can do it the old way. The steps are the following. However, the following procedure does not work if there are more than two records with the same idvariable value. Different code is required. 1) Open each dataset and create a new variable having the same name, e.g., File, and one value in the first dataset and another value in the second dataset. Save both datasets. 2) Add the two files and sort by your id variable and the new variable, File. 3) Run one of the following code segments depending on your needs. * To locate duplicate records. Compute dups=0. Do if (idvariable eq lag(idvariable)). + Do repeat x=<your list of variable to be check for duplicates>. + If (value(x) eq value(lag(x))) dups=dups+1. + If (sysmis(x) and (sysmis(lag(x)))) dups=dups+1. + End repeat. End if. * To check whether allegedly duplicate records are the same. Compute errs=0. Do if (idvariable eq lag(idvariable)). + Do repeat x=<your list of variable to be check for duplicates>. + If (value(x) ne value(lag(x))) errs=errs+1. + If (sysmis(x) and not(sysmis(lag(x)))) errs=errs+1. + If (not(sysmis(x)) and sysmis(lag(x))) errs=errs+1. + End repeat. End if. 4) Sort cases by idvariable in ascending order and File in descending order. 5) copy the value of Dups or Errs to second record in the pair cases with the same idvariable value. That is, If (idvariable eq lag(idvariable)) errs=lag(errs). Gene Maguin |
|
A couple of comments, again. At 04:03 PM 7/2/2007, Gene Maguin wrote:
>Spss has an identify duplicates function accessible from the Data >dropdown in the Data window. I've never used it but there is no reason >to believe that it won't do some or all of what you want. I hadn't, either. It expands into a surprisingly elaborate piece of code. Here's SPSS 15 draft output (WRR-data & code not saved separately). The LIST commands are mine. Everything from the "Identify Duplicate Cases" comment through the FREQUENCIES command is generated. I've reformatted the generated code. Note that it seems to look for duplicates by the ID variable only. That logic may be adaptable for some of what you want. >|-----------------------------|---------------------------| >|Output Created |02-JUL-2007 16:53:01 | >|-----------------------------|---------------------------| >[DataSet0] > > ID Data1 Data2 Data3 > > 1 1.20 3.40 5.60 > 1 1.20 3.40 7.80 > 2 1.00 2.00 3.00 > 2 1.00 2.00 3.00 > 2 1.00 2.00 4.00 > 3 1.23 4.56 7.89 > 3 1.23 4.56 7.89 > > >Number of cases read: 7 Number of cases listed: 7 > > >* Identify Duplicate Cases. >SORT CASES BY ID(A) Data1(A) Data2(A) Data3(A) . >MATCH FILES /FILE = * > /BY ID > /FIRST = PrimaryFirst > /LAST = PrimaryLast. >DO IF (PrimaryFirst). >. COMPUTE MatchSequence = 1 - PrimaryLast. >ELSE. >. COMPUTE MatchSequence = MatchSequence + 1. >END IF. >LEAVE MatchSequence. >FORMAT MatchSequence (f7). >COMPUTE InDupGrp = MatchSequence > 0. >SORT CASES InDupGrp(D). >MATCH FILES /FILE = * /DROP = PrimaryLast InDupGrp MatchSequence. >VARIABLE LABELS PrimaryFirst 'Indicator of each first matching case >as'+ > ' Primary' . >VALUE LABELS PrimaryFirst 0 'Duplicate Case' 1 'Primary Case'. >VARIABLE LEVEL PrimaryFirst (ORDINAL). >FREQUENCIES VARIABLES = PrimaryFirst . > > >Frequencies >|-----------------------------|---------------------------| >|Output Created |02-JUL-2007 16:55:41 | >|-----------------------------|---------------------------| >[DataSet0] > >Statistics [suppressed] > > >PrimaryFirst Indicator of each first matching case as Primary >|-----|---------------|---------|-------|-------------|---------------| >| | |Frequency|Percent|Valid >Percent|Cumulative | >| | | | | |Percent >| >|-----|---------------|---------|-------|-------------|---------------| >|Valid|0 Duplicate |4 |57.1 |57.1 |57.1 >| >| |Case | | | | >| >| >|---------------|---------|-------|-------------|---------------| >| |1 Primary >Case|3 |42.9 |42.9 |100.0 | >| >|---------------|---------|-------|-------------|---------------| >| |Total |7 |100.0 |100.0 | >| >|-----|---------------|---------|-------|-------------|---------------| > > >LIST. > >List >|-----------------------------|---------------------------| >|Output Created |02-JUL-2007 16:56:09 | >|-----------------------------|---------------------------| >[DataSet0] > > ID Data1 Data2 Data3 PrimaryFirst > > 1 1.20 3.40 5.60 1 > 1 1.20 3.40 7.80 0 > 2 1.00 2.00 3.00 1 > 2 1.00 2.00 3.00 0 > 2 1.00 2.00 4.00 0 > 3 1.23 4.56 7.89 1 > 3 1.23 4.56 7.89 0 > >Number of cases read: 7 Number of cases listed: 7 >You can [also] do it the old way. However, the following procedure >does not work if there are more than two records with the same >idvariable value. Different code is required. > >1) Open each dataset and create a new variable having the same name, >e.g., File, and one value in the first dataset and another value in >the second >dataset. Save both datasets. >2) Add the two files and sort by your id variable and the new >variable, File. If your two files are already sorted by ID, this variant is good: ADD FILES /FILE=File1 /IN=IN_File1 /FILE=File2 /IN=IN_File2 /FIRST = New_ID /LAST = End_ID /BY ID. The two "IN" variables serve the purpose, not quiet as conveniently,as the "File" variable. And "NOT New_ID" is equivalent to "idvariable eq lag(idvariable)", except the latter is missing for the first case. See the above remark, that the code does not work if there are more than two records with the same idvariable value. That can be checked with the logic above: if there are any records where which neither New_ID nor End_ID is 1, there are more than two records for that idvariable. >3) Run one of the following code segments depending on your needs. > >* To locate duplicate records. >Compute dups=0. >Do if (idvariable eq lag(idvariable)). >+ Do repeat x=<your list of variable to be check for duplicates>. >+ If (value(x) eq value(lag(x))) dups=dups+1. >+ If (sysmis(x) and (sysmis(lag(x)))) dups=dups+1. >+ End repeat. >End if. > > >* To check whether allegedly duplicate records are the same. >Compute errs=0. >Do if (idvariable eq lag(idvariable)). >+ Do repeat x=<your list of variable to be check for duplicates>. >+ If (value(x) ne value(lag(x))) errs=errs+1. >+ If (sysmis(x) and not(sysmis(lag(x)))) errs=errs+1. >+ If (not(sysmis(x)) and sysmis(lag(x))) errs=errs+1. >+ End repeat. >End if. > >4) Sort cases by idvariable in ascending order and File in descending >order. Or, as above: the ADD FILES will leave the file sorted in ascending order by ID, with File1 records preceding File2 records within an ID value. >5) copy the value of Dups or Errs to second record in the pair cases >with the same idvariable value. That is, > >If (idvariable eq lag(idvariable)) errs=lag(errs). Cheers! And now, I may have got away from myself - maybe it's confusing to suggest such a large variant in the logic. If so, ignore the above, and onward-- Richard Ristow |
| Free forum by Nabble | Edit this page |
