SPSSX Discussion

Question on Double Data Entry Checking

Classic

List

Threaded

4 messages Options

Deepa Bhat

Question on Double Data Entry Checking

Hi,

I have two datasets with some the same variables.

How do you check/compare the datasets to find the cases that do not
match within the variables that are the same in both datasets?

For example, I have facility name and type of facility in both
datasets. I would like to see which facilities are in one dataset and
not in the other and which facilities changed their type of facility in
the datasets.

We also have two datasets with all of the same variables because our
data entry staff enter the same data twice. We would like to
compare/check the datasets.

Basically, how do you carry out the "double data entry concept" in
SPSS? Is there a function in SPSS that does this?

Thanks,
Deepa

Deepa Bhat, MPH, MS
Monitoring & Evaluation Associate Technical Officer
Making Medical Injections Safer (MMIS)
John Snow, Inc.
1616 N. Fort Myer Drive
Arlington, VA 22209
Phone: 703-528-7474 x5180
Fax: 703-528-7480
[hidden email]
mmis.jsi.com

Zdaniuk, Bozena

Re: Question on Double Data Entry Checking

Do you have SPSS Data Entry Builder module? If yes, it is very easy to
do using that module. Let me know and I can tell you how to do it.
Bozena

Bozena Zdaniuk, Ph.D.

University of Pittsburgh

UCSUR, 6th Fl.

121 University Place

Pittsburgh, PA 15260

Ph.: 412-624-5736

Fax: 412-624-4810

email: [hidden email]

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Deepa Bhat
Sent: Monday, July 02, 2007 3:05 PM
To: [hidden email]
Subject: Question on Double Data Entry Checking

Hi,

I have two datasets with some the same variables.

How do you check/compare the datasets to find the cases that do not
match within the variables that are the same in both datasets?

For example, I have facility name and type of facility in both
datasets. I would like to see which facilities are in one dataset and
not in the other and which facilities changed their type of facility in
the datasets.

We also have two datasets with all of the same variables because our
data entry staff enter the same data twice. We would like to
compare/check the datasets.

Basically, how do you carry out the "double data entry concept" in
SPSS? Is there a function in SPSS that does this?

Thanks,
Deepa

Deepa Bhat, MPH, MS
Monitoring & Evaluation Associate Technical Officer
Making Medical Injections Safer (MMIS)
John Snow, Inc.
1616 N. Fort Myer Drive
Arlington, VA 22209
Phone: 703-528-7474 x5180
Fax: 703-528-7480
[hidden email]
mmis.jsi.com

Maguin, Eugene

Re: Question on Double Data Entry Checking

In reply to this post by Deepa Bhat

Deepa,

Spss has an identify duplicates function accessible from the Data dropdown
in the Data window. I've never used it but there is no reason to believe
that it won't do some or all of what you want.

However, if that function fails, you can do it the old way. The steps are
the following.
However, the following procedure does not work if there are more than two
records with the same idvariable value. Different code is required.

1) Open each dataset and create a new variable having the same name, e.g.,
File, and one value in the first dataset and another value in the second
dataset. Save both datasets.
2) Add the two files and sort by your id variable and the new variable,
File.
3) Run one of the following code segments depending on your needs.

* To locate duplicate records.
Compute dups=0.
Do if (idvariable eq lag(idvariable)).
+ Do repeat x=<your list of variable to be check for duplicates>.
+ If (value(x) eq value(lag(x))) dups=dups+1.
+ If (sysmis(x) and (sysmis(lag(x)))) dups=dups+1.
+ End repeat.
End if.

* To check whether allegedly duplicate records are the same.
Compute errs=0.
Do if (idvariable eq lag(idvariable)).
+ Do repeat x=<your list of variable to be check for duplicates>.
+ If (value(x) ne value(lag(x))) errs=errs+1.
+ If (sysmis(x) and not(sysmis(lag(x)))) errs=errs+1.
+ If (not(sysmis(x)) and sysmis(lag(x))) errs=errs+1.
+ End repeat.
End if.

4) Sort cases by idvariable in ascending order and File in descending order.
5) copy the value of Dups or Errs to second record in the pair cases with
the same idvariable value. That is,

If (idvariable eq lag(idvariable)) errs=lag(errs).

Gene Maguin

Richard Ristow

Re: Question on Double Data Entry Checking

A couple of comments, again. At 04:03 PM 7/2/2007, Gene Maguin wrote:

>Spss has an identify duplicates function accessible from the Data
>dropdown in the Data window. I've never used it but there is no reason
>to believe that it won't do some or all of what you want.

I hadn't, either. It expands into a surprisingly elaborate piece of
code. Here's SPSS 15 draft output (WRR-data & code not saved
separately). The LIST commands are mine. Everything from the "Identify
Duplicate Cases" comment through the FREQUENCIES command is generated.
I've reformatted the generated code.

Note that it seems to look for duplicates by the ID variable only. That
logic may be adaptable for some of what you want.

>|-----------------------------|---------------------------|
>|Output Created |02-JUL-2007 16:53:01 |
>|-----------------------------|---------------------------|
>[DataSet0]
>
> ID Data1 Data2 Data3
>
> 1 1.20 3.40 5.60
> 1 1.20 3.40 7.80
> 2 1.00 2.00 3.00
> 2 1.00 2.00 3.00
> 2 1.00 2.00 4.00
> 3 1.23 4.56 7.89
> 3 1.23 4.56 7.89
>
>
>Number of cases read: 7 Number of cases listed: 7
>
>
>* Identify Duplicate Cases.
>SORT CASES BY ID(A) Data1(A) Data2(A) Data3(A) .
>MATCH FILES /FILE = *
> /BY ID
> /FIRST = PrimaryFirst
> /LAST = PrimaryLast.
>DO IF (PrimaryFirst).
>. COMPUTE MatchSequence = 1 - PrimaryLast.
>ELSE.
>. COMPUTE MatchSequence = MatchSequence + 1.
>END IF.
>LEAVE MatchSequence.
>FORMAT MatchSequence (f7).
>COMPUTE InDupGrp = MatchSequence > 0.
>SORT CASES InDupGrp(D).
>MATCH FILES /FILE = * /DROP = PrimaryLast InDupGrp MatchSequence.
>VARIABLE LABELS PrimaryFirst 'Indicator of each first matching case
>as'+
> ' Primary' .
>VALUE LABELS PrimaryFirst 0 'Duplicate Case' 1 'Primary Case'.
>VARIABLE LEVEL PrimaryFirst (ORDINAL).
>FREQUENCIES VARIABLES = PrimaryFirst .
>
>
>Frequencies
>|-----------------------------|---------------------------|
>|Output Created |02-JUL-2007 16:55:41 |
>|-----------------------------|---------------------------|
>[DataSet0]
>
>Statistics [suppressed]
>
>
>PrimaryFirst Indicator of each first matching case as Primary
>|-----|---------------|---------|-------|-------------|---------------|
>| | |Frequency|Percent|Valid
>Percent|Cumulative |
>| | | | | |Percent
>|
>|-----|---------------|---------|-------|-------------|---------------|
>|Valid|0 Duplicate |4 |57.1 |57.1 |57.1
>|
>| |Case | | | |
>|
>|
>|---------------|---------|-------|-------------|---------------|
>| |1 Primary
>Case|3 |42.9 |42.9 |100.0 |
>|
>|---------------|---------|-------|-------------|---------------|
>| |Total |7 |100.0 |100.0 |
>|
>|-----|---------------|---------|-------|-------------|---------------|
>
>
>LIST.
>
>List
>|-----------------------------|---------------------------|
>|Output Created |02-JUL-2007 16:56:09 |
>|-----------------------------|---------------------------|
>[DataSet0]
>
> ID Data1 Data2 Data3 PrimaryFirst
>
> 1 1.20 3.40 5.60 1
> 1 1.20 3.40 7.80 0
> 2 1.00 2.00 3.00 1
> 2 1.00 2.00 3.00 0
> 2 1.00 2.00 4.00 0
> 3 1.23 4.56 7.89 1
> 3 1.23 4.56 7.89 0
>
>Number of cases read: 7 Number of cases listed: 7

>You can [also] do it the old way. However, the following procedure
>does not work if there are more than two records with the same
>idvariable value. Different code is required.
>
>1) Open each dataset and create a new variable having the same name,
>e.g., File, and one value in the first dataset and another value in
>the second
>dataset. Save both datasets.
>2) Add the two files and sort by your id variable and the new
>variable, File.

If your two files are already sorted by ID, this variant is good:

ADD FILES
/FILE=File1 /IN=IN_File1
/FILE=File2 /IN=IN_File2
/FIRST = New_ID
/LAST = End_ID
/BY ID.

The two "IN" variables serve the purpose, not quiet as conveniently,as
the "File" variable. And "NOT New_ID" is equivalent to "idvariable eq
lag(idvariable)", except the latter is missing for the first case.

See the above remark, that the code does not work if there are more
than two records with the same idvariable value. That can be checked
with the logic above: if there are any records where which neither
New_ID nor End_ID is 1, there are more than two records for that
idvariable.

>3) Run one of the following code segments depending on your needs.
>
>* To locate duplicate records.
>Compute dups=0.
>Do if (idvariable eq lag(idvariable)).
>+ Do repeat x=<your list of variable to be check for duplicates>.
>+ If (value(x) eq value(lag(x))) dups=dups+1.
>+ If (sysmis(x) and (sysmis(lag(x)))) dups=dups+1.
>+ End repeat.
>End if.
>
>
>* To check whether allegedly duplicate records are the same.
>Compute errs=0.
>Do if (idvariable eq lag(idvariable)).
>+ Do repeat x=<your list of variable to be check for duplicates>.
>+ If (value(x) ne value(lag(x))) errs=errs+1.
>+ If (sysmis(x) and not(sysmis(lag(x)))) errs=errs+1.
>+ If (not(sysmis(x)) and sysmis(lag(x))) errs=errs+1.
>+ End repeat.
>End if.
>
>4) Sort cases by idvariable in ascending order and File in descending
>order.

Or, as above: the ADD FILES will leave the file sorted in ascending
order by ID, with File1 records preceding File2 records within an ID
value.

>5) copy the value of Dups or Errs to second record in the pair cases
>with the same idvariable value. That is,
>
>If (idvariable eq lag(idvariable)) errs=lag(errs).

Cheers! And now, I may have got away from myself - maybe it's confusing
to suggest such a large variant in the logic. If so, ignore the above,
and onward--

Richard Ristow