SPSSX Discussion

Poisson regression python module.

Classic

List

Threaded

27 messages Options

Richard Ristow

Re: Hmm.. How to do this?

Postscript. At 06:17 PM 11/13/2006, I wrote:

>I'd try something like this (not tested):
>
>* Prepare the variables to be "left": .
>NUMERIC #X1 TO #X10 (F5.3) /* or, any other format .
>
>* Check for match with previous case, as before: .
>COMPUTE MATCH=0.
>DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)).
[Etc. See previous posting for the complete code.]

I don't think this logic, or the earlier logic in this thread (if it
had worked as intended), will detect records - combinations of the key
variables - that are found in only one of the two files being compared.

Boer, CPJ de

Re: Comparing two data sets

In reply to this post by Fiona Graff

Hello Fiona,

This is the one reason that we also buy a license for the Data Entry
Builder of SPSS.
With this program you can compare two datasets and receive an extensive
report on missing cases in the one or the other dataset and of cases
that apear in both datasets but differ in 1 or more variables.
Unfortunately i cannot reproduce the exact way to do it as our license
is expired (must have been so since august but nobody seems tho have
noticed it as yet...). Perhaps an other forummember can cough it up???

Regards,

Kees de Boer
Ing. C.P.J. de Boer
EMGO, VUmc
afd. Datamanagement & Systeembeheer
BS-7 D-451
(020) 44 49828
In theorie is er geen verschil tussen theorie en praktijk. De praktijk
is echter anders...

> -----Oorspronkelijk bericht-----
> Van: SPSSX(r) Discussion [mailto:[hidden email]]
> Namens Fiona Graff
> Verzonden: dinsdag 14 november 2006 0:07
> Aan: [hidden email]
> Onderwerp: Comparing two data sets
>
> Hi all,
>
> Does anyone know if there is an easy way to compare two data
> files to determine if their data contents are identical? I
> know how to do this by exporting SPSS files to excel and then
> writing a formula to compare cell-by-cell, but I know there
> must be an easier way in SPSS.
>
> Thanks very much,
>
> Fiona Graff
>
>
>
>
>
> Any information, including protected health information
> (PHI), transmitted in this email is intended only for the
> person or entity to which it is addressed and may contain
> information that is privileged, confidential and or exempt
> from disclosure under applicable Federal or State law. Any
> review, retransmission, dissemination or other use of or
> taking of any action in reliance upon, protected health
> information (PHI) by persons or entities other than the
> intended recipient is prohibited. If you received this email
> in error, please contact the sender and delete the material
> from any computer.
>

Boer, CPJ de

Re: Comparing two data sets

In reply to this post by Fiona Graff

Hi Fiona,

I found our licence code for the Data Entry Builder and tried it out.
It would go like this:

1) Start SPSS Data Entry Builder
2) Open the orginal file (Invoer.sav): File-->Open
3) Remove the mark at "Open in Tabel Entry Mode" and click <Yes>
4) Click in the mainmenu on "View" --> "Form Entry"
5) Click in the mainmenu on "View" --> "Builder Window"
6) You are now in the SPSS Data Entry Builder
7) Clik in the mainmenu on "File" --> "Compare Versions"
8) Browse to the file you want to check against the original file
9) Put a mark at "Match Cases by ID variable"
10) Select the unique variabele in both files (often respondentnumber or
patientnumber or so)
11) Put a mark at "Sort active file by case ID variable"
12) Finally click on <OK> to start the comparison

The output would be something like this:
=============================================================
Active file: C:\zandbak\Een.sav
Verification file: C:\zandbak\Twee.sav
14-11-2006 9:16:29
Warning: Value mismatch for Case ID = 108.00.
Variable Active Verification
CNTMIN1 1422.41 1000.00
Warning: The verification file does not contain a matching case with the
Case ID of 119.00.
Warning: The verification file does not contain a matching case with the
Case ID of 141.00.
Warning: Value mismatch for Case ID = 715.00.
Variable Active Verification
COUNT5 6.00
Warning: Value mismatch for Case ID = 717.00.
Variable Active Verification
CNTMIN4 93.74
=============================================================

The active variables are in the original file, the verfication variables
are in the second file.
On line 4..6 of the output it says:
...
Warning: Value mismatch for Case ID = 108.00.
Variable Active Verification
CNTMIN1 1422.41 1000.00
...
This means that in the original file in the case with Case ID 108.00 the
variable CNTMIN1 has value 1422.41 while in the second file its value is
1000.00

Cases 119 and 141 seem to be missing etc.

You can either save or print the output.

This seems to me a better way than simply compare the files themselves
(but of course you need a license for the Builder...)

Kees

Ing. C.P.J. de Boer
EMGO, VUmc
afd. Datamanagement & Systeembeheer
BS-7 D-451
(020) 44 49828
In theorie is er geen verschil tussen theorie en praktijk. De praktijk
is echter anders...
_______________________________________________

hillel vardi

Re: Hmm.. How to do this?

In reply to this post by Maguin, Eugene

Shalom

I belive that using aggregate to find duplicates is a simpler way
then using lag .

As much as i can tall today (spss 14 ) there is no limit to the number
of break variable .

In the next example aggregate correctly identify all combination of
variable types

that is value vis value = true

value vis sysmis = false

value vis user missing = false

sysmis vis sysmis = true

user missing vis user missing = true

sismis vis user missing = false

title identify duplicates .
input program .
vector num(4).
loop #j =1 to 15 .
compute subnum=#j.
loop #i =1 to 4 .
compute num(#i)= trunc(unifrom(11)).
end loop .
end case .
end loop .
end file .
end input program .
execute .

formats num1 to num4(f3) .
save outfile=tmp.sav .
recode num1 to num4(3 9=sysmis)(6=7) .
missing values num1 to num4(7) .
compute line=1.
add files file=*/ file=tmp.sav
/keep=subnum line num1 to num4 .
recode line(1=1)(else=2).
sort cases by subnum line .
AGGREGATE
/OUTFILE=*
MODE=ADDVARIABLES
/BREAK=subnum num1 to num4
/line_n = N(line).

Hillel Vardi

Gene Maguin wrote:

> All,
>
> Here is the setup. Two datasets are allegedly identical. Consider variables
> x1 to x10. Possible values include sysmis and user missing (two or three
> values). Let's do this in syntax and not through the identify duplicates
> thing.
>
> If there are no user missing, then this will work.
>
> COMPUTE MATCH=0.
> DO IF (FAMID EQ LAG(FAMID)).
> + DO REPEAT X=X1 TO X10.
> + IF (X EQ LAG(X)) MATCH=MATCH+1.
> + IF (SYSMIS(X) AND SYSMIS(X)) MATCH=MATCH+1.
> + END REPEAT.
> ELSE.
> + COMPUTE MATCH=99.
> END IF.
> IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1.
>
> Now add user missing. I'd like to say
>
> COMPUTE MATCH=0.
> DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)).
> + DO REPEAT X=C1PRCA1 TO C1PRCA9.
> + IF (X EQ VALUE(LAG(X))) MATCH=MATCH+1.
> + IF (SYSMIS(X) AND SYSMIS(LAG(X))) MATCH=MATCH+1.
> + END REPEAT.
> ELSE.
> + COMPUTE MATCH=99.
> END IF.
> IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1.
>
> However, the Value and Lag functions don't work together--in any sequence.
>
> A plausible alternative is
>
> COMPUTE MATCH=0.
> DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)).
> + DO REPEAT X=C1PRCA1 TO C1PRCA9.
> + COMPUTE #TEMP=LAG(X).
> + IF (X EQ VALUE(#TEMP)) MATCH=MATCH+1.
> + IF (SYSMIS(X) AND SYSMIS(#TEMP)) MATCH=MATCH+1.
> + END REPEAT.
> ELSE.
> + COMPUTE MATCH=99.
> END IF.
> IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1.
>
> But the problem here is that a user missing value is resent to sysmis in
> this statement.
>
> + COMPUTE #TEMP=LAG(X).
>
> So things don't work correctly.
>
> The only alternative I can think of is to set user missing off, execute the
> comparison and then set user missing back on.
>
> My question: is there a one step alternative?
>
> Follow up question: Will the identify duplicates thing work correctly in
> this case?
>
>
> Thanks, Gene Maguin
>
>

Peck, Jon

Re: Hmm.. How to do this?

In reply to this post by Richard Ristow

[snip]

>However, the Value and Lag functions don't work together--in any
>sequence.

Barf. One of those things SPSS didn't fully think out. (If it'll make
you feel any 'better', VALUE also doesn't work for variables referenced
as vector elements; see thread "VECTOR and VALUE problem", Thu, 20 Jan
2005 ff. If it'll make you feel still 'better', that one surprised the
SPSS folks.)

[>>>Peck, Jon] You may think this was not fully thought out, but there is in fact a good reason for these restrictions, which are documented.

VALUE. VALUE(variable). Numeric or string. Returns the value of variable, ignoring user
missing-value definitions for variable, which must be a variable name or a vector reference
to a variable name.

The reason is that when values are fetched as input to a transformation program, their use is not yet bound to a particular operator or function. In fact, they may be used in several places. Following the general rules, they are converted to sysmis based on the originating variable missing value definitions. Now certain functions have to work differently, including VALUE, because its behavior is tightly bound to metadata for its argument. So for this function to work, it has to be able to identify the variable whose value is being taken in order to counteract the general conversion rule. That means that functions such as VALUE and VALUELABEL cannot take an expression as the argument: they must have a simple variable reference. In the instance described, VALUE is being passed an expression.

Of course there is nothing stopping the use of a compute with a simple VALUE function usage or a TEMPORARY removal of missing values for the variables.

-Jon Peck

Richard Ristow

Re: Hmm.. How to do this?

Ah, well. Sorry to step on toes, especially as I've left a long trail
of bruised toes.

At 08:30 AM 11/14/2006, Peck, Jon wrote:

>To the (accurate) observation that
>>>However, the Value and Lag functions don't work together--in any
>>>sequence.
>>I'd written,
>>Barf. One of those things SPSS didn't fully think out.
>
>[>>>Peck, Jon] There is in fact a good reason for these restrictions,
>which are documented. [From the Command Syntax Reference; the
>definition in the SPSS 14 CSR differs slightly from this:]
>
>>VALUE. VALUE(variable). Numeric or string. Returns the value of
>>variable, ignoring user missing-value definitions for variable, which
>>must be a variable name or a vector reference to a variable name.
>
>The reason is that when values are fetched as input to a
>transformation program, their use is not yet bound to a particular
>operator or function. In fact, they may be used in several
>places. Following the general rules, they are converted to sysmis
>based on the originating variable missing value definitions.

Granted. To simplify, I hope accurately:
A function that takes a general expression as argument must treat
user-missing values, for an argument consisting of a single variable,
exactly as it treats system-missing values.

>Now certain functions have to work differently, including VALUE,
>because its behavior is tightly bound to metadata for its argument. In
>the instance described, VALUE is being passed an expression[, namely
>LAG(variable)].

Also granted. That said, LAG(variable) is also one of those functions
that work differently, in that its argument is not only a single
variable, its returned value (like that of VALUE) is a value of a
variable.

It would be desirable, then, to have VALUE (and SYSMIS, which also
takes special cognizance of user-missing values) be passed the metadata
of LAG's variable, and act on the LAGged value they receive, as they
would on an un-LAGged value of the same variable.

Further granting that it's a lot easier to say "not fully thought out"
in retrospect, than to consider (and likely write special code for) an
interaction of this subtlety, during development.
...................
As for "VALUE doesn't work for variables referenced as vector
elements", that restriction isn't there, at least in SPSS 14; the
quoted documentation says the same. (It was a problem, in SPSS 9.) See
the following, which is SPSS 14 draft output (code and output not saved
separately):

NEW FILE.
DATA LIST FREE / A.
BEGIN DATA
1 2 3
END DATA.
MISSING VALUES A(2).
COMPUTE B = A.
COMPUTE C = VALUE(A).
VECTOR VECT=A TO C.
COMPUTE D_VECT = VECT(1).
COMPUTE E_VECT = VALUE(VECT(1)).

LIST.
|-----------------------------|---------------------------|
|Output Created |16-NOV-2006 14:04:03 |
|-----------------------------|---------------------------|
A B C D_VECT E_VECT

1.00 1.00 1.00 1.00 1.00
2.00 . 2.00 . 2.00
3.00 3.00 3.00 3.00 3.00

Number of cases read: 3 Number of cases listed: 3

Eric Skuja

Re: Comparing two data sets

In reply to this post by Fiona Graff

Fiona,

I found the following some years back when I had the same problem. It
works perfectly. Best wishes

Eric

___________________________________________________

Duanren Yuan wrote:

Dear SPSS netter,

Would you help me solve the problem. I asked two persons to do data
entry. They used the same set of questionnaires. I want to check
whether each value in the two data files are exactly the same by using
SPSS program. How Can I do that? Any suggestion will be very much
appreciated.

Jeremy Yuan

* Sample data files * .
DATA LIST /ID (F1) NX NY NZ (3F1) SX SY SZ (3A1).
BEGIN DATA
1124ABC
2245SEF
3234DFG
END DATA.
SAVE OUTFILE 'F1'.

DATA LIST /ID (F1) NX NY NZ (3F1) SX SY SZ (3A1).
BEGIN DATA
1123ABC
2345SEF
3234DFH
END DATA.
SAVE OUTFILE 'F2'.

* Assuming your files are sorted by ID, have ID's which match,
the variables are the same (String lengths are the same etc) *.

ADD FILES FILE 'F1' / IN=ONE / FILE='F2' / IN=TWO / BY ID .
DO IF TWO.
DO REPEAT V=NX NY NZ SX SY SZ /TAG=TNX TNY TNZ TSX TSY TSZ.
COMPUTE TAG=(V<>LAG(V)).
END REPEAT.
END IF.
COMPUTE DIFF=SUM(TNX TO TSZ).
FORMATS TNX TNY TNZ TSX TSY TSZ (F1).
LIST.

ID NX NY NZ SX SY SZ ONE TWO TNX TNY TNZ TSX TSY TSZ DIFF

1 1 2 4 A B C 1 0 . . . . . . .
1 1 2 3 A B C 0 1 0 0 1 0 0 0 1.00
2 2 4 5 S E F 1 0 . . . . . . .
2 3 4 5 S E F 0 1 1 0 0 0 0 0 1.00
3 2 3 4 D F G 1 0 . . . . . . .
3 2 3 4 D F H 0 1 0 0 0 0 0 1 1.00

________________________________________________________________

Fiona Graff wrote:

> Hi all,
>
> Does anyone know if there is an easy way to compare two data files to
> determine if their data contents are identical? I know how to do this by
> exporting SPSS files to excel and then writing a formula to compare
> cell-by-cell, but I know there must be an easier way in SPSS.
>
> Thanks very much,
>
> Fiona Graff
>
>
>
>
>
> Any information, including protected health information (PHI), transmitted
> in this email is intended only for the person or entity to which it is
> addressed and may contain information that is privileged, confidential and or
> exempt from disclosure under applicable Federal or State law. Any review,
> retransmission, dissemination or other use of or taking of any action in
> reliance upon, protected health information (PHI) by persons or entities other
> than the intended recipient is prohibited. If you received this email in error,
> please contact the sender and delete the material from any computer.
>
>