Poisson regression python module.

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Hmm.. How to do this?

Richard Ristow
Postscript. At 06:17 PM 11/13/2006, I wrote:

>I'd try something like this (not tested):
>
>*  Prepare the variables to be "left":             .
>NUMERIC #X1 TO #X10 (F5.3) /* or, any other format .
>
>*  Check for match with previous case, as before:  .
>COMPUTE MATCH=0.
>DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)).
[Etc. See previous posting for the complete code.]

I don't think this logic, or the earlier logic in this thread (if it
had worked as intended), will detect records - combinations of the key
variables - that are found in only one of the two files being compared.
Reply | Threaded
Open this post in threaded view
|

Re: Comparing two data sets

Boer, CPJ de
In reply to this post by Fiona Graff
Hello Fiona,

This is the one reason that we also buy a license for the Data Entry
Builder of SPSS.
With this program you can compare two datasets and receive an extensive
report on missing cases in the one or the other dataset and of cases
that apear in both datasets but differ in 1 or more variables.
Unfortunately i cannot reproduce the exact way to do it as our license
is expired (must have been so since august but nobody seems tho have
noticed it as yet...). Perhaps an other forummember can cough it up???


Regards,

Kees de Boer
Ing. C.P.J. de Boer
EMGO, VUmc
afd. Datamanagement & Systeembeheer
BS-7 D-451
(020) 44 49828
In theorie is er geen verschil tussen theorie en praktijk. De praktijk
is echter anders...

> -----Oorspronkelijk bericht-----
> Van: SPSSX(r) Discussion [mailto:[hidden email]]
> Namens Fiona Graff
> Verzonden: dinsdag 14 november 2006 0:07
> Aan: [hidden email]
> Onderwerp: Comparing two data sets
>
> Hi all,
>
> Does anyone know if there is an easy way to compare two data
> files to determine if their data contents are identical?  I
> know how to do this by exporting SPSS files to excel and then
> writing a formula to compare cell-by-cell, but I know there
> must be an easier way in SPSS.
>
> Thanks very much,
>
> Fiona Graff
>
>
>
>
>
> Any information, including protected health information
> (PHI), transmitted  in this email is intended only for the
> person or entity to which it is addressed and may contain
> information that is privileged, confidential and or exempt
> from disclosure under applicable Federal or State law. Any
> review, retransmission, dissemination or other use of or
> taking of any action in reliance upon, protected health
> information (PHI) by persons or entities other than the
> intended recipient is prohibited. If you received this email
> in error, please contact the sender and delete the material
> from any computer.
>
Reply | Threaded
Open this post in threaded view
|

Re: Comparing two data sets

Boer, CPJ de
In reply to this post by Fiona Graff
Hi Fiona,

I found our licence code for the Data Entry Builder and tried it out.
It would go like this:

1) Start SPSS Data Entry Builder
2) Open  the orginal file (Invoer.sav): File-->Open
3) Remove the mark at "Open in Tabel Entry Mode" and click <Yes>
4) Click in the mainmenu on "View" --> "Form Entry"
5) Click in the mainmenu on "View" --> "Builder Window"
6) You are now in the SPSS Data Entry Builder
7) Clik in the mainmenu on "File" --> "Compare Versions"
8) Browse to the file you want to check against the original file
9) Put a mark at "Match Cases by ID variable"
10) Select the unique variabele in both files (often respondentnumber or
patientnumber or so)
11) Put a mark at "Sort active file by case ID variable"
12) Finally click on <OK> to start the comparison

The output would be something like this:
=============================================================
Active file: C:\zandbak\Een.sav
Verification file: C:\zandbak\Twee.sav
14-11-2006 9:16:29
Warning: Value mismatch for Case ID =   108.00.
Variable       Active         Verification
CNTMIN1         1422.41        1000.00
Warning: The verification file does not contain a matching case with the
Case ID of   119.00.
Warning: The verification file does not contain a matching case with the
Case ID of   141.00.
Warning: Value mismatch for Case ID =   715.00.
Variable       Active         Verification
COUNT5                            6.00
Warning: Value mismatch for Case ID =   717.00.
Variable       Active         Verification
CNTMIN4           93.74
=============================================================

The active variables are in the original file, the verfication variables
are in the second file.
On line 4..6 of the output it says:
...
Warning: Value mismatch for Case ID =   108.00.
Variable       Active         Verification
CNTMIN1         1422.41        1000.00
...
This means that in the original file in the case with Case ID 108.00 the
variable CNTMIN1 has value 1422.41 while in the second file its value is
1000.00


Cases 119 and 141 seem to be missing etc.

You can either save or print the output.


This seems to me a better way than simply compare the files themselves
(but of course you need a license for the Builder...)

Kees

Ing. C.P.J. de Boer
EMGO, VUmc
afd. Datamanagement & Systeembeheer
BS-7 D-451
(020) 44 49828
In theorie is er geen verschil tussen theorie en praktijk. De praktijk
is echter anders...
_______________________________________________

> -----Oorspronkelijk bericht-----
> Van: SPSSX(r) Discussion [mailto:[hidden email]]
> Namens Fiona Graff
> Verzonden: dinsdag 14 november 2006 0:07
> Aan: [hidden email]
> Onderwerp: Comparing two data sets
>
> Hi all,
>
> Does anyone know if there is an easy way to compare two data
> files to determine if their data contents are identical?  I
> know how to do this by exporting SPSS files to excel and then
> writing a formula to compare cell-by-cell, but I know there
> must be an easier way in SPSS.
>
> Thanks very much,
>
> Fiona Graff
>
>
>
>
>
> Any information, including protected health information
> (PHI), transmitted  in this email is intended only for the
> person or entity to which it is addressed and may contain
> information that is privileged, confidential and or exempt
> from disclosure under applicable Federal or State law. Any
> review, retransmission, dissemination or other use of or
> taking of any action in reliance upon, protected health
> information (PHI) by persons or entities other than the
> intended recipient is prohibited. If you received this email
> in error, please contact the sender and delete the material
> from any computer.
>
Reply | Threaded
Open this post in threaded view
|

Re: Hmm.. How to do this?

hillel vardi
In reply to this post by Maguin, Eugene
Shalom


  I belive that using aggregate to find duplicates is  a simpler way
then using  lag .

As much as i can tall today (spss  14 ) there is no limit to the number
of break variable .

In the next example aggregate correctly  identify all combination of
variable types

that is   value             vis  value          = true

            value            vis sysmis          = false

            value           vis user missing   = false

            sysmis         vis sysmis           = true

           user missing vis user missing   = true

          sismis           vis user missing   = false





title    identify duplicates .
input program .
vector    num(4).
loop      #j =1 to 15 .
compute   subnum=#j.
loop      #i =1 to 4 .
compute    num(#i)= trunc(unifrom(11)).
end loop .
end case .
end loop .
end file .
end input program .
execute .


formats  num1 to num4(f3) .
save     outfile=tmp.sav .
recode   num1 to num4(3 9=sysmis)(6=7) .
missing values num1 to num4(7) .
compute      line=1.
add files    file=*/ file=tmp.sav
            /keep=subnum line num1 to num4 .
recode       line(1=1)(else=2).
sort cases   by subnum line .
AGGREGATE
  /OUTFILE=*
  MODE=ADDVARIABLES
  /BREAK=subnum num1 to num4
  /line_n = N(line).


Hillel Vardi



Gene Maguin wrote:

> All,
>
> Here is the setup. Two datasets are allegedly identical. Consider variables
> x1 to x10. Possible values include sysmis and user missing (two or three
> values). Let's do this in syntax and not through the identify duplicates
> thing.
>
> If there are no user missing, then this will work.
>
> COMPUTE MATCH=0.
> DO IF (FAMID EQ LAG(FAMID)).
> +  DO REPEAT X=X1 TO X10.
> +     IF (X EQ LAG(X)) MATCH=MATCH+1.
> +     IF (SYSMIS(X) AND SYSMIS(X)) MATCH=MATCH+1.
> +  END REPEAT.
> ELSE.
> +  COMPUTE MATCH=99.
> END IF.
> IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1.
>
> Now add user missing. I'd like to say
>
> COMPUTE MATCH=0.
> DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)).
> +  DO REPEAT X=C1PRCA1 TO C1PRCA9.
> +     IF (X EQ VALUE(LAG(X))) MATCH=MATCH+1.
> +     IF (SYSMIS(X) AND SYSMIS(LAG(X))) MATCH=MATCH+1.
> +  END REPEAT.
> ELSE.
> +  COMPUTE MATCH=99.
> END IF.
> IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1.
>
> However, the Value and Lag functions don't work together--in any sequence.
>
> A plausible alternative is
>
> COMPUTE MATCH=0.
> DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)).
> +  DO REPEAT X=C1PRCA1 TO C1PRCA9.
> +     COMPUTE #TEMP=LAG(X).
> +     IF (X EQ VALUE(#TEMP)) MATCH=MATCH+1.
> +     IF (SYSMIS(X) AND SYSMIS(#TEMP)) MATCH=MATCH+1.
> +  END REPEAT.
> ELSE.
> +  COMPUTE MATCH=99.
> END IF.
> IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1.
>
> But the problem here is that a user missing value is resent to sysmis in
> this statement.
>
> +     COMPUTE #TEMP=LAG(X).
>
> So things don't work correctly.
>
> The only alternative I can think of is to set user missing off, execute the
> comparison and then set user missing back on.
>
> My question: is there a one step alternative?
>
> Follow up question: Will the identify duplicates thing work correctly in
> this case?
>
>
> Thanks, Gene Maguin
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Hmm.. How to do this?

Peck, Jon
In reply to this post by Richard Ristow
[snip]

>However, the Value and Lag functions don't work together--in any
>sequence.

Barf. One of those things SPSS didn't fully think out. (If it'll make
you feel any 'better', VALUE also doesn't work for variables referenced
as vector elements; see thread "VECTOR and VALUE problem", Thu, 20 Jan
2005 ff. If it'll make you feel still 'better', that one surprised the
SPSS folks.)

[>>>Peck, Jon] You may think this was not fully thought out, but there is in fact a good reason for these restrictions, which are documented.

VALUE. VALUE(variable). Numeric or string. Returns the value of variable, ignoring user
missing-value definitions for variable, which must be a variable name or a vector reference
to a variable name.

The reason is that when values are fetched as input to a transformation program, their use is not yet bound to a particular operator or function.  In fact, they may be used in several places.  Following the general rules, they are converted to sysmis based on the originating variable missing value definitions.  Now certain functions have to work differently, including VALUE, because its behavior is tightly bound to metadata for its argument.  So for this function to work, it has to be able to identify the variable whose value is being taken in order to counteract the general conversion rule.  That means that functions such as VALUE and VALUELABEL cannot take an expression as the argument: they must have a simple variable reference.  In the instance described, VALUE is being passed an expression.

Of course there is nothing stopping the use of a compute with a simple VALUE function usage or a TEMPORARY removal of missing values for the variables.

-Jon Peck
Reply | Threaded
Open this post in threaded view
|

Re: Hmm.. How to do this?

Richard Ristow
Ah, well. Sorry to step on toes, especially as I've left a long trail
of bruised toes.

At 08:30 AM 11/14/2006, Peck, Jon wrote:

>To the (accurate) observation that
>>>However, the Value and Lag functions don't work together--in any
>>>sequence.
>>I'd written,
>>Barf. One of those things SPSS didn't fully think out.
>
>[>>>Peck, Jon] There is in fact a good reason for these restrictions,
>which are documented. [From the Command Syntax Reference; the
>definition in the SPSS 14 CSR differs slightly from this:]
>
>>VALUE. VALUE(variable). Numeric or string. Returns the value of
>>variable, ignoring user missing-value definitions for variable, which
>>must be a variable name or a vector reference to a variable name.
>
>The reason is that when values are fetched as input to a
>transformation program, their use is not yet bound to a particular
>operator or function.  In fact, they may be used in several
>places.  Following the general rules, they are converted to sysmis
>based on the originating variable missing value definitions.

Granted. To simplify, I hope accurately:
A function that takes a general expression as argument must treat
user-missing values, for an argument consisting of a single variable,
exactly as it treats system-missing values.

>Now certain functions have to work differently, including VALUE,
>because its behavior is tightly bound to metadata for its argument. In
>the instance described, VALUE is being passed an expression[, namely
>LAG(variable)].

Also granted. That said, LAG(variable) is also one of those functions
that work differently, in that its argument is not only a single
variable, its returned value (like that of VALUE) is a value of a
variable.

It would be desirable, then, to have VALUE (and SYSMIS, which also
takes special cognizance of user-missing values) be passed the metadata
of LAG's variable, and act on the LAGged value they receive, as they
would on an un-LAGged value of the same variable.

Further granting that it's a lot easier to say "not fully thought out"
in retrospect, than to consider (and likely write special code for) an
interaction of this subtlety, during development.
...................
As for "VALUE doesn't work for variables referenced as vector
elements", that restriction isn't there, at least in SPSS 14; the
quoted documentation says the same. (It was a problem, in SPSS 9.) See
the following, which is SPSS 14 draft output (code and output not saved
separately):

NEW FILE.
DATA LIST FREE / A.
BEGIN DATA
1 2 3
END DATA.
MISSING VALUES A(2).
COMPUTE B = A.
COMPUTE C = VALUE(A).
VECTOR  VECT=A TO C.
COMPUTE D_VECT = VECT(1).
COMPUTE E_VECT = VALUE(VECT(1)).

LIST.
|-----------------------------|---------------------------|
|Output Created               |16-NOV-2006 14:04:03       |
|-----------------------------|---------------------------|
        A        B        C   D_VECT   E_VECT

     1.00     1.00     1.00     1.00     1.00
     2.00      .       2.00      .       2.00
     3.00     3.00     3.00     3.00     3.00

Number of cases read:  3    Number of cases listed:  3
Reply | Threaded
Open this post in threaded view
|

Re: Comparing two data sets

Eric Skuja
In reply to this post by Fiona Graff
Fiona,

I found the following some years back when I had the same problem.  It
works perfectly.  Best wishes

Eric

___________________________________________________

Duanren Yuan wrote:

      Dear SPSS netter,

      Would you help me solve the problem. I asked two persons to do data
      entry. They used the same set of questionnaires. I want to check
      whether each value in the two data files are exactly the same by using
      SPSS program. How Can I do that? Any suggestion will be very much
      appreciated.

      Jeremy Yuan

* Sample data files * .
DATA LIST /ID (F1) NX NY NZ (3F1) SX SY SZ (3A1).
BEGIN DATA
1124ABC
2245SEF
3234DFG
END DATA.
SAVE OUTFILE 'F1'.

DATA LIST /ID (F1) NX NY NZ (3F1) SX SY SZ (3A1).
BEGIN DATA
1123ABC
2345SEF
3234DFH
END DATA.
SAVE OUTFILE 'F2'.

* Assuming your files are sorted by ID,  have ID's which match,
the variables are the same (String lengths are the same etc) *.

ADD FILES FILE 'F1' / IN=ONE / FILE='F2' / IN=TWO / BY ID .
DO IF TWO.
DO REPEAT V=NX NY NZ SX SY SZ /TAG=TNX TNY TNZ TSX TSY TSZ.
COMPUTE TAG=(V<>LAG(V)).
END REPEAT.
END IF.
COMPUTE DIFF=SUM(TNX TO TSZ).
FORMATS TNX TNY TNZ TSX TSY TSZ (F1).
LIST.

ID NX NY NZ SX SY SZ ONE TWO TNX TNY TNZ TSX TSY TSZ     DIFF

 1  1  2  4 A  B  C   1   0   .   .   .   .   .   .       .
 1  1  2  3 A  B  C   0   1   0   0   1   0   0   0      1.00
 2  2  4  5 S  E  F   1   0   .   .   .   .   .   .       .
 2  3  4  5 S  E  F   0   1   1   0   0   0   0   0      1.00
 3  2  3  4 D  F  G   1   0   .   .   .   .   .   .       .
 3  2  3  4 D  F  H   0   1   0   0   0   0   0   1      1.00


________________________________________________________________




Fiona Graff wrote:

> Hi all,
>
> Does anyone know if there is an easy way to compare two data files to
> determine if their data contents are identical?  I know how to do this by
> exporting SPSS files to excel and then writing a formula to compare
> cell-by-cell, but I know there must be an easier way in SPSS.
>
> Thanks very much,
>
> Fiona Graff
>
>
>
>
>
> Any information, including protected health information (PHI), transmitted
>  in this email is intended only for the person or entity to which it is
> addressed and may contain information that is privileged, confidential and or
> exempt from disclosure under applicable Federal or State law. Any review,
> retransmission, dissemination or other use of or taking of any action in
> reliance upon, protected health information (PHI) by persons or entities other
> than the intended recipient is prohibited. If you received this email in error,
> please contact the sender and delete the material from any computer.
>
>
12