SPSSX Discussion

Poisson regression python module.

Classic

List

Threaded

27 messages Options

Mike P-5

Poisson regression python module.

Hi All,

Does anybody have an example of using the Poisson regression module for
python (downloadable from spss.com/devcentral) with example data.
I'd like to have a look at what other people are doing in order to apply
it to my work.

Thanks in advance

Mike

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________

Reutter, Alex

Re: Poisson regression python module.

If you're running 15.0, you might want to check out the new GENLIN procedure, which can fit Poisson regressions.

Alex

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Michael Pearmain
Sent: Tuesday, November 07, 2006 4:56 AM
To: [hidden email]
Subject: Poisson regression python module.

Hi All,

Does anybody have an example of using the Poisson regression module for
python (downloadable from spss.com/devcentral) with example data.
I'd like to have a look at what other people are doing in order to apply
it to my work.

Thanks in advance

Mike

Marta García-Granero

Re: Poisson regression python module.

Hi everybody

RA> If you're running 15.0, you might want to check out the new
RA> GENLIN procedure, which can fit Poisson regressions.

Just in case anyone is interested, here is a worked example (from
Campbell's "Statistics at Square Two") of Poisson regression solved
using GENLIN (SPSS 15) or GENLOG (older SPSS versions). Results are
consistent with the ones obtained with STATA (as shown in the
mentioned book), and POISSREG.exe (from PEPI 4.0 freeware statistical
package).

Sorry but I'm not Pythoning yet (but that will change soon).

Cheers,
Marta

* Example dataset (from MJ Campbell "Statistics at Square Two", BMJ
books) *.
DATA LIST list / id(f2.0) agegroup(f8.0) smoker(f1.0) pyears(f8.0)
deaths(f4.0).
BEGIN DATA
1 0 0 18790 2
2 1 0 10673 12
3 2 0 5712 28
4 3 0 2585 28
5 4 0 1462 31
6 0 1 52407 32
7 1 1 43248 104
8 2 1 28612 206
9 3 1 12663 186
10 4 1 5317 102
END DATA.
DOCUMENT 'Coronary deaths from British male doctors. Doll & Hill
(Nat Cancer Inst Monog 1996; 19:205-68)'.
VARIABLE LABELS agegroup "Age group".
VALUE LABELS agegroup
0 "35-44 years"
1 "45-54 years"
2 "55-64 years"
3 "65-74 years"
4 "75-84 years".
VARIABLE LABELS smoker "Smoking status".
VALUE LABELS smoker
0 "No"
1 "Yes".

* Using SPSS 15 - GENLIN *.
COMPUTE logpyears=LN(pyears).
GENLIN deaths
BY agegroup smoker
(ORDER=DESCENDING)
/MODEL agegroup smoker
INTERCEPT=YES
OFFSET=logpyears
DISTRIBUTION=POISSON
LINK=LOG.

* Older SPSS: use GENLOG *.

* GENLOG uses the last group as reference group: agegroup needs recoding *.
RECODE agegroup (0=5) .
ADD VALUE LABELS agegroup 0 "" 5 "35-44 years".
RECODE smoker (0=2) .
ADD VALUE LABELS smoker 0 "" 2 "No".

FREQUENCIES
VARIABLES=agegroup smoker
/ORDER VARIABLES .

* Statistical analysis *.
GENLOG
agegroup smoker
/CSTRUCTURE=pyears
/MODEL=POISSON
/PRINT FREQ RESID ESTIM
/PLOT NONE
/CRITERIA =DELTA(0)
/DESIGN agegroup smoker .

Bauer, John H.

Re: Poisson regression python module.

With a little care to make sure that the same reference categories are used, the Python poisson_regression module delivers parameter estimates identical to GENLIN. See the Draft Viewer output below.

* Example dataset (from MJ Campbell "Statistics at Square Two", BMJ
books) *.
DATA LIST list / id(f2.0) agegroup(f8.0) smoker(f1.0) pyears(f8.0) deaths(f4.0).
BEGIN DATA
1 0 0 18790 2
2 1 0 10673 12
3 2 0 5712 28
4 3 0 2585 28
5 4 0 1462 31
6 0 1 52407 32
7 1 1 43248 104
8 2 1 28612 206
9 3 1 12663 186
10 4 1 5317 102
END DATA.
DOCUMENT 'Coronary deaths from British male doctors. Doll & Hill
(Nat Cancer Inst Monog 1996; 19:205-68)'.
VARIABLE LABELS agegroup "Age group".
VALUE LABELS agegroup
0 "35-44 years"
1 "45-54 years"
2 "55-64 years"
3 "65-74 years"
4 "75-84 years".
VARIABLE LABELS smoker "Smoking status".
VALUE LABELS smoker
0 "No"
1 "Yes".
DATASET NAME campbell .

* GENLOG uses the last group as reference group: agegroup needs recoding *.
RECODE agegroup (0=5) .
ADD VALUE LABELS agegroup 0 "" 5 "35-44 years".
RECODE smoker (0=2) .
ADD VALUE LABELS smoker 0 "" 2 "No".

BEGIN PROGRAM.
from poisson_regression import *
#help(poisson_regression)
poisson_regression(dependent="deaths",
factors=["agegroup", "smoker"],
ratevar="pyears")
END PROGRAM.

************************************************************************************************************.
* ... lots of output omitted ...
************************************************************************************************************.

Parameter Estimates
|--------------|-------------------|-----------|-----------|----------------------------------|----------------------------|
| |Parameter |Estimate |Std. Error |95% Confidence Interval |95% Trimmed Range |
| | |-----------|-----------|----------------------|-----------|----------------|-----------|
| | |Lower Bound|Upper Bound|Lower Bound |Upper Bound|Lower Bound |Upper Bound|
|--------------|-------------------|-----------|-----------|----------------------|-----------|----------------|-----------|
|Bootstrap(a,b)|Intercept |-7.919 |.613 |-9.127 |-6.712 |-9.148 |-7.301 |
| |-------------------|-----------|-----------|----------------------|-----------|----------------|-----------|
| |agegroup_45_54years|1.484 |.593 |.315 |2.653 |.840 |2.923 |
| |-------------------|-----------|-----------|----------------------|-----------|----------------|-----------|
| |agegroup_55_64years|2.627 |.613 |1.418 |3.836 |2.341 |4.020 |
| |-------------------|-----------|-----------|----------------------|-----------|----------------|-----------|
| |agegroup_65_74years|3.350 |.585 |2.197 |4.504 |3.135 |4.755 |
| |-------------------|-----------|-----------|----------------------|-----------|----------------|-----------|
| |agegroup_75_84years|3.700 |.645 |2.428 |4.972 |3.447 |5.294 |
| |-------------------|-----------|-----------|----------------------|-----------|----------------|-----------|
| |smoker_Yes |.355 |.305 |-.247 |.956 |-.100 |1.081 |
|--------------|-------------------|-----------|-----------|----------------------|-----------|----------------|-----------|
a Based on 210 samples.
b Loss function value equals 33.600.

Correlations of Parameter Estimates
|---------|-------------------|---------|-------------------|-------------------|-------------------|-------------------|-----------|
| | |Intercept|agegroup_45_54years|agegroup_55_64years|agegroup_65_74years|agegroup_75_84years|smoker_Yes|
|---------|-------------------|---------|-------------------|-------------------|-------------------|-------------------|-----------|
|Bootstrap|Intercept |1.000 |-.840 |-.908 |-.816 |-.820 |-.147 |
| |-------------------|---------|-------------------|-------------------|-------------------|-------------------|----------|
| |agegroup_45_54years|-.840 |1.000 |.819 |.689 |.727 |-.126 |
| |-------------------|---------|-------------------|-------------------|-------------------|-------------------|----------|
| |agegroup_55_64years|-.908 |.819 |1.000 |.774 |.764 |-.082 |
| |-------------------|---------|-------------------|-------------------|-------------------|-------------------|----------|
| |agegroup_65_74years|-.816 |.689 |.774 |1.000 |.735 |-.063 |
| |-------------------|---------|-------------------|-------------------|-------------------|-------------------|----------|
| |agegroup_75_84years|-.820 |.727 |.764 |.735 |1.000 |-.061 |
| |-------------------|---------|-------------------|-------------------|-------------------|-------------------|----------|
| |smoker_Yes |-.147 |-.126 |-.082 |-.063 |-.061 |1.000 |
|---------|-------------------|---------|-------------------|-------------------|-------------------|-------------------|-----------|

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Marta García-Granero
Sent: Tuesday, November 07, 2006 12:34 PM
To: [hidden email]
Subject: Re: Poisson regression python module.

Hi everybody

RA> If you're running 15.0, you might want to check out the new GENLIN
RA> procedure, which can fit Poisson regressions.

Just in case anyone is interested, here is a worked example (from Campbell's "Statistics at Square Two") of Poisson regression solved using GENLIN (SPSS 15) or GENLOG (older SPSS versions). Results are consistent with the ones obtained with STATA (as shown in the mentioned book), and POISSREG.exe (from PEPI 4.0 freeware statistical package).

Sorry but I'm not Pythoning yet (but that will change soon).

Cheers,
Marta

* Example dataset (from MJ Campbell "Statistics at Square Two", BMJ
books) *.
DATA LIST list / id(f2.0) agegroup(f8.0) smoker(f1.0) pyears(f8.0) deaths(f4.0).
BEGIN DATA
1 0 0 18790 2
2 1 0 10673 12
3 2 0 5712 28
4 3 0 2585 28
5 4 0 1462 31
6 0 1 52407 32
7 1 1 43248 104
8 2 1 28612 206
9 3 1 12663 186
10 4 1 5317 102
END DATA.
DOCUMENT 'Coronary deaths from British male doctors. Doll & Hill
(Nat Cancer Inst Monog 1996; 19:205-68)'.
VARIABLE LABELS agegroup "Age group".
VALUE LABELS agegroup
0 "35-44 years"
1 "45-54 years"
2 "55-64 years"
3 "65-74 years"
4 "75-84 years".
VARIABLE LABELS smoker "Smoking status".
VALUE LABELS smoker
0 "No"
1 "Yes".

* Using SPSS 15 - GENLIN *.
COMPUTE logpyears=LN(pyears).
GENLIN deaths
BY agegroup smoker
(ORDER=DESCENDING)
/MODEL agegroup smoker
INTERCEPT=YES
OFFSET=logpyears
DISTRIBUTION=POISSON
LINK=LOG.

* Older SPSS: use GENLOG *.

* GENLOG uses the last group as reference group: agegroup needs recoding *.
RECODE agegroup (0=5) .
ADD VALUE LABELS agegroup 0 "" 5 "35-44 years".
RECODE smoker (0=2) .
ADD VALUE LABELS smoker 0 "" 2 "No".

FREQUENCIES
VARIABLES=agegroup smoker
/ORDER VARIABLES .

* Statistical analysis *.
GENLOG
agegroup smoker
/CSTRUCTURE=pyears
/MODEL=POISSON
/PRINT FREQ RESID ESTIM
/PLOT NONE
/CRITERIA =DELTA(0)
/DESIGN agegroup smoker .

Maguin, Eugene

Re: Poisson regression python module.

John,

I'd like to respectfully ask where spss is going with python specifically
and, more generally, with add-on languages such as python and sax basic. I
ask this out of a curiosity driven by the above poisson regression example.
It seems to me that python could be or will become either a parallel to the
current syntax or a replacement for it. Either way, I curious as to the
thinking that drove this decision (and I can imagine that it was a decision
not undertaken lightly). For instance, are there groups of spss users for
whom python is a benfit relative to syntax?

Gene Maguin

Gary Rosin

Binary Logistic Classification Cutoffs

Assume a binary logistic regression, where the overall
proportion of positive responses is well above 0.50.

If I run a BLR using the default classification cutoff
of 0.50, the classification table shows that the
resulting model is better at predicting positive
responses than it is at predicting negative responses.

If I raise classification cutoff to just under the
observed overall proportion of positive responses, the
overall proporti0on of correct classifications declines
somewhat, but the postive v. negative responses are more
equialent--the model predicts positives and negatives
about equally well.

My initial reaction is to prefer the more balance model
that results from using the higher classification
cutoffs. Thoughts?

Gary

---

Gary S. Rosin
Professor of Law
South Texas College of Law
1303 San Jacinto
Houston, TX 77002

<[hidden email]>
713-646-1854

Gary Rosin

Classification cutoffs & Probit & GenLin

In reply to this post by Maguin, Eugene

When using Probit (logit) or GenLin (logit) models in
SPSS 15.0, I notice no obvious equivalent to changing
the classification cutoff in binary logistic. What
am I missing?

---

Gary S. Rosin
Professor of Law
South Texas College of Law
1303 San Jacinto
Houston, TX 77002

<[hidden email]>
713-646-1854

David Greenberg

Re: Binary Logistic Classification Cutoffs

In reply to this post by Gary Rosin

The relative importance of a false negative prediction and a false
positive prediction will depend on the particular situation you are
analyzing (e.g. a wrong prediction that a convict will commit a crime
may lead a parole board to deny release, resulting in the prisoner
serving a longer prison sentence; a wrong prediction that the convict
will not commit a crime may result in the prisoner's release, and the
commission of a new crime). Your question is not a statistics question.
You have to make a judgment about the tradeoffs here. David Greenberg,
Sociology Department, New York University

----- Original Message -----
From: Gary Rosin <[hidden email]>
Date: Tuesday, November 7, 2006 6:21 pm
Subject: Binary Logistic Classification Cutoffs

> Assume a binary logistic regression, where the overall
> proportion of positive responses is well above 0.50.
>
> If I run a BLR using the default classification cutoff
> of 0.50, the classification table shows that the
> resulting model is better at predicting positive
> responses than it is at predicting negative responses.
>
> If I raise classification cutoff to just under the
> observed overall proportion of positive responses, the
> overall proporti0on of correct classifications declines
> somewhat, but the postive v. negative responses are more
> equialent--the model predicts positives and negatives
> about equally well.
>
> My initial reaction is to prefer the more balance model
> that results from using the higher classification
> cutoffs. Thoughts?
>
> Gary
>
> ---
>
> Gary S. Rosin
> Professor of Law
> South Texas College of Law
> 1303 San Jacinto
> Houston, TX 77002
>
> <[hidden email]>
> 713-646-1854
>

Reutter, Alex

Re: Classification cutoffs & Probit & GenLin

In reply to this post by Gary Rosin

There is no equivalent control in Probit or Genlin. You can obtain equivalent results in Genlin by saving the predicted probabilities, computing new variables for each competing classification cutoff, and then running Crosstabs (or Ctables) to create the classification tables.

Note that when choosing between cutoffs, saving the predicted probabilities should/could actually save you some time because you won't be recomputing the logistic regression equation to test each cutoff. You can also run the predicted probabilities through ROC Curve, which can help you find the right balance between specificity and sensitivity in your choice of cutoff.

Alex

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Gary Rosin
Sent: Tuesday, November 07, 2006 5:29 PM
To: [hidden email]
Subject: Classification cutoffs & Probit & GenLin

When using Probit (logit) or GenLin (logit) models in
SPSS 15.0, I notice no obvious equivalent to changing
the classification cutoff in binary logistic. What
am I missing?

---

Gary S. Rosin
Professor of Law
South Texas College of Law
1303 San Jacinto
Houston, TX 77002

<[hidden email]>
713-646-1854

Bauer, John H.

Re: Poisson regression python module.

In reply to this post by Maguin, Eugene

Gene,

I don't speak for SPSS here, so I will not comment on what the company was thinking. However, there is one group of users that springs to mind immediately for whom Python is of greatest benefit relative to syntax: MACRO users. Or to be precise: those who need the kind of functionality that should be provided by MACRO, but who are not using it because of its difficulty.

But check SPSS Developer Central http://www.spss.com/devcentral/index.cfm for more of where I'm going with Python. There should be a couple of new downloads this week.

John

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Gene Maguin
Sent: Tuesday, November 07, 2006 3:37 PM
To: [hidden email]
Subject: Re: Poisson regression python module.

John,

I'd like to respectfully ask where spss is going with python specifically and, more generally, with add-on languages such as python and sax basic. I ask this out of a curiosity driven by the above poisson regression example.
It seems to me that python could be or will become either a parallel to the current syntax or a replacement for it. Either way, I curious as to the thinking that drove this decision (and I can imagine that it was a decision not undertaken lightly). For instance, are there groups of spss users for whom python is a benfit relative to syntax?

Gene Maguin

Weeks, Kyle

Re: Poisson regression python module.

I've been having some email problems and had thought I sent a response earlier today. Apologies to those of you who may receive this twice.

1. SPSS has no intention of abandoning traditional SPSS syntax.

2. By adding the capability of using a general purpose programming language within SPSS, we can offer a major increase in the capabilities of SPSS by combining the power of such a language with the traditional strengths of the SPSS statistical and data management engine. You can find a discussion of the benefits on SPSS Developer Central (www.spss.com/devcentral) in the Directions presentation article, which is linked on the right-hand side of the main page. The SPSS Programming and Data Management book, also linked there, illustrates many of the capabilities of this combination. In brief, a general purpose language such as Python or VB, the two currently provided, allows building of more flexible and more robust jobs and automating many tasks that previously had to be done manually.

3. While programmability offers many capabilities for those building statistical applications, it also offers benefits to users who can take advantage of modules built by SPSS or users who do not need or want to learn the new language. Recent examples include the partial least squares regression and raking modules which will be made available soon for download on Developer Central and the expansion of the SPSS transformation system using the trans and extendedTransforms modules.

4. In the future, SPSS will continue to offer modules that use the traditional syntax, but some new capabilities will be offered as
programmability modules. The combination of SPSS and user-written modules means that the capabilities of the SPSS system can advance faster than would be the case with the traditional development methods.

Regards.

Kyle Weeks, Ph.D.
Director of Product Management, SPSS Product Line
Product Management
SPSS Inc.
[hidden email]
www.spss.com
SPSS Inc. helps organizations turn data into insight through predictive
analytics.

________________________________

From: SPSSX(r) Discussion on behalf of Bauer, John H.
Sent: Wed 11/8/2006 3:10 PM
To: [hidden email]
Subject: Re: Poisson regression python module.

Gene,

I don't speak for SPSS here, so I will not comment on what the company was thinking. However, there is one group of users that springs to mind immediately for whom Python is of greatest benefit relative to syntax: MACRO users. Or to be precise: those who need the kind of functionality that should be provided by MACRO, but who are not using it because of its difficulty.

But check SPSS Developer Central http://www.spss.com/devcentral/index.cfm for more of where I'm going with Python. There should be a couple of new downloads this week.

John

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Gene Maguin
Sent: Tuesday, November 07, 2006 3:37 PM
To: [hidden email]
Subject: Re: Poisson regression python module.

John,

I'd like to respectfully ask where spss is going with python specifically and, more generally, with add-on languages such as python and sax basic. I ask this out of a curiosity driven by the above poisson regression example.
It seems to me that python could be or will become either a parallel to the current syntax or a replacement for it. Either way, I curious as to the thinking that drove this decision (and I can imagine that it was a decision not undertaken lightly). For instance, are there groups of spss users for whom python is a benfit relative to syntax?

Gene Maguin

Marta García-Granero

Re: Poisson regression python module.

In reply to this post by Bauer, John H.

Hi everybody

Going back to the subject of Poisson regression with SPSS (I haven't
learnt Python, therefore, I think I can't say anything on the topic of
where this SPSS-Python marriage is going to...), one thing that
strikes me as unusual in the output (either with with GENLOG, GENLIG
or the Python module) is the lack of EXP(b) with its confidence
interval (it is called IRR: Incidence Rate Ratio). In logistic or Cox
regression, SPSS output gives EXP(b) - OR in the first case and HR in
the second) - which is more easily interpreted than the coefficient in
log scale. Neither GENLIN nor the Python module do. This is my small
contribution to improve the output (the last table mimicks STATA output):

* POISSON REGRESSION WITH SPSS 15 *.

* Sample dataset *.
DATA LIST list /id(F2.0) agegroup(F8.0) smoker(F1.0) pyears(F8.0) deaths(F4.0).
BEGIN DATA
1 0 0 18790 2
2 1 0 10673 12
3 2 0 5712 28
4 3 0 2585 28
5 4 0 1462 31
6 0 1 52407 32
7 1 1 43248 104
8 2 1 28612 206
9 3 1 12663 186
10 4 1 5317 102
END DATA.
DOCUMENT 'Coronary deaths from British male doctors. Doll & Hill
(Nat Cancer Inst Monog 1996; 19:205-68)'.
VARIABLE LABELS agegroup "Age group".
VALUE LABELS agegroup
0 "35-44 years"
1 "45-54 years"
2 "55-64 years"
3 "65-74 years"
4 "75-84 years".
VARIABLE LABELS smoker "Smoking status".
VALUE LABELS smoker
0 "No"
1 "Yes".
DATASET NAME Campbell .

* OMS (to capture the parameter estimates) *.
DATASET DECLARE Coefficients.
SET OLANG=ENGLISH.
OMS /SELECT TABLES
/IF COMMANDS = 'Generalized Linear Models'
SUBTYPES = 'ParameterEstimates'
/DESTINATION FORMAT = SAV
OUTFILE = Coefficients.

* GENLIN *.
COMPUTE logpyears=LN(pyears).
GENLIN deaths
BY agegroup smoker
(ORDER=DESCENDING)
/MODEL agegroup smoker
INTERCEPT=YES
OFFSET=logpyears
DISTRIBUTION=POISSON
LINK=LOG.
OMSEND.

* Computing & displaying IRR *.
DATASET ACTIVATE Coefficients.
DELETE VARIABLES Command_ TO Label_.
SELECT IF (NOT MISSING(Sig)) AND (Var1 NE '(Intercept)').
COMPUTE IRR=EXP(B).
COMPUTE LowerIRR=EXP(Lower).
COMPUTE UpperIRR=EXP(Upper).
COMPUTE Zvalue=SQRT(WaldChiSquare).
VAR LABEL Var1 'Parameter'/
IRR 'IRR'/
LowerIRR 'Lower 95% CL for IRR'/
UpperIRR 'Upper 95% CL for IRR'/
Zvalue 'Z'
Sig 'Sig.'.
FORMAT IRR TO Zvalue (F8.4).
COMPUTE id=$casenum.
SORT CASES BY id(D).
OMS /SELECT TABLES
/IF COMMANDS = 'Summarize'
SUBTYPES = 'Case Processing Summary'
/DESTINATION VIEWER = NO.
SUMMARIZE
/TABLES=Var1 IRR Zvalue Sig LowerIRR UpperIRR
/FORMAT=LIST NOCASENUM NOTOTAL
/TITLE='Poisson regression: Incidence Rate Ratio (IRR) & 95% Wald CI'
/MISSING=VARIABLE
/CELLS=NONE.
DATASET ACTIVATE Campbell.
DATASET CLOSE Coefficients.
OMSEND.

Regards,
Marta

Marta García-Granero

Re: Poisson regression python module.

Hi

Thursday, November 9, 2006, 10:36:45 AM, I wrote:

MGG> Going back to the subject of Poisson regression with SPSS [...]
MGG> one thing that strikes me as unusual in the output (either with
MGG> with GENLOG, GENLIG or the Python module) is the lack of EXP(b)
MGG> with its confidence interval (it is called IRR: Incidence Rate
MGG> Ratio).

Ooooops! I didn't spot the "/PRINT SOLUTION(EXPONENTIATED)" subcommand
in GENLIN, sorry.

Still very new with SPSS 15, I didn't have time to mess with every
option...

Regards,
Marta

Marta García-Granero

Re: Poisson regression python module.

In reply to this post by Marta García-Granero

Hi Eric

I'd rather follow this exchange of mail on the list, if you don't
mind.

ejff> Dear Marta

ejff> That's what I did before and was answered the same tip by Kyle (from SPSS).
ejff> Thanks anyway.

Well that's how things go... If you don't have SPSS 15, then you have
to work a bit harder to fit a negative binomial regression model.

ejff> By the way when it says "You may be able to use the following set of command
ejff> syntax, after editing": 'after editing' means 'getting the
ejff> dataset'?

No, it means that you have to modify the variable names supplied with
the example (v1, v2, v3...) by the ones you are going to use to fit the
model. If we take a look at resolution 54271 (see my comments to the
syntax) :

Resolution Description:

SPSS releases through Release 14 have no procedure designed to fit
negative binomial regression models. The new GENLIN procedure in
Release 15 includes the ability to fit negative binomial regression
models. In releases prior to 15, the following approach may be of use.

The CNLR procedure fits nonlinear regression models, including ones
with user defined loss functions. You may be able to use the following
set of command syntax, after editing, to fit negative binomial
regression models:

* Change y to the actual dependent variable.
* Add as many parameters b0, b1, ... to the model program as needed.
* Change v1, v2, v3... into the names of the independent variables.
* Modify "compute bx" to be the sum of parameters times independent variables.

* MLE FOR NEGATIVE BINOMIAL (x = threshold p = prob) .

* My comments: define as many "b" as independent variables + 1 (don't
forget the intercept!), those are only starting values *.

Model program x = 1.5 b0 = 0.0 b1 = 1 b2 = 1 b3 = 1 .

* Change "v1", "v2", "v3"... by the names of the independent variables
of your model *.

compute bx = b0+b1*v1+b2*v2+b3*v3 .

* Now, leave the following commands unchanged *.

compute k = exp(bx) .
compute pred_ = x/k .
COMPUTE loss_ = -(lngamma(x+y)-lngamma(x)-lngamma(y+1)+x*bx-(x+y)*ln(1+k)) .
CNLR y
/PRED pred_
/LOSS loss_
/BOUNDS x >= 1 .

>> You'd better ask the whole list, since some SPSS people might give you
>> a better answer than mine. See this SPSS Resolution:
>>
>> http://support.spss.com/tech/troubleshooting/ressearchdetail.asp?ID=54271
>>
>> (if requested to login, use Guest both as user & password)
>>
>> I have never tried it myself.
>>
>> ejff> Right, I wish to run a Poisson regression but I have trouble with
>> ejff> overdispersion. Note that the DV is a rate and IV are categorical and
>> ejff> continuous. Thing is:
>>
>> ejff> - I don't have SPSS 15 and therefore not allowed to use the brand new
>> Genlin option in order to run a negative binomial regression;
>> ejff> - SAS offers the possibility of including a correction term within a
>> Poisson regression. Nevertheless SAS does not accept continuous independent
>> variable and I'm afraid not to be able to categorize the ones I intend to use.
>>
>> ejff> So: are you aware of the possibility of this correcting term in SPSS
>> (version 12 or 14)? How can I manage to take into account a variance greater (or
>> smaller) than the mean?

--
Regards,
Dr. Marta García-Granero,PhD mailto:[hidden email]
Statistician

---
"It is unwise to use a statistical procedure whose use one does
not understand. SPSS syntax guide cannot supply this knowledge, and it
is certainly no substitute for the basic understanding of statistics
and statistical thinking that is essential for the wise choice of
methods and the correct interpretation of their results".

(Adapted from WinPepi manual - I'm sure Joe Abrahmson will not mind)

Eric Janssen

Re: Poisson regression python module.

Hi Marta

Thanks for your concern.
Another couple questions: I did as was told, edited the variables and then ran
the CNLR. But the output appeared quite weird, at least for my understanding:

- first, does CNLR accept catgorical IV? I put one in the model (others IV are
continuous) and just got one estimator, which does not make it quite easy to
interpret.

- I did a NB regression using SAS and the results are quite different. I know it
might be a difference in calculation algorythmn used in each software but still
I think I did not handle the regression very well.

- next, when I add the BOOTSTRAP subcommand, I get valueless estimated SE.

Any other tips?
Thanks again
E.

> Hi Eric
>
> I'd rather follow this exchange of mail on the list, if you don't
> mind.
>
> ejff> Dear Marta
>
> ejff> That's what I did before and was answered the same tip by Kyle (from
> SPSS).
> ejff> Thanks anyway.
>
> Well that's how things go... If you don't have SPSS 15, then you have
> to work a bit harder to fit a negative binomial regression model.
>
> ejff> By the way when it says "You may be able to use the following set of
> command
> ejff> syntax, after editing": 'after editing' means 'getting the
> ejff> dataset'?
>
> No, it means that you have to modify the variable names supplied with
> the example (v1, v2, v3...) by the ones you are going to use to fit the
> model. If we take a look at resolution 54271 (see my comments to the
> syntax) :
>
> Resolution Description:
>
> SPSS releases through Release 14 have no procedure designed to fit
> negative binomial regression models. The new GENLIN procedure in
> Release 15 includes the ability to fit negative binomial regression
> models. In releases prior to 15, the following approach may be of use.
>
> The CNLR procedure fits nonlinear regression models, including ones
> with user defined loss functions. You may be able to use the following
> set of command syntax, after editing, to fit negative binomial
> regression models:
>
> * Change y to the actual dependent variable.
> * Add as many parameters b0, b1, ... to the model program as needed.
> * Change v1, v2, v3... into the names of the independent variables.
> * Modify "compute bx" to be the sum of parameters times independent
> variables.
>
> * MLE FOR NEGATIVE BINOMIAL (x = threshold p = prob) .
>
> * My comments: define as many "b" as independent variables + 1 (don't
> forget the intercept!), those are only starting values *.
>
> Model program x = 1.5 b0 = 0.0 b1 = 1 b2 = 1 b3 = 1 .
>
> * Change "v1", "v2", "v3"... by the names of the independent variables
> of your model *.
>
> compute bx = b0+b1*v1+b2*v2+b3*v3 .
>
> * Now, leave the following commands unchanged *.
>
> compute k = exp(bx) .
> compute pred_ = x/k .
> COMPUTE loss_ = -(lngamma(x+y)-lngamma(x)-lngamma(y+1)+x*bx-(x+y)*ln(1+k)) .
> CNLR y
> /PRED pred_
> /LOSS loss_
> /BOUNDS x >= 1 .
>
> >> You'd better ask the whole list, since some SPSS people might give you
> >> a better answer than mine. See this SPSS Resolution:
> >>
> >> http://support.spss.com/tech/troubleshooting/ressearchdetail.asp?ID=54271
> >>
> >> (if requested to login, use Guest both as user & password)
> >>
> >> I have never tried it myself.
> >>
> >> ejff> Right, I wish to run a Poisson regression but I have trouble with
> >> ejff> overdispersion. Note that the DV is a rate and IV are categorical
> and
> >> ejff> continuous. Thing is:
> >>
> >> ejff> - I don't have SPSS 15 and therefore not allowed to use the brand
> new
> >> Genlin option in order to run a negative binomial regression;
> >> ejff> - SAS offers the possibility of including a correction term within a
> >> Poisson regression. Nevertheless SAS does not accept continuous
> independent
> >> variable and I'm afraid not to be able to categorize the ones I intend to
> use.
> >>
> >> ejff> So: are you aware of the possibility of this correcting term in SPSS
> >> (version 12 or 14)? How can I manage to take into account a variance
> greater (or
> >> smaller) than the mean?
>
>
> --
> Regards,
> Dr. Marta García-Granero,PhD mailto:[hidden email]
> Statistician
>
> ---
> "It is unwise to use a statistical procedure whose use one does
> not understand. SPSS syntax guide cannot supply this knowledge, and it
> is certainly no substitute for the basic understanding of statistics
> and statistical thinking that is essential for the wise choice of
> methods and the correct interpretation of their results".
>
> (Adapted from WinPepi manual - I'm sure Joe Abrahmson will not mind)
>

Marta García-Granero

Re: Poisson regression python module.

Hi Eric

ejff> Another couple questions: I did as was told, edited the variables and then ran
ejff> the CNLR. But the output appeared quite weird, at least for my understanding:

ejff> - first, does CNLR accept catgorical IV? I put one in the model (others IV are
ejff> continuous) and just got one estimator, which does not make it quite easy to
ejff> interpret.

Not directly, you have to dummy code them and pass them to CNLR as
different variables (each dummy with its own "Bi" coefficient). For
instance, to dummy code Agegroup (with the following categories:
1:25-35 years; 2:35-45 years; 3:45-55 years; 4:55.65 years) using
first category as indicator, use the following syntax:

DO REPAT a/agegrp1 agegrp2 agegrp3/B=2 3 4.
COMPUTE A= agegroup(B).
END REPEAT.

This will create 3 dummy variables, called agegrp1 ti agegrp3. Pass
them to the CNLR syntax as b1*agegrp1, b2*agegrp2, b3*agegrp3...

ejff> - I did a NB regression using SAS and the results are quite different. I know it
ejff> might be a difference in calculation algorythmn used in each software but still
ejff> I think I did not handle the regression very well.

Try now and see if the results agree.

ejff> - next, when I add the BOOTSTRAP subcommand, I get valueless estimated SE.

I don't know, check them after dummy coding the categorical variable
and running the syntax again.

Regards,
Marta

Maguin, Eugene

Hmm.. How to do this?

In reply to this post by Bauer, John H.

All,

Here is the setup. Two datasets are allegedly identical. Consider variables
x1 to x10. Possible values include sysmis and user missing (two or three
values). Let's do this in syntax and not through the identify duplicates
thing.

If there are no user missing, then this will work.

COMPUTE MATCH=0.
DO IF (FAMID EQ LAG(FAMID)).
+ DO REPEAT X=X1 TO X10.
+ IF (X EQ LAG(X)) MATCH=MATCH+1.
+ IF (SYSMIS(X) AND SYSMIS(X)) MATCH=MATCH+1.
+ END REPEAT.
ELSE.
+ COMPUTE MATCH=99.
END IF.
IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1.

Now add user missing. I'd like to say

COMPUTE MATCH=0.
DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)).
+ DO REPEAT X=C1PRCA1 TO C1PRCA9.
+ IF (X EQ VALUE(LAG(X))) MATCH=MATCH+1.
+ IF (SYSMIS(X) AND SYSMIS(LAG(X))) MATCH=MATCH+1.
+ END REPEAT.
ELSE.
+ COMPUTE MATCH=99.
END IF.
IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1.

However, the Value and Lag functions don't work together--in any sequence.

A plausible alternative is

COMPUTE MATCH=0.
DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)).
+ DO REPEAT X=C1PRCA1 TO C1PRCA9.
+ COMPUTE #TEMP=LAG(X).
+ IF (X EQ VALUE(#TEMP)) MATCH=MATCH+1.
+ IF (SYSMIS(X) AND SYSMIS(#TEMP)) MATCH=MATCH+1.
+ END REPEAT.
ELSE.
+ COMPUTE MATCH=99.
END IF.
IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1.

But the problem here is that a user missing value is resent to sysmis in
this statement.

+ COMPUTE #TEMP=LAG(X).

So things don't work correctly.

The only alternative I can think of is to set user missing off, execute the
comparison and then set user missing back on.

My question: is there a one step alternative?

Follow up question: Will the identify duplicates thing work correctly in
this case?

Thanks, Gene Maguin

Fiona Graff

Comparing two data sets

Hi all,

Does anyone know if there is an easy way to compare two data files to
determine if their data contents are identical? I know how to do this by
exporting SPSS files to excel and then writing a formula to compare
cell-by-cell, but I know there must be an easier way in SPSS.

Thanks very much,

Fiona Graff

Any information, including protected health information (PHI), transmitted
in this email is intended only for the person or entity to which it is
addressed and may contain information that is privileged, confidential and or
exempt from disclosure under applicable Federal or State law. Any review,
retransmission, dissemination or other use of or taking of any action in
reliance upon, protected health information (PHI) by persons or entities other
than the intended recipient is prohibited. If you received this email in error,
please contact the sender and delete the material from any computer.

Richard Ristow

Re: Hmm.. How to do this?

In reply to this post by Maguin, Eugene

At 05:09 PM 11/13/2006, Gene Maguin wrote:

>Two datasets are allegedly identical. [This will be tested by
>interleaving them, and comparing values in successive cases.] Consider
>variables x1 to x10.
>
>If there are no user missing, then this will work.
>
>COMPUTE MATCH=0.
>DO IF (FAMID EQ LAG(FAMID)).
>+ DO REPEAT X=X1 TO X10.
>+ IF (X EQ LAG(X)) MATCH=MATCH+1.
>+ IF (SYSMIS(X) AND SYSMIS(X)) MATCH=MATCH+1.
>+ END REPEAT.
>ELSE.
>+ COMPUTE MATCH=99.
>END IF.
>IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1.
>
>Now add user missing. I'd like to say
>
>COMPUTE MATCH=0.
>DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)).
>+ DO REPEAT X=C1PRCA1 TO C1PRCA9.
>+ IF (X EQ VALUE(LAG(X))) MATCH=MATCH+1.
>+ IF (SYSMIS(X) AND SYSMIS(LAG(X))) MATCH=MATCH+1.
>+ END REPEAT.
>ELSE.
>+ COMPUTE MATCH=99.
>END IF.
>IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1.
>
>However, the Value and Lag functions don't work together--in any
>sequence.

Barf. One of those things SPSS didn't fully think out. (If it'll make
you feel any 'better', VALUE also doesn't work for variables referenced
as vector elements; see thread "VECTOR and VALUE problem", Thu, 20 Jan
2005 ff. If it'll make you feel still 'better', that one surprised the
SPSS folks.)

As an alternative to LAG, there's logic using LEAVE - except I'd use
scratch variables, which always behave as if LEAVE had been specified
for them, and which don't clutter up the final file.

I'd try something like this (not tested):

* Prepare the variables to be "left": .
NUMERIC #X1 TO #X10 (F5.3) /* or, any other format .

* Check for match with previous case, as before: .
COMPUTE MATCH=0.
DO IF (FAMID EQ LAG(FAMID) AND PARID EQ LAG(PARID)).
+ DO REPEAT X = C1PRCA1 TO C1PRCA9
/LAG_X = #X1 TO #X10.
+ IF (VALUE(X) EQ LAG_X) MATCH=MATCH+1.
+ IF (SYSMIS(X) AND SYSMIS(LAG_X) MATCH=MATCH+1.
+ END REPEAT.
ELSE.
+ COMPUTE MATCH=99.
END IF.
IF ($CASENUM EQ 1) MATCH=99. /* NEEDED FOR PAIR 1 RECORD 1.

* Save the variables to be "left" for the next case.
* (I'm retaining your indents so I can write this .
* code easily by copying yours.) .

+ DO REPEAT X = C1PRCA1 TO C1PRCA9
/LAG_X = #X1 TO #X10.
+ COMPUTE LAG_X = VALUE(X).
+ END REPEAT.

Barnett, Adrian (HEALTH)

Re: Comparing two data sets

In reply to this post by Fiona Graff

Hi Fiona
I'm not aware of a way to do what you want with SPSS, but there are free
(and payware) utilities available which will compare plain text (data)
files. So if you are prepared to export your SPSS files to a plain ASCII
file, you can tell the utility to compare the 2 files.

What the ones I'm used to will do is open both files side by side and
highlight lines which have been inserted in one but not the other, and
then draw a line over to the other file to show where it was put in.
Ditto for lines which have been altered. Some then have the ability to
nominate which bits from one file you want duplicated into the other.

I did a quick Google on "file comparison" and "freeware" and the first
one it came up with was this one:
http://www.prestosoft.com/ps.asp?page=edp_examdiff

I've not used it so I can't comment on it, but this is a way you could
start.

Some text editors have a file comparison utility built in, so if you
spend most of your program editing time in one of these, it is very
convenient. Some people use TextPad, which has a built-in file
comparison utility, but it is payware. A freeware editor with built-in
comparison is ConText:
http://context.cx/component/option,com_frontpage/Itemid,1/

I've not used either of these but their list of features looks useful
and extensive.

Hope this is of some help

Regards

Adrian

--
Adrian Barnett
Senior Project Officer Ph: +61 8 82266615
Research, Analysis and Evaluation Fax: +61 8 82267088
Strategic Planning and Research Branch
Policy and Intergovernment Relations
Department of Health

This e-mail may contain confidential information, which also may be
legally privileged. Only the intended recipient(s) may access, use,
distribute or copy this e-mail. If this e-mail is received in error,
please inform the sender by return e-mail and delete the original. If
there are doubts about the validity of this message, please contact the
sender by telephone. It is the recipient's responsibility to check the
e-mail and any attached files for viruses.

> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]]
> On Behalf Of Fiona Graff
> Sent: Tuesday, 14 November 2006 9:37
> To: [hidden email]
> Subject: Comparing two data sets
>
> Hi all,
>
> Does anyone know if there is an easy way to compare two data
> files to determine if their data contents are identical? I
> know how to do this by exporting SPSS files to excel and then
> writing a formula to compare cell-by-cell, but I know there
> must be an easier way in SPSS.
>
> Thanks very much,
>
> Fiona Graff
>
>
>
>
>
> Any information, including protected health information
> (PHI), transmitted in this email is intended only for the
> person or entity to which it is addressed and may contain
> information that is privileged, confidential and or exempt
> from disclosure under applicable Federal or State law. Any
> review, retransmission, dissemination or other use of or
> taking of any action in reliance upon, protected health
> information (PHI) by persons or entities other than the
> intended recipient is prohibited. If you received this email
> in error, please contact the sender and delete the material
> from any computer.
>