difficulties with PLUM analysis - parameter estimates [Sec: UNOFFICIAL]

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

difficulties with PLUM analysis - parameter estimates [Sec: UNOFFICIAL]

Gosse, Michelle

Greetings all,

 

I have been undertaking some multiple regressions in SPSS, starting with OLS and moving away from this method if the assumptions are not met.

 

I have a DV where it is an average of counts of response options from 4 questions in a survey, censored between 0 and 14, so this variable has an explicit order.

 

For the IVs, there is a series of these and they mainly comprise dummy variables, for example sex is coded 0 (male)/1 (female), income quintiles have the bottom quintile as the omitted comparator with 0/1 codes against the four quintile dummy variables included in the model. There are also 3 ratio level IVs.

 

For the normal OLS regression, the dummy variables are accurately output – SPSS seems to automatically recognise the dummy variables as such, and I get nice parameter estimates.

 

In PLUM, on the other hand, SPSS is giving output on both levels of all dummy variables (so I get a line for the “0” value and a line for the “1” value in the parameter estimates table, with all the “1” values being footnoted as “This parameter is set to zero as it is redundant”. Basically, it’s looking like the model is over specified. On probably a related note, I get a warning “Warnings: There are 3381 (75.0%) cells (i.e., dependent variable levels by combinations of predictor variable values) with zero frequencies.” A subset of the parameter values is:

Parameter Estimates

 

Estimate

Std. Error

Wald

df

Sig.

95% Confidence Interval

Lower Bound

Upper Bound

 

[INCOMEQUINTILE2=.00]

.384

.185

4.315

1

.038

.022

.745

[INCOMEQUINTILE2=1.00]

0

.

.

0

.

.

.

[INCOMEQUINTILE3=.00]

.423

.201

4.452

1

.035

.030

.817

[INCOMEQUINTILE3=1.00]

0

.

.

0

.

.

.

[INCOMEQUINTILE4=.00]

.873

.193

20.486

1

.000

.495

1.251

[INCOMEQUINTILE4=1.00]

0

.

.

0

.

.

.

[INCOMEQUINTILE5=.00]

.920

.202

20.701

1

.000

.524

1.316

[INCOMEQUINTILE5=1.00]

0

.

.

0

.

.

.

 

All the dummy variables are specified as factors in the PLUM command, and the three ratio variables are entered as covariates. I have looked at the PLUM information in the help menu, and done a search online but cannot find any information on what I am doing wrong. I have also emailed the SPSS support, but they don’t seem to be familiar with PLUM. I have done a search of the archives for this listserv and have been unable to locate a similar issue.

 

I’m running PLUM as my DV is highly non-normal. I tried it with an unbinned, and then a binned DV, and the error remains (initially I thought the issue might have been too many DV categories).

 

Could someone please advise me where I am going wrong on this? The syntax I have used is:

PLUM BINNEDOVERALLBENEFIT BY CLAIMPRESENT AGE18TO34 AGE35TO54 SEXFEMALE COUNTRYAUSTRALIA DEPENDENTS INCOMEQUINTILE2 INCOMEQUINTILE3 INCOMEQUINTILE4

INCOMEQUINTILE5 NOINCOMEGIVEN EDUCATIONHIGH EDUCATIONGIVEN CONCERNSGENERAL CONCERNSSPECIFIC CONCERNSBOTH NUTRITCORMOD NUTRITCORHIGH

NUTRITMOTMOD NUTRITMOTHIGH WITH MICRONUTKNOW MICRONUTFAML FRUITANDVEG

/CRITERIA=CIN(95) DELTA(0) LCONVERGE(0) MXITER(100) MXSTEP(5) PCONVERGE(1.0E-6) SINGULAR(1.0E-8)

/LINK=LOGIT

/PRINT=FIT PARAMETER SUMMARY.

 

Cheers

Michelle

 

Michelle Gosse

Consumer and Social Sciences

Food Standards Australia New Zealand

108 The Terrace

Wellington

New Zealand

ph: 0064-4-978-5652

email: [hidden email]

 

**********************************************************************

This email and any files transmitted with it are confidential and

intended solely for the use of the individual or entity to whom they

are addressed. If you have received this email in error please notify

the system manager.

This footnote also confirms that this email message has been swept by

MIMEsweeper for the presence of computer viruses.

www.clearswift.com

**********************************************************************

Reply | Threaded
Open this post in threaded view
|

Re: difficulties with PLUM analysis - parameter estimates [Sec: UNOFFICIAL]

Bruce Weaver
Administrator
Gosse, Michelle wrote
Greetings all,

I have been undertaking some multiple regressions in SPSS, starting with OLS and moving away from this method if the assumptions are not met.

I have a DV where it is an average of counts of response options from 4 questions in a survey, censored between 0 and 14, so this variable has an explicit order.

For the IVs, there is a series of these and they mainly comprise dummy variables, for example sex is coded 0 (male)/1 (female), income quintiles have the bottom quintile as the omitted comparator with 0/1 codes against the four quintile dummy variables included in the model. There are also 3 ratio level IVs.

For the normal OLS regression, the dummy variables are accurately output - SPSS seems to automatically recognise the dummy variables as such, and I get nice parameter estimates.

In PLUM, on the other hand, SPSS is giving output on both levels of all dummy variables (so I get a line for the "0" value and a line for the "1" value in the parameter estimates table, with all the "1" values being footnoted as "This parameter is set to zero as it is redundant". Basically, it's looking like the model is over specified. On probably a related note, I get a warning "Warnings: There are 3381 (75.0%) cells (i.e., dependent variable levels by combinations of predictor variable values) with zero frequencies." A subset of the parameter values is:

--- snip output ---

All the dummy variables are specified as factors in the PLUM command, and the three ratio variables are entered as covariates. I have looked at the PLUM information in the help menu, and done a search online but cannot find any information on what I am doing wrong. I have also emailed the SPSS support, but they don't seem to be familiar with PLUM. I have done a search of the archives for this listserv and have been unable to locate a similar issue.

I'm running PLUM as my DV is highly non-normal. I tried it with an unbinned, and then a binned DV, and the error remains (initially I thought the issue might have been too many DV categories).

Could someone please advise me where I am going wrong on this? The syntax I have used is:

--- edited for readability ---

PLUM BINNEDOVERALLBENEFIT BY
    CLAIMPRESENT     AGE18TO34    AGE35TO54
    SEXFEMALE   COUNTRYAUSTRALIA  DEPENDENTS  
    INCOMEQUINTILE2  INCOMEQUINTILE3  INCOMEQUINTILE4 INCOMEQUINTILE5
    NOINCOMEGIVEN    EDUCATIONHIGH   EDUCATIONGIVEN  CONCERNSGENERAL
   CONCERNSSPECIFIC CONCERNSBOTH NUTRITCORMOD NUTRITCORHIGH
   NUTRITMOTMOD NUTRITMOTHIGH WITH
   MICRONUTKNOW MICRONUTFAML FRUITANDVEG
/LINK=LOGIT
/PRINT=FIT PARAMETER SUMMARY.

Cheers
Michelle
Hi Michelle.  The key word BY precedes "factors", which are categorical variables.  So for your income quintiles, you would need a SINGLE variable with values from 1 to 5; and for your age groups, a single variable with values from 1 to 3.  But given that you've already computed indicator variables for everything, you could also just get rid of the BY, and put all your indicator variables (and continuous variables) after the key word WITH.  

BUT...it looks to me like you have a much more serious problem here.  I.e., you are fitting 24 parameters (including the constant), and so you would need an ENORMOUS sample size, with enough cases falling into each of the outcome bins to avoid SERIOUS over-fitting of the model.  For ordinary binary logistic regression, for example, one needs in the order of 15-20 'events' per model parameter to avoid over-fitting (where 'event' = the less frequent outcome category).  I've not come across a similar guideline geared specifically to ordinal logistic regression; but I suspect you have far more variables in the model than your data can support.

For a nice readable discussion of over-fitting in regression models, see Mike Babyak's article.

   http://www.class.uidaho.edu/psy586/Course%20Readings/Babyak_04.pdf

HTH.
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: difficulties with PLUM analysis - parameter estimates [Sec: UNOFFICIAL]

Gosse, Michelle
Hi Bruce,

Thanks for the quick and clear response, this had me tearing my hair out half of yesterday.

So handy to know that I don't have to spend another hour or so doing recoding, and then checking I did the recoding correctly, that is fantastic news about rejigging the model line in PLUM.

My sample size is 1127, I had worked on a rule of thumb of 10 observations per IV, your recommendation sounds a bit different to that. I have created the bins so that there are in the order of 50+ observations per dummy category - I'll look  into the event versus observation methods for deciding on the maximum number of IVs, I hadn't heard of the other method until you mentioned it, so thanks for that information. It sounds very useful in assisting in working out sample size in the first place.

The variables are in the model because previous research tells us that these factors should be important for the work I am doing, so the model specification is theory-based.

I've come back to SPSS after numerous years of using SAS, and SPSS code is very different.

Thanks again for your help. :)

Cheers
Michelle

<snip my stuff out>

Hi Michelle.  The key word BY precedes "factors", which are categorical
variables.  So for your income quintiles, you would need a SINGLE variable
with values from 1 to 5; and for your age groups, a single variable with
values from 1 to 3.  But given that you've already computed indicator
variables for everything, you could also just get rid of the BY, and put all
your indicator variables (and continuous variables) after the key word WITH.

BUT...it looks to me like you have a much more serious problem here.  I.e.,
you are fitting 24 parameters (including the constant), and so you would
need an ENORMOUS sample size, with enough cases falling into each of the
outcome bins to avoid SERIOUS over-fitting of the model.  For ordinary
binary logistic regression, for example, one needs in the order of 15-20
'events' per model parameter to avoid over-fitting (where 'event' = the less
frequent outcome category).  I've not come across a similar guideline geared
specifically to ordinal logistic regression; but I suspect you have far more
variables in the model than your data can support.

For a nice readable discussion of over-fitting in regression models, see
Mike Babyak's article.

   http://www.class.uidaho.edu/psy586/Course%20Readings/Babyak_04.pdf

HTH.


**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.clearswift.com
**********************************************************************



=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD