SPSSX Discussion

logistic regression assumption

Classic

List

Threaded

21 messages Options

dimitrios

logistic regression assumption

Dear All,

I would be grateful to you if you could help me with the following. I am getting desperate as I have to present my data on Wednesday.

I checked the linear assumption for my only continuous variable and it is violated.

I used the natural logarithm. How can I check that now it is not?

Thank you in advance,

Dimitrios

Bruce Weaver

Re: logistic regression assumption

Administrator

One straightforward way to get an idea about the functional relationship between a continuous explanatory variable and the log-odds of an "event" (with "event" being defined as Outcome variable = 1) is as follows:

1. For exploratory purposes only, recode the continuous variable into some number of categories (e.g., quintiles).

2. Estimate a model with the categorical variable in place of the continuous variable, and save the predicted probabilities.

3. Convert the predicted probabilities to predicted log-odds.

4. Make a scatterplot with X = the original continuous variable and Y = predicted log-odds.

Here's an example from something I helped a colleague with a while ago.

* Model 1: Exploratory with categorical Age variable.

LOGISTIC REGRESSION VARIABLES Admission_status2
/METHOD=ENTER AgeGroup Sex ED_only locum
/CONTRAST (AgeGroup)=Indicator(1)
/PRINT=CI(95)
/SAVE pred(PP1)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

COMPUTE LogOdds1 = ln(PP1 / (1 - PP1)).
VARIABLE LABELS LogOdds1 "Log-odds of outcome (Model 1)".
DESCRIPTIVES PP1 LogOdds1.

GRAPH /SCATTERPLOT(BIVAR)=AgeGroup WITH LogOdds1 .

* That scatter-plot shows a clear quadratic (U-shaped) relationship.
* Therefore, when we use Age as a continuous variable in Model 2,
* we'll want to include Age-squared as well.

* Model 2: Treat Age as a continuous variable,
* and include Age-squared.

COMPUTE AgeSq = Age**2.

LOGISTIC REGRESSION VARIABLES Admission_status2
/METHOD=ENTER Age AgeSq Sex ED_only locum
/PRINT=CI(95)
/SAVE pred(PP2)
/CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5).

COMPUTE LogOdds2 = ln(PP2 / (1 - PP2)).
VARIABLE LABELS LogOdds2 "Log-odds of outcome (Model 2)".

HTH.

dimitrios wrote

Dear All,

I would be grateful to you if you could help me with the following. I am getting desperate as I have to present my data on Wednesday.

I checked the linear assumption for my only continuous variable and it is violated.

I used the natural logarithm. How can I check that now it is not?

Thank you in advance,

Dimitrios

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

dimitrios

Re: logistic regression assumption

Thank you for your reply.

Is it acceptable to transform a continuous variable to a categorigal one for the logistic regression since my variable is not linear or is it advisable to go through the transformation?

Thank you in advance,

Dimitrios

Bruce Weaver

Re: logistic regression assumption

Administrator

I don't entirely understand your question, but will offer these comments.

1. Generally speaking, it is preferable to treat continuous variables as continuous. E.g., see the Streiner article "Breaking up is hard to do" (link given below). But if the functional relationship between that continuous variable and the outcome is not linear, you'll have to take that into account somehow (e.g., by including higher order polynomial terms, or regression splines, etc).

http://isites.harvard.edu/fs/docs/icb.topic477909.files/dichotomizing_continuous.pdf

2. In the example I gave earlier in the thread, I carved age into categories for a *preliminary*, *exploratory* analysis that was carried out to provide information about the shape of the functional relationship between age and the log-odds of the 1-0 outcome variable being = 1. A plot of the fitted log-odds as a function of age showed a clear U-shaped functional relationship. Therefore, when I reverted to treating age as a continuous variable (in my final model), I knew I had to include both Age and Age-squared as explanatory variables. Including Age-squared allowed the functional relationship to be U-shaped.

I hope this clarifies things somewhat.

dimitrios wrote

Thank you for your reply.

Is it acceptable to transform a continuous variable to a categorigal one for the logistic regression since my variable is not linear or is it advisable to go through the transformation?

Thank you in advance,

Dimitrios

Richard Ristow

Re: logistic regression assumption

At 02:34 PM 4/16/2014, Bruce Weaver wrote:

>when I reverted to treating age as a continuous variable (in my
>final model), I knew I had to include both Age and Age-squared as
>explanatory variables. Including Age-squared allowed the functional
>relationship to be U-shaped.

Bruce is far more the methodologist than I, but it's worth adding
that, for variables (like age) with strictly positive values, the
linear and squared terms tend to be highly correlated, leading to the
usual difficulties when estimating using correlated independent variables.

One can mean-center the age before estimating, to avoid this. Or, it
works pretty well to choose an age near the middle of the range you
have, and use the square of the difference from that age. (It's fine
to use the plain age, rather than mean-centered, as the linear term.)

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Bruce Weaver

Re: logistic regression assumption

Administrator

Richard Ristow wrote

At 02:34 PM 4/16/2014, Bruce Weaver wrote:

>when I reverted to treating age as a continuous variable (in my
>final model), I knew I had to include both Age and Age-squared as
>explanatory variables. Including Age-squared allowed the functional
>relationship to be U-shaped.

Bruce is far more the methodologist than I, but it's worth adding
that, for variables (like age) with strictly positive values, the
linear and squared terms tend to be highly correlated, leading to the
usual difficulties when estimating using correlated independent variables.

One can mean-center the age before estimating, to avoid this. Or, it
works pretty well to choose an age near the middle of the range you
have, and use the square of the difference from that age. (It's fine
to use the plain age, rather than mean-centered, as the linear term.)

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Ryan

Re: logistic regression assumption

http://m.orm.sagepub.com/content/15/3/339.abstract

> On Apr 17, 2014, at 7:01 PM, Bruce Weaver <[hidden email]> wrote:
>
> Hi Richard. Just a quick off the cuff response here, because it's time to
> get off home for the Easter weekend.
>
> I would argue that the collinearity of X and X-squared is "illusory",
> meaning that it is completely non-problematic. (I know there is an
> published article somewhere making this argument, but I can't lay my hands
> on it right now.) Here's one reason for thinking that: If you run the
> model with and without centering, and save the fitted values of Y (or the
> predicted probabilities, in the case of logistic regression), those fitted
> values (or predicted probabilities) will be identical. And the R-squared
> (for OLS models) or -2LL values (for models fit via MLE) will be identical
> too. So it's the same model, regardless of whether you center or not.
>
> Having said that, I often do center the variables. But I do so simply to
> make (some of) the coefficients more interpretable. And rather than center
> on the mean, I often center on a convenient value near the minimum. Part of
> the reason I do that is to emphasize the point that it is nowhere written in
> stone that thou shalt center on the mean! (Even if one does want to
> mean-center, it is better practice, I think, to center on a value near the
> mean, and to center on the same value each time if one is conducting
> multiple studies. After all, the sample means will not all be the same; so
> centering on the same value each time makes the results more comparable
> across studies.)
>
>
> Cheers!
> Bruce
>
>
>
> Richard Ristow wrote
>> At 02:34 PM 4/16/2014, Bruce Weaver wrote:
>>
>>> when I reverted to treating age as a continuous variable (in my
>>> final model), I knew I had to include both Age and Age-squared as
>>> explanatory variables. Including Age-squared allowed the functional
>>> relationship to be U-shaped.
>>
>> Bruce is far more the methodologist than I, but it's worth adding
>> that, for variables (like age) with strictly positive values, the
>> linear and squared terms tend to be highly correlated, leading to the
>> usual difficulties when estimating using correlated independent variables.
>>
>> One can mean-center the age before estimating, to avoid this. Or, it
>> works pretty well to choose an age near the middle of the range you
>> have, and use the square of the difference from that age. (It's fine
>> to use the plain age, rather than mean-centered, as the linear term.)
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>
>> LISTSERV@.UGA
>
>> (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>
>
>
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> http://sites.google.com/a/lakeheadu.ca/bweaver/
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> View this message in context: http://spssx-discussion.1045642.n5.nabble.com/logistic-regression-assumption-tp5725433p5725508.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Bruce Weaver

Re: logistic regression assumption

Administrator

That's the one I was thinking of. Thanks Ryan.

Ryan Black wrote

http://m.orm.sagepub.com/content/15/3/339.abstract

> On Apr 17, 2014, at 7:01 PM, Bruce Weaver <[hidden email]> wrote:
>
> Hi Richard. Just a quick off the cuff response here, because it's time to
> get off home for the Easter weekend.
>
> I would argue that the collinearity of X and X-squared is "illusory",
> meaning that it is completely non-problematic. (I know there is an
> published article somewhere making this argument, but I can't lay my hands
> on it right now.) Here's one reason for thinking that: If you run the
> model with and without centering, and save the fitted values of Y (or the
> predicted probabilities, in the case of logistic regression), those fitted
> values (or predicted probabilities) will be identical. And the R-squared
> (for OLS models) or -2LL values (for models fit via MLE) will be identical
> too. So it's the same model, regardless of whether you center or not.
>
> Having said that, I often do center the variables. But I do so simply to
> make (some of) the coefficients more interpretable. And rather than center
> on the mean, I often center on a convenient value near the minimum. Part of
> the reason I do that is to emphasize the point that it is nowhere written in
> stone that thou shalt center on the mean! (Even if one does want to
> mean-center, it is better practice, I think, to center on a value near the
> mean, and to center on the same value each time if one is conducting
> multiple studies. After all, the sample means will not all be the same; so
> centering on the same value each time makes the results more comparable
> across studies.)
>
>
> Cheers!
> Bruce
>
>
>
> Richard Ristow wrote
>> At 02:34 PM 4/16/2014, Bruce Weaver wrote:
>>
>>> when I reverted to treating age as a continuous variable (in my
>>> final model), I knew I had to include both Age and Age-squared as
>>> explanatory variables. Including Age-squared allowed the functional
>>> relationship to be U-shaped.
>>
>> Bruce is far more the methodologist than I, but it's worth adding
>> that, for variables (like age) with strictly positive values, the
>> linear and squared terms tend to be highly correlated, leading to the
>> usual difficulties when estimating using correlated independent variables.
>>
>> One can mean-center the age before estimating, to avoid this. Or, it
>> works pretty well to choose an age near the middle of the range you
>> have, and use the square of the difference from that age. (It's fine
>> to use the plain age, rather than mean-centered, as the linear term.)
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>
>> LISTSERV@.UGA
>
>> (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>
>
>
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> http://sites.google.com/a/lakeheadu.ca/bweaver/
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> View this message in context: http://spssx-discussion.1045642.n5.nabble.com/logistic-regression-assumption-tp5725433p5725508.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Andy W

Re: logistic regression assumption

In reply to this post by Ryan

I haven't read it yet but I have what appears to be a pretty similar article in my to read bucket list:

Shieh, G. (2011). Clarifying the role of mean centring in multicollinearity of interaction effects. The British journal of mathematical and statistical psychology, 64(3):462-477. (No pre-print PDF I'm afraid, doi here.)

I would note - if the variable has a mean far away from zero you can have numerical instability in inverting the design matrix for squared or higher polynomial terms. E.g. In this post for illustration I had polynomial terms of years starting in 1985. If I remember correctly I'm pretty sure SPSS would drop the squared year term when I estimated a linear regression equation - let alone the regression with both the square and the cubed term.

Also FYI I wrote a macro to estimate restricted cubic spline basis, a popular alternative to polynomial terms. I guess I will do the next blog post on how you can use them in logistic regression - as I got a comment asking about that as well.

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Jon K Peck

Re: logistic regression assumption

In reply to this post by Bruce Weaver

Amen, Bruce. I see this misconception repeated all the time on this list and elsewhere. No matter how many times I assert that computationally this makes no difference, it doesn't seem to get through, and the results are exactly equivalent up to a very high level of numerical exactness. Maybe people will believe it when you say it.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Bruce Weaver <[hidden email]>
To: [hidden email],
Date: 04/17/2014 05:02 PM
Subject: Re: [SPSSX-L] logistic regression assumption
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Hi Richard. Just a quick off the cuff response here, because it's time to get off home for the Easter weekend. I would argue that the collinearity of X and X-squared is "illusory", meaning that it is completely non-problematic. (I know there is an published article somewhere making this argument, but I can't lay my hands on it right now.) Here's one reason for thinking that: If you run the model with and without centering, and save the fitted values of Y (or the predicted probabilities, in the case of logistic regression), those fitted values (or predicted probabilities) will be identical. And the R-squared (for OLS models) or -2LL values (for models fit via MLE) will be identical too. So it's the same model, regardless of whether you center or not. Having said that, I often do center the variables. But I do so simply to make (some of) the coefficients more interpretable. And rather than center on the mean, I often center on a convenient value near the minimum. Part of the reason I do that is to emphasize the point that it is nowhere written in stone that thou shalt center on the mean! (Even if one does want to mean-center, it is better practice, I think, to center on a value near the mean, and to center on the same value each time if one is conducting multiple studies. After all, the sample means will not all be the same; so centering on the same value each time makes the results more comparable across studies.) Cheers! Bruce Richard Ristow wrote > At 02:34 PM 4/16/2014, Bruce Weaver wrote: > >>when I reverted to treating age as a continuous variable (in my >>final model), I knew I had to include both Age and Age-squared as >>explanatory variables. Including Age-squared allowed the functional >>relationship to be U-shaped. > > Bruce is far more the methodologist than I, but it's worth adding > that, for variables (like age) with strictly positive values, the > linear and squared terms tend to be highly correlated, leading to the > usual difficulties when estimating using correlated independent variables. > > One can mean-center the age before estimating, to avoid this. Or, it > works pretty well to choose an age near the middle of the range you > have, and use the square of the difference from that age. (It's fine > to use the plain age, rather than mean-centered, as the linear term.) > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ----- -- Bruce Weaver [hidden email]http://sites.google.com/a/lakeheadu.ca/bweaver/"When all else fails, RTFM." NOTE: My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. -- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/logistic-regression-assumption-tp5725433p5725508.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Ryan

Re: logistic regression assumption

In reply to this post by Bruce Weaver

Hi Bruce,

I just posted the link to that article without comment before because I was preoccupied, but now that I have a moment I'd like to chime in here. First and foremost, I agree with you entirely. I have not encountered a situation in which centering a variable resulted in any change in the actual model being fit. I have, on occasion, encountered challenges achieving convergence when fitting random effects Bayesian estimation models using WINBUGS and SAS without mean-centering [due to high autocorrelation--an issue with Bayesian estimation I care not to delve into at the moment].

Knowing that (1) generally, regression models do not change by mean centering variables and (2) I can utilize the coefficient matrix L to obtain parameter estimates/contrasts at whatever values of the variables I so desire by utilizing sub-commands of various procedures (e.g., LMATRIX in GLM, TEST in MIXED), I virtually never mean center before fitting models.

Best,

Ryan

On Thu, Apr 17, 2014 at 8:41 PM, Bruce Weaver <[hidden email]> wrote:

That's the one I was thinking of. Thanks Ryan.

Ryan Black wrote

> http://m.orm.sagepub.com/content/15/3/339.abstract
>
>> On Apr 17, 2014, at 7:01 PM, Bruce Weaver <

> bruce.weaver@

> > wrote:
>>
>> Hi Richard. Just a quick off the cuff response here, because it's time
>> to
>> get off home for the Easter weekend.
>>
>> I would argue that the collinearity of X and X-squared is "illusory",
>> meaning that it is completely non-problematic. (I know there is an
>> published article somewhere making this argument, but I can't lay my
>> hands
>> on it right now.) Here's one reason for thinking that: If you run the
>> model with and without centering, and save the fitted values of Y (or the
>> predicted probabilities, in the case of logistic regression), those
>> fitted
>> values (or predicted probabilities) will be identical. And the R-squared
>> (for OLS models) or -2LL values (for models fit via MLE) will be
>> identical
>> too. So it's the same model, regardless of whether you center or not.
>>
>> Having said that, I often do center the variables. But I do so simply to
>> make (some of) the coefficients more interpretable. And rather than
>> center
>> on the mean, I often center on a convenient value near the minimum. Part
>> of
>> the reason I do that is to emphasize the point that it is nowhere written
>> in
>> stone that thou shalt center on the mean! (Even if one does want to
>> mean-center, it is better practice, I think, to center on a value near
>> the
>> mean, and to center on the same value each time if one is conducting
>> multiple studies. After all, the sample means will not all be the same;
>> so
>> centering on the same value each time makes the results more comparable
>> across studies.)
>>
>>
>> Cheers!
>> Bruce
>>
>>
>>
>> Richard Ristow wrote
>>> At 02:34 PM 4/16/2014, Bruce Weaver wrote:
>>>
>>>> when I reverted to treating age as a continuous variable (in my
>>>> final model), I knew I had to include both Age and Age-squared as
>>>> explanatory variables. Including Age-squared allowed the functional
>>>> relationship to be U-shaped.
>>>
>>> Bruce is far more the methodologist than I, but it's worth adding
>>> that, for variables (like age) with strictly positive values, the
>>> linear and squared terms tend to be highly correlated, leading to the
>>> usual difficulties when estimating using correlated independent
>>> variables.
>>>
>>> One can mean-center the age before estimating, to avoid this. Or, it
>>> works pretty well to choose an age near the middle of the range you
>>> have, and use the square of the difference from that age. (It's fine
>>> to use the plain age, rather than mean-centered, as the linear term.)
>>>
>>> =====================
>>> To manage your subscription to SPSSX-L, send a message to
>>
>>> LISTSERV@.UGA
>>
>>> (not to SPSSX-L), with no body text except the
>>> command. To leave the list, send the command
>>> SIGNOFF SPSSX-L
>>> For a list of commands to manage subscriptions, send the command
>>> INFO REFCARD
>>
>>
>>
>>
>>
>> -----
>> --
>> Bruce Weaver
>>

> bweaver@

>> http://sites.google.com/a/lakeheadu.ca/bweaver/
>>
>> "When all else fails, RTFM."
>>
>> NOTE: My Hotmail account is not monitored regularly.
>> To send me an e-mail, please use the address shown above.
>>
>> --
>> View this message in context:
>> http://spssx-discussion.1045642.n5.nabble.com/logistic-regression-assumption-tp5725433p5725508.html
>> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>>

> LISTSERV@.UGA

> (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

> (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--

View this message in context: http://spssx-discussion.1045642.n5.nabble.com/logistic-regression-assumption-tp5725433p5725510.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Rich Ulrich

Re: logistic regression assumption

In reply to this post by Jon K Peck

Naturally, I have to agree with the mathematics. If you want to say
that the difference is "illusory", that's okay, too, for certain values of
the word "illusory". I have to say, here, please keep in mind that
"illusions" can serve a useful function. Twenty frames per second of
fixed images showing moving figures gives the human viewer the
illusion of perceived motion. That makes possible flip-books and movies.

Have you ever had to show your results to someone else? I assure you,
it is easier to discuss two regression coefficients - their sizes and tests -
when they are not highly correlated. I try to avoid modeling with such
terms, period. For two highly correlated variables among the IV's, I
suggest to consultees that they be modeled by some (relatively uncorrelated)
composites for the sum and difference, or sum and difference of the logarithms.
Putting in two highly correlated terms is something that we only should do when
it is unavoidable, that is, we *want* to puzzle over their confounding, after
the fact.

What you can say about the correlated ones, most often, has to come down
to, "Ignore these numbers; take my word that it means what I say."
My own consultees have been happier with the illusion presented by
values and tests for separate terms. And it *does* tell them about the
relative impact of the terms, fairly concisely and precisely.

But I learned to center for the other purpose that was mentioned, the
*occasional* failure of a program to get an answer because of near-collinearity
error -- convergence, or otherwise. That purpose is not illusory. It seems like
sloppy practice to wait for the error to happen when it can be prevented.

--
Rich Ulrich

Date: Thu, 17 Apr 2014 19:41:54 -0600
From: [hidden email]
Subject: Re: logistic regression assumption
To: [hidden email]

Amen, Bruce. I see this misconception repeated all the time on this list and elsewhere. No matter how many times I assert that computationally this makes no difference, it doesn't seem to get through, and the results are exactly equivalent up to a very high level of numerical exactness. Maybe people will believe it when you say it.

[snip, previous]

Ryan

Re: logistic regression assumption

In reply to this post by Bruce Weaver

Might I also add that one could perform a Likelihood Ratio Test (LRT) to test whether including the AgeSq term significantly improves model fit in Bruce's example. Although untested, I'm fairly certain the following adjustment to Bruce's syntax will provide the LRT in the Omnibus Tests of Model Coefficients Table:

LOGISTIC REGRESSION VARIABLES Admission_status2

/METHOD=ENTER Age Sex ED_only locum
/METHOD=ENTER Age AgeSq Sex ED_only locum.

Best,

Ryan

On Mon, Apr 14, 2014 at 4:17 PM, Bruce Weaver <[hidden email]> wrote:

One straightforward way to get an idea about the functional relationship
between a continuous explanatory variable and the log-odds of an "event"
(with "event" being defined as Outcome variable = 1) is as follows:

1. For exploratory purposes only, recode the continuous variable into some
number of categories (e.g., quintiles).

2. Estimate a model with the categorical variable in place of the continuous
variable, and save the predicted probabilities.

3. Convert the predicted probabilities to predicted log-odds.

4. Make a scatterplot with X = the original continuous variable and Y =
predicted log-odds.

Here's an example from something I helped a colleague with a while ago.

* Model 1: Exploratory with categorical Age variable.

LOGISTIC REGRESSION VARIABLES Admission_status2
/METHOD=ENTER AgeGroup Sex ED_only locum
/CONTRAST (AgeGroup)=Indicator(1)
/PRINT=CI(95)
/SAVE pred(PP1)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

COMPUTE LogOdds1 = ln(PP1 / (1 - PP1)).
VARIABLE LABELS LogOdds1 "Log-odds of outcome (Model 1)".
DESCRIPTIVES PP1 LogOdds1.

GRAPH /SCATTERPLOT(BIVAR)=AgeGroup WITH LogOdds1 .

* That scatter-plot shows a clear quadratic (U-shaped) relationship.
* Therefore, when we use Age as a continuous variable in Model 2,
* we'll want to include Age-squared as well.

* Model 2: Treat Age as a continuous variable,
* and include Age-squared.

COMPUTE AgeSq = Age**2.

LOGISTIC REGRESSION VARIABLES Admission_status2
/METHOD=ENTER Age AgeSq Sex ED_only locum
/PRINT=CI(95)
/SAVE pred(PP2)
/CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5).

COMPUTE LogOdds2 = ln(PP2 / (1 - PP2)).
VARIABLE LABELS LogOdds2 "Log-odds of outcome (Model 2)".

HTH.

dimitrios wrote

> Dear All,
>
> I would be grateful to you if you could help me with the following. I am
> getting desperate as I have to present my data on Wednesday.
>
> I checked the linear assumption for my only continuous variable and it is
> violated.
>
> I used the natural logarithm. How can I check that now it is not?
>
> Thank you in advance,
>
> Dimitrios

-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/logistic-regression-assumption-tp5725433p5725436.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Andy W

Re: logistic regression assumption

In reply to this post by Rich Ulrich

While I agree the mean centered variables are easier to interpret - please add a chart if you want to substantively talk about them! I can do the derivatives in my head - although I would suspect much of any audience won't go to that trouble. I also do not have a good mental model of the steepness of the parabola from just the estimated parameters nor do I have a good mental model of how large or small the estimates get towards the reasonable values of the explanatory variable in question. (This is important, as polynomial terms often behave badly in the tails - one of the reasons to use restricted cubic splines.) My mental model of these things gets worse if you include a cubed term.

So please, graph your effect estimates! All the things of interest (inflection point, how fast the curve rises or falls, how extreme the tails are) are immediately visible in a graph. You can also add confidence intervals or prediction intervals to the graph.

This advice extends to any set of functionally related explanatory variables.

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Maguin, Eugene

Re: logistic regression assumption

In reply to this post by Andy W

Just curious. It seems that some people post directly and only to nabble and some the same to this list. When I was looking the logistic regression discussion this morning, I noticed that one of the posts "had not been accepted by the list", which I think Bruce, David or Andy have noted before. What is the functional relationship between nabble and this list? And, is that relationship bidirectional or unidirectional only? Then, why the delay?
Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Andy W
Sent: Thursday, April 17, 2014 9:05 PM
To: [hidden email]
Subject: Re: logistic regression assumption

I haven't read it yet but I have what appears to be a pretty similar article in my to read bucket list:

Shieh, G. (2011). Clarifying the role of mean centring in multicollinearity of interaction effects. The British journal of mathematical and statistical psychology, 64(3):462-477. (No pre-print PDF I'm afraid, doi here <http://dx.doi.org/10.1111/j.2044-8317.2010.02002.x> .)

I would note - if the variable has a mean far away from zero you can have numerical instability in inverting the design matrix for squared or higher polynomial terms. E.g. In this post for illustration <http://andrewpwheeler.wordpress.com/2013/04/03/some-notes-on-single-line-charts-in-spss/>
I had polynomial terms of years starting in 1985. If I remember correctly I'm pretty sure SPSS would drop the squared year term when I estimated a linear regression equation - let alone the regression with both the square and the cubed term.

Also FYI I wrote a macro to estimate restricted cubic spline basis <http://andrewpwheeler.wordpress.com/2013/06/06/restricted-cubic-splines-in-spss/>
, a popular alternative to polynomial terms. I guess I will do the next blog post on how you can use them in logistic regression - as I got a comment asking about that as well.

-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/logistic-regression-assumption-tp5725433p5725513.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: logistic regression assumption

In reply to this post by Bruce Weaver

At 07:01 PM 4/17/2014, Bruce Weaver wrote:

>I would argue that the collinearity of X and X-squared is
>"illusory", meaning that it is completely non-problematic. (I know
>there is an published article somewhere making this argument, but I
>can't lay my hands on it right now.) Here's one reason for thinking
>that: If you run the model with and without centering, and save the
>fitted values of Y (or the predicted probabilities, in the case of
>logistic regression), those fitted values (or predicted
>probabilities) will be identical.

Whatever the collinearity is, it isn't illusory; it's there, and
readily calculable and displayable in the usual fashions.

What you, and others, are arguing is, that re-paramaterizing the
model as I've suggested doesn't change the subspace of possible
models (defining a 'model' as a set of predicted values), which is
correct; that, therefore, it doesn't change the best-fitting model,
which is also correct; and that, therefore, it doesn't matter, which
I disagree with.

The two reasons I advocate re-paramaterizing are, first, that it
makes the resulting coefficients much more interpretable, as others
have noted -- the linear term becomes the predicted DV change per
unit IV change in a central part of the range; and second, that
keeping the original, near-collinear paramaterization greatly
inflates the SEEs and confidence intervals of the estimated
coefficients. Among other things, that makes using t- or F-tests for
whether non-linear terms belong in the model, very insensitive. (It
may be argued that using ANY test to exclude terms from a model
results in overstating the F-based significance of the model; but
that argument applies equally to choosing whether to include
higher-order terms on the basis of a graph.)

It's been noted that collinear predictors also make the estimation
more difficult, numerically, though with modern hardware and software
that's a lesser issue.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Bruce Weaver

Re: logistic regression assumption

Administrator

Good morning Richard. :-)

For the record, I want to clarify that I did not intend to advocate NOT centering variables. As I said...

"... I often do center the variables. But I do so simply to make (some of) the coefficients more interpretable. And rather than center on the mean, I often center on a convenient value near the minimum. Part of the reason I do that is to emphasize the point that it is nowhere written in stone that thou shalt center on the mean! (Even if one does want to mean-center, it is better practice, I think, to center on a value near the mean, and to center on the same value each time if one is conducting multiple studies. After all, the sample means will not all be the same; so centering on the same value each time makes the results more comparable across studies.)"

Upon reflection, one change I would make in that (off the cuff) paragraph is to change "simply" to "mainly" in the second sentence--, i.e., "I do so MAINLY to make (some) of the coefficients more interpretable".

The main point I was *trying* to make is that I disagree with those authors who say that one MUST (mean) center their variables when the model includes product terms or higher order polynomial terms (which are really product terms too--X-sq = X*X, for example).

But...having read some of the other posts in the thread, I will concede that even with modern computing power & software, one may sometimes run into computational difficulties that can be alleviated by centering on some reasonable, in-the-observed-range values (not necessarily the mean).

By the way, I also strongly agree with Andy W on the importance of plotting fitted values for models that include product terms. Looking at such plots is FAR more illuminating than looking at tables of coefficients. (Even if one does wish to interpret the coefficients, it is much easier to do so having looked at plots of fitted values, in my experience.)

Cheers!
Bruce

Richard Ristow wrote

At 07:01 PM 4/17/2014, Bruce Weaver wrote:

>I would argue that the collinearity of X and X-squared is
>"illusory", meaning that it is completely non-problematic. (I know
>there is an published article somewhere making this argument, but I
>can't lay my hands on it right now.) Here's one reason for thinking
>that: If you run the model with and without centering, and save the
>fitted values of Y (or the predicted probabilities, in the case of
>logistic regression), those fitted values (or predicted
>probabilities) will be identical.

Whatever the collinearity is, it isn't illusory; it's there, and
readily calculable and displayable in the usual fashions.

What you, and others, are arguing is, that re-paramaterizing the
model as I've suggested doesn't change the subspace of possible
models (defining a 'model' as a set of predicted values), which is
correct; that, therefore, it doesn't change the best-fitting model,
which is also correct; and that, therefore, it doesn't matter, which
I disagree with.

The two reasons I advocate re-paramaterizing are, first, that it
makes the resulting coefficients much more interpretable, as others
have noted -- the linear term becomes the predicted DV change per
unit IV change in a central part of the range; and second, that
keeping the original, near-collinear paramaterization greatly
inflates the SEEs and confidence intervals of the estimated
coefficients. Among other things, that makes using t- or F-tests for
whether non-linear terms belong in the model, very insensitive. (It
may be argued that using ANY test to exclude terms from a model
results in overstating the F-based significance of the model; but
that argument applies equally to choosing whether to include
higher-order terms on the basis of a graph.)

It's been noted that collinear predictors also make the estimation
more difficult, numerically, though with modern hardware and software
that's a lesser issue.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Ryan

Re: logistic regression assumption

Bruce and others,

Suppose the population regression model is:

Y = 0.5 + 1.5*x + 2.0*(x^2) + Epsilon

Further, suppose we randomly select 10,000 subjects, and collect data on both y and x for each subject. BELOW my name is a simulation experiment which shows that we can obtain estimated parameters (intercept and main effect), standard errors, t-statistics, and p-values from a model employed on non-centered data that are *identical* to a model employed on centered data. The TEST statements of the MIXED procedure provide proof of what I claim, at least for this simulation example.

To construct those TEST statements, all I needed to do was to recognize the relationship between the non-centered and centered equations.

With the exception of numerical instability due to unknown various factors (which I very rarely encounter and certainly did not encounter with this simulation experiment), I continue to assert that there is no need to mean-center the model provided above with respect to accurately estimating model fit, parameters, standard errors/confidence intervals, test-statistics, p-values, etc.

Ryan

*Generate Data.

set seed 1234.

new file.

input program.

loop ID= 1 to 10000.

compute x = rv.normal(2,1).

compute y = 0.5 + 1.5*x + 2.0*(x**2) + rv.normal(0,1).

end case.

end loop.

end file.

end input program.

execute.

COMPUTE x_squared=x*x.

EXECUTE.

*OLS Regression without mean centering.

REGRESSION

/STATISTICS COEFF OUTS R ANOVA

/DEPENDENT y

/METHOD=ENTER x x_squared.

COMPUTE x_mean_centered=x - 1.9797462214653716.

COMPUTE x_mean_centered_sqrd = x_mean_centered**2.

*OLS Regression with mean centering.

REGRESSION

/STATISTICS COEFF OUTS R ANOVA

/DEPENDENT y

/METHOD=ENTER x_mean_centered x_mean_centered_sqrd.

*REML Regression without mean centering.

*Note: Used TEST subcommand to recover intercept and main effect

test from OLS Regression with mean centering.

MIXED y WITH x x_squared

/FIXED=x x_squared | SSTYPE(3)

/PRINT = SOLUTION

/METHOD=REML

/TEST 'intercept @ x=0' intercept 1 x 0 x_squared 0

/TEST 'main eff @ x=0' intercept 0 x 1 x_squared 0

/TEST 'intercept @ x=mean' intercept 1 x 1.9797462214653716 x_squared 3.919395101406414

/TEST 'main eff @ x=mean' intercept 0 x 1 x_squared 3.959492442930742.

On Fri, Apr 18, 2014 at 10:13 AM, Bruce Weaver <[hidden email]> wrote:

Good morning Richard. :-)

For the record, I want to clarify that I did not intend to advocate NOT
centering variables. As I said...

"... I often do center the variables. But I do so simply to make (some of)

the coefficients more interpretable. And rather than center on the mean, I
often center on a convenient value near the minimum. Part of the reason I
do that is to emphasize the point that it is nowhere written in stone that
thou shalt center on the mean! (Even if one does want to mean-center, it is
better practice, I think, to center on a value near the mean, and to center
on the same value each time if one is conducting multiple studies. After
all, the sample means will not all be the same; so centering on the same
value each time makes the results more comparable across studies.)"

Upon reflection, one change I would make in that (off the cuff) paragraph is
to change "simply" to "mainly" in the second sentence--, i.e., "I do so
MAINLY to make (some) of the coefficients more interpretable".

The main point I was *trying* to make is that I disagree with those authors
who say that one MUST (mean) center their variables when the model includes
product terms or higher order polynomial terms (which are really product
terms too--X-sq = X*X, for example).

But...having read some of the other posts in the thread, I will concede that
even with modern computing power & software, one may sometimes run into
computational difficulties that can be alleviated by centering on some
reasonable, in-the-observed-range values (not necessarily the mean).

By the way, I also strongly agree with Andy W on the importance of plotting
fitted values for models that include product terms. Looking at such plots
is FAR more illuminating than looking at tables of coefficients. (Even if
one does wish to interpret the coefficients, it is much easier to do so
having looked at plots of fitted values, in my experience.)

Cheers!
Bruce

Richard Ristow wrote

> At 07:01 PM 4/17/2014, Bruce Weaver wrote:
>
>>I would argue that the collinearity of X and X-squared is
>>"illusory", meaning that it is completely non-problematic. (I know
>>there is an published article somewhere making this argument, but I
>>can't lay my hands on it right now.) Here's one reason for thinking
>>that: If you run the model with and without centering, and save the
>>fitted values of Y (or the predicted probabilities, in the case of
>>logistic regression), those fitted values (or predicted
>>probabilities) will be identical.
>
> Whatever the collinearity is, it isn't illusory; it's there, and
> readily calculable and displayable in the usual fashions.
>
> What you, and others, are arguing is, that re-paramaterizing the
> model as I've suggested doesn't change the subspace of possible
> models (defining a 'model' as a set of predicted values), which is
> correct; that, therefore, it doesn't change the best-fitting model,
> which is also correct; and that, therefore, it doesn't matter, which
> I disagree with.
>
> The two reasons I advocate re-paramaterizing are, first, that it
> makes the resulting coefficients much more interpretable, as others
> have noted -- the linear term becomes the predicted DV change per
> unit IV change in a central part of the range; and second, that
> keeping the original, near-collinear paramaterization greatly
> inflates the SEEs and confidence intervals of the estimated
> coefficients. Among other things, that makes using t- or F-tests for
> whether non-linear terms belong in the model, very insensitive. (It
> may be argued that using ANY test to exclude terms from a model
> results in overstating the F-based significance of the model; but
> that argument applies equally to choosing whether to include
> higher-order terms on the basis of a graph.)
>
> It's been noted that collinear predictors also make the estimation
> more difficult, numerically, though with modern hardware and software
> that's a lesser issue.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

> (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--

View this message in context: http://spssx-discussion.1045642.n5.nabble.com/logistic-regression-assumption-tp5725433p5725536.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

dimitrios

Re: logistic regression assumption

Thank you all for your input.
I am rather naive in stats so I would like to clarify this:
When I use age as a continous variable, it is not linear and therefore I cannot use it.
If I use age as ordinal (18-40, 41-60, 61-80), I guess I do not need to worry about linearity.
Results come back similar and from practical point of view, it does not change a lot.

I may miss the information that a continuous variable offer (e.g. HR per year), but I still get valuable information about the impact of age.

Is this considered acceptable?

I am grateful to you for all your input, but I am a little concerned to use advanced stats (at least for me) since I may make a significant mistake, without even realizing it.

Thank you in advance,

Bruce Weaver

Re: logistic regression assumption

Administrator

I'll repeat something I noted earlier in the thread, and expand on it.

Here's the repeated bit:

1. Generally speaking, it is preferable to treat continuous variables as continuous. E.g., see the Streiner article "Breaking up is hard to do" (link given below). But if the functional relationship between that continuous variable and the outcome is not linear, you'll have to take that into account somehow (e.g., by including higher order polynomial terms, or regression splines, etc).

http://isites.harvard.edu/fs/docs/icb.topic477909.files/dichotomizing_continuous.pdf

And here is the expansion.

With the age groups you list below:

1. Everyone within an age group will have exactly the same fitted value, despite differing in age by up to about 20 years for those at the extremes.

2. Two people just on either side of the age group cut-points can have very different fitted values, despite tiny differences in age.

3. The age-group cut-points are probably arbitrary, and the fitted values for individuals near the cut-points will likely change fairly substantially if you change the cut-points.

These are some of the reasons why it is usually preferable (if at all possible) to model continuous variables (like Age) as continuous.

HTH.

dimitrios wrote

Thank you all for your input.
I am rather naive in stats so I would like to clarify this:
When I use age as a continous variable, it is not linear and therefore I cannot use it.
If I use age as ordinal (18-40, 41-60, 61-80), I guess I do not need to worry about linearity.
Results come back similar and from practical point of view, it does not change a lot.

I may miss the information that a continuous variable offer (e.g. HR per year), but I still get valuable information about the impact of age.

Is this considered acceptable?

I am grateful to you for all your input, but I am a little concerned to use advanced stats (at least for me) since I may make a significant mistake, without even realizing it.

Thank you in advance,