Dear All,
I would be grateful to you if you could help me with the following. I am getting desperate as I have to present my data on Wednesday. I checked the linear assumption for my only continuous variable and it is violated. I used the natural logarithm. How can I check that now it is not? Thank you in advance, Dimitrios |
Administrator
|
One straightforward way to get an idea about the functional relationship between a continuous explanatory variable and the log-odds of an "event" (with "event" being defined as Outcome variable = 1) is as follows:
1. For exploratory purposes only, recode the continuous variable into some number of categories (e.g., quintiles). 2. Estimate a model with the categorical variable in place of the continuous variable, and save the predicted probabilities. 3. Convert the predicted probabilities to predicted log-odds. 4. Make a scatterplot with X = the original continuous variable and Y = predicted log-odds. Here's an example from something I helped a colleague with a while ago. * Model 1: Exploratory with categorical Age variable. LOGISTIC REGRESSION VARIABLES Admission_status2 /METHOD=ENTER AgeGroup Sex ED_only locum /CONTRAST (AgeGroup)=Indicator(1) /PRINT=CI(95) /SAVE pred(PP1) /CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5). COMPUTE LogOdds1 = ln(PP1 / (1 - PP1)). VARIABLE LABELS LogOdds1 "Log-odds of outcome (Model 1)". DESCRIPTIVES PP1 LogOdds1. GRAPH /SCATTERPLOT(BIVAR)=AgeGroup WITH LogOdds1 . * That scatter-plot shows a clear quadratic (U-shaped) relationship. * Therefore, when we use Age as a continuous variable in Model 2, * we'll want to include Age-squared as well. * Model 2: Treat Age as a continuous variable, * and include Age-squared. COMPUTE AgeSq = Age**2. LOGISTIC REGRESSION VARIABLES Admission_status2 /METHOD=ENTER Age AgeSq Sex ED_only locum /PRINT=CI(95) /SAVE pred(PP2) /CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5). COMPUTE LogOdds2 = ln(PP2 / (1 - PP2)). VARIABLE LABELS LogOdds2 "Log-odds of outcome (Model 2)". HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Thank you for your reply.
Is it acceptable to transform a continuous variable to a categorigal one for the logistic regression since my variable is not linear or is it advisable to go through the transformation? Thank you in advance, Dimitrios |
Administrator
|
I don't entirely understand your question, but will offer these comments.
1. Generally speaking, it is preferable to treat continuous variables as continuous. E.g., see the Streiner article "Breaking up is hard to do" (link given below). But if the functional relationship between that continuous variable and the outcome is not linear, you'll have to take that into account somehow (e.g., by including higher order polynomial terms, or regression splines, etc). http://isites.harvard.edu/fs/docs/icb.topic477909.files/dichotomizing_continuous.pdf 2. In the example I gave earlier in the thread, I carved age into categories for a *preliminary*, *exploratory* analysis that was carried out to provide information about the shape of the functional relationship between age and the log-odds of the 1-0 outcome variable being = 1. A plot of the fitted log-odds as a function of age showed a clear U-shaped functional relationship. Therefore, when I reverted to treating age as a continuous variable (in my final model), I knew I had to include both Age and Age-squared as explanatory variables. Including Age-squared allowed the functional relationship to be U-shaped. I hope this clarifies things somewhat.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
At 02:34 PM 4/16/2014, Bruce Weaver wrote:
>when I reverted to treating age as a continuous variable (in my >final model), I knew I had to include both Age and Age-squared as >explanatory variables. Including Age-squared allowed the functional >relationship to be U-shaped. Bruce is far more the methodologist than I, but it's worth adding that, for variables (like age) with strictly positive values, the linear and squared terms tend to be highly correlated, leading to the usual difficulties when estimating using correlated independent variables. One can mean-center the age before estimating, to avoid this. Or, it works pretty well to choose an age near the middle of the range you have, and use the square of the difference from that age. (It's fine to use the plain age, rather than mean-centered, as the linear term.) ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
Hi Richard. Just a quick off the cuff response here, because it's time to get off home for the Easter weekend.
I would argue that the collinearity of X and X-squared is "illusory", meaning that it is completely non-problematic. (I know there is an published article somewhere making this argument, but I can't lay my hands on it right now.) Here's one reason for thinking that: If you run the model with and without centering, and save the fitted values of Y (or the predicted probabilities, in the case of logistic regression), those fitted values (or predicted probabilities) will be identical. And the R-squared (for OLS models) or -2LL values (for models fit via MLE) will be identical too. So it's the same model, regardless of whether you center or not. Having said that, I often do center the variables. But I do so simply to make (some of) the coefficients more interpretable. And rather than center on the mean, I often center on a convenient value near the minimum. Part of the reason I do that is to emphasize the point that it is nowhere written in stone that thou shalt center on the mean! (Even if one does want to mean-center, it is better practice, I think, to center on a value near the mean, and to center on the same value each time if one is conducting multiple studies. After all, the sample means will not all be the same; so centering on the same value each time makes the results more comparable across studies.) Cheers! Bruce
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
http://m.orm.sagepub.com/content/15/3/339.abstract
> On Apr 17, 2014, at 7:01 PM, Bruce Weaver <[hidden email]> wrote: > > Hi Richard. Just a quick off the cuff response here, because it's time to > get off home for the Easter weekend. > > I would argue that the collinearity of X and X-squared is "illusory", > meaning that it is completely non-problematic. (I know there is an > published article somewhere making this argument, but I can't lay my hands > on it right now.) Here's one reason for thinking that: If you run the > model with and without centering, and save the fitted values of Y (or the > predicted probabilities, in the case of logistic regression), those fitted > values (or predicted probabilities) will be identical. And the R-squared > (for OLS models) or -2LL values (for models fit via MLE) will be identical > too. So it's the same model, regardless of whether you center or not. > > Having said that, I often do center the variables. But I do so simply to > make (some of) the coefficients more interpretable. And rather than center > on the mean, I often center on a convenient value near the minimum. Part of > the reason I do that is to emphasize the point that it is nowhere written in > stone that thou shalt center on the mean! (Even if one does want to > mean-center, it is better practice, I think, to center on a value near the > mean, and to center on the same value each time if one is conducting > multiple studies. After all, the sample means will not all be the same; so > centering on the same value each time makes the results more comparable > across studies.) > > > Cheers! > Bruce > > > > Richard Ristow wrote >> At 02:34 PM 4/16/2014, Bruce Weaver wrote: >> >>> when I reverted to treating age as a continuous variable (in my >>> final model), I knew I had to include both Age and Age-squared as >>> explanatory variables. Including Age-squared allowed the functional >>> relationship to be U-shaped. >> >> Bruce is far more the methodologist than I, but it's worth adding >> that, for variables (like age) with strictly positive values, the >> linear and squared terms tend to be highly correlated, leading to the >> usual difficulties when estimating using correlated independent variables. >> >> One can mean-center the age before estimating, to avoid this. Or, it >> works pretty well to choose an age near the middle of the range you >> have, and use the square of the difference from that age. (It's fine >> to use the plain age, rather than mean-centered, as the linear term.) >> >> ===================== >> To manage your subscription to SPSSX-L, send a message to > >> LISTSERV@.UGA > >> (not to SPSSX-L), with no body text except the >> command. To leave the list, send the command >> SIGNOFF SPSSX-L >> For a list of commands to manage subscriptions, send the command >> INFO REFCARD > > > > > > ----- > -- > Bruce Weaver > [hidden email] > http://sites.google.com/a/lakeheadu.ca/bweaver/ > > "When all else fails, RTFM." > > NOTE: My Hotmail account is not monitored regularly. > To send me an e-mail, please use the address shown above. > > -- > View this message in context: http://spssx-discussion.1045642.n5.nabble.com/logistic-regression-assumption-tp5725433p5725508.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
That's the one I was thinking of. Thanks Ryan.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by Ryan
I haven't read it yet but I have what appears to be a pretty similar article in my to read bucket list:
Shieh, G. (2011). Clarifying the role of mean centring in multicollinearity of interaction effects. The British journal of mathematical and statistical psychology, 64(3):462-477. (No pre-print PDF I'm afraid, doi here.) I would note - if the variable has a mean far away from zero you can have numerical instability in inverting the design matrix for squared or higher polynomial terms. E.g. In this post for illustration I had polynomial terms of years starting in 1985. If I remember correctly I'm pretty sure SPSS would drop the squared year term when I estimated a linear regression equation - let alone the regression with both the square and the cubed term. Also FYI I wrote a macro to estimate restricted cubic spline basis, a popular alternative to polynomial terms. I guess I will do the next blog post on how you can use them in logistic regression - as I got a comment asking about that as well. |
In reply to this post by Bruce Weaver
Amen, Bruce. I see this misconception
repeated all the time on this list and elsewhere. No matter how many
times I assert that computationally this makes no difference, it doesn't
seem to get through, and the results are exactly equivalent up to
a very high level of numerical exactness. Maybe people
will believe it when you say it.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Bruce Weaver <[hidden email]> To: [hidden email], Date: 04/17/2014 05:02 PM Subject: Re: [SPSSX-L] logistic regression assumption Sent by: "SPSSX(r) Discussion" <[hidden email]> Hi Richard. Just a quick off the cuff response here, because it's time to get off home for the Easter weekend. I would argue that the collinearity of X and X-squared is "illusory", meaning that it is completely non-problematic. (I know there is an published article somewhere making this argument, but I can't lay my hands on it right now.) Here's one reason for thinking that: If you run the model with and without centering, and save the fitted values of Y (or the predicted probabilities, in the case of logistic regression), those fitted values (or predicted probabilities) will be identical. And the R-squared (for OLS models) or -2LL values (for models fit via MLE) will be identical too. So it's the same model, regardless of whether you center or not. Having said that, I often do center the variables. But I do so simply to make (some of) the coefficients more interpretable. And rather than center on the mean, I often center on a convenient value near the minimum. Part of the reason I do that is to emphasize the point that it is nowhere written in stone that thou shalt center on the mean! (Even if one does want to mean-center, it is better practice, I think, to center on a value near the mean, and to center on the same value each time if one is conducting multiple studies. After all, the sample means will not all be the same; so centering on the same value each time makes the results more comparable across studies.) Cheers! Bruce Richard Ristow wrote > At 02:34 PM 4/16/2014, Bruce Weaver wrote: > >>when I reverted to treating age as a continuous variable (in my >>final model), I knew I had to include both Age and Age-squared as >>explanatory variables. Including Age-squared allowed the functional >>relationship to be U-shaped. > > Bruce is far more the methodologist than I, but it's worth adding > that, for variables (like age) with strictly positive values, the > linear and squared terms tend to be highly correlated, leading to the > usual difficulties when estimating using correlated independent variables. > > One can mean-center the age before estimating, to avoid this. Or, it > works pretty well to choose an age near the middle of the range you > have, and use the square of the difference from that age. (It's fine > to use the plain age, rather than mean-centered, as the linear term.) > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@.UGA > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ----- -- Bruce Weaver [hidden email] http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." NOTE: My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/logistic-regression-assumption-tp5725433p5725508.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Bruce Weaver
Hi Bruce, I just posted the link to that article without comment before because I was preoccupied, but now that I have a moment I'd like to chime in here. First and foremost, I agree with you entirely. I have not encountered a situation in which centering a variable resulted in any change in the actual model being fit. I have, on occasion, encountered challenges achieving convergence when fitting random effects Bayesian estimation models using WINBUGS and SAS without mean-centering [due to high autocorrelation--an issue with Bayesian estimation I care not to delve into at the moment].
Knowing that (1) generally, regression models do not change by mean centering variables and (2) I can utilize the coefficient matrix L to obtain parameter estimates/contrasts at whatever values of the variables I so desire by utilizing sub-commands of various procedures (e.g., LMATRIX in GLM, TEST in MIXED), I virtually never mean center before fitting models.
Best, Ryan On Thu, Apr 17, 2014 at 8:41 PM, Bruce Weaver <[hidden email]> wrote: That's the one I was thinking of. Thanks Ryan. |
In reply to this post by Jon K Peck
Naturally, I have to agree with the mathematics. If you want to say
that the difference is "illusory", that's okay, too, for certain values of the word "illusory". I have to say, here, please keep in mind that "illusions" can serve a useful function. Twenty frames per second of fixed images showing moving figures gives the human viewer the illusion of perceived motion. That makes possible flip-books and movies. Have you ever had to show your results to someone else? I assure you, it is easier to discuss two regression coefficients - their sizes and tests - when they are not highly correlated. I try to avoid modeling with such terms, period. For two highly correlated variables among the IV's, I suggest to consultees that they be modeled by some (relatively uncorrelated) composites for the sum and difference, or sum and difference of the logarithms. Putting in two highly correlated terms is something that we only should do when it is unavoidable, that is, we *want* to puzzle over their confounding, after the fact. What you can say about the correlated ones, most often, has to come down to, "Ignore these numbers; take my word that it means what I say." My own consultees have been happier with the illusion presented by values and tests for separate terms. And it *does* tell them about the relative impact of the terms, fairly concisely and precisely. But I learned to center for the other purpose that was mentioned, the *occasional* failure of a program to get an answer because of near-collinearity error -- convergence, or otherwise. That purpose is not illusory. It seems like sloppy practice to wait for the error to happen when it can be prevented. -- Rich Ulrich Date: Thu, 17 Apr 2014 19:41:54 -0600 From: [hidden email] Subject: Re: logistic regression assumption To: [hidden email] Amen, Bruce. I see this misconception repeated all the time on this list and elsewhere. No matter how many times I assert that computationally this makes no difference, it doesn't seem to get through, and the results are exactly equivalent up to a very high level of numerical exactness. Maybe people will believe it when you say it. [snip, previous] |
In reply to this post by Bruce Weaver
Might I also add that one could perform a Likelihood Ratio Test (LRT) to test whether including the AgeSq term significantly improves model fit in Bruce's example. Although untested, I'm fairly certain the following adjustment to Bruce's syntax will provide the LRT in the Omnibus Tests of Model Coefficients Table:
LOGISTIC REGRESSION VARIABLES Admission_status2
/METHOD=ENTER Age Sex ED_only locum /METHOD=ENTER Age AgeSq Sex ED_only locum. Best,
Ryan On Mon, Apr 14, 2014 at 4:17 PM, Bruce Weaver <[hidden email]> wrote: One straightforward way to get an idea about the functional relationship |
In reply to this post by Rich Ulrich
While I agree the mean centered variables are easier to interpret - please add a chart if you want to substantively talk about them! I can do the derivatives in my head - although I would suspect much of any audience won't go to that trouble. I also do not have a good mental model of the steepness of the parabola from just the estimated parameters nor do I have a good mental model of how large or small the estimates get towards the reasonable values of the explanatory variable in question. (This is important, as polynomial terms often behave badly in the tails - one of the reasons to use restricted cubic splines.) My mental model of these things gets worse if you include a cubed term.
So please, graph your effect estimates! All the things of interest (inflection point, how fast the curve rises or falls, how extreme the tails are) are immediately visible in a graph. You can also add confidence intervals or prediction intervals to the graph. This advice extends to any set of functionally related explanatory variables. |
In reply to this post by Andy W
Just curious. It seems that some people post directly and only to nabble and some the same to this list. When I was looking the logistic regression discussion this morning, I noticed that one of the posts "had not been accepted by the list", which I think Bruce, David or Andy have noted before. What is the functional relationship between nabble and this list? And, is that relationship bidirectional or unidirectional only? Then, why the delay?
Gene Maguin -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Andy W Sent: Thursday, April 17, 2014 9:05 PM To: [hidden email] Subject: Re: logistic regression assumption I haven't read it yet but I have what appears to be a pretty similar article in my to read bucket list: Shieh, G. (2011). Clarifying the role of mean centring in multicollinearity of interaction effects. The British journal of mathematical and statistical psychology, 64(3):462-477. (No pre-print PDF I'm afraid, doi here <http://dx.doi.org/10.1111/j.2044-8317.2010.02002.x> .) I would note - if the variable has a mean far away from zero you can have numerical instability in inverting the design matrix for squared or higher polynomial terms. E.g. In this post for illustration <http://andrewpwheeler.wordpress.com/2013/04/03/some-notes-on-single-line-charts-in-spss/> I had polynomial terms of years starting in 1985. If I remember correctly I'm pretty sure SPSS would drop the squared year term when I estimated a linear regression equation - let alone the regression with both the square and the cubed term. Also FYI I wrote a macro to estimate restricted cubic spline basis <http://andrewpwheeler.wordpress.com/2013/06/06/restricted-cubic-splines-in-spss/> , a popular alternative to polynomial terms. I guess I will do the next blog post on how you can use them in logistic regression - as I got a comment asking about that as well. ----- Andy W [hidden email] http://andrewpwheeler.wordpress.com/ -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/logistic-regression-assumption-tp5725433p5725513.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Bruce Weaver
At 07:01 PM 4/17/2014, Bruce Weaver wrote:
>I would argue that the collinearity of X and X-squared is >"illusory", meaning that it is completely non-problematic. (I know >there is an published article somewhere making this argument, but I >can't lay my hands on it right now.) Here's one reason for thinking >that: If you run the model with and without centering, and save the >fitted values of Y (or the predicted probabilities, in the case of >logistic regression), those fitted values (or predicted >probabilities) will be identical. Whatever the collinearity is, it isn't illusory; it's there, and readily calculable and displayable in the usual fashions. What you, and others, are arguing is, that re-paramaterizing the model as I've suggested doesn't change the subspace of possible models (defining a 'model' as a set of predicted values), which is correct; that, therefore, it doesn't change the best-fitting model, which is also correct; and that, therefore, it doesn't matter, which I disagree with. The two reasons I advocate re-paramaterizing are, first, that it makes the resulting coefficients much more interpretable, as others have noted -- the linear term becomes the predicted DV change per unit IV change in a central part of the range; and second, that keeping the original, near-collinear paramaterization greatly inflates the SEEs and confidence intervals of the estimated coefficients. Among other things, that makes using t- or F-tests for whether non-linear terms belong in the model, very insensitive. (It may be argued that using ANY test to exclude terms from a model results in overstating the F-based significance of the model; but that argument applies equally to choosing whether to include higher-order terms on the basis of a graph.) It's been noted that collinear predictors also make the estimation more difficult, numerically, though with modern hardware and software that's a lesser issue. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
Good morning Richard. :-)
For the record, I want to clarify that I did not intend to advocate NOT centering variables. As I said... "... I often do center the variables. But I do so simply to make (some of) the coefficients more interpretable. And rather than center on the mean, I often center on a convenient value near the minimum. Part of the reason I do that is to emphasize the point that it is nowhere written in stone that thou shalt center on the mean! (Even if one does want to mean-center, it is better practice, I think, to center on a value near the mean, and to center on the same value each time if one is conducting multiple studies. After all, the sample means will not all be the same; so centering on the same value each time makes the results more comparable across studies.)" Upon reflection, one change I would make in that (off the cuff) paragraph is to change "simply" to "mainly" in the second sentence--, i.e., "I do so MAINLY to make (some) of the coefficients more interpretable". The main point I was *trying* to make is that I disagree with those authors who say that one MUST (mean) center their variables when the model includes product terms or higher order polynomial terms (which are really product terms too--X-sq = X*X, for example). But...having read some of the other posts in the thread, I will concede that even with modern computing power & software, one may sometimes run into computational difficulties that can be alleviated by centering on some reasonable, in-the-observed-range values (not necessarily the mean). By the way, I also strongly agree with Andy W on the importance of plotting fitted values for models that include product terms. Looking at such plots is FAR more illuminating than looking at tables of coefficients. (Even if one does wish to interpret the coefficients, it is much easier to do so having looked at plots of fitted values, in my experience.) Cheers! Bruce
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Bruce and others, Suppose the population regression model is: Y = 0.5 + 1.5*x + 2.0*(x^2) + Epsilon Further, suppose we randomly select 10,000 subjects, and collect data on both y and x for each subject. BELOW my name is a simulation experiment which shows that we can obtain estimated parameters (intercept and main effect), standard errors, t-statistics, and p-values from a model employed on non-centered data that are *identical* to a model employed on centered data. The TEST statements of the MIXED procedure provide proof of what I claim, at least for this simulation example.
To construct those TEST statements, all I needed to do was to recognize the relationship between the non-centered and centered equations. With the exception of numerical instability due to unknown various factors (which I very rarely encounter and certainly did not encounter with this simulation experiment), I continue to assert that there is no need to mean-center the model provided above with respect to accurately estimating model fit, parameters, standard errors/confidence intervals, test-statistics, p-values, etc.
Ryan -- *Generate Data. set seed 1234. new file. input program. loop ID= 1 to 10000. compute x = rv.normal(2,1).
compute y = 0.5 + 1.5*x + 2.0*(x**2) + rv.normal(0,1). end case. end loop. end file. end input program. execute. COMPUTE x_squared=x*x.
EXECUTE. *OLS Regression without mean centering. REGRESSION /STATISTICS COEFF OUTS R ANOVA /DEPENDENT y /METHOD=ENTER x x_squared. COMPUTE x_mean_centered=x - 1.9797462214653716. COMPUTE x_mean_centered_sqrd = x_mean_centered**2. *OLS Regression with mean centering. REGRESSION /STATISTICS COEFF OUTS R ANOVA
/DEPENDENT y /METHOD=ENTER x_mean_centered x_mean_centered_sqrd. *REML Regression without mean centering. *Note: Used TEST subcommand to recover intercept and main effect
test from OLS Regression with mean centering. MIXED y WITH x x_squared /FIXED=x x_squared | SSTYPE(3) /PRINT = SOLUTION /METHOD=REML /TEST 'intercept @ x=0' intercept 1 x 0 x_squared 0
/TEST 'main eff @ x=0' intercept 0 x 1 x_squared 0 /TEST 'intercept @ x=mean' intercept 1 x 1.9797462214653716 x_squared 3.919395101406414 /TEST 'main eff @ x=mean' intercept 0 x 1 x_squared 3.959492442930742.
On Fri, Apr 18, 2014 at 10:13 AM, Bruce Weaver <[hidden email]> wrote: Good morning Richard. :-) |
Thank you all for your input.
I am rather naive in stats so I would like to clarify this: When I use age as a continous variable, it is not linear and therefore I cannot use it. If I use age as ordinal (18-40, 41-60, 61-80), I guess I do not need to worry about linearity. Results come back similar and from practical point of view, it does not change a lot. I may miss the information that a continuous variable offer (e.g. HR per year), but I still get valuable information about the impact of age. Is this considered acceptable? I am grateful to you for all your input, but I am a little concerned to use advanced stats (at least for me) since I may make a significant mistake, without even realizing it. Thank you in advance, |
Administrator
|
I'll repeat something I noted earlier in the thread, and expand on it.
Here's the repeated bit: 1. Generally speaking, it is preferable to treat continuous variables as continuous. E.g., see the Streiner article "Breaking up is hard to do" (link given below). But if the functional relationship between that continuous variable and the outcome is not linear, you'll have to take that into account somehow (e.g., by including higher order polynomial terms, or regression splines, etc). http://isites.harvard.edu/fs/docs/icb.topic477909.files/dichotomizing_continuous.pdf And here is the expansion. With the age groups you list below: 1. Everyone within an age group will have exactly the same fitted value, despite differing in age by up to about 20 years for those at the extremes. 2. Two people just on either side of the age group cut-points can have very different fitted values, despite tiny differences in age. 3. The age-group cut-points are probably arbitrary, and the fitted values for individuals near the cut-points will likely change fairly substantially if you change the cut-points. These are some of the reasons why it is usually preferable (if at all possible) to model continuous variables (like Age) as continuous. HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Free forum by Nabble | Edit this page |