Administrator
|
This is a question that Diana Kornbrot tacked on at the end of a post in the recent thread on the Benjamini-Hochberg FDR (http://spssx-discussion.1045642.n5.nabble.com/False-Discovery-Benjamin-Hochberg-tp5725087.html). I thought it deserved it's own thread, so am re-posting it here.
Diana wrote: SPSS was far from eager to run logistic with 2 factors, 1 of which had 3219 levels [after 3 hours it was still on iteration2 - i gave up] Comments welcome best Diana ___________ Diana, please provide more information. What are the 2 factors? Bearing in mind that "factor" means categorical variable in SPSS lingo, I am struggling to think of a categorical variable that would have 3219 levels. Should it really be treated as a continuous variable (or a "covariate" in SPSS lingo)? On the other hand, if it really does have 3219 levels, should you be using GENLINMIXED, and estimating the variance of random intercepts, etc, rather than treating it as fixed? HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
From someone who works with SAS and SPSS on a daily basis, every new version in SAS (now at SAS 9.4), the developers have made leaps and bounds. One example would be developing "high performance" procedures (e.g., logistic, mixed). From what I have seen, these high performance procedures require a fraction of time to converge very complex models that the traditional procs require. There are many other advances SAS has made that I would like to see SPSS achieve, but I'll stick *loosely* to the current topic of this thread. (If one were to take Bruce's advice to treat this fixed factor as random, it would likely take very long for the model to converge using GENLINMIXED). Ryan On Thu, Apr 3, 2014 at 8:51 AM, Bruce Weaver <[hidden email]> wrote: This is a question that Diana Kornbrot tacked on at the end of a post in the |
In reply to this post by Bruce Weaver
If one wants to do fixed effects regression with many groups (fixed effects in Economics speak - not multi-level model speak) it is equivalent to estimating a regression model with the expanded set of dummy variables. See http://www.stata.com/support/faqs/statistics/intercept-in-fixed-effects-model/ for an example. Fitting a random effects model is not equivalent - so regardless of computational difficulty I don't believe suggesting a random effects in lieu of the dummy variables is reasonable advice. In most circumstances the study design and identification of what effects you are interested in should dictate which model you fit.
The Stata link also has an option for linear regression if you can't estimate all of those dummy variables, you can group mean center the variables and then estimate the regression equation without an intercept (although the standard errors for the effect estimates are not correct at that point - and you do not have individual group intercept estimates if you are interested in those). This advice does not extend to logistic regression though. Fixed effects logistic regression is sometimes referred to as conditional logistic regression. See https://www-304.ibm.com/support/docview.wss?uid=swg21477360 for how you can use the COXREG procedure with the STRATA subcommand to fit this model. It is worth a shot to see if that fits the model faster than LOGISTIC (or whatever procedure you are using). Again I do not believe that results in individual intercept estimates for the groups though, which is what it seems Diana wants from the original question. I've had problems fitting a large number of dummy variables in the past as well. For this paper, http://dx.doi.org/10.1007/s10940-011-9161-7, I had a fixed effects OLS regression with 29,000 observations and around 10,000 groups. The reported results are from Stata and I was unable to fit synonymous models in SPSS (although I could fit them in SPSS on smaller subgroups of the data). Here I made a quick simulation to test out PLUM (I do not have access to LOGISTIC, GENLIN, GENLINMIXED or COXREG on my work machine) with 4,000 groups and between 2 and 20 observations per group. I am on iteration 6 about 40 minutes in. I'd be interested to hear if any of the other procedures have better luck. *******************************************. SET SEED 10. INPUT PROGRAM. LOOP #g = 1 TO 4000. COMPUTE #gRan = RV.NORMAL(0.5,0.1). LOOP #obs = 1 TO TRUNC(RV.UNIFORM(2,21)). COMPUTE Group = #g. COMPUTE GroupRand = #gRan. END CASE. END LOOP. END LOOP. END FILE. END INPUT PROGRAM. DATASET NAME sim. *One categorical covariate. COMPUTE cat = RV.BERNOULLI(0.5). *Two continuous. COMPUTE cont1 = RV.NORMAL(0,1). COMPUTE cont2 = RV.NORMA(0,1). FORMATS cat (F1.0) Group (F4.0) GroupRand cont2 cont2 (F4.2). VARIABLE LEVEL cat Group (NOMINAL) GroupRand cont1 cont2 (SCALE). *Inverse logit function. DEFINE !INVLOGIT (!POSITIONAL !ENCLOSE("(",")") ) 1/(1 + EXP(-!1)) !ENDDEFINE. *Making model where cont1 varies with group, but cont2 is fixed. COMPUTE prob = !INVLOGIT(-0.7 + 0.6*cat + GroupRand*cont1 + 0.2*cont2 + RV.NORMAL(0,1)). COMPUTE out = RV.BERNOULLI(prob). FORMATS out (F1.0) prob (F3.2). VARIABLE LEVEL out (NOMINAL) prob (SCALE). EXECUTE. *Now estimating logit regression. *I only have base - so no LOGISTIC or GENLIN. PLUM out BY Group WITH cont2 /CRITERIA=CIN(95) DELTA(0) LCONVERGE(0) MXITER(100) MXSTEP(5) PCONVERGE(1.0E-6) SINGULAR(1.0E-8) /LINK=LOGIT /PRINT=FIT PARAMETER SUMMARY. *Also might want to check out COXREG with the strata parameter. *See https://www-304.ibm.com/support/docview.wss?uid=swg21477360. *******************************************. |
In reply to this post by Bruce Weaver
Diana,
This sounds to me like a shot into the cloud and see whether any bird will fall down. Can you interpret the different between say the 1746th and the 2001st level ? Would'nt a regrouping of levels before starting the calculation produce results that are more easy to interpret ? HTH ftr On 03/04/2014 14:51, Bruce Weaver wrote: > This is a question that Diana Kornbrot tacked on at the end of a post in the > recent thread on the Benjamini-Hochberg FDR > (http://spssx-discussion.1045642.n5.nabble.com/False-Discovery-Benjamin-Hochberg-tp5725087.html). > I thought it deserved it's own thread, so am re-posting it here. > > Diana wrote: > > SPSS was far from eager to run logistic with 2 factors, 1 of which had 3219 > levels [after 3 hours it was still on iteration2 - i gave up] > > Comments welcome > > best > > Diana > ___________ > > Diana, please provide more information. What are the 2 factors? Bearing in > mind that "factor" means categorical variable in SPSS lingo, I am struggling > to think of a categorical variable that would have 3219 levels. Should it > really be treated as a continuous variable (or a "covariate" in SPSS lingo)? > > On the other hand, if it really does have 3219 levels, should you be using > GENLINMIXED, and estimating the variance of random intercepts, etc, rather > than treating it as fixed? > > HTH. > > > > > > ----- > -- > Bruce Weaver > [hidden email] > http://sites.google.com/a/lakeheadu.ca/bweaver/ > > "When all else fails, RTFM." > > NOTE: My Hotmail account is not monitored regularly. > To send me an e-mail, please use the address shown above. > > -- > View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Logistic-regression-1-factor-with-3219-levels-taking-too-long-tp5725220.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Andy W
Andy, While I agree that simply having a large number of levels of a categorical variable does not indicate that the variable should be treated as random, it certainly is indicative of the possibility that the levels are being sampled (perhaps randomly) from a larger population. Now, if one has no interest in making inferences from the sampled units to the population of units, then continuing to treat it as fixed is reasonable. In my line of work, usually when I have thousands of categories (e.g., cities, zip codes), I'm dealing with multilevel data and plan on making inferences about beyond the sampled categories.
Ryan On Thu, Apr 3, 2014 at 10:45 AM, Andy W <[hidden email]> wrote: If one wants to do fixed effects regression with many groups (fixed effects |
It helps to be specific here, now I will quote myself:
"In most circumstances the study design and identification of what effects you are interested in should dictate which model you fit." If you are substantively interested in random effects then by all means go and fit a random effects model. Fixed effects models are typically motivated by omitted "group invariant" variables that are possible confounders for effects of interest. E.g. in my simulated example, even if you don't observe "cont1" (and it happens to be correlated with "cont2" - which isn't the case in my example), you can still estimate the unbiased effect of "cont2" with the fixed effects approach. And yes, you can make inferences based on that estimated effect. (Talking about inferences to super-populations though is getting a bit into the weeds for a forum post.) ------ Now, the simulated code example I posted finished in 2 hours and 10 minutes (although I have various warnings for low cell counts). Do any of the other procedures do any better? |
In reply to this post by Bruce Weaver
At 08:51 AM 4/3/2014, Bruce Weaver wrote:
>Diana Kornbrot wrote: > >>SPSS was far from eager to run logistic with 2 factors, 1 of which >>had 3219 levels [after 3 hours it was still on iteration2 - i gave up] > >Diana, please provide more information. What are the 2 >factors? Bearing in mind that "factor" means categorical variable >in SPSS lingo, I am struggling to think of a categorical variable >that would have 3219 levels. Should it really be treated as a >continuous variable (or a "covariate" in SPSS lingo)? Actually, I ran into a problem with something like this number of levels, a few years ago -- a linear, rather than logistic, problem. The data was a set of outcomes, plus presenting condition and demographic information, for a large set of patients, each treated by one of some thousands of physical therapists, identified by a code number. The question was, could you meaningfully identify 'better' therapists, with a view to looking at what characteristics -- education, experience, etc. -- 'better' therapists had. This was done (in STATA, if I recall correctly) by fitting a regression model for outcome based on presenting characteristics and demographics; and then running a simple ANOVA on the residuals of that regression, with therapist being the factor with a very large number of levels. Of course that was very expensive of sample size, but the dataset was large. (All therapists with fewer than 7 patients on record were dropped from the analysis.) Having demonstrated the effect of therapist on reducing outcome variance, therapists in the top 10% of outcome effects were compared with 'ordinary' ones (40-60 percentile). So I can see Diana having a similar problem, though logistic regression will be much more expensive in computing time and sample size than is plain ANOVA. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
This ends up being pretty similar (but less efficient) to estimating the regression equation with the expanded dummy variables. So if we have a regression equation of the form:
y = B0 + Bk*X + e You can also write it in terms of expectations, since the expectation is linear and the error term has a mean of zero. E[y] = B0 + Bk*E[X] (Remember if we draw a scatterplot of X vs Y, the linear regression line always has to pivot on the mean of Y and the mean of X). So if you then take the residual from this equation, e, and then examine the differences in means among the e's for your set of dummies it can be written as another linear regression of the form for dummy variables D1 to Dj. E[e] = A1*E[D1] + A2*E[D2] ..... Aj*E[Dj] Note there is no intercept here & the D's are all orthogonal by construction. If we replace e with its original constituent parts, we then have: E[(y - B0 - Bk*X)] = A1*E[D1] + A2*E[D2] ..... Aj*E[Dj] Which again because of the bilinearity of the expectation we may then expand out the terms in the parentheses: E[y] - B0 - Bk*E[X] = A1*E[D1] + A2*E[D2] ..... Aj*E[Dj] Which then putting back all the things on the right hand side that should not be on the left hand side we have: E[y] = B0 + Bk*E[X] + A1*E[D1] + A2*E[D2] ..... Aj*E[Dj] Because of the collinear D's you need to leave out the intercept to estimate all of the A's (or just use a post estimation command like EMMEANS would be just as simple in terms of SPSS estimates). You actually won't get the same coefficient estimates in the two equations (unless the dummy variables are orthogonal to the other covariates in the original model). When I see someone do this I like to refer to Gary King's article in which he calls this regression on residuals and warrants against it, How Not to Lie with Statistics: Avoiding Common Mistakes in Quantitative Political Science, http://gking.harvard.edu/files/gking/files/mist.pdf. Besides being more efficient, estimating the dummy variables allows you to set a confidence interval on the estimate & form future prediction intervals for the original metric. It may be a bit of a sore spot for some of the education researchers on the list, by this is pretty similar (if not synonymous) with how value added scores for teachers are frequently calculated. This actually ends up being a situation in which random effects are reasonable. For future predictions, the shrinkage of the effect estimates produce better future forecasts in the original metric. Also with the dummy variable approach individuals with few observations will have a wide interval for the effect. For random effects these people will just be shrunk back to the grand mean by a larger amount, making ranking of the individuals simpler. |
Administrator
|
The PDF Andy linked to doesn't contain any citation info, so here it is (from Google Scholar):
King, G. (1986). How not to lie with statistics: Avoiding common mistakes in quantitative political science. American Journal of Political Science, 666-687. More info on the author can be found here: http://gking.harvard.edu/ http://www.gov.harvard.edu/people/faculty/gary-king HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by Andy W
One comment inserted below:
On Thu, Apr 3, 2014 at 12:35 PM, Andy W <[hidden email]> wrote: It helps to be specific here, now I will quote myself: -->We may have to agree to disagree here with your parenthetical statement. If one views levels of a factor as a [random] subset of all possible levels of a factor and is interested in making inferences that go beyond the subset sampled, then treating that factor as "random" would be appropriate. I do not see this concept as getting into the weeds for a forum post that brought in the concept of treating a factor as random.
|
I don't we are in disagreement Ryan. I am saying in the fixed effects design you aren't interested in making inferences about the factor level effects - they are just a means to an ends to estimate other effects you are interested in absent of potential group level confounding factors. That is what I mean when I say "what effects you are interested in will dictate the model you fit".
That is why I initially said replacing a set of dummy variables with random effects is frequently not warranted. Given the nature of the observation or quasi-experiment you might need those 3,000 dummy variables to properly identify the estimate of the effect you are interested in. Thank you Bruce for the full citation to the King article. |
Free forum by Nabble | Edit this page |