Hi,
I am currently in the end faze of writing my bachelor's thesis in political science where I use a panel data set to investigate the effects of corporate contribution bans on policy outcomes on state level. Since I don't have much statistical background to draw upon I've been forced to read up on panel data, how to model it and different problems as I've gone along. Now I've encountered one problem which I don't know how to tackle and hopefully someone here will be able to offer some advice. My question is if my model is overfitted (a concept I just discovered). I first got suspicious due to the very high value of my models R-squared (0,766) despite the fact that I'm only using 4 predictor variables. The 732 degrees of freedom also appears very high to my untrained eye. Due to my use of LSDV I also have 62 dummy variables (the study includes 47 states and 17 time periods, resulting in 46+16 dummy variables) Could this have resulted in that my model now is too complex and that I should apply a bootstrap or another similar test? I greatly appreciate all help and wish you all a happy new year. Best, A. Severin |
Perhaps 'daze' would be a better word than 'faze'!?
That aside, it seems to me that you have a (very) industrious topic for a senior thesis and that you must have an industrious senior thesis advisor. And, therefore, that person ought to know or know who at your institution knows about analyzing the kind of data that you have. That person is your advisor and they should advise. That said, please begin by defining 'LSDV'. Others on the list will know more than I do about this type of analysis problem. You have data for 17 time periods for each of 47 states. That data must be arranged in a long format file which will given you 47*17=799 records. In addition to your main IVs, you need to model state differences and time differences, so you have 46 state dummies and 16 time dummies. So your dof will be 799 - 46 - 16 - 4 - 1 = 732. Is your Rsq too high? Hard to tell; it depends on your correlations. What does the correlations of the 4 IVs and the DV look like (ignore the dummies for the moment)? Gene Maguin -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Molte Sent: Monday, December 31, 2012 10:11 AM To: [hidden email] Subject: Fixed Effects Model (LSDV) and Overfitting?? Hi, I am currently in the end faze of writing my bachelor's thesis in political science where I use a panel data set to investigate the effects of corporate contribution bans on policy outcomes on state level. Since I don't have much statistical background to draw upon I've been forced to read up on panel data, how to model it and different problems as I've gone along. Now I've encountered one problem which I don't know how to tackle and hopefully someone here will be able to offer some advice. My question is if my model is overfitted (a concept I just discovered). I first got suspicious due to the very high value of my models R-squared (0,766) despite the fact that I'm only using 4 predictor variables. The 732 degrees of freedom also appears very high to my untrained eye. Due to my use of LSDV I also have 62 dummy variables (the study includes 47 states and 17 time periods, resulting in 46+16 dummy variables) Could this have resulted in that my model now is too complex and that I should apply a bootstrap or another similar test? I greatly appreciate all help and wish you all a happy new year. Best, A. Severin -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Fixed-Effects-Model-LSDV-and-Overfitting-tp5717179.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Molte
How often have bans been imposed?
To start with a simple model, before you think of "bans" that have been imposed, you are looking at "policy outcomes" by State; across 17 years. I would expect that there would be some fairly strong and consistent differences between States -- and this *might* be considered as "nuisance" variance, something to be accounted for but not very interesting at all. But that is part of what is making up your "large R^2". Depending on what the "policy outcomes" consist of, there might also be a sizable component of linear trend, since these are time-series. The time trend *also* could be "nuisance". Time series are notorious for showing large and largely-irrelevant R^2. Is that one of your other variables? The interesting terms would seem to be the effects, on parallel time series, of imposing Interventions (bans). - This seems more and more like one of those complicated models in econometrics, since (for instance) the previous change in "policy outcomes" in a particular state (or neighboring states, or any states) might have helped create opinions that led to the bans. In any case -- the large R^2 is not surprising to me, but it is also (probably) not interesting until you look at terms that model the possible effects of bans. That could be more complicated than Yes/No. -- Rich Ulrich > Date: Mon, 31 Dec 2012 07:11:11 -0800 > From: [hidden email] > Subject: Fixed Effects Model (LSDV) and Overfitting?? > To: [hidden email] > > Hi, > > I am currently in the end faze of writing my bachelor's thesis in political > science where I use a panel data set to investigate the effects of corporate > contribution bans on policy outcomes on state level. Since I don't have much > statistical background to draw upon I've been forced to read up on panel > data, how to model it and different problems as I've gone along. > > Now I've encountered one problem which I don't know how to tackle and > hopefully someone here will be able to offer some advice. My question is if > my model is overfitted (a concept I just discovered). I first got suspicious > due to the very high value of my models R-squared (0,766) despite the fact > that I'm only using 4 predictor variables. The 732 degrees of freedom also > appears very high to my untrained eye. Due to my use of LSDV I also have 62 > dummy variables (the study includes 47 states and 17 time periods, resulting > in 46+16 dummy variables) Could this have resulted in that my model now is > too complex and that I should apply a bootstrap or another similar test? > > I greatly appreciate all help and wish you all a happy new year. > |
In reply to this post by Maguin, Eugene
Gene:
Daze is indeed a fitting word. 67 hours to go It is indeed a bit industrious and I feel like I have dug a statistically hole for myself. No choice but to see it through now though. Unfortunately, my advisor has left the country so on my own for the last leg. By ‘LSDV’ I mean the Least Square Dummy Variable approach to modeling fixed effects rather that the Within Group Approach. Regarding my correlations, they are presented below Where CCB denoted the contribution ban and is coded dichotomously and the remaining 3 IVs represent partisan control of the branches of state government. The dependent variable is an indexation of pollution abatement costs in the states. I appreciate the input :) |
In reply to this post by Rich Ulrich
Rich:
During the time period 1977-1994, a corporate contribution ban was implemented in one state and lifted in four states. A total of 20 states had a ban in place throughout the time period and the remaining 22 states never had a ban in place. There is indeed a lot of fluctuation between the states in terms of the dependent variable (pollution abatement costs). The reason for these constant variations is what I aim to control for by employing state fixed effects. Just as you say, I suspect that it is the heterogeneity of these fixed effects is the reason for much of the high R2. Noteworthy that the state fixed effects have a combined F-value of 29,3 and is very significant. As I wrote in the other reply, I absolutely believe that there’s a trend in the time series but that is what I hope to control for when I in a further analysis employ a lagged dependent variable. This gives me a total number of observations of 686 instead of 799, since the first year of the time series is dropped for each state. I do also control for time fixed effects but they seem to be very limited and not statistically significant. Thanks so much for your help A further question of mine regards the importance of interpreting the significance of my results. Since almost all US states are included in the study, can it be said to resemble a study of the total population and if so, doesn’t the importance of the p-value diminish since I’m not using a small sample from a larger universe of US states? |
In reply to this post by Molte
Thank you for contacting me. I will be out of the office until 14th January 2013. For urgent enquiries, please contact the CRIC office on (03) 5327 9318.
Kind regards,
David
|
In reply to this post by Molte
For the effects of changing a ban: Tiny, tiny Ns, with 4 lifted and one imposed. This is too few to expect tests to show much, especially since there are so many other potential differences between states -- like, How big? How industrialized? You have a question of what happens across years. Is there a big, general change? You have questions of "scaling" -- Are you looking at logs, or at some sort of "normalized" value that makes states comparable? I would say that your major *test* comparison can only be the no-ban/ all-ban contrast, using 20 versus 22 states. You can plot those means across the 17 years and decide whether there is a time-trend that needs to be accommodated. (That is most easily done if "linear" takes out most of the change.) The data you have available only lets you try to tell a somewhat-solid story about these states (if there is one story). If this were mine, what seems to come next is to use a log-scale (vertical) to show the costs across time for the 5 states with Changes, marking the point of "intervention". On that, superimpose the two lines for the averages of the 20 and 22 states. Is there a further story here? -- Rich Ulrich > Date: Tue, 1 Jan 2013 09:54:46 -0800 > From: [hidden email] > Subject: Re: Fixed Effects Model (LSDV) and Overfitting?? > To: [hidden email] > > Rich: > > During the time period 1977-1994, a corporate contribution ban was > implemented in one state and lifted in four states. A total of 20 states had > a ban in place throughout the time period and the remaining 22 states never > had a ban in place. There is indeed a lot of fluctuation between the states > in terms of the dependent variable (pollution abatement costs). The reason > for these constant variations is what I aim to control for by employing > state fixed effects. Just as you say, I suspect that it is the heterogeneity > of these fixed effects is the reason for much of the high R2. Noteworthy > that the state fixed effects have a combined F-value of 29,3 and is very > significant. > > As I wrote in the other reply, I absolutely believe that there’s a trend in > the time series but that is what I hope to control for when I in a further > analysis employ a lagged dependent variable. This gives me a total number of > observations of 686 instead of 799, since the first year of the time series > is dropped for each state. I do also control for time fixed effects but they > seem to be very limited and not statistically significant. > > Thanks so much for your help > > A further question of mine regards the importance of interpreting the > significance of my results. Since almost all US states are included in the > study, can it be said to resemble a study of the total population and if so, > doesn’t the importance of the p-value diminish since I’m not using a small > sample from a larger universe of US states? > > ... |
I am currently out of the office and will respond to your email, if necessary, on Monday, January 7. If you need immediate assistance, please call 812-856-5824. Shimon Sarraf Center for Postsecondary Research Indiana University Bloomington
|
Free forum by Nabble | Edit this page |