Dear SPSS List Folks,
I have data that was transformed to meet the assumptions of parametric tests. The transformation is as follows: V1..V4 -->Transformed to Log10 --> Saved Standard values --> Saved all as Mean=50, SD=10.
I now have standardized and unstandardized beta coefficients from my linear regression output that I would like to make statements about in their original units. Is there a typical way of handling these conditions such that I can a 1 unit increase in my IV predicts X unit increase in my DV. Or a 1 unit increase in my IV predicts X standard deviation increase in my DV.
Trying to keep this clear... Peter |
Peter, Can you describe the dependent variables in their original form in as much detail as possible, and why you felt the need to transform them? (Keep in mind that one assumes the errors are normally distributed when performing regression analyses.)
Thanks, Ryan On Tue, Apr 16, 2013 at 8:17 PM, Peter Spangler <[hidden email]> wrote:
|
For example the original dependent variable of interest is in dollars (gross market value) and IV is repeat buyers. Both scale variables. I transformed them because the distribution was very skewed and for them to share the same scale.
On Tue, Apr 16, 2013 at 5:31 PM, R B <[hidden email]> wrote:
|
Ryan,
Would it be correct to say that a 1% increase in the IV would predict an average .558% increase in the DV. Such that : A repeat Buyer increase of .2 would predict a $32 increase in GMV
Change in DV = (.558/100)*5735 = 32.0013 Unstandardized Beta log_rb .558
Mean GMV = $5735 Mean Repeat Buyer = 20 On Tue, Apr 16, 2013 at 5:40 PM, Peter Spangler <[hidden email]> wrote:
|
In reply to this post by Peter Spangler
I no little about use of statistics in economics except for the occasional example I have come across in a textbook and/or online forum. I think I dabbled in cost-benefit analyses a while back, but that was a long time ago. So bear with me for a moment...Additional questions:
What is the possible range of values that the dependent variable can take on (e.g., 1 dollar to infinitely many dollars, 1 dollar to a fixed upper limit, 0 to ...). I assume the values are positive integers, and that the distribution is positively skewed.
How in the world is the dependent variable (number of dollars spent) linked to the independent variable (repeat buyers)? In fact, what do you mean by repeat buyers? Repeat buyers of a specific product? So does that mean that each record represents a different product?
Sorry, but I am still not clear. Ryan On Tue, Apr 16, 2013 at 8:40 PM, Peter Spangler <[hidden email]> wrote:
|
In reply to this post by Peter Spangler
Peter, Without understanding your model, I will simply direct you to a specific answer with respect to interpretation: Go to the last section of this page that discusses interpretation of regression coefficients when the DV and predictor(s) are log-transformed. HTH, Ryan On Tue, Apr 16, 2013 at 8:50 PM, Peter Spangler <[hidden email]> wrote:
|
In reply to this post by Ryan
Thanks for your patience Ryan. Range for DV is 1-5 million. Positively skewed indeed. IV can be though of as individuals that have purchased a product from a distinct seller more than once. The greater the number of buyers that come back to purchase from the same seller, the greater the sales. Sent from my iPhone
|
In reply to this post by Ryan
Yes, this section is very helpful. I guess my question remains: if the unstandardized coefficient is .11, is it divided by 100 to get .11% before multiplying by the mean of the DV in order to get the actual unit increase in the DV? Sent from my iPhone
|
In reply to this post by Peter Spangler
Okay. I'm going to cut off this back-and-forth because it would take a long time in order to obtain all necessary information for me to provide any substantive advice (e.g., are the DV units in hundreds, thousands, millions, etc.). Let's not go down this path because time is against me. Perhaps somebody else will pick up where I have left off; I simply do not have the experience with these kinds of data. Moreover, I would need a lot more information before providing any advice.
I have provided you with information on how to interpret coefficients when the variables are log-transformed. Hope that information proves useful. I will make a three general statements:
1. The solution in dealing with skewed data is not always a transformation (see 3rd point) 2. There are no distributional assumptions about IVs in regression, but highly skewed IVs can have certain implications. Note that if the skew is in the same or different direction between two variables influences the range of the Pearson correlation.
3. There is a large family of exponential distributions outside of the Gaussian distribution (e.g., binomial, poisson, negative binomial, gamma, beta, etc.) that may be entirely appropriate for modeling this type of data (dollars spent). There are zero-truncated variations of these models as well that may be appropriate. I'm not suggesting that linear regression is necessarily inappropriate, but you ought to familiarize yourself with generalized linear models, and see how economists commonly model these data. I wouldn't be surprised to see that they do not always transform the DV; perhaps sometimes they consider other distributions.
Best, Ryan On Tue, Apr 16, 2013 at 9:03 PM, Peter Spangler <[hidden email]> wrote:
|
In reply to this post by Ryan
Correction: I *know* May the grammar gods forgive me for the grammatically incorrect messages I post. ;-) Rarely do I take the time to double check my messages. I usually type the message, submit, and move on.... Ryan
|
Administrator
|
In reply to this post by Peter Spangler
Considering the fact that you haven't even bothered to post the actual regression model, anyone jumping further into your rabbit hole is bound to become a mad hatter!
I decline! --
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Ryan
<insert tongue in cheek>
Rarely do I take the time to double check my messagesHey you are weakening my soapbox about always checking syntax because it is like any other kind of writing. <remove tongue from cheek> Art Kendall Social Research ConsultantsOn 4/16/2013 11:44 PM, R B [via SPSSX Discussion] wrote:
Art Kendall
Social Research Consultants |
In reply to this post by David Marso
The regression model is simple linear using two log transformed variables: DV = nlog_gmv (scale variable in dollars, $1 - $5 million) IV = nlog_rb (scale variable, the number of buyers that a seller had more than one transactions with) REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT nlog_gmv /METHOD=ENTER nlog_rb.
Coefficients Unstandardized Beta nlog .558
On Tue, Apr 16, 2013 at 9:00 PM, David Marso <[hidden email]> wrote: Considering the fact that you haven't even bothered to post the actual |
Trying this again for clarity and completeness: My data is made up of two scale variables DV (gmv) -- in dollars and IV (repeat buyers) in persons. Both variables transformed to t_gmv and t_repeat_buyers: Log10 --> Z scores --> Mean = 50, SD = 10.
My goal is to calculate GMV in its original units (dollars) based on a one unit (person) in crease in Repeat Buyers. I need to essentially back transform to calculate: t_GMV = Bo + B1 (t_repeat buyers) + E1 t_GMV = 5.37 + .426 (t_repeat buyers) + E1 REGRESSION
/MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN /DEPENDENT t_gmv /METHOD=ENTER t_rb. Coefficients Unstandardized Beta constant 5.37 t_ rb .426
On Wed, Apr 17, 2013 at 8:48 AM, Peter Spangler <[hidden email]> wrote:
|
Peter, Okay. I've given this some thought... If you take the derivative of both sides of the log-log simple regression equation w.r.t. x results in a straightforward interpretation of the unstandardized slope; that is,
unstandardized slope = <unstandardized slope value> percent change in y given unit percent change in x. The unstandardized slope is the point elasticity of y with respect to x. I would abandon the notion of back-transforming the unstandardized slope from a log-log simple regression since the linear relationship is on a multiplicative or percentage scale. That's how I see it; perhaps someone else will have a different perspective. Frankly, I tend to avoid transforming variables as it tends to complicate interpretation. Furthermore, there is usually a misunderstanding as to when it is appropriate to employ certain transformations, and often I find that people (not you) mistakenly transform data for the wrong reason(s) (e.g., examining the distribution of the DV as opposed to the distribution of the residuals).
What you should ask yourself, IMHO: 1. Did you find that the assumption(s) of a simple linear regression model did not hold when using the variables in their original forms? If so, which assumption(s) were not tenable? How did taking the log of both variables resolve the problem(s)? You will need to be able to defend these transformations if and when you submit this for peer review.
2. Further, why did you standardize the variables after the logarithmic transformations? Again, you will need to defend this decision. While I can see why someone would perform a log transformation to linearize a relationship, I really do not see why one would standardize the variables to a mean of 50 and sd of 10 after the transformation.
HTH, Ryan On Wed, Apr 17, 2013 at 3:45 PM, Peter Spangler <[hidden email]> wrote:
|
Administrator
|
I concur with Ryan's comment about people often transforming for the wrong reasons, and with the two questions he posed. As he says, people often fail to understand that the assumptions for OLS linear regression concern the errors, not to be confused with the residuals. I think the Wikipedia page on errors versus residuals is quite good.
http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics The assumptions for OLS linear regression are that the errors are independently and identically distributed as Normal with a mean of 0 and variance = sigma-squared. In the usual notation, the errors are assumed to be i.i.d. N(0, sigma-squared). That's it. And as Herman Rubin has frequently reminded readers of the sci.stat.* newsgroups, the independence assumption is by far the most important one, followed by identically distributed (i.e., homoscedasticity). One way to think of it is that if those assumptions were met perfectly, then the statistical tests associated with OLS regression would be exact tests; but as the assumptions are never met perfectly, the tests are always approximate, and the question is whether the approximation is good enough for them to be useful. (And yes, I am thinking of what George Box said about models being wrong, but still useful.) As Ryan pointed out in another post, if the error distribution is too far from normal, one may wish to consider a generalized linear model that employs a different error distribution (e.g., via GENLIN). HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by Ryan
Ryan and Bruce, thank you very much indeed! After some further reading today, I better understand Ryan's interpretation that a single unit percent change in x predicts an <unstandardized slope value> percent change in y.
The reason I transformed the data was not only to handle a horrid positive skew but to minimize the variance among scores. I believe Andy Field mentions log transformation as a way of handling data that tests significantly for Levenes test of homoscedasticity.
Log transform of the variables, saving them as z scores and setting means and std deviations removed the different units of some of the other variables (ratios, etc) and allowed scores to be added to create an overall score that could rank cases.
|
Responses are interspersed below.
On Wed, Apr 17, 2013 at 10:37 PM, Peter Spangler <[hidden email]> wrote:
***You are welcome.
***Use of the term "horrid" suggests that you view that something is wrong with positively skewed data. It is not uncommon to observe positively skewed sample data that arise from Poisson, Negative Binomial, and other distributions.
***Why would you want to minimize variance among scores?
***What do you think is the source of the heteroscedasticity? I fear that you are trying to force your data to conform to meet the assumptions of OLS regression without considering other estimation methods and models.
***As someone who lives in the world of psychometrics, what you just stated above is very concerning. A simple algebra trick does not give someone permission to sum scores across variables. I assume you have good reason to do so, aside from simply forcing the distributions to have the same mean and sd.
***I don't recall you stating that you were ranking cases, and I have no idea how that has anything to do with the two variables you described initially (but perhaps you did). ***Anyway, I will just assume that you understand what you are doing.
***Good luck. ***Ryan
|
Administrator
|
I believe I mentioned a rabbit hole? One reason I rarely involve myself with stat discussions on X-L...On Wed, Apr 17, 2013 at 11:36 PM, R B [via SPSSX Discussion] <[hidden email]> wrote:
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Peter Spangler
At 11:48 AM 4/17/2013, Peter Spangler wrote:
>The regression model is simple linear using two log transformed >variables: DV = nlog_gmv (scale variable in dollars, $1 - $5 million) > >IV = nlog_rb (scale variable, the number of buyers that a seller had >more than one transactions with) All right. First of all, others have noted that it's doubtful practice to transform variables to make them 'look' better -- to reduce skewness, for example. In addition to commonly being statistically inadvisable, it has the great drawback you've run into: when you make a transformation that doesn't have theoretical backing, you have a hard time understanding what the resulting model means. Now, there are legitimate reasons to transform variables, especially when theory supports the transformation. In particular, when a variable has a very wide dynamic range (ratio of largest to smallest values), and the behavior over the whole range is of interest, a log transformation is frequently recommended. Taking the log transformation asserts, implicitly, that the same percentage change is about equally important over the whole range; and that the same absolute change is less important toward the high end of the range. There are often good reasons to accept this. Your case, where the dynamic range is 5,000,000:1, is a good one for log transformation. Now, you're also log-transforming your independent variable. That gives you a power model: the model log(gmv)=a*log(rv)+b corresponds (taking anti-logs) to gmv = b*(rv**a) If you're fitting such a model, make sure it makes theoretical sense. Often in a case like yours, one log-transforms only the independent variable, fitting model log(gmv)=a*rv + b That corresponds to an exponential growth model, gmv = exp(b)*exp(a)**rv = B*A**rv and it's one you may well consider, depending on the theory you're working from and the particulars of variable rv. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |