|
Hello all!
I'm writing here to get some input from people who hopefully know more about statistic than myself.. I have a problem with the distribution of my dataset (don't we all?). I have 4 measures of eating disorder pathology, in which a '0' score corresponds to an absence of such pathology. All 4 measures have about 40-60% zero scorers (n = 1080), so the data is heavily positively skewed. I also have an age and a weight variable, no particular problems with non-normality with these. I wish to use both a two-way ANOVA oneach of my 4 measures (grouped according to age and weight), and a correlational / regression analyses between my 4 measures, age and weight. Of course, my dataset violates the normality distribution. Also, I have read that neither tranforming the problematic variables nor using non-parametric tests will do the trick, since all those who scored '0' would be assigned the same rank. My question is obviously what is reasonable to do in this situation. What is more problematic; the skewness itself or the high frequency of equal scores? I have read that it is possible to exclude everyone that scores '0' from correlational analyses; and only do analysis on them who have some degree of eating disorder pathology. Is this feasible to do? Any comments are appreciated! Regards, Lasse |
|
one option is to do two separate analyses
-logistic regression yes/no disorder -remove zero's and do the normal ols regression on those that do have the disorder another one is perform a zero-inflated negative binomial regression (not implemented in spss (if I'm correct) but is in R) I don't yet have experience with it myself but will in a few weeks because I have similar data distributions. Maurice On Thu, Oct 28, 2010 at 10:58, BanLas <[hidden email]> wrote: > Hello all! > > I'm writing here to get some input from people who hopefully know more about > statistic than myself.. > > I have a problem with the distribution of my dataset (don't we all?). I have > 4 measures of eating disorder pathology, in which a '0' score corresponds to > an absence of such pathology. All 4 measures have about 40-60% zero scorers > (n = 1080), so the data is heavily positively skewed. I also have an age and > a weight variable, no particular problems with non-normality with these. > > I wish to use both a two-way ANOVA � oneach of my 4 measures (grouped > according to age and weight), and a correlational / regression analyses > between my 4 measures, age and weight. Of course, my dataset violates the > normality distribution. Also, I have read that neither tranforming the > problematic variables nor using non-parametric tests will do the trick, > since all those who scored '0' would be assigned the same rank. > > My question is obviously what is reasonable to do in this situation. > What is more problematic; the skewness itself or the high frequency of equal > scores? > I have read that it is possible to exclude everyone that scores '0' from > correlational analyses; and only do analysis on them who have some degree of > eating disorder pathology. Is this feasible to do? > > Any comments are appreciated! > Regards, Lasse > -- > View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Skewness-High-frequency-of-zero-scores-tp3240173p3240173.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > -- ___________________________________________________________________ Maurice Vergeer Department of communication Radboud University� (www.ru.nl) PO Box 9104 NL-6500 HE Nijmegen The Netherlands Visiting Professor Yeungnam University, Gyeongsan, South Korea contact: E: [hidden email] T: +31 24 3612297 (direct)/ 3612372 (secretary) / maurice.vergeer (skype) personal webpage: www.mauricevergeer.nl blog:� http://blog.mauricevergeer.nl/ Journalism: www.journalisteninhetdigitaletijdperk.nl CENMEP New Media and European Parliament Elections 2009 http://mauricevergeer.ruhosting.nl/cenmep Recent publications: - Eisinga, R., Franses, Ph.H. & Vergeer, M. (accepted for publication). Weather conditions and daily television use in the Netherlands, 1996-2005. International Journal of Biometeorology. - Vergeer, M. & Pelzer, B. (2009). Consequences of media and Internet use for offline and online network capital and well-being. A causal model approach. Journal of Computer-Mediated Communication, 15, 189-210. - Vergeer, M., Coenders, M. & Scheepers, P. (2009). Time spent on television in European countries. In R.P. Konig, P.W.M. Nelissen, & F.J.M. Huysmans (Eds.), Meaningful media: Communication Research on the Social Construction of Reality (54-73). Nijmegen, The Netherlands: Tandem Felix. - Hermans, L., Vergeer, M., &� d’Haenens, L. (2009). Internet in the daily life of journalists. Explaining the use of the Internet through work-related characteristics and professional opinions. Journal of Computer-Mediated Communication, 15, 138-157. ___________________________________________________________________ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
In reply to this post by BanLas
At 01:58 AM 10/28/2010, BanLas wrote:
Hello all! You haven't said much about the other three measures, and whether they truly represent a scale of progressively worse pathology. Assuming that they are, what you may have here is a variable that has a poisson distribution, which I would guess is the case, anyway. That is, it is reasonable to guess that most people do not have the pathology, only a very few have extreme pathology, and the categories decline in frequency from "none" to "extreme." If that is the case, then by definition the distribution is not "normal," and a transformation is not going to make it normal, because the mode is not a measure of central tendency. I'd be tempted to do an ANOVA in which your eating pathology variable is used to define 4 groups, and the null hypothesis is that the other variables will be distributed the same, regardless of which eating pathology group the subject is in. Bob Schacht Northern Arizona University |
|
In reply to this post by BanLas
Thank you for all the helpful comments! No; there are no repeated measures. Measurement of eating disorder pathology is done once for each participant. Also; considering the explanation of a Poisson distribution posted above, I would say that all my 4 measures of eating disorder pathology has a poisson distribution. Thanks; -Lasse |
|
Lasse,
A standard Poisson distribution may or may not be the optimal distribution to select for your particular problem. Poisson regression is used for modelling count data that can theoretically range from 0 to infinity. If you have [zero-]inflation, truncation and/or a high conditional variance relative to the conditional mean, then it might be worthwhile modifying the log-likelihood function accordingly. Also, based on the information you've provided, it seems to me that an ordered logits equation should probably be considered as well. Finally, if you decide to model all 4 measures simultaneously (multivariate model), which may or may not be a reasonable approach, then you will likely need to account for within-subject correlation.
Ryan
On Mon, Nov 1, 2010 at 4:39 AM, BanLas <[hidden email]> wrote: Thank you for all the helpful comments! |
|
In reply to this post by BanLas
Please cobble together a small set of data that describes what you have.
Please detail the number of variables, the levels that are possible for each. Do you have 4 groups of respondents? Each group has a level of pathology. I.e., you do not have 4 measures of pathology, but 1 measure with 4 levels. or do you have 4 measures of different kinds of pathology all with a yes/no (2level) response scale or some other response scale? If a person does not score zero what other possible scores can they have. Which are the DVs (dependent variables) and which are the IVs (independent variables)? Are you interest in finding out how the people in the 4 groups can be distinguished? Are you interested in predicting which of the 4 groups someone would be in? Etc. Art Kendall Social Research Consultants On 10/28/2010 4:58 AM, BanLas wrote: > Hello all! > > I'm writing here to get some input from people who hopefully know more about > statistic than myself.. > > I have a problem with the distribution of my dataset (don't we all?). I have > 4 measures of eating disorder pathology, in which a '0' score corresponds to > an absence of such pathology. All 4 measures have about 40-60% zero scorers > (n = 1080), so the data is heavily positively skewed. I also have an age and > a weight variable, no particular problems with non-normality with these. > > I wish to use both a two-way ANOVA oneach of my 4 measures (grouped > according to age and weight), and a correlational / regression analyses > between my 4 measures, age and weight. Of course, my dataset violates the > normality distribution. Also, I have read that neither tranforming the > problematic variables nor using non-parametric tests will do the trick, > since all those who scored '0' would be assigned the same rank. > > My question is obviously what is reasonable to do in this situation. > What is more problematic; the skewness itself or the high frequency of equal > scores? > I have read that it is possible to exclude everyone that scores '0' from > correlational analyses; and only do analysis on them who have some degree of > eating disorder pathology. Is this feasible to do? > > Any comments are appreciated! > Regards, Lasse > -- > View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Skewness-High-frequency-of-zero-scores-tp3240173p3240173.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
|
In reply to this post by BanLas
Some have requested more information about variables and aims of analyses, so here goes:
Sample size; n = 1080. I have 4 different measures (each measure consists of items in a 7-point Likert style format) of eating disorder pathology; each measure provides a global mean score indicating the severity of eating disorder pathology; ranging from 0 and upwards. These measures are therefore continuous. A zero score is interpreted as a total absence of pathology. Needless to say, these measure are all positively skewed, with as much as 40-60% of my sample scoring zero (or very close to zero) across all 4 measures. I have two additional variables; age and weight, which both are categorical (4 categories for both variables). The first thing I would be interested in is whether or not there are any group differences in eating disorder pathology. Therefore; I was thinking to do 4 two-way ANOVA's; one for each eating disorder measure (that is, age and BMI are independent variables and the measures of eating disorder pathology are dependent variables). I would also like to know between which groups any differences lie with post hoc tests. Secondly; I would like to know the strength of relationship between the 4 measure of eating disorder pathology, i.e. do any of the meaurements covary more than others. A related question is to investigate how good 3 of the measures perform in predicting the fourth. I was then thinking of a multiple regression analysis, with three of the measures as predictors and the last one as dependent variable. (I am not suggesting to do 4 separate regression analyses, I am only interested in predicting one of the measures). But, alas, there is the problem with skewness / high frequency of zero scorers, and **I was wondering what statisticians think about this. **What are the consequences of doing parametric tests on a sample distribution that clearly violates the normality assumption? ***And are there any alternative approaches except for transformation of variables or the use of non-parametric tests; which I have read are all poor alternatives when dealing with distributions that are skewed and consist of many zero scores. ***Is bootstrapping; or permutation tests a feasible alternative? Regards; -Lasse |
|
Since the 4 measures of pathology are all on the same response scale (an
extent response scale) how are they intended to be the same/different? Are they intended to be items in creating a summative scale? If all four are measured on each subject why do you not consider them to be some form of repeated measure? What are the groups? The only IVs you mentioned are continuous. Art Kendall Social Research Consultants On 11/1/2010 10:38 AM, BanLas wrote: > Some have requested more information about variables and aims of analyses, so > here goes: > > Sample size; n = 1080. > > I have 4 different measures (each measure consists of items in a 7-point > Likert style format) of eating disorder pathology; each measure provides a > global mean score indicating the severity of eating disorder pathology; > ranging from 0 and upwards. These measures are therefore continuous. A zero > score is interpreted as a total absence of pathology. Needless to say, these > measure are all positively skewed, with as much as 40-60% of my sample > scoring zero (or very close to zero) across all 4 measures. I have two > additional variables; age and weight, which both are categorical (4 > categories for both variables). > > The first thing I would be interested in is whether or not there are any > group differences in eating disorder pathology. Therefore; I was thinking to > do 4 two-way ANOVA's; one for each eating disorder measure (that is, age and > BMI are independent variables and the measures of eating disorder pathology > are dependent variables). I would also like to know between which groups any > differences lie with post hoc tests. > > Secondly; I would like to know the strength of relationship between the 4 > measure of eating disorder pathology, i.e. do any of the meaurements covary > more than others. A related question is to investigate how good 3 of the > measures perform in predicting the fourth. I was then thinking of a multiple > regression analysis, with three of the measures as predictors and the last > one as dependent variable. (I am not suggesting to do 4 separate regression > analyses, I am only interested in predicting one of the measures). > > But, alas, there is the problem with skewness / high frequency of zero > scorers, and > **I was wondering what statisticians think about this. > **What are the consequences of doing parametric tests on a sample > distribution that clearly violates the normality assumption? > ***And are there any alternative approaches except for transformation of > variables or the use of non-parametric tests; which I have read are all poor > alternatives when dealing with distributions that are skewed and consist of > many zero scores. > ***Is bootstrapping; or permutation tests a feasible alternative? > > Regards; > -Lasse > > > -- > View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Skewness-High-frequency-of-zero-scores-tp3240173p3245150.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants |
|
Administrator
|
Hi Art. BanLas said first that "age and weight, which both are categorical (4 categories for both variables)." But in the next paragraph: "...age and BMI are independent variables and the measures of eating disorder pathology are dependent variables". So I assume the explanatory variables are actually Age and BMI, and that BanLas has either carved each of them into 4 categories, or only has access to them in that form. BanLas -- do you have the raw data for both Age and BMI? If so, why do you want to carve them into categories? Generally speaking, this is a bad idea. You use up degrees of freedom needlessly, and you throw away information (which results in loss of power). Where possible, continuous variables should be treated as continuous. HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
|
Art; all 4 measures are different in that they measure distinct facets of eating disorder pathology; i.e. one measure psychosocial impairment, another restrained eating etc. Some of the measure can indeed be separated to produce an overall or global score of eating disorder pathology, but for this study I wish to treat them separate as there are some important theoretical and empirical differences among them that I wish to elucidate.
Bruce; Both age and BMI (sorry I said weight at first) are measured continuously. However, my groups division of both BMI and age follow certain clinical conventions, and I therefore want to proceed with categorising; to see if there are any significant differences in for instance restrained eating between normalweight and overweight women. When talking about these sorts of group analyses; age and BMI are IVs; and the eating disorder pathology measures are DVs. The original question was to what extent my skewed data; and high frequency of zero scores affect parametric tests, and what potential alternatives there are. -Lasse |
|
Administrator
|
Regarding clinical conventions about imposing cut-points on a continuous variable, I understand that clinicians may need to sort people into categories ultimately--e.g., treat or don't treat. But from a statistical point of view, I think one should delay that categorization as long as possible. Here's a simple example. Suppose you have a simple two-variable situation with BMI as the explanatory variable, and one of your scales as the outcome variable. If you carve BMI into the usual categories, but treat the outcome variable as continuous, you'll be doing a one-way ANOVA. In the ANOVA model, the fitted value for any individual is the mean of the category they belong to. So two people who differ quite a bit in BMI, but who fall within the same category, will have the same fitted value in this model. On the other hand, two people who differ by only a tiny amount, but who happen to fall in two different categories, could have substantially different fitted values of Y. Do you really want a model like that? <soapbox> What I would prefer is to fit a simple linear regression model, with X = actual BMI. THEN, if the conventional BMI categories are needed, draw vertical lines on the X-axis to indicate the conventional cut-points, and apply the cut-points to the fitted values. This makes more sense to me. </soapbox> HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
|
I see your point regarding categorising data; and I will consider doing something similar to what you suggested (but with both BMI and age as predictors)! But the question still remains; how will the regression model perform when the DV is such a skewed score, with median and mode values of 0?
|
|
again, I suggest you look into the two options I posted earlier. This message was written on a mobile phone Op 3 nov 2010 10:26 schreef "BanLas" <[hidden email]>: |
|
In reply to this post by BanLas
Lasse,
I would advise against fitting a general linear model (e.g., ANOVA) if you are observing a large percentage of zeros along with high skew. If the response options represent counts, then I would consider fitting a zero-inflated count model (e.g., zero-inflated poisson, zero-inflated negative binomial), possibly adjusting the log likelihood function to account for an upper truncation if your counts cannot go above a specified value (e.g., 7). If you want to model four count dependent variables simultaneously, then you would enhance the model to account for a multivariate response (e.g. multivariate zero-inflated poisson, possibly with an upper truncation). Unfortunately, I do not believe that SPSS is capable of fitting this type of model. One could certainly fit such a model via the NLMIXED procedure in SAS. There are other options, such as dichotomizing the dependent variables, as suggested by others. If you were to dichotomize the dependent variables, then you might consider fitting a GEE/generalized linear model. Alternatively, you might find that it makes sense to collapse a couple of the adjacent categories and then consider fitting some sort of multivariate ordered logits model. Based on the information you have provided, I cannot say which is the optimal approach.
You have and will likely continue to receive different types of recommendations. You should really consider reaching out to a statistician in your area [to whom you can provide the entire picture with respect to your data and research questions] in order for him/her to help you make some of these important decisions.
Best wishes,
Ryan
On Tue, Nov 2, 2010 at 10:31 AM, BanLas <[hidden email]> wrote:
Art; all 4 measures are different in that they measure distinct facets of |
|
In reply to this post by BanLas
Lasse,
You seem to want to treat a couple of continuous variables (BMI, Age) as categorical Independent Variables, and a number of categorical pathology variables as continuous Dependent Variables. As Bruce W. points out, there are pretty good reasons for NOT categorizing BMI and Age until you are actually in a prescriptive clinical situation (and even there the boundaries between categories are subject to change, as you must know). I've certainly had to point out to a number of clinicians that their findings based on self-imposed categorization of continuous variables will be completely useless if -- as often happens -- the cutpoints between categories change in the near future, as they have done quite recently for pathologies like obesity, diabetes, etc. And obviously, in a categorization of Age or BMI, there will be adjacent continuous values that fall into different categories. This both lowers the power of your analysis, and highlights the inadequacy of many prescriptive clinical diagnostics. Why not test the hypotheses that Age and BMI (as continuous, dependent variables) do not/do vary between categories of your eating disorder pathologies? That would be a more conventional ANOVA, and not so fraught with self imposed aberrant distributions. It would be more valid, and equally interesting (I think) to see if a certain stage of a pathology is defined by having a significantly lower/ higher Age or BMI than another stage of the same pathology. If you are required by convention, peer or supervisory pressure to categorize variables which clearly ought to be regarded as continuous, then perhaps look into a wholly categorical analysis. Perhaps log-linear models? regards, Ian On 02 Nov, 2010, at 10:31 AM, BanLas wrote: > Art; all 4 measures are different in that they measure distinct > facets of > eating disorder pathology; i.e. one measure psychosocial > impairment, another > restrained eating etc. Some of the measure can indeed be separated to > produce an overall or global score of eating disorder pathology, > but for > this study I wish to treat them separate as there are some important > theoretical and empirical differences among them that I wish to > elucidate. > > Bruce; Both age and BMI (sorry I said weight at first) are measured > continuously. However, my groups division of both BMI and age > follow certain > clinical conventions, and I therefore want to proceed with > categorising; to > see if there are any significant differences in for instance > restrained > eating between normalweight and overweight women. > > When talking about these sorts of group analyses; age and BMI are > IVs; and > the eating disorder pathology measures are DVs. > > The original question was to what extent my skewed data; and high > frequency > of zero scores affect parametric tests, and what potential > alternatives > there are. > > -Lasse > -- > View this message in context: http://spssx-discussion. > 1045642.n5.nabble.com/Skewness-High-frequency-of-zero-scores- > tp3240173p3246798.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text > except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
| Free forum by Nabble | Edit this page |
