SPSSX Discussion

ANCOVA with a nominal variable only relevant to a subset of data

Classic

List

Threaded

12 messages Options

weasel

ANCOVA with a nominal variable only relevant to a subset of data

Have a complex ANCOVA I want to run. A simplified example to illustrate the question I have:

Suppose Strength is a function of age, gender (male/female) and, AMONG MALES, diet (beef, fish, or vegetarian). I don't care about diet in females, and those cells are blank for females in the data set.

If I were doing this through dummy coding, I would code this using four dummy variables to account for gender and testosterone level, and the model would be as follows:

strength = B1(female) + B2(MaleBeef) + B3(MaleFish) + B4(MaleVeg) +B5(age)

To ask if sex matters, I would test mean strength of females vs weighted average of the 3 male categories.
To test if diet matters, I want to test the null hypothesis that the mean strength in the 3 male categories is equal.
To ask if age matters, a straightforward t-test seeing if B5 =0

Can I do this using GLM without having to manually dummy code everything? My concern is, if I put the categorical variable diet in the fixed factors, I am not sure how SPSS will handle it because this variable is only relevant to the male subset, and females all have missing values. I don't think there is an option to exclude pair-wise for GLM. If i were dummy-coding manually, females would get a 1 for "female" and zeroes for all the male categories.

Also, how do I ask SPSS to test the specific hypothesis I am asking (if I'm correct, does B2=B3=B4, and does B1 = weighted average of B2,B3,B4)?
I have tried searching online--do I need to use the lmatrix or kmatrix subcommand (which I have not done before).

The actual model I have is a bit more complex than above, in that the female category is also broken down into subgroups not relevant to men, that I want to compare with each other, so there would actually be more dummy variables (for example, femaleTall vs femaleShort), but to get to the core of the problem I've posed the question as above.
Thanks for the help.

Bruce Weaver

Re: ANCOVA with a nominal variable only relevant to a subset of data

Administrator

The Nabble archive for the list shows that this was posted on Feb 07, 2016 at 1:30pm. But it also says, "This post has NOT been accepted by the mailing list yet."

Anyway, I think I would try using the REGRESSION procedure to estimate the model hierarchically, as follows:

Step 1: Strength = b0 + b1*Age + b2*Female
Step 2: Strength = b0 + b1*Age + b2*Female + b3*Beef + b4*Fish

In the step 1 model, the t-test on b2 gives you a test of females vs males (controlling for age). And the test on the change in Rsq from step 1 to step 2 gives a 2-df test of the null hypothesis that model fit does not improve when you split the males into 3 dietary groups.

And of course, variables Female, Beef & Fish are indicator variables (1=Yes, 0=No).

HTH.

weasel wrote

Have a complex ANCOVA I want to run. A simplified example to illustrate the question I have:

Suppose Strength is a function of age, gender (male/female) and, AMONG MALES, diet (beef, fish, or vegetarian). I don't care about diet in females, and those cells are blank for females in the data set.

If I were doing this through dummy coding, I would code this using four dummy variables to account for gender and testosterone level, and the model would be as follows:

strength = B1(female) + B2(MaleBeef) + B3(MaleFish) + B4(MaleVeg) +B5(age)

To ask if sex matters, I would test mean strength of females vs weighted average of the 3 male categories.
To test if diet matters, I want to test the null hypothesis that the mean strength in the 3 male categories is equal.
To ask if age matters, a straightforward t-test seeing if B5 =0

Can I do this using GLM without having to manually dummy code everything? My concern is, if I put the categorical variable diet in the fixed factors, I am not sure how SPSS will handle it because this variable is only relevant to the male subset, and females all have missing values. I don't think there is an option to exclude pair-wise for GLM. If i were dummy-coding manually, females would get a 1 for "female" and zeroes for all the male categories.

Also, how do I ask SPSS to test the specific hypothesis I am asking (if I'm correct, does B2=B3=B4, and does B1 = weighted average of B2,B3,B4)?
I have tried searching online--do I need to use the lmatrix or kmatrix subcommand (which I have not done before).

The actual model I have is a bit more complex than above, in that the female category is also broken down into subgroups not relevant to men, that I want to compare with each other, so there would actually be more dummy variables (for example, femaleTall vs femaleShort), but to get to the core of the problem I've posed the question as above.
Thanks for the help.

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

Maguin, Eugene

Re: ANCOVA with a nominal variable only relevant to a subset of data

Am I missing something? As I understand the claim, the male by diet interaction is expected to be significant and, maybe, within that interaction there is expected to be some sort of ranking among the diet categories. But doesn't this require diet data for ALL respondents--women as well as men? Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bruce Weaver
Sent: Monday, February 08, 2016 7:53 AM
To: [hidden email]
Subject: Re: ANCOVA with a nominal variable only relevant to a subset of data

The Nabble archive for the list shows that this was posted on Feb 07, 2016 at 1:30pm. But it also says, "This post has NOT been accepted by the mailing list yet."

Anyway, I think I would try using the REGRESSION procedure to estimate the model hierarchically, as follows:

Step 1: Strength = b0 + b1*Age + b2*Female Step 2: Strength = b0 + b1*Age + b2*Female + b3*Beef + b4*Fish

In the step 1 model, the t-test on b2 gives you a test of females vs males (controlling for age). And the test on the change in Rsq from step 1 to step 2 gives a 2-df test of the null hypothesis that model fit does not improve when you split the males into 3 dietary groups.

And of course, variables Female, Beef & Fish are indicator variables (1=Yes, 0=No).

HTH.

weasel wrote

> Have a complex ANCOVA I want to run. A simplified example to
> illustrate the question I have:
>
> Suppose Strength is a function of age, gender (male/female) and, AMONG
> MALES, diet (beef, fish, or vegetarian). I don't care about diet in
> females, and those cells are blank for females in the data set.
>
> If I were doing this through dummy coding, I would code this using
> four dummy variables to account for gender and testosterone level, and
> the model would be as follows:
>
> strength = B1(female) + B2(MaleBeef) + B3(MaleFish) + B4(MaleVeg)
> +B5(age)
>
> To ask if sex matters, I would test mean strength of females vs
> weighted average of the 3 male categories.
> To test if diet matters, I want to test the null hypothesis that the
> mean strength in the 3 male categories is equal.
> To ask if age matters, a straightforward t-test seeing if B5 =0
>
> Can I do this using GLM without having to manually dummy code everything?
> My concern is, if I put the categorical variable diet in the fixed
> factors, I am not sure how SPSS will handle it because this variable
> is only relevant to the male subset, and females all have missing
> values. I don't think there is an option to exclude pair-wise for GLM.
> If i were dummy-coding manually, females would get a 1 for "female"
> and zeroes for all the male categories.
>
> Also, how do I ask SPSS to test the specific hypothesis I am asking
> (if I'm correct, does B2=B3=B4, and does B1 = weighted average of B2,B3,B4)?
> I have tried searching online--do I need to use the lmatrix or kmatrix
> subcommand (which I have not done before).
>
> The actual model I have is a bit more complex than above, in that the
> female category is also broken down into subgroups not relevant to
> men, that I want to compare with each other, so there would actually
> be more dummy variables (for example, femaleTall vs femaleShort), but
> to get to the core of the problem I've posed the question as above.
> Thanks for the help.

-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/ANCOVA-with-a-nominal-variable-only-relevant-to-a-subset-of-data-tp5731460p5731465.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Bruce Weaver

Re: ANCOVA with a nominal variable only relevant to a subset of data

Administrator

Hi Gene. I believe the OP wants to carry out two tests (controlling for age in both cases):

1. Female vs Male.
2. Test of the null hypothesis that there is no effect of diet among males.

I think the hierarchical regression approach I suggested does exactly that. I don't see anything suggesting the need for an interaction term, which is fortunate, because as you suggest, to determine if the effect of diet varies by sex, one would have to have females in the 3 dietary conditions too.

Cheers,
Bruce

Maguin, Eugene wrote

Am I missing something? As I understand the claim, the male by diet interaction is expected to be significant and, maybe, within that interaction there is expected to be some sort of ranking among the diet categories. But doesn't this require diet data for ALL respondents--women as well as men? Gene Maguin

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bruce Weaver
Sent: Monday, February 08, 2016 7:53 AM
To: [hidden email]
Subject: Re: ANCOVA with a nominal variable only relevant to a subset of data

The Nabble archive for the list shows that this was posted on Feb 07, 2016 at 1:30pm. But it also says, "This post has NOT been accepted by the mailing list yet."

Anyway, I think I would try using the REGRESSION procedure to estimate the model hierarchically, as follows:

Step 1: Strength = b0 + b1*Age + b2*Female Step 2: Strength = b0 + b1*Age + b2*Female + b3*Beef + b4*Fish

In the step 1 model, the t-test on b2 gives you a test of females vs males (controlling for age). And the test on the change in Rsq from step 1 to step 2 gives a 2-df test of the null hypothesis that model fit does not improve when you split the males into 3 dietary groups.

And of course, variables Female, Beef & Fish are indicator variables (1=Yes, 0=No).

HTH.

weasel wrote
> Have a complex ANCOVA I want to run. A simplified example to
> illustrate the question I have:
>
> Suppose Strength is a function of age, gender (male/female) and, AMONG
> MALES, diet (beef, fish, or vegetarian). I don't care about diet in
> females, and those cells are blank for females in the data set.
>
> If I were doing this through dummy coding, I would code this using
> four dummy variables to account for gender and testosterone level, and
> the model would be as follows:
>
> strength = B1(female) + B2(MaleBeef) + B3(MaleFish) + B4(MaleVeg)
> +B5(age)
>
> To ask if sex matters, I would test mean strength of females vs
> weighted average of the 3 male categories.
> To test if diet matters, I want to test the null hypothesis that the
> mean strength in the 3 male categories is equal.
> To ask if age matters, a straightforward t-test seeing if B5 =0
>
> Can I do this using GLM without having to manually dummy code everything?
> My concern is, if I put the categorical variable diet in the fixed
> factors, I am not sure how SPSS will handle it because this variable
> is only relevant to the male subset, and females all have missing
> values. I don't think there is an option to exclude pair-wise for GLM.
> If i were dummy-coding manually, females would get a 1 for "female"
> and zeroes for all the male categories.
>
> Also, how do I ask SPSS to test the specific hypothesis I am asking
> (if I'm correct, does B2=B3=B4, and does B1 = weighted average of B2,B3,B4)?
> I have tried searching online--do I need to use the lmatrix or kmatrix
> subcommand (which I have not done before).
>
> The actual model I have is a bit more complex than above, in that the
> female category is also broken down into subgroups not relevant to
> men, that I want to compare with each other, so there would actually
> be more dummy variables (for example, femaleTall vs femaleShort), but
> to get to the core of the problem I've posed the question as above.
> Thanks for the help.

-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/ANCOVA-with-a-nominal-variable-only-relevant-to-a-subset-of-data-tp5731460p5731465.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Rich Ulrich

Re: ANCOVA with a nominal variable only relevant to a subset of data

In reply to this post by weasel

I can't guess how much the "simplified" version changes things, but I see
difficulties that have not been mentioned. The design is "incomplete" in
that you have scores that depend on sex. That gives you confounding by
sex that can be awkward to eliminate for the matter of testing. But the
measures are confounded beyond that.

Your text mentions testosterone, which your model does not. Sex and T. are
no doubt strongly confounded, on top of physical size and strength being
confounded with sex. But I would also want to look at the scales of measurement,
and whether they are apt to be skewed and non-homogeneous across the range:

Testosterone and strength both would seem potentially problematic.
Not only are outliers likely, but the extra variance for both will occur for males.
Is T one of those hormones that is best accommodated by taking the log?

Then there is age: As they age, people get stronger and then they get weaker.
Does your age range leave you satisfied with a linear trend?

Diet is shown as three dummies: Are these exclusive categories that add to 1?

With /so much/ confounding by sex, including the choice of measures, I am sure
that I would want to try my regressions for single-sex samples at the start. That also
lessens the urgency for taking transformations, by reducing the range of scores.

That does leave out the comparison of M vs F. Well, look at those results when you
get them. What is it that you really want to compare? Which regression lines? Which
intercepts?

Using GLM seems like "an interesting exercise" -- Can you code up the tests so that
they /exactly/ replicate the results from the single-sample regressions? (Probably
not, if the tests share error terms.) Can you code up the tests so that they /very nearly/
replicate the separate, robust results?

--
Rich Ulrich

> Date: Sun, 7 Feb 2016 11:30:34 -0700

> From: [hidden email]
> Subject: ANCOVA with a nominal variable only relevant to a subset of data
> To: [hidden email]
>
> Have a complex ANCOVA I want to run. A simplified example to illustrate the
> question I have:
>
> Suppose Strength is a function of age, gender (male/female) and, AMONG
> MALES, diet (beef, fish, or vegetarian). I don't care about diet in females,
> and those cells are blank for females in the data set.
>
> If I were doing this through dummy coding, I would code this using four
> dummy variables to account for gender and testosterone level, and the model
> would be as follows:
>
> strength = B1(female) + B2(MaleBeef) + B3(MaleFish) + B4(MaleVeg) +B5(age)
>
> To ask if sex matters, I would test mean strength of females vs weighted
> average of the 3 male categories.
> To test if diet matters, I want to test the null hypothesis that the mean
> strength in the 3 male categories is equal.
> To ask if age matters, a straightforward t-test seeing if B5 =0
>
> Can I do this using GLM without having to manually dummy code everything? My
> concern is, if I put the categorical variable diet in the fixed factors, I
> am not sure how SPSS will handle it because this variable is only relevant
> to the male subset, and females all have missing values. I don't think there
> is an option to exclude pair-wise for GLM. If i were dummy-coding manually,
> females would get a 1 for "female" and zeroes for all the male categories.
>
> Also, how do I ask SPSS to test the specific hypothesis I am asking (if I'm
> correct, does B2=B3=B4, and does B1 = weighted average of B2,B3,B4)?
> I have tried searching online--do I need to use the lmatrix or kmatrix
> subcommand (which I have not done before).
>
> The actual model I have is a bit more complex than above, in that the female
> category is also broken down into subgroups not relevant to men, that I want
> to compare with each other, so there would actually be more dummy variables
> (for example, femaleTall vs femaleShort), but to get to the core of the
> problem I've posed the question as above.
> Thanks for the help.
>

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Ryan

Re: ANCOVA with a nominal variable only relevant to a subset of data

I had some similar concerns about fitting a traditional single regression model, given the imbalance. It is possible to simultaneously estimate gender-specific regressions in a single model.

Ryan

On Mon, Feb 8, 2016 at 3:44 PM, Rich Ulrich <[hidden email]> wrote:

I can't guess how much the "simplified" version changes things, but I see
difficulties that have not been mentioned. The design is "incomplete" in
that you have scores that depend on sex. That gives you confounding by
sex that can be awkward to eliminate for the matter of testing. But the
measures are confounded beyond that.

Your text mentions testosterone, which your model does not. Sex and T. are
no doubt strongly confounded, on top of physical size and strength being
confounded with sex. But I would also want to look at the scales of measurement,
and whether they are apt to be skewed and non-homogeneous across the range:

Testosterone and strength both would seem potentially problematic.
Not only are outliers likely, but the extra variance for both will occur for males.
Is T one of those hormones that is best accommodated by taking the log?

Then there is age: As they age, people get stronger and then they get weaker.
Does your age range leave you satisfied with a linear trend?

Diet is shown as three dummies: Are these exclusive categories that add to 1?

With /so much/ confounding by sex, including the choice of measures, I am sure
that I would want to try my regressions for single-sex samples at the start. That also
lessens the urgency for taking transformations, by reducing the range of scores.

That does leave out the comparison of M vs F. Well, look at those results when you
get them. What is it that you really want to compare? Which regression lines? Which
intercepts?

Using GLM seems like "an interesting exercise" -- Can you code up the tests so that
they /exactly/ replicate the results from the single-sample regressions? (Probably
not, if the tests share error terms.) Can you code up the tests so that they /very nearly/
replicate the separate, robust results?

--
Rich Ulrich

> Date: Sun, 7 Feb 2016 11:30:34 -0700

> From: [hidden email]
> Subject: ANCOVA with a nominal variable only relevant to a subset of data
> To: [hidden email]

>
> Have a complex ANCOVA I want to run. A simplified example to illustrate the
> question I have:
>
> Suppose Strength is a function of age, gender (male/female) and, AMONG
> MALES, diet (beef, fish, or vegetarian). I don't care about diet in females,
> and those cells are blank for females in the data set.
>
> If I were doing this through dummy coding, I would code this using four
> dummy variables to account for gender and testosterone level, and the model
> would be as follows:
>
> strength = B1(female) + B2(MaleBeef) + B3(MaleFish) + B4(MaleVeg) +B5(age)
>
> To ask if sex matters, I would test mean strength of females vs weighted
> average of the 3 male categories.
> To test if diet matters, I want to test the null hypothesis that the mean
> strength in the 3 male categories is equal.
> To ask if age matters, a straightforward t-test seeing if B5 =0
>
> Can I do this using GLM without having to manually dummy code everything? My
> concern is, if I put the categorical variable diet in the fixed factors, I
> am not sure how SPSS will handle it because this variable is only relevant
> to the male subset, and females all have missing values. I don't think there
> is an option to exclude pair-wise for GLM. If i were dummy-coding manually,
> females would get a 1 for "female" and zeroes for all the male categories.
>
> Also, how do I ask SPSS to test the specific hypothesis I am asking (if I'm
> correct, does B2=B3=B4, and does B1 = weighted average of B2,B3,B4)?
> I have tried searching online--do I need to use the lmatrix or kmatrix
> subcommand (which I have not done before).
>
> The actual model I have is a bit more complex than above, in that the female
> category is also broken down into subgroups not relevant to men, that I want
> to compare with each other, so there would actually be more dummy variables
> (for example, femaleTall vs femaleShort), but to get to the core of the
> problem I've posed the question as above.
> Thanks for the help.
>

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

weasel

Re: ANCOVA with a nominal variable only relevant to a subset of data

In reply to this post by Bruce Weaver

Bruce. Thanks so much for your reply. I think significance of change in F between the two steps will be equivalent to a test of are the subcategories for men significantly different

bwgriffin

Re: ANCOVA with a nominal variable only relevant to a subset of data

In reply to this post by weasel

Hi weasel -

I don' think the analysis you wish to perform is possible since the data are not available for females, or more specifically, since the data are blank/missing (or possibly coded as zero).

You can perform this comparison:

strength = B1(female) + B2(age)

and b1 will provide an adjusted test of sex difference controlling for age.

However, this model is not possible or will be incorrect

strength = B1(female) + B2(MaleBeef) + B3(MaleFish) + B4(MaleVeg) +B5(age)

If the diet data is blank (missing) in SPSS for females, then females will be automatically dropped from the analysis due to missing data, so the model reduces to this:

strength = B2(MaleBeef) + B3(MaleFish) + B4(MaleVeg) +B5(age)

Note absence of b1(female).

If you included the values of 0 for diet categories for females in the SPSS data file (which would be a mistake), then you will have not the model you think, but a model that compares males and females to whatever your diet reference category happens to be using incorrect female diet data - it will be misleading and incorrect.

I think the only solution is to compare diet among males only.

You also asked about the need for dummy coding if using GLM - no, that is not necessary, just enter diet (coded as 1, 2, 3 or similar) as a factor and GLM will handle the coding.

Bruce Weaver

Re: ANCOVA with a nominal variable only relevant to a subset of data

Administrator

As weasel described things in the original post, the data file would look something like this (with Y = some measure of strength):

Y Age F Beef Fish Veg
51 48 1 0 0 0
58 59 1 0 0 0
58 51 1 0 0 0
45 51 0 1 0 0
50 31 0 0 1 0
32 60 0 0 0 1
etc.

If Step 1 includes Age and F (Female) as predictors of Y, adding any two of Beef, Fish or Veg to the model on Step 2 asks if the fit of the model improves when the males are treated as 3 distinct dietary groups rather than as a single group. Suppose Beef and Fish are added to the model on Step 2. Males who had Veg will have 0s for all 3 of the indicator variables in the model (F, Beef and Fish). Females will have F=1, Beef = 0 and Fish = 0, and they will not dropped from the analysis.

As Rich and Ryan have already pointed out, sex and diet are indeed confounded here. Some posters seem to be saying that because of that, one cannot even make the comparisons of interest to weasel. I, on the other hand, think he can make those comparisons. However, he must take great care in how any differences among the 4 groups (Femles and the 3 Male dietary groups) are interpreted. Because sex and diet are confounded, there is no way to know how much of a difference to attribute to sex and how much to diet, or to some combination of the two. I suspect weasel knows that, and still wants to make the comparisons. I also think that quite often, comparisons of interest in the real world are confounded to some degree, and that the goal is not always to tease apart the contributions of each individual variable.

Now I'll don my flame-proof suit and duck down behind my desk. ;-)

bwgriffin wrote

Hi weasel -

I don' think the analysis you wish to perform is possible since the data are not available for females, or more specifically, since the data are blank/missing (or possibly coded as zero).

You can perform this comparison:

strength = B1(female) + B2(age)

and b1 will provide an adjusted test of sex difference controlling for age.

However, this model is not possible or will be incorrect

strength = B1(female) + B2(MaleBeef) + B3(MaleFish) + B4(MaleVeg) +B5(age)

If the diet data is blank (missing) in SPSS for females, then females will be automatically dropped from the analysis due to missing data, so the model reduces to this:

strength = B2(MaleBeef) + B3(MaleFish) + B4(MaleVeg) +B5(age)

Note absence of b1(female).

If you included the values of 0 for diet categories for females in the SPSS data file (which would be a mistake), then you will have not the model you think, but a model that compares males and females to whatever your diet reference category happens to be using incorrect female diet data - it will be misleading and incorrect.

I think the only solution is to compare diet among males only.

You also asked about the need for dummy coding if using GLM - no, that is not necessary, just enter diet (coded as 1, 2, 3 or similar) as a factor and GLM will handle the coding.

bwgriffin

Re: ANCOVA with a nominal variable only relevant to a subset of data

Hi Bruce --

You won't need a fire suite because of me -- I try NOT to enflame folks!

The original post contained this:

"I don't care about diet in females, and those cells are blank for females in the data set."

If the cells are truly blank, then data from females will be discarded by SPSS as missing so the model reduces to males only.

If the cells are not blank but instead contain zeros like the data example you posted, then the issue, I think, will be with hypothesis testing.

The regression estimates -- I think -- will provide correct partial estimates, or in this case pairwise comparisons between sexes and among diets, but the standard errors for those diet estimates (and t and F tests among diets) will be incorrect (I think!).

The reason is because including female data will increase the sample size without providing any more information about diet, so tests among diets (or squared semi-partial correlations for model increases) will be based upon this larger sample size so the model will appear to have more power than is actually there.

In short, I think including female data that contributes nothing to diet information artificially increases sample size and power for comparisons among male diets. If the sample size is large, then the effects will likely be trivial, but if n is small, hypothesis testing results for diet comparisons could be misleading.

Rich Ulrich

Re: ANCOVA with a nominal variable only relevant to a subset of data

Your first mention of power, below, is okay. The next paragraph misses it.

You say, "the model will appear to have more power than is actually there."
Yes, it will /appear/ to have more power if you look only at the N; it will not
have more power, though, if you look at the reduced size of the effect, and note
that the female N will contribute nothing. That is: If half the sample is
female, then the measured /effect/ will be half of what it would measure in
males alone ... assuming, of course, that the confounding-by-sex does not
affect the effect size.

Including females will contribute no (appropriate) information for the male-diet
contrasts, and (therefore) they contribute nothing to the power. Presumably,
you make sure to take out a Sex effect. When you get a test that is proper for
the diet, a /proper/ power analysis would be based on the N for Males alone.

That is: If you were predicting the power of your analysis, you would have to take
into account the fact that there are Female cases diluting the observable effect.
You only obtain extra power by including /informative/ cases ... not by merely
increasing the N.

--
Rich Ulrich

> Date: Sat, 13 Feb 2016 21:02:36 -0700
> From: [hidden email]
> Subject: Re: ANCOVA with a nominal variable only relevant to a subset of data
> To: [hidden email]
>

...

>
> The reason is because including female data will increase the sample size
> without providing any more information about diet, so tests among diets (or
> squared semi-partial correlations for model increases) will be based upon
> this larger sample size so the model will appear to have more power than is
> actually there.
>
> In short, I think including female data that contributes nothing to diet
> information artificially increases sample size and power for comparisons
> among male diets. If the sample size is large, then the effects will likely
> be trivial, but if n is small, hypothesis testing results for diet
> comparisons could be misleading.
>
>

Bruce Weaver

Re: ANCOVA with a nominal variable only relevant to a subset of data

Administrator

In reply to this post by Bruce Weaver

Here is a further thought on the confounding. Earlier, assuming the data file looks something like this...

Y Age F Beef Fish Veg
51 48 1 0 0 0
58 59 1 0 0 0
58 51 1 0 0 0
45 51 0 1 0 0
50 31 0 0 1 0
32 60 0 0 0 1
etc.

...I suggested a hierarchical regression model with two steps, as follows:

Step 1: Strength = b0 + b1*Age + b2*Female + error
Step 2: Strength = b0 + b1*Age + b2*Female + b3*Beef + b4*Fish + error

As has been noted several times, there is obvious confounding of Sex and Diet. But after further consideration, I now believe that the confounding is limited to the F vs M comparison in Step 1. If that t-test is significant, there is no way to know how much of the difference is due to sex and how much is due to diet.

The null hypothesis that Diet (within males) has no effect is tested via the 2-df F-test for the change in R-sq from Step 1 to Step 2. If that F-test is significant, the interpretation is clear and unambiguous: There are differences among the 3 (male) dietary groups. I do not see any possibility of confounding with Sex in that test.

p.s. - Another issue (or maybe I should say "can of worms") we have not discussed thus far is that some folks are very opposed to using ANCOVA the groups are not the result of random assignment. ;-)

Bruce Weaver wrote

As weasel described things in the original post, the data file would look something like this (with Y = some measure of strength):

Y Age F Beef Fish Veg
51 48 1 0 0 0
58 59 1 0 0 0
58 51 1 0 0 0
45 51 0 1 0 0
50 31 0 0 1 0
32 60 0 0 0 1
etc.

If Step 1 includes Age and F (Female) as predictors of Y, adding any two of Beef, Fish or Veg to the model on Step 2 asks if the fit of the model improves when the males are treated as 3 distinct dietary groups rather than as a single group. Suppose Beef and Fish are added to the model on Step 2. Males who had Veg will have 0s for all 3 of the indicator variables in the model (F, Beef and Fish). Females will have F=1, Beef = 0 and Fish = 0, and they will not dropped from the analysis.

As Rich and Ryan have already pointed out, sex and diet are indeed confounded here. Some posters seem to be saying that because of that, one cannot even make the comparisons of interest to weasel. I, on the other hand, think he can make those comparisons. However, he must take great care in how any differences among the 4 groups (Femles and the 3 Male dietary groups) are interpreted. Because sex and diet are confounded, there is no way to know how much of a difference to attribute to sex and how much to diet, or to some combination of the two. I suspect weasel knows that, and still wants to make the comparisons. I also think that quite often, comparisons of interest in the real world are confounded to some degree, and that the goal is not always to tease apart the contributions of each individual variable.

Now I'll don my flame-proof suit and duck down behind my desk. ;-)

bwgriffin wrote

Hi weasel -

I don' think the analysis you wish to perform is possible since the data are not available for females, or more specifically, since the data are blank/missing (or possibly coded as zero).

You can perform this comparison:

strength = B1(female) + B2(age)

and b1 will provide an adjusted test of sex difference controlling for age.

However, this model is not possible or will be incorrect

strength = B1(female) + B2(MaleBeef) + B3(MaleFish) + B4(MaleVeg) +B5(age)

If the diet data is blank (missing) in SPSS for females, then females will be automatically dropped from the analysis due to missing data, so the model reduces to this:

strength = B2(MaleBeef) + B3(MaleFish) + B4(MaleVeg) +B5(age)

Note absence of b1(female).

If you included the values of 0 for diet categories for females in the SPSS data file (which would be a mistake), then you will have not the model you think, but a model that compares males and females to whatever your diet reference category happens to be using incorrect female diet data - it will be misleading and incorrect.

I think the only solution is to compare diet among males only.

You also asked about the need for dummy coding if using GLM - no, that is not necessary, just enter diet (coded as 1, 2, 3 or similar) as a factor and GLM will handle the coding.