SPSSX Discussion

Re: Sample Means

Posted by Spousta Jan on Dec 21, 2006; 3:05pm
URL: http://spssx-discussion.165.s1.nabble.com/Sample-Means-tp1072828p1072843.html

Just one very basic remark: Chi-square tests weren't created for testing
differences in means, but for testing differences in shapes of discrete
distributions. It is always better to use scissors for cutting paper and
not for screwing :-)

In other words, it is possible to have very significand Chi-square test
and zero differences in means.

Jan

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Arthur Kramer
Sent: Thursday, December 21, 2006 3:36 PM
To: [hidden email]
Subject: Re: Sample Means

Samir,

If I were you I would separate the students from the other subjects in
the dataset. Make 2 new files--one with students only; one without
students--and create a new variable in each file that identifies which
file it is and populate that field with an identifier for the file. That
is not difficult; it can be done with a " select if" syntax statement
and use whatever you have that identifies students, or using a
drop-down from "data" on the toolbar. Just remember to name any new
files with new names do you don't destroy your original file. (I am
don't use version 14 yet so I don't know how to use the two together, I
would merge these two files (using another new file name so as to not
destroy the files I just created)). Then I could do any test with these
two groups and they will be independent of each other. I still think
that with files of this size any difference in means will be
statistically significant, and effect size is the way to go with this
type of analysis.

The chi-square is a good way to go because your data are essentially
non-parametric (i.e., ordinal) in nature, unless you can document that
the "space" between 1 and 2, 3 and 4, etc. are based on a common metric
among all of your subjects.

Arthur Kramer, Ph.D.
Director of Institutional Research
New Jersey City University
-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Samir Omerovic
Sent: Thursday, December 21, 2006 3:01 AM
To: [hidden email]
Subject: Re: Sample Means

Hi again to all,

Thank you all for contributing to this discussion. I must admit that
some of posts I had difficulty to follow so my reply comes late. Anyway
after read all the posts I figure out that I did not find the solution
to my problem, or rather I find so many I can not pick one that works
best. Independent T test, Z-test, One sample test, Cohan d... they all
seem not to work for me since I have not read about these tests used in
situation like mine.
Let me ask this again: If I have survey done with 1000 respondents and
if among these 1000 I got 500 students (and the rest 500 are not
students). If the total of 1000 respondents have mean value of X,X and
500 students have mean value of Y,Y. I am wondering if there is a test
that can tell me if these two mean values are significantly different.
My problem lies in the fact that the mean X,X (1000 respondents) has
been calculated with 500 students included so the two are obviously
dependent.
One of my friends suggested the following: Since I have answers at
7-point scale, I could maybe use Chi-square. If I take the frequency of
answers of 1000 respondents as expected values and the frequency of
answers of 500 students as test values. The problem is the same.
Frequencies of 500 students are included in frequencies of 1000
respondents. And is it ok to use chi-square here or not at all?

So... do not know what to do.

Thanks once more to all

Samir

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Spousta Jan
Sent: Wednesday, December 20, 2006 5:20 PM
To: [hidden email]
Subject: Re: Sample Means

Yes, you are true, Stephen, from the general point of wiew. It is really
interesting how many different scenarios can be hidden in such a simple
situation. But if we return to the original question...

> what test should I use if I need to assess the significance of the
mean of some
> subgroup in comparison the mean of the total population. For example I
have 1000
> respondents that answered a question on 7-point scale. Their
mean/average is 3,5.
> I have 500 students among those 1000 respondents with mean/average
2,9. I want to
> know if these values are significantly different.

...then it seems to me (taking into account both my and Samir's somehow
limited knowledge of English) that the 1000 persons - probably a sample
from a much bigger population - are both students and non-students, and
that these 500 just happened to be students - they answered Yes when
asked "Are you a student?", while the others answered No. Therefore some
of the scenarios seem not applicable here.

But the real problem is, of course, how to help Samir in his first steps
over the deep swamp of applied statistics and not confuse him even more
than needed...

Greetings,

Jan

-----Original Message-----
From: Statisticsdoc [mailto:[hidden email]]
Sent: Wednesday, December 20, 2006 4:57 PM
To: [hidden email]
Cc: Spousta Jan
Subject: Re: Sample Means

Stephen Brand
www.statisticsdoc.com

Jan,

I think there is still some confusion about the research question that
Samir is trying to answer, and even the sampling design. I will try to
cover the various possibilities.

Your posting is quite correct if we assume that there are only 500
students in the population of 1,000 cases - i.e. the 1000 cases are
made of 500 students and 500 non-students). If in fact the larger
population of 1000 cases contains only 500 students, then there is no
need to utilize inferential statistics - there is nothing to infer.
There is no null hypothesis to test, both means are constants, and what
you say is correct. Samir may still be interested in knowing whether
the difference between two subpopulation means is interesting and
meaningful, but that is not a question of statistical significance in
the sense of testing an inference about population paramters from sample
statistics. As another poster pointed out, computing the effect size
would be informative.

On the other hand, if the 500 students comprise a sample of students
that was drawn in some way from a larger population, the additional
procedures are justified (either the z-statistic or the one-sample
t-test).

On possibility is that the 500 students were a sample drawn from the
population of 1,000. That is, there are 1,000 students, and Samir has
drawn a subsample of 500 of them. Samir may be interested in knowing
whether the sampling process that he used was unbiased. Assuming that
he knows not only the mean but the population standard deviation from
the population of 1,000, he can compute the distribution of sampling
means with n=500 and apply the z-statistic the calculate the likelihood
of obtaining the observed sample mean if the sampling process was random
and unbiased.

Another possibility is that the sample 500 cases were drawn from some
population other than the 1,000 cases. This is the scenario that I had
in mind when I posted that the one-sample t-test would be justified.
In this instance, Samir would be interested in testing the hypothesis
that the sample of 500 students was drawn from a population whose mean
was equal to the mean of the population of 1,000 cases.

My conclusion is that Richard and I are both right, and so are you.

Cheers,

Stephen Brand

P.S. I think that I might use this example as an extra-credit question
on my next stats exam :)

Jan Spousta Wrote:

Now it is my turn to support Richard a bit :-)

If the 1,000 cases were the whole available population and 500 of them
were students, the the one-sample procedure would be still
_unjustified_, because then both the population mean and the
subpoplation mean are constants and it is nonsense to test the
difference between two constants. If the two are different, then the
difference is always "significant" in the exact meaning of the word.

The interesting case is when the sampled population is _almost_ the
whole available population (e.g. five students from the 500 are missing)
, but then the statistics starts to be rather complicated and you still
cannot use the "standard" techniques under Compare Means in SPSS. Ask
Marta, she will tell you...

Jan

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Statisticsdoc
Sent: Tuesday, December 19, 2006 10:42 PM
To: [hidden email]
Subject: Sample Means

Stephen Brand
www.statisticsdoc.com

Richard,

Thanks for citing me on both sides of this discussion :) Let me say a
little more about why I would accept that 1,000 cases can constitute a
population, and under what conditions.

It is not too hard to imagine population definitions that encompass
small numbers of people (e.g., all of the left-handed residents of the
town of Exeter, Rhode Island; the Fall 2006 intake of a small college).

The question of whether you accept that 1,000 cases make up a population
depends on the definition of the population. If these 1,000 cases are
all of the potential members of the population, then the mean of those
cases constitutes the population mean. Whatever random processes might
have influence the mean score of that population, that score is the
population parameter. We are not trying to estimate a parameter of a
wider population from which we have obtained the 1,000 cases. In this
instance, the one-sample procedure is justified.

Granted, you might say that the left-handed residents of Exeter, or the
2006 intake of a small college, constitute a sub-set of your population
of interest, but then I think that you have to allow that these cases do
not exhaust the potential membership of the population (which might
constitute the left-handed population of Rhode Island, or the various
cohorts of potential incoming first year students), and then your means
become sample statistics, not population parameters. BTW, in this
instance, the Exeter sample is not a very random one :)

It all depends on where the boundaries of the population are drawn.

Best,

Stephen Brand

Stephen Brand

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Richard Ristow
Sent: Tuesday, December 19, 2006 2:59 PM
To: [hidden email]
Subject: Re: Significant difference - Means

To weigh in with two comments:

At 03:54 AM 12/19/2006, Spousta Jan wrote:

>The error of that 3.5 is about sqrt(1/1000) = 0,03 while the error of
>2.9 for students is about sqrt(1/500) = 0.045. That is both errors are
>of the same order of magnitude and the population error cannot be
>neglected in this case.

I'd like to second, and emphasize, this. Jan is clearly right here,
where the two groups are the same size. However, the same thing holds
when the sizes are quite different.

First, the t-test algorithm correctly allows for the increased precision
in measuring the mean in the larger group. Replacing it by a constant
only 'gains' you a little precision you don't really have.

Second, inequality of group size matters less than one might think.
Roughly, precision goes as the square root of sample size. (Under 'nice'
conditions, that's exact: standard error of estimate goes as the square
root of sample size.) That means increasing the sample size ten-fold
leaves the SEE still 1/3 of the size it had - quite a long way from
letting it be considered a constant.

And at 10:42 AM 12/19/2006, Statisticsdoc (Stephen Brand) wrote:

>If your population consists of the 1000 students, then the mean of 3.5
>is a population parameter, and you would be justified is using the
>one-sample t-test suggested by John.

(This won't be quite fair to Stephen Brand, who'd also written "Formally
one should test the null hypothesis that the two samples have the same
mean, by using the independent groups t-test.")

There's a philosophical position, which I agree with, that will hardly
ever accept something like "[my] population consists of the 1000
students." The argument is that, even if those 1,000 students are all
you've ever seen or ever will see, their observed values constitute a
set generated by an underlying random mechanism, and that randomness
must be allowed for in estimation exactly as if you were aware of
100,000 similar students.

('Generated by an underlying random mechanism' is sometimes expressed as
'drawn from a conceptually infinite population.' However, while this is
technically accurate, I don't blame anyone who considers a 'conceptually
infinite population' a very odd notion.)

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.432 / Virus Database: 268.15.25/593 - Release Date:
12/19/2006 1:17 PM

--
For personalized and experienced consulting in statistics and research
design, visit www.statisticsdoc.com

--
For personalized and experienced consulting in statistics and research
design, visit www.statisticsdoc.com