Hello, I’m working with a data set with 1600 observations from a total population of about 3200. The variables are mostly categorical, and mostly dichotomous. The data are made up of two strata, one of which is an entire population and the other
a random sample. The first stratum makes up about 80% of the data and 36% of the entire population. The random sample stratum contains four components which potentially could have been strata as well, but weren’t sampled that way. The sampling plan was determined
externally. Some findings will have to compare the five components (the stratum with the entire population and the four components within the other stratum). Others will be for the entire sample. I first thought that the way handle the data was to weight, and I used the formula given by Maletta (2007, available on line), which divides the proportion of the entire population made up by each stratum by the proportion of the sample
made up by that stratum. This reduces the influence of the first stratum while expanding the influence of the second. When I examine the differences among the five components using the SPSS Complex Samples procedure, the results come with acceptable margins
of error, less than 5%. Is this the best way to look at the data, given that I have a complete population for one stratum and relatively small samples from the four components that make up the other one? That is, should I honor the completeness of the data in
the first stratum? The write-up will be a report, rather than an article, and the audience is professional, but not necessarily research minded. Many thanks, Henry Henry Ilian, Ph.D. ACS Office of Quality Improvement 150 William Street, 17th Fl New York, NY 10038 (212) 227-5414 |
It always makes me nervous when someone tries to talk about the
"total population" because (a) 99.9% of all research has to treat even a "population" as if it were a sample; (b) 99% of all questions that I have seen (sent to Lists) were wrong-headed where they tried to use "population statistics"; and (c) almost every audience expects to see the testing done in the ordinary way. Population statistics, the ones with zero total variance, are most often applicable to incoming data on election eve. Beyond that, they arise in certain roles of administering limited resources, where the measurements can be assumed to have been made with (almost) no error.... Very rarely, "population statistics" do apply to research which is undertaken to draw inferences. Are you sure you have justification for it? (And, if so, your presentation *must* give a clear emphasis to the unusual choice.) In this case, the sample description that follows is not one that I can parse. It seems that the "total population" is 3200, and the "data set" is 1600. However, the "first stratum" is "an entire population" which further is described as "80% of the data and 36% of the entire population". Despite having two evocations of "entire population", that *almost* parses: If the 80% of 1600 is a complete sampling of one characteristic, then it would be 40% of the 3200, rather than 36%. Is this merely a round-off error in your presentation? - If you are truly comparing to "complete data" in the first stratum, which, further, is measured without error, then you might want a strategy that compares the other means to the fixed values that are the means for the first stratum. I don't know what the options are for doing that. -- Rich Ulrich Date: Fri, 23 May 2014 16:16:55 +0000 From: [hidden email] Subject: Stratum containing entire population plus stratum containing a sample To: [hidden email] Hello,
I’m working with a data set with 1600 observations from a total population of about 3200. The variables are mostly categorical, and mostly dichotomous. The data are made up of two strata, one of which is an entire population and the other a random sample. The first stratum makes up about 80% of the data and 36% of the entire population. The random sample stratum contains four components which potentially could have been strata as well, but weren’t sampled that way. The sampling plan was determined externally. Some findings will have to compare the five components (the stratum with the entire population and the four components within the other stratum). Others will be for the entire sample.
I first thought that the way handle the data was to weight, and I used the formula given by Maletta (2007, available on line), which divides the proportion of the entire population made up by each stratum by the proportion of the sample made up by that stratum. This reduces the influence of the first stratum while expanding the influence of the second. When I examine the differences among the five components using the SPSS Complex Samples procedure, the results come with acceptable margins of error, less than 5%.
Is this the best way to look at the data, given that I have a complete population for one stratum and relatively small samples from the four components that make up the other one? That is, should I honor the completeness of the data in the first stratum?
The write-up will be a report, rather than an article, and the audience is professional, but not necessarily research minded.
Many thanks,
Henry
Henry Ilian, Ph.D. ACS Office of Quality Improvement 150 William Street, 17th Fl New York, NY 10038 (212) 227-5414 |
In reply to this post by henryilian
Rich, thanks for your response.
The reason for the difference between my 36% and your 40% is that I carelessly misstated the total population. It’s slightly more than 3500. Thanks for catching the error. The population is divided into five units of differing sizes. The data set contains roughly 1600 observations in two strata. The first stratum contains nearly the entire population of the largest of the five
units. Actually, it’s 93%, so there is some sampling error, although in practical terms, I’m not sure how much difference that amount would make. The other stratum is a random sample of the rest. I’m primarily working with proportions of dichotomous variables. Again, the problem is, can I do justice to the nearly complete sample from the one unit, while also reporting on the results for the entire sample? If that seems self-contradictory. It may be, and it may truly
be a matter of one or the other. But I do want to make the most of the data I have, and I also want to avoid reporting misleading findings. Henry It always makes me nervous when someone tries to talk about the Date: Fri, 23 May 2014 16:16:55 +0000 Hello, I’m working with a data set with 1600 observations from a total population of about 3200. The variables are mostly categorical, and mostly dichotomous. The data are made up of two strata, one of which is an entire population and the other
a random sample. The first stratum makes up about 80% of the data and 36% of the entire population. The random sample stratum contains four components which potentially could have been strata as well, but weren’t sampled that way. The sampling plan was determined
externally. Some findings will have to compare the five components (the stratum with the entire population and the four components within the other stratum). Others will be for the entire sample. I first thought that the way handle the data was to weight, and I used the formula given by Maletta (2007, available on line), which divides the proportion of the entire population made up by each stratum by the proportion of the sample
made up by that stratum. This reduces the influence of the first stratum while expanding the influence of the second. When I examine the differences among the five components using the SPSS Complex Samples procedure, the results come with acceptable margins
of error, less than 5%. Is this the best way to look at the data, given that I have a complete population for one stratum and relatively small samples from the four components that make up the other one? That is, should I honor the completeness of the data in
the first stratum? The write-up will be a report, rather than an article, and the audience is professional, but not necessarily research minded. Many thanks, Henry Henry Ilian, Ph.D. ACS Office of Quality Improvement 150 William Street, 17th Fl New York, NY 10038 (212) 227-5414 |
- I will say directly: "misleading findings" is what we are likely
to have on hand from anyone who is asking questions about using finite-population statistics. The proper uses are apt to be well-established in an area. Do you have a good literature available? Or, are you ground-breaking with the statistical approach? I don't know how many complex-sampling experts we have reading the list. But I think you need to be more forthcoming about the study if you expect more concrete advice. What is your data about? What is a statement that you might make about the data? -- Rich Ulrich Date: Fri, 23 May 2014 19:05:27 +0000 From: [hidden email] Subject: Re: Stratum containing entire population plus stratum containing a sample To: [hidden email] Rich, thanks for your response. The reason for the difference between my 36% and your 40% is that I carelessly misstated the total population. It’s slightly more than 3500. Thanks for catching the error. The population is divided into five units of differing sizes. The data set contains roughly 1600 observations in two strata. The first stratum contains nearly the entire population of the largest of the five units. Actually, it’s 93%, so there is some sampling error, although in practical terms, I’m not sure how much difference that amount would make. The other stratum is a random sample of the rest. I’m primarily working with proportions of dichotomous variables. Again, the problem is, can I do justice to the nearly complete sample from the one unit, while also reporting on the results for the entire sample? If that seems self-contradictory. It may be, and it may truly be a matter of one or the other. But I do want to make the most of the data I have, and I also want to avoid reporting misleading findings. Henry It always makes me nervous when someone tries to talk about the Date: Fri, 23 May 2014 16:16:55 +0000 Hello,
I’m working with a data set with 1600 observations from a total population of about 3200. The variables are mostly categorical, and mostly dichotomous. The data are made up of two strata, one of which is an entire population and the other a random sample. The first stratum makes up about 80% of the data and 36% of the entire population. The random sample stratum contains four components which potentially could have been strata as well, but weren’t sampled that way. The sampling plan was determined externally. Some findings will have to compare the five components (the stratum with the entire population and the four components within the other stratum). Others will be for the entire sample.
I first thought that the way handle the data was to weight, and I used the formula given by Maletta (2007, available on line), which divides the proportion of the entire population made up by each stratum by the proportion of the sample made up by that stratum. This reduces the influence of the first stratum while expanding the influence of the second. When I examine the differences among the five components using the SPSS Complex Samples procedure, the results come with acceptable margins of error, less than 5%.
Is this the best way to look at the data, given that I have a complete population for one stratum and relatively small samples from the four components that make up the other one? That is, should I honor the completeness of the data in the first stratum?
The write-up will be a report, rather than an article, and the audience is professional, but not necessarily research minded.
Many thanks,
Henry
Henry Ilian, Ph.D. ACS Office of Quality Improvement 150 William Street, 17th Fl New York, NY 10038 (212) 227-5414
|
In reply to this post by henryilian
Rich, The sample is unusual enough that I didn’t think to look for literature. I just assumed that there wouldn’t be any on this particular kind type of problem. I can only talk about the study in a general way, but the results will be used to guide decision makers in where to direct limited resources, either for training staff or
for changing policies. For these kinds of decisions, small differences, in, say, the proportion of times one task was accomplished as opposed to another, don’t offer much guidance, only large ones do. So if one necessary task was accomplished 50% of the
time, and another was accomplished 55% of the time, that basically counts as the same, especially if there was third task, equally necessary, that was accomplished only 25% of the time. The issue here is to get the proportions correct and not let them be skewed
by one stratum. Most of the variables will be reported as frequencies—in some cases, I’ll be cross-tabbing—and the areas with the strongest deficits will be those that are targeted. Thanks, Henry - I will say directly: "misleading findings" is what we are likely Date: Fri, 23 May 2014 19:05:27 +0000 Rich, thanks for your response.
The reason for the difference between my 36% and your 40% is that I carelessly misstated the total population. It’s slightly more than 3500. Thanks for catching the error. The population is divided into five units of differing sizes. The data set contains roughly 1600 observations in two strata. The first stratum contains nearly the entire population of the
largest of the five units. Actually, it’s 93%, so there is some sampling error, although in practical terms, I’m not sure how much difference that amount would make. The other stratum is a random sample of the rest. I’m primarily working with proportions of
dichotomous variables. Again, the problem is, can I do justice to the nearly complete sample from the one unit, while also reporting on the results for the entire sample? If that seems self-contradictory. It may
be, and it may truly be a matter of one or the other. But I do want to make the most of the data I have, and I also want to avoid reporting misleading findings. Henry It always makes me nervous when someone tries to talk about the Date: Fri, 23 May 2014 16:16:55 +0000 Hello, I’m working with a data set with 1600 observations from a total population of about 3200. The variables are mostly categorical, and mostly dichotomous. The data are made up of two strata,
one of which is an entire population and the other a random sample. The first stratum makes up about 80% of the data and 36% of the entire population. The random sample stratum contains four components which potentially could have been strata as well, but
weren’t sampled that way. The sampling plan was determined externally. Some findings will have to compare the five components (the stratum with the entire population and the four components within the other stratum). Others will be for the entire sample. I first thought that the way handle the data was to weight, and I used the formula given by Maletta (2007, available on line), which divides the proportion of the entire population made up
by each stratum by the proportion of the sample made up by that stratum. This reduces the influence of the first stratum while expanding the influence of the second. When I examine the differences among the five components using the SPSS Complex Samples procedure,
the results come with acceptable margins of error, less than 5%. Is this the best way to look at the data, given that I have a complete population for one stratum and relatively small samples from the four components that make up the other one? That is,
should I honor the completeness of the data in the first stratum? The write-up will be a report, rather than an article, and the audience is professional, but not necessarily research minded. Many thanks, Henry Henry Ilian, Ph.D. ACS Office of Quality Improvement 150 William Street, 17th Fl New York, NY 10038 (212) 227-5414 |
In reply to this post by henryilian
Rich It occurred in thinking about what you had to say, that I missed your point
about finite-population statistics. I also am not sure I understand what you are telling me. I know the size of the population and am using a weighting formula that takes it into account. From that point on, I'm using the SPSS complex samples procedures, and
I confess I don't know whether or not they assume a finite population. My other choice would be to use the SPSS procedures that don't involve complex samples, but wouldn't that also give misleading results given the imbalance between strata? Could you elaborate on your warning. I'd like to understand it. Many thanks, Henry - I will say directly: "misleading findings" is what we are likely Date: Fri, 23 May 2014 19:05:27 +0000 Rich, thanks for your response. The reason for the difference between my 36% and your 40% is that I carelessly misstated the total population. It’s slightly more than 3500. Thanks for catching the error. The population is divided into five units of differing sizes. The data set contains roughly 1600 observations in two strata. The first stratum contains nearly the entire population of the largest of the five units. Actually, it’s 93%, so there is some sampling error, although in practical terms, I’m not sure how much difference that amount would make. The other stratum is a random sample of the rest. I’m primarily working with proportions of dichotomous variables. Again, the problem is, can I do justice to the nearly complete sample from the one unit, while also reporting on the results for the entire sample? If that seems self-contradictory. It may be, and it may truly be a matter of one or the other. But I do want to make the most of the data I have, and I also want to avoid reporting misleading findings. Henry
It always makes me nervous when someone tries to talk about the
Date: Fri, 23 May 2014 16:16:55 +0000 Hello,
I’m working with a data set with 1600 observations from a total population of about 3200. The variables are mostly categorical, and mostly dichotomous. The data are made up of two strata, one of which is an entire population and the other a random sample. The first stratum makes up about 80% of the data and 36% of the entire population. The random sample stratum contains four components which potentially could have been strata as well, but weren’t sampled that way. The sampling plan was determined externally. Some findings will have to compare the five components (the stratum with the entire population and the four components within the other stratum). Others will be for the entire sample.
I first thought that the way handle the data was to weight, and I used the formula given by Maletta (2007, available on line), which divides the proportion of the entire population made up by each stratum by the proportion of the sample made up by that stratum. This reduces the influence of the first stratum while expanding the influence of the second. When I examine the differences among the five components using the SPSS Complex Samples procedure, the results come with acceptable margins of error, less than 5%.
Is this the best way to look at the data, given that I have a complete population for one stratum and relatively small samples from the four components that make up the other one? That is, should I honor the completeness of the data in the first stratum?
The write-up will be a report, rather than an article, and the audience is professional, but not necessarily research minded.
Many thanks,
Henry
|
Free forum by Nabble | Edit this page |