Hi,
I want to compare demographic variables for veterans in my 4 month study sample with demographic data from the VA for all veterans who attended the emergency clinic for the month after my study ended. I have the mean age, proportion of males etc for bot the study sample and for all veterans but don't know how to do a statistical test looking for significant differences. Thanks for any help. Jan ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
For continuous variables, I assume you want to use a t-test. If you have the mean, SD and N for each group, you can perform a one-way ANOVA (with the ONEWAY procedure), using a matrix data file as input. E.g.,
MATRIX DATA VARIABLES=Group ROWTYPE_ Score /FACTORS=Group. BEGIN DATA 1 N 56 2 N 71 1 MEAN 22.98 2 MEAN 25.78 1 STDDEV 8.79 2 STDDEV 9.08 END DATA. ONEWAY Score BY Group /matrix in(*) . If you want to report it as a t-test, t = SQRT(F). For categorical variables, I assume you want a chi-square test of association. Again, all you need is the cell counts for the contingency table. The trick is to use WEIGHT CASES. Here's an example where the demographic variable has 3 levels. If you have only 2 levels, delete the rows with DEMOGVAR = 3. If you have more levels, add more rows as needed. And obviously, replace the CELL_COUNT values with the ones from your table. DATA LIST LIST /demogvar (f2.0) group (f2.0) cell_count (f5.0) . BEGIN DATA. 1 1 9 2 1 17 3 1 22 1 2 21 2 2 13 3 2 8 END DATA. weight by cell_count. /* This is the key line . crosstabs demogvar by group /cells = count exp /stat = chisq. HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
Or, for t tests you can use the SPSSINC
SUMMARY TTEST extension command, which also has a dialog box interface.
The command and its required Python programmability plugin can be
downloaded from SPSS Developer Central, www.spss.com/devcentral.
HTH, Jon Peck Senior Software Engineer, IBM [hidden email] 312-651-3435 From: Bruce Weaver <[hidden email]> To: [hidden email] Date: 11/08/2010 01:56 PM Subject: Re: [SPSSX-L] comparing groups in two different datasets Sent by: "SPSSX(r) Discussion" <[hidden email]> For continuous variables, I assume you want to use a t-test. If you have the mean, SD and N for each group, you can perform a one-way ANOVA (with the ONEWAY procedure), using a matrix data file as input. E.g., MATRIX DATA VARIABLES=Group ROWTYPE_ Score /FACTORS=Group. BEGIN DATA 1 N 56 2 N 71 1 MEAN 22.98 2 MEAN 25.78 1 STDDEV 8.79 2 STDDEV 9.08 END DATA. ONEWAY Score BY Group /matrix in(*) . If you want to report it as a t-test, t = SQRT(F). For categorical variables, I assume you want a chi-square test of association. Again, all you need is the cell counts for the contingency table. The trick is to use WEIGHT CASES. Here's an example where the demographic variable has 3 levels. If you have only 2 levels, delete the rows with DEMOGVAR = 3. If you have more levels, add more rows as needed. And obviously, replace the CELL_COUNT values with the ones from your table. DATA LIST LIST /demogvar (f2.0) group (f2.0) cell_count (f5.0) . BEGIN DATA. 1 1 9 2 1 17 3 1 22 1 2 21 2 2 13 3 2 8 END DATA. weight by cell_count. /* This is the key line . crosstabs demogvar by group /cells = count exp /stat = chisq. HTH. J McClure wrote: > > Hi, > I want to compare demographic variables for veterans in my 4 month study > sample with demographic data from the VA for all veterans who attended > the emergency clinic for the month after my study ended. I have the > mean age, proportion of males etc for bot the study sample and for all > veterans but don't know how to do a statistical test looking for > significant differences. > Thanks for any help. > Jan > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > > ----- -- Bruce Weaver [hidden email] http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." NOTE: My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/comparing-groups-in-two-different-datasets-tp3255523p3255769.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Colleaguees,
This is not a SPSS question (at least not yet).
I am seeking advice on the appropriate test for comparing two non-independent samples when the non-independence cannot be modeled.
The proportions are drawn from the same employees pop (~ 700, response rate of ~50%) employee population, surveyd one year apart. An example of an actual comparison is 98.4% vs 96.1% between time1 and time2.
The problem, as I see it, is the two samples are not independent but there is no ID so neither a dependent t-test nor a mixed model can be used. I found a test for comparing proportions from two independent groups.
What is the risk of violating the assumption of independence? inflated type 1 error?
As far as I know there is no appropriate test for this situation, but I thought I'd check with minds greater than mine...
Thank you,
John
|
Not sure I'm a greater mind but here goes:
(1) Simple stuff first: if you are doing t-tests, the
general formula
for the t-test is the following:
Obtained t=(M1 - M2)/sqrt[VarErr1 + VarErr2 -
2*r*SE1*SE2]
Where M1=mean group1, M2=mean group2,VarErr1=Variance error
group1,
VarErr2=Variance error group2, r=Pearson r between group1 and
group2
valaues, SE1=standard error group1, SE2=standard error group2,
and
2=constant (the number 2).
If you cannot calculate "r", you have to assume that it is
equal to zero
which makes the t-test denominator = sqrt [ VarErr1 +
VarErr1]. This
denominator will be larger than the denominator if "r" is
known. The
good news is if the t-test is significant under the assumption
of r=0.00,
then it has to be significant if you can calculate r (NOTE: r
is typically
a positive value -- a negative r should cause you to
re-examine your data).
The bad news is if the t-test is non-significant, it could be
so because
there is no real difference or you failed to find a
significant difference
because you could not adjust (reduce) your denominator
appropriately.
So, treating your data as independent groups makes the test
more conservative
or less powerful. I am open to correction on these
points.
(2) It seems to me that you should be able to get an
estimate of the
Pearson r through bootstrapping or some other simulation
procedure.
If there is a positive correlation between time 1 and time 2,
then, assuming
data consisting only of 0 and 1, time1 zeros should co-occur
with time2
zeros at a greater than chance level and the same holds for
ones. even if
they are not matched up properly. I haven't thought this
through but
perhaps someone more familiar with bootstrapping with
correlation
has more wisdom.
-Mike Palij
New York University
|
Administrator
|
As usual, Mike is giving solid advice. The only point at which I did a (small) double-take was where he said, "If you cannot calculate "r", you have to assume that it is equal to zero". I was reminded of situations where I did not know the value of some parameter, and was therefore advised to do a "sensitivity analysis". I.e., do the computation a few times with a range of plausible values plugged in for the unknown parameter to determine how sensitive the result was to the value of that parameter. That was when I was working for people doing medical research, and it seemed that the term "sensitivity analysis" was well known in those circles. I had never heard of it prior to that (my background to that point had been in experimental psychology). The obvious potential problem here is convincing other people that the values you plugged in are plausible. ;-)
HTH.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
As usual, Bruce gives good advice. ;-) What he is suggesting,
I believe, might be something like have Pearson r set to values of 0.00, 0.10, 0.20, 0.30 and so on, and see what happens to the obtained t-value (i.e., does it become statistically significant). One might also use Cohen's recommended values for small, medium, and large effect sizes for r (his "Power Primer" article would have the values -- a Google Scholar search may turn up the article if one doesn't have access to PsycInfo/Articles). I hesistated making such a recommendation because I think that it may be possible to come up with an empirically derived value for the Pearson r given that data (though this might be difficult). Ultimately, what one decides to do depends upon the question(s) one wants to answer and how important it is to get to the "truth" of the situation. -Mike Palij New York University [hidden email] ----- Original Message ----- From: Bruce Weaver <[hidden email]> Date: Friday, November 12, 2010 12:39 pm Subject: Re: non-SPSS: appropriate statistical test To: [hidden email] > As usual, Mike is giving solid advice. The only point at which I did > a > (small) double-take was where he said, "If you cannot calculate "r", you > have to assume that it is equal to zero". I was reminded of situations > where I did not know the value of some parameter, and was therefore advised > to do a "sensitivity analysis". I.e., do the computation a few times > with a > range of plausible values plugged in for the unknown parameter to determine > how sensitive the result was to the value of that parameter. That was > when > I was working for people doing medical research, and it seemed that > the term > "sensitivity analysis" was well known in those circles. I had never heard > of it prior to that (my background to that point had been in experimental > psychology). The obvious potential problem here is convincing other people > that the values you plugged in are plausible. ;-) > > HTH. > > > > Mike Palij wrote: > > > > Not sure I'm a greater mind but here goes: > > > > (1) Simple stuff first: if you are doing t-tests, the general formula > > for the t-test is the following: > > > > Obtained t=(M1 - M2)/sqrt[VarErr1 + VarErr2 - 2*r*SE1*SE2] > > > > Where M1=mean group1, M2=mean group2,VarErr1=Variance error group1, > > VarErr2=Variance error group2, r=Pearson r between group1 and group2 > > valaues, SE1=standard error group1, SE2=standard error group2, and > > 2=constant (the number 2). > > > > If you cannot calculate "r", you have to assume that it is equal to > zero > > which makes the t-test denominator = sqrt [ VarErr1 + VarErr1]. This > > denominator will be larger than the denominator if "r" is known. The > > good news is if the t-test is significant under the assumption of r=0.00, > > then it has to be significant if you can calculate r (NOTE: r is typically > > a positive value -- a negative r should cause you to re-examine your > > data). > > The bad news is if the t-test is non-significant, it could be so because > > there is no real difference or you failed to find a significant difference > > because you could not adjust (reduce) your denominator appropriately. > > > > So, treating your data as independent groups makes the test more > > conservative > > or less powerful. I am open to correction on these points. > > > > (2) It seems to me that you should be able to get an estimate of the > > Pearson r through bootstrapping or some other simulation procedure. > > If there is a positive correlation between time 1 and time 2, then, > > assuming > > data consisting only of 0 and 1, time1 zeros should co-occur with time2 > > zeros at a greater than chance level and the same holds for ones. > even if > > they are not matched up properly. I haven't thought this through but > > perhaps someone more familiar with bootstrapping with correlation > > has more wisdom. > > > > -Mike Palij > > New York University > > [hidden email] > > > > > > > > ----- Original Message ----- > > From: J P > > To: [hidden email] > > Sent: Friday, November 12, 2010 9:19 AM > > Subject: non-SPSS: appropriate statistical test > > > > > > Colleaguees, > > > > This is not a SPSS question (at least not yet). > > > > I am seeking advice on the appropriate test for comparing two > > non-independent samples when the non-independence cannot be modeled. > > > > The proportions are drawn from the same employees pop (~ 700, response > > rate of ~50%) employee population, surveyd one year apart. An > example of > > an actual comparison is 98.4% vs 96.1% between time1 and time2. > > > > The problem, as I see it, is the two samples are not independent but > > there is no ID so neither a dependent t-test nor a mixed model can be > > used. I found a test for comparing proportions from two independent > > groups. > > > > What is the risk of violating the assumption of independence? inflated > > type 1 error? > > > > As far as I know there is no appropriate test for this situation, > but I > > thought I'd check with minds greater than mine... > > > > Thank you, > > > > John > > > > > > > > > > > > > > > ----- > -- > Bruce Weaver > [hidden email] > http://sites.google.com/a/lakeheadu.ca/bweaver/ > > "When all else fails, RTFM." > > NOTE: My Hotmail account is not monitored regularly. > To send me an e-mail, please use the address shown above. > > -- > View this message in context: http://spssx-discussion.1045642.n5.nabble.com/comparing-groups-in-two-different-datasets-tp3255523p3262459.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
What I probably didn't make clear enough in my previous post is that when performing a sensitivity analysis, one would usually have some kind of data from somewhere (e.g., the literature, pilot data) on which to base their guess about what range of values is plausible.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by J P-6
Not to question others' posts on this sample overlap (covariance)
problem (in large part because I almost exclusively deal in the realms of proportions derived from surveys, I'm not sufficiently clued up on the theory behind the later posts), but it might be helpful to approach this from the perspective of the normal approximation to the binomial. Which, slightly problematic response rate and proportions close to 100% (or zero for the complements, q, of the proportions of interest) aside, seems defensible given your sample sizes = 700. There is also the minor complication relating to the unit of interest - employees. Obviously, one year on I would envisage the exact same employees will not in fact comprise the 2nd sample, ie the 2nd sample does not completely "overlap" with the 1st. But for the time being let's just ignore that nuisance parameter... In particular an emphasis only on proportions means you need no information about correlation per se, as the discrete proportions from both samples, along with an estimate of the overlap proportion (the proportion in common to both samples), is sufficient. Specifically, the covariance situation you describe is complete sample overlap in the form of a panel/longitudinal survey. Assuming the usual (social science which I sense appropriate given employee data?) 95% confidence level, the margin of error (MOE) formula for the difference between p1 and p2 (proportions of interest from 1st and 2nd sample, respectively) is: MOE(p1-p2) = 1.96 x SQRT{1/n[p1(1-p1) + p2(1-p2) - 2(p12-p1p2)]}, where p12 = the overlap proportion, ie proportion in common to both 1st and 2nd samples. Importantly, the covariance term (p12-p1p2) is NOT guaranteed to be positive, essentially because there is no requirement for p1 and p2 to vary inversely, positively, or at all (this should not be confused with the underlying reality that the two samples overlap from the perspective of common RESPONDENTS, but not necessarily common RESPONSES). You can relatively easily model the MOE impact of different p1, p2 and p12 (with appropriate limitations, ie if p1=0.9 and p2=0.8 the overlap proportion p12 can not be less than 0.7, ie ALL the p2 0.8 are in common to the p1 0.9; nor logically can it be higher than 0.8), to confirm that under certain circumstances MOE can be higher, lower, or the same relative to independent samples - for which the covariance term disappears. Some references, in particular Kish who describes solutions to a range of sample overlap/covariance scenarios: - Kish, L. (1965). Survey Sampling. (see especially chapters 12.4 and 12.10) - Franklin, C. (2007). The 'Margin of Error' for Differences in Polls. [Online]. Available: http://spicyscandals.com/MOEFranklin.pdf. [3 August 2010]. - Scott, A.J., and Seber, G.A.F. (1983). Difference of Proportions From the Same Survey. The American Statistician, 37, 319-320. - Worcester, R., and Downham, J. (Eds) (1986). Consumer Market Research Handbook. 3rd Edition. London: McGraw-Hill. Pete McMillen Senior Research Adviser Strategic Analysis & Research | Strategy, Policy & Planning Department of Corrections | Ara Poutama Aotearoa Mayfair House | 44-52 The Terrace | Private Box 1206 | Wellington Ext 68052 | DDI +64 4 460 3052 | Fax +64 4 460 3214 -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bruce Weaver Sent: Saturday, 13 November 2010 09:03 To: [hidden email] Subject: Re: non-SPSS: appropriate statistical test What I probably didn't make clear enough in my previous post is that when performing a sensitivity analysis, one would usually have some kind of data from somewhere (e.g., the literature, pilot data) on which to base their guess about what range of values is plausible. Mike Palij wrote: > > As usual, Bruce gives good advice. ;-) What he is suggesting, > I believe, might be something like have Pearson r set > to values of 0.00, 0.10, 0.20, 0.30 and so on, and see > what happens to the obtained t-value (i.e., does it > become statistically significant). One might also use > Cohen's recommended values for small, medium, and > large effect sizes for r (his "Power Primer" article would > have the values -- a Google Scholar search may turn up > the article if one doesn't have access to PsycInfo/Articles). > > I hesistated making such a recommendation because I > think that it may be possible to come up with an empirically > derived value for the Pearson r given that data (though this > might be difficult). Ultimately, what one decides to do > depends upon the question(s) one wants to answer and > how important it is to get to the "truth" of the situation. > > -Mike Palij > New York University > [hidden email] > > > ----- Original Message ----- > From: Bruce Weaver <[hidden email]> > Date: Friday, November 12, 2010 12:39 pm > Subject: Re: non-SPSS: appropriate statistical test > To: [hidden email] > > >> As usual, Mike is giving solid advice. The only point at which I did >> a >> (small) double-take was where he said, "If you cannot calculate "r", >> have to assume that it is equal to zero". I was reminded of situations >> where I did not know the value of some parameter, and was therefore >> advised >> to do a "sensitivity analysis". I.e., do the computation a few times >> with a >> range of plausible values plugged in for the unknown parameter to >> determine >> how sensitive the result was to the value of that parameter. That was >> when >> I was working for people doing medical research, and it seemed that >> the term >> "sensitivity analysis" was well known in those circles. I had never >> heard >> of it prior to that (my background to that point had been in >> experimental >> psychology). The obvious potential problem here is convincing other >> people >> that the values you plugged in are plausible. ;-) >> >> HTH. >> >> >> >> Mike Palij wrote: >> > >> > Not sure I'm a greater mind but here goes: >> > >> > (1) Simple stuff first: if you are doing t-tests, the general >> > for the t-test is the following: >> > >> > Obtained t=(M1 - M2)/sqrt[VarErr1 + VarErr2 - 2*r*SE1*SE2] >> > >> > Where M1=mean group1, M2=mean group2,VarErr1=Variance error group1, >> > VarErr2=Variance error group2, r=Pearson r between group1 and group2 >> > valaues, SE1=standard error group1, SE2=standard error group2, and >> > 2=constant (the number 2). >> > >> > If you cannot calculate "r", you have to assume that it is equal to >> zero >> > which makes the t-test denominator = sqrt [ VarErr1 + VarErr1]. This >> > denominator will be larger than the denominator if "r" is known. The >> > good news is if the t-test is significant under the assumption of >> r=0.00, >> > then it has to be significant if you can calculate r (NOTE: r is >> typically >> > a positive value -- a negative r should cause you to re-examine your >> > data). >> > The bad news is if the t-test is non-significant, it could be so >> because >> > there is no real difference or you failed to find a significant >> difference >> > because you could not adjust (reduce) your denominator appropriately. >> > >> > So, treating your data as independent groups makes the test more >> > conservative >> > or less powerful. I am open to correction on these points. >> > >> > (2) It seems to me that you should be able to get an estimate of the >> > Pearson r through bootstrapping or some other simulation procedure. >> > If there is a positive correlation between time 1 and time 2, then, >> > assuming >> > data consisting only of 0 and 1, time1 zeros should co-occur with time2 >> > zeros at a greater than chance level and the same holds for ones. >> even if >> > they are not matched up properly. I haven't thought this through but >> > perhaps someone more familiar with bootstrapping with correlation >> > has more wisdom. >> > >> > -Mike Palij >> > New York University >> > [hidden email] >> > >> > >> > >> > ----- Original Message ----- >> > From: J P >> > To: [hidden email] >> > Sent: Friday, November 12, 2010 9:19 AM >> > Subject: non-SPSS: appropriate statistical test >> > >> > >> > Colleaguees, >> > >> > This is not a SPSS question (at least not yet). >> > >> > I am seeking advice on the appropriate test for comparing two >> > non-independent samples when the non-independence cannot be >> > >> > The proportions are drawn from the same employees pop (~ 700, >> response >> > rate of ~50%) employee population, surveyd one year apart. An >> example of >> > an actual comparison is 98.4% vs 96.1% between time1 and time2. >> > >> > The problem, as I see it, is the two samples are not independent but >> > there is no ID so neither a dependent t-test nor a mixed model can be >> > used. I found a test for comparing proportions from two independent >> > groups. >> > >> > What is the risk of violating the assumption of independence? >> inflated >> > type 1 error? >> > >> > As far as I know there is no appropriate test for this situation, >> but I >> > thought I'd check with minds greater than mine... >> > >> > Thank you, >> > >> > John >> > >> > >> > >> > >> > >> > >> >> >> ----- >> -- >> Bruce Weaver >> [hidden email] >> http://sites.google.com/a/lakeheadu.ca/bweaver/ >> >> "When all else fails, RTFM." >> >> NOTE: My Hotmail account is not monitored regularly. >> To send me an e-mail, please use the address shown above. >> >> -- >> View this message in context: >> fferent-datasets-tp3255523p3262459.html >> Sent from the SPSSX Discussion mailing list archive at Nabble.com. >> >> ===================== >> To manage your subscription to SPSSX-L, send a message to >> [hidden email] (not to SPSSX-L), with no body text except the >> command. To leave the list, send the command >> SIGNOFF SPSSX-L >> For a list of commands to manage subscriptions, send the command >> INFO REFCARD > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > > ----- -- Bruce Weaver [hidden email] http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." NOTE: My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/comparing-groups-in-two-di fferent-datasets-tp3255523p3262780.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD The information in this message is the property of the New Zealand Department of Corrections. It is intended only for the person or entity to which it is addressed and may contain confidential or privileged material. Any review, storage, copying, editing, summarising, transmission, retransmission, dissemination or other use of, by any means, in whole or part, or taking any action in reliance upon, this information by persons or entities other than intended recipient are prohibited. If you received this in error, please contact the sender and delete the material from all computers. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by J P-6
In no position to query others' posts on this sample overlap (covariance)
problem (given a primarily sample survey/proportions background), but it might be helpful to consider this problem from the perspective of normal approximation to the binomial. Which, somewhat low response rate and proportions close to 100% aside, seems defensible given your sample size(s) around 700. Also the minor complication relating to employees as the unit of interest - one year on I would expect the exact same employees will not in fact comprise the 2nd sample, ie technically the 2nd sample does not completely "overlap" with the 1st. But for the time being let's just ignore that nuisance... In particular a focus only on proportions means you need no information about correlation per se, as the seperate proportions from both samples, along with an estimate of the overlap proportion (the proportion in common to both samples), is sufficient. The covariance situation you describe is (in theory) complete sample overlap in the form of a panel/longitudinal survey. Assuming the usual (social science which I sense appropriate given employee data?) 95% confidence level, the margin of error (MOE) formula for the difference between p1 and p2 (proportions of interest from 1st and 2nd sample, respectively) is: MOE(p1-p2) = 1.96 x SQRT{1/n[p1(1-p1) + p2(1-p2) - 2(p12-p1p2)]}, where p12 = the overlap proportion, ie proportion in common to both 1st and 2nd samples. Importantly, the covariance term (p12-p1p2) is NOT guaranteed to be positive, essentially because there is no requirement for p1 and p2 to vary inversely, positively, or at all (not to be confused with the underlying fact that the two samples overlap from the perspective of common RESPONDENTS, but not necessarily common RESPONSES). You can relatively easily model the MOE impact of different p1, p2 and p12 (with appropriate limitations, ie if p1=0.9 and p2=0.8 the overlap proportion p12 can not be less than 0.7, ie ALL the p2 0.8 are in common to the p1 0.9; nor logically can it be higher than 0.8), to confirm that under certain circumstances MOE can be higher, lower, or the same relative to independent samples (the latter in which the covariance term disappears altogether). Some references, in particular Kish who describes solutions to a range of sample overlap/covariance scenarios: - Kish. 1965. Survey Sampling. (in particular chapters 12.4 and 12.10) - Franklin. 2007. The âMargin of Errorâ for Differences in Polls. - Scott & Seber. 1983. Difference of Proportions From the Same Survey. American Statistician,37,319-320. - Worcester & Downham (eds). 1986. Consumer Market Research Handbook. 3rd Edition. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |