comparing groups in two different datasets

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

comparing groups in two different datasets

J McClure
Hi,
I want to compare demographic variables for veterans in my 4 month study
sample with demographic data from the VA for all veterans who attended
the emergency clinic for the month after my study ended.  I have the
mean age, proportion of males etc for bot the study sample and for all
veterans but don't know how to do a statistical test looking for
significant differences.
Thanks for any help.
Jan

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: comparing groups in two different datasets

Bruce Weaver
Administrator
For continuous variables, I assume you want to use a t-test.  If you have the mean, SD and N for each group, you can perform a one-way ANOVA (with the ONEWAY procedure), using a matrix data file as input.  E.g.,

MATRIX DATA VARIABLES=Group ROWTYPE_ Score /FACTORS=Group.
BEGIN DATA
1 N 56
2 N 71
1 MEAN 22.98
2 MEAN 25.78
1 STDDEV 8.79
2 STDDEV 9.08
END DATA.

ONEWAY Score BY Group  /matrix in(*) .

If you want to report it as a t-test, t = SQRT(F).

For categorical variables, I assume you want a chi-square test of association.  Again, all you need is the cell counts for the contingency table.  The trick is to use WEIGHT CASES.  Here's an example where the demographic variable has 3 levels.  If you have only 2 levels, delete the rows with DEMOGVAR = 3.  If you have more levels, add more rows as needed.  And obviously, replace the CELL_COUNT values with the ones from your table.

DATA LIST LIST /demogvar (f2.0) group (f2.0) cell_count (f5.0) .
BEGIN DATA.
1 1 9
2 1 17
3 1 22
1 2 21
2 2 13
3 2 8
END DATA.

weight by cell_count.  /* This is the key line .
crosstabs demogvar by group /cells = count exp /stat = chisq.

HTH.

J McClure wrote
Hi,
I want to compare demographic variables for veterans in my 4 month study
sample with demographic data from the VA for all veterans who attended
the emergency clinic for the month after my study ended.  I have the
mean age, proportion of males etc for bot the study sample and for all
veterans but don't know how to do a statistical test looking for
significant differences.
Thanks for any help.
Jan

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: comparing groups in two different datasets

Jon K Peck
Or, for t tests you can use the SPSSINC SUMMARY TTEST extension command, which also has a dialog box interface.  The command and its required Python programmability plugin can be downloaded from SPSS Developer Central, www.spss.com/devcentral.

HTH,

Jon Peck
Senior Software Engineer, IBM
[hidden email]
312-651-3435




From:        Bruce Weaver <[hidden email]>
To:        [hidden email]
Date:        11/08/2010 01:56 PM
Subject:        Re: [SPSSX-L] comparing groups in two different datasets
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




For continuous variables, I assume you want to use a t-test.  If you have the
mean, SD and N for each group, you can perform a one-way ANOVA (with the
ONEWAY procedure), using a matrix data file as input.  E.g.,

MATRIX DATA VARIABLES=Group ROWTYPE_ Score /FACTORS=Group.
BEGIN DATA
1 N 56
2 N 71
1 MEAN 22.98
2 MEAN 25.78
1 STDDEV 8.79
2 STDDEV 9.08
END DATA.

ONEWAY Score BY Group  /matrix in(*) .

If you want to report it as a t-test, t = SQRT(F).

For categorical variables, I assume you want a chi-square test of
association.  Again, all you need is the cell counts for the contingency
table.  The trick is to use WEIGHT CASES.  Here's an example where the
demographic variable has 3 levels.  If you have only 2 levels, delete the
rows with DEMOGVAR = 3.  If you have more levels, add more rows as needed.
And obviously, replace the CELL_COUNT values with the ones from your table.

DATA LIST LIST /demogvar (f2.0) group (f2.0) cell_count (f5.0) .
BEGIN DATA.
1 1 9
2 1 17
3 1 22
1 2 21
2 2 13
3 2 8
END DATA.

weight by cell_count.  /* This is the key line .
crosstabs demogvar by group /cells = count exp /stat = chisq.

HTH.


J McClure wrote:
>
> Hi,
> I want to compare demographic variables for veterans in my 4 month study
> sample with demographic data from the VA for all veterans who attended
> the emergency clinic for the month after my study ended.  I have the
> mean age, proportion of males etc for bot the study sample and for all
> veterans but don't know how to do a statistical test looking for
> significant differences.
> Thanks for any help.
> Jan
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>


-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/comparing-groups-in-two-different-datasets-tp3255523p3255769.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

non-SPSS: appropriate statistical test

J P-6
Colleaguees,
 
This is not a SPSS question (at least not yet).
 
I am seeking advice on the appropriate test for comparing two non-independent samples when the non-independence cannot be modeled.
 
The proportions are drawn from the same employees pop  (~ 700, response rate of ~50%) employee population, surveyd one year apart. An example of an actual comparison is 98.4% vs 96.1% between time1 and time2. 
 
The problem, as I see it, is the two samples are not independent but there is no ID so neither a dependent t-test nor a mixed model can be used. I found a test for comparing proportions from two independent groups.
 
What is the risk of violating the assumption of independence? inflated type 1 error?
 
As far as I know there is no appropriate test for this situation, but I thought I'd check with minds greater than mine...
 
Thank you,
 
John
 

 

Reply | Threaded
Open this post in threaded view
|

Re: non-SPSS: appropriate statistical test

Mike
Not sure I'm a greater mind but here goes:
 
(1) Simple stuff first:  if you are doing t-tests, the general formula
for the t-test is the following:
 
Obtained t=(M1 - M2)/sqrt[VarErr1 + VarErr2 - 2*r*SE1*SE2]
 
Where M1=mean group1, M2=mean group2,VarErr1=Variance error group1,
VarErr2=Variance error group2, r=Pearson r between group1 and group2
valaues, SE1=standard error group1, SE2=standard error group2, and
2=constant (the number 2).
 
If you cannot calculate "r", you have to assume that it is equal to zero
which makes the t-test denominator = sqrt [ VarErr1 + VarErr1].  This
denominator will be larger than the denominator if "r" is known.  The
good news is if the t-test is significant under the assumption of r=0.00,
then it has to be significant if you can calculate r (NOTE: r is typically
a positive value -- a negative r should cause you to re-examine your data).
The bad news is if the t-test is non-significant, it could be so because
there is no real difference or you failed to find a significant difference
because you could not adjust (reduce) your denominator appropriately.
 
So, treating your data as independent groups makes the test more conservative
or less powerful.  I am open to correction on these points.
 
(2)  It seems to me that you should be able to get an estimate of the
Pearson r through bootstrapping or some other simulation procedure.
If there is a positive correlation between time 1 and time 2, then, assuming
data consisting only of 0 and 1, time1 zeros should co-occur with time2
zeros at a greater than chance level and the same holds for ones. even if
they are not matched up properly.  I haven't thought this through but
perhaps someone more familiar with bootstrapping with correlation
has more wisdom.
 
-Mike Palij
New York University
 
 
 
----- Original Message -----
Sent: Friday, November 12, 2010 9:19 AM
Subject: non-SPSS: appropriate statistical test

Colleaguees,
 
This is not a SPSS question (at least not yet).
 
I am seeking advice on the appropriate test for comparing two non-independent samples when the non-independence cannot be modeled.
 
The proportions are drawn from the same employees pop  (~ 700, response rate of ~50%) employee population, surveyd one year apart. An example of an actual comparison is 98.4% vs 96.1% between time1 and time2. 
 
The problem, as I see it, is the two samples are not independent but there is no ID so neither a dependent t-test nor a mixed model can be used. I found a test for comparing proportions from two independent groups.
 
What is the risk of violating the assumption of independence? inflated type 1 error?
 
As far as I know there is no appropriate test for this situation, but I thought I'd check with minds greater than mine...
 
Thank you,
 
John
 

 

Reply | Threaded
Open this post in threaded view
|

Re: non-SPSS: appropriate statistical test

Bruce Weaver
Administrator
As usual, Mike is giving solid advice.  The only point at which I did a (small) double-take was where he said, "If you cannot calculate "r", you have to assume that it is equal to zero".  I was reminded of situations where I did not know the value of some parameter, and was therefore advised to do a "sensitivity analysis".  I.e., do the computation a few times with a range of plausible values plugged in for the unknown parameter to determine how sensitive the result was to the value of that parameter.  That was when I was working for people doing medical research, and it seemed that the term "sensitivity analysis" was well known in those circles.  I had never heard of it prior to that (my background to that point had  been in experimental psychology).  The obvious potential problem here is convincing other people that the values you plugged in are plausible.  ;-)

HTH.


Mike Palij wrote
Not sure I'm a greater mind but here goes:

(1) Simple stuff first:  if you are doing t-tests, the general formula
for the t-test is the following:

Obtained t=(M1 - M2)/sqrt[VarErr1 + VarErr2 - 2*r*SE1*SE2]

Where M1=mean group1, M2=mean group2,VarErr1=Variance error group1,
VarErr2=Variance error group2, r=Pearson r between group1 and group2
valaues, SE1=standard error group1, SE2=standard error group2, and
2=constant (the number 2).

If you cannot calculate "r", you have to assume that it is equal to zero
which makes the t-test denominator = sqrt [ VarErr1 + VarErr1].  This
denominator will be larger than the denominator if "r" is known.  The
good news is if the t-test is significant under the assumption of r=0.00,
then it has to be significant if you can calculate r (NOTE: r is typically
a positive value -- a negative r should cause you to re-examine your data).
The bad news is if the t-test is non-significant, it could be so because
there is no real difference or you failed to find a significant difference
because you could not adjust (reduce) your denominator appropriately.

So, treating your data as independent groups makes the test more conservative
or less powerful.  I am open to correction on these points.

(2)  It seems to me that you should be able to get an estimate of the
Pearson r through bootstrapping or some other simulation procedure.
If there is a positive correlation between time 1 and time 2, then, assuming
data consisting only of 0 and 1, time1 zeros should co-occur with time2
zeros at a greater than chance level and the same holds for ones. even if
they are not matched up properly.  I haven't thought this through but
perhaps someone more familiar with bootstrapping with correlation
has more wisdom.

-Mike Palij
New York University
mp26@nyu.edu



  ----- Original Message -----
  From: J P
  To: SPSSX-L@LISTSERV.UGA.EDU
  Sent: Friday, November 12, 2010 9:19 AM
  Subject: non-SPSS: appropriate statistical test


  Colleaguees,

  This is not a SPSS question (at least not yet).

  I am seeking advice on the appropriate test for comparing two non-independent samples when the non-independence cannot be modeled.

  The proportions are drawn from the same employees pop  (~ 700, response rate of ~50%) employee population, surveyd one year apart. An example of an actual comparison is 98.4% vs 96.1% between time1 and time2.

  The problem, as I see it, is the two samples are not independent but there is no ID so neither a dependent t-test nor a mixed model can be used. I found a test for comparing proportions from two independent groups.

  What is the risk of violating the assumption of independence? inflated type 1 error?

  As far as I know there is no appropriate test for this situation, but I thought I'd check with minds greater than mine...

  Thank you,

  John


   
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: non-SPSS: appropriate statistical test

Mike
As usual, Bruce gives good advice. ;-)  What he is suggesting,
I believe, might be something like have Pearson r set
to values of 0.00, 0.10, 0.20, 0.30 and so on, and see
what happens to the obtained t-value (i.e., does it
become statistically significant).  One might also use
Cohen's recommended values for small, medium, and
large effect sizes for r (his "Power Primer" article would
have the values -- a Google Scholar search may turn up
the article if one doesn't have access to PsycInfo/Articles).

I hesistated making such a recommendation because I
think that it may be possible to come up with an empirically
derived value for the Pearson r given that data (though this
might be difficult). Ultimately, what one decides to do
depends upon the question(s) one wants to answer and
how important it is to get to the "truth" of the situation.

-Mike Palij
New York University
[hidden email]


----- Original Message -----
From: Bruce Weaver <[hidden email]>
Date: Friday, November 12, 2010 12:39 pm
Subject: Re: non-SPSS: appropriate statistical test
To: [hidden email]


> As usual, Mike is giving solid advice.  The only point at which I did
> a
> (small) double-take was where he said, "If you cannot calculate "r", you
> have to assume that it is equal to zero".  I was reminded of situations
> where I did not know the value of some parameter, and was therefore advised
> to do a "sensitivity analysis".  I.e., do the computation a few times
> with a
> range of plausible values plugged in for the unknown parameter to determine
> how sensitive the result was to the value of that parameter.  That was
> when
> I was working for people doing medical research, and it seemed that
> the term
> "sensitivity analysis" was well known in those circles.  I had never heard
> of it prior to that (my background to that point had  been in experimental
> psychology).  The obvious potential problem here is convincing other people
> that the values you plugged in are plausible.  ;-)
>
> HTH.
>
>
>
> Mike Palij wrote:
> >
> > Not sure I'm a greater mind but here goes:
> >
> > (1) Simple stuff first:  if you are doing t-tests, the general formula
> > for the t-test is the following:
> >
> > Obtained t=(M1 - M2)/sqrt[VarErr1 + VarErr2 - 2*r*SE1*SE2]
> >
> > Where M1=mean group1, M2=mean group2,VarErr1=Variance error group1,
> > VarErr2=Variance error group2, r=Pearson r between group1 and group2
> > valaues, SE1=standard error group1, SE2=standard error group2, and
> > 2=constant (the number 2).
> >
> > If you cannot calculate "r", you have to assume that it is equal to
> zero
> > which makes the t-test denominator = sqrt [ VarErr1 + VarErr1].  This
> > denominator will be larger than the denominator if "r" is known.  The
> > good news is if the t-test is significant under the assumption of r=0.00,
> > then it has to be significant if you can calculate r (NOTE: r is typically
> > a positive value -- a negative r should cause you to re-examine your
> > data).
> > The bad news is if the t-test is non-significant, it could be so because
> > there is no real difference or you failed to find a significant difference
> > because you could not adjust (reduce) your denominator appropriately.
> >
> > So, treating your data as independent groups makes the test more
> > conservative
> > or less powerful.  I am open to correction on these points.
> >
> > (2)  It seems to me that you should be able to get an estimate of the
> > Pearson r through bootstrapping or some other simulation procedure.
> > If there is a positive correlation between time 1 and time 2, then,
> > assuming
> > data consisting only of 0 and 1, time1 zeros should co-occur with time2
> > zeros at a greater than chance level and the same holds for ones.
> even if
> > they are not matched up properly.  I haven't thought this through but
> > perhaps someone more familiar with bootstrapping with correlation
> > has more wisdom.
> >
> > -Mike Palij
> > New York University
> > [hidden email]
> >
> >
> >
> >   ----- Original Message -----
> >   From: J P
> >   To: [hidden email]
> >   Sent: Friday, November 12, 2010 9:19 AM
> >   Subject: non-SPSS: appropriate statistical test
> >
> >
> >   Colleaguees,
> >
> >   This is not a SPSS question (at least not yet).
> >
> >   I am seeking advice on the appropriate test for comparing two
> > non-independent samples when the non-independence cannot be modeled.
> >
> >   The proportions are drawn from the same employees pop  (~ 700, response
> > rate of ~50%) employee population, surveyd one year apart. An
> example of
> > an actual comparison is 98.4% vs 96.1% between time1 and time2.
> >
> >   The problem, as I see it, is the two samples are not independent but
> > there is no ID so neither a dependent t-test nor a mixed model can be
> > used. I found a test for comparing proportions from two independent
> > groups.
> >
> >   What is the risk of violating the assumption of independence? inflated
> > type 1 error?
> >
> >   As far as I know there is no appropriate test for this situation,
> but I
> > thought I'd check with minds greater than mine...
> >
> >   Thank you,
> >
> >   John
> >
> >
> >
> >
> >
> >
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> http://sites.google.com/a/lakeheadu.ca/bweaver/
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> View this message in context: http://spssx-discussion.1045642.n5.nabble.com/comparing-groups-in-two-different-datasets-tp3255523p3262459.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: non-SPSS: appropriate statistical test

Bruce Weaver
Administrator
What I probably didn't make clear enough in my previous post is that when performing a sensitivity analysis, one would  usually have some kind of data from somewhere (e.g., the literature, pilot data) on which to base their guess about what range of values is plausible.


Mike Palij wrote
As usual, Bruce gives good advice. ;-)  What he is suggesting,
I believe, might be something like have Pearson r set
to values of 0.00, 0.10, 0.20, 0.30 and so on, and see
what happens to the obtained t-value (i.e., does it
become statistically significant).  One might also use
Cohen's recommended values for small, medium, and
large effect sizes for r (his "Power Primer" article would
have the values -- a Google Scholar search may turn up
the article if one doesn't have access to PsycInfo/Articles).

I hesistated making such a recommendation because I
think that it may be possible to come up with an empirically
derived value for the Pearson r given that data (though this
might be difficult). Ultimately, what one decides to do
depends upon the question(s) one wants to answer and
how important it is to get to the "truth" of the situation.

-Mike Palij
New York University
mp26@nyu.edu


----- Original Message -----
From: Bruce Weaver <bruce.weaver@hotmail.com>
Date: Friday, November 12, 2010 12:39 pm
Subject: Re: non-SPSS: appropriate statistical test
To: SPSSX-L@LISTSERV.UGA.EDU


> As usual, Mike is giving solid advice.  The only point at which I did
> a
> (small) double-take was where he said, "If you cannot calculate "r", you
> have to assume that it is equal to zero".  I was reminded of situations
> where I did not know the value of some parameter, and was therefore advised
> to do a "sensitivity analysis".  I.e., do the computation a few times
> with a
> range of plausible values plugged in for the unknown parameter to determine
> how sensitive the result was to the value of that parameter.  That was
> when
> I was working for people doing medical research, and it seemed that
> the term
> "sensitivity analysis" was well known in those circles.  I had never heard
> of it prior to that (my background to that point had  been in experimental
> psychology).  The obvious potential problem here is convincing other people
> that the values you plugged in are plausible.  ;-)
>
> HTH.
>
>
>
> Mike Palij wrote:
> >
> > Not sure I'm a greater mind but here goes:
> >
> > (1) Simple stuff first:  if you are doing t-tests, the general formula
> > for the t-test is the following:
> >
> > Obtained t=(M1 - M2)/sqrt[VarErr1 + VarErr2 - 2*r*SE1*SE2]
> >
> > Where M1=mean group1, M2=mean group2,VarErr1=Variance error group1,
> > VarErr2=Variance error group2, r=Pearson r between group1 and group2
> > valaues, SE1=standard error group1, SE2=standard error group2, and
> > 2=constant (the number 2).
> >
> > If you cannot calculate "r", you have to assume that it is equal to
> zero
> > which makes the t-test denominator = sqrt [ VarErr1 + VarErr1].  This
> > denominator will be larger than the denominator if "r" is known.  The
> > good news is if the t-test is significant under the assumption of r=0.00,
> > then it has to be significant if you can calculate r (NOTE: r is typically
> > a positive value -- a negative r should cause you to re-examine your
> > data).
> > The bad news is if the t-test is non-significant, it could be so because
> > there is no real difference or you failed to find a significant difference
> > because you could not adjust (reduce) your denominator appropriately.
> >
> > So, treating your data as independent groups makes the test more
> > conservative
> > or less powerful.  I am open to correction on these points.
> >
> > (2)  It seems to me that you should be able to get an estimate of the
> > Pearson r through bootstrapping or some other simulation procedure.
> > If there is a positive correlation between time 1 and time 2, then,
> > assuming
> > data consisting only of 0 and 1, time1 zeros should co-occur with time2
> > zeros at a greater than chance level and the same holds for ones.
> even if
> > they are not matched up properly.  I haven't thought this through but
> > perhaps someone more familiar with bootstrapping with correlation
> > has more wisdom.
> >
> > -Mike Palij
> > New York University
> > mp26@nyu.edu
> >
> >
> >
> >   ----- Original Message -----
> >   From: J P
> >   To: SPSSX-L@LISTSERV.UGA.EDU
> >   Sent: Friday, November 12, 2010 9:19 AM
> >   Subject: non-SPSS: appropriate statistical test
> >
> >
> >   Colleaguees,
> >
> >   This is not a SPSS question (at least not yet).
> >
> >   I am seeking advice on the appropriate test for comparing two
> > non-independent samples when the non-independence cannot be modeled.
> >
> >   The proportions are drawn from the same employees pop  (~ 700, response
> > rate of ~50%) employee population, surveyd one year apart. An
> example of
> > an actual comparison is 98.4% vs 96.1% between time1 and time2.
> >
> >   The problem, as I see it, is the two samples are not independent but
> > there is no ID so neither a dependent t-test nor a mixed model can be
> > used. I found a test for comparing proportions from two independent
> > groups.
> >
> >   What is the risk of violating the assumption of independence? inflated
> > type 1 error?
> >
> >   As far as I know there is no appropriate test for this situation,
> but I
> > thought I'd check with minds greater than mine...
> >
> >   Thank you,
> >
> >   John
> >
> >
> >
> >
> >
> >
>
>
> -----
> --
> Bruce Weaver
> bweaver@lakeheadu.ca
> http://sites.google.com/a/lakeheadu.ca/bweaver/
>
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is not monitored regularly.
> To send me an e-mail, please use the address shown above.
>
> --
> View this message in context: http://spssx-discussion.1045642.n5.nabble.com/comparing-groups-in-two-different-datasets-tp3255523p3262459.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

FW: Re: non-SPSS: appropriate statistical test

MCMILLEN, Pete (WELLHO)
In reply to this post by J P-6
Not to question others' posts on this sample overlap (covariance)
problem (in large part because I almost exclusively deal in the realms
of proportions derived from surveys, I'm not sufficiently clued up on
the theory behind the later posts), but it might be helpful to approach
this from the perspective of the normal approximation to the binomial.
Which, slightly problematic response rate and proportions close to 100%
(or zero for the complements, q, of the proportions of interest) aside,
seems defensible given your sample sizes = 700. There is also the minor
complication relating to the unit of interest - employees. Obviously,
one year on I would envisage the exact same employees will not in fact
comprise the 2nd sample, ie the 2nd sample does not completely "overlap"
with the 1st. But for the time being let's just ignore that nuisance
parameter...

In particular an emphasis only on proportions means you need no
information about correlation per se, as the discrete proportions from
both samples, along with an estimate of the overlap proportion (the
proportion in common to both samples), is sufficient.

Specifically, the covariance situation you describe is complete sample
overlap in the form of a panel/longitudinal survey.

Assuming the usual (social science which I sense appropriate given
employee data?) 95% confidence level, the margin of error (MOE) formula
for the difference between p1 and p2 (proportions of interest from 1st
and 2nd sample, respectively) is:

MOE(p1-p2) = 1.96 x SQRT{1/n[p1(1-p1) + p2(1-p2) - 2(p12-p1p2)]}, where
p12 = the overlap proportion, ie proportion in common to both 1st and
2nd samples.

Importantly, the covariance term (p12-p1p2) is NOT guaranteed to be
positive, essentially because there is no requirement for p1 and p2 to
vary inversely, positively, or at all (this should not be confused with
the underlying reality that the two samples overlap from the perspective
of common RESPONDENTS, but not necessarily common RESPONSES). You can
relatively easily model the MOE impact of different p1, p2 and p12 (with
appropriate limitations, ie if p1=0.9 and p2=0.8 the overlap proportion
p12 can not be less than 0.7, ie ALL the p2 0.8 are in common to the p1
0.9; nor logically can it be higher than 0.8), to confirm that under
certain circumstances MOE can be higher, lower, or the same relative to
independent samples - for which the covariance term disappears.

Some references, in particular Kish who describes solutions to a range
of sample overlap/covariance scenarios:
- Kish, L. (1965). Survey Sampling. (see especially chapters 12.4 and
12.10)
- Franklin, C. (2007). The 'Margin of Error' for Differences in Polls.
[Online]. Available: http://spicyscandals.com/MOEFranklin.pdf. [3 August
2010].
- Scott, A.J., and Seber, G.A.F. (1983). Difference of Proportions From
the Same Survey. The American Statistician, 37, 319-320.
- Worcester, R., and Downham, J. (Eds) (1986). Consumer Market Research
Handbook. 3rd Edition. London: McGraw-Hill.


Pete McMillen
Senior Research Adviser
Strategic Analysis & Research | Strategy, Policy & Planning
Department of Corrections | Ara Poutama Aotearoa
Mayfair House | 44-52 The Terrace | Private Box 1206 | Wellington
Ext 68052 | DDI +64 4 460 3052 | Fax +64 4 460 3214

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Bruce Weaver
Sent: Saturday, 13 November 2010 09:03
To: [hidden email]
Subject: Re: non-SPSS: appropriate statistical test

What I probably didn't make clear enough in my previous post is that
when
performing a sensitivity analysis, one would  usually have some kind of
data
from somewhere (e.g., the literature, pilot data) on which to base their
guess about what range of values is plausible.



Mike Palij wrote:

>
> As usual, Bruce gives good advice. ;-)  What he is suggesting,
> I believe, might be something like have Pearson r set
> to values of 0.00, 0.10, 0.20, 0.30 and so on, and see
> what happens to the obtained t-value (i.e., does it
> become statistically significant).  One might also use
> Cohen's recommended values for small, medium, and
> large effect sizes for r (his "Power Primer" article would
> have the values -- a Google Scholar search may turn up
> the article if one doesn't have access to PsycInfo/Articles).
>
> I hesistated making such a recommendation because I
> think that it may be possible to come up with an empirically
> derived value for the Pearson r given that data (though this
> might be difficult). Ultimately, what one decides to do
> depends upon the question(s) one wants to answer and
> how important it is to get to the "truth" of the situation.
>
> -Mike Palij
> New York University
> [hidden email]
>
>
> ----- Original Message -----
> From: Bruce Weaver <[hidden email]>
> Date: Friday, November 12, 2010 12:39 pm
> Subject: Re: non-SPSS: appropriate statistical test
> To: [hidden email]
>
>
>> As usual, Mike is giving solid advice.  The only point at which I did
>> a
>> (small) double-take was where he said, "If you cannot calculate "r",
you
>> have to assume that it is equal to zero".  I was reminded of
situations
>> where I did not know the value of some parameter, and was therefore
>> advised
>> to do a "sensitivity analysis".  I.e., do the computation a few times
>> with a
>> range of plausible values plugged in for the unknown parameter to
>> determine
>> how sensitive the result was to the value of that parameter.  That
was

>> when
>> I was working for people doing medical research, and it seemed that
>> the term
>> "sensitivity analysis" was well known in those circles.  I had never
>> heard
>> of it prior to that (my background to that point had  been in
>> experimental
>> psychology).  The obvious potential problem here is convincing other
>> people
>> that the values you plugged in are plausible.  ;-)
>>
>> HTH.
>>
>>
>>
>> Mike Palij wrote:
>> >
>> > Not sure I'm a greater mind but here goes:
>> >
>> > (1) Simple stuff first:  if you are doing t-tests, the general
formula
>> > for the t-test is the following:
>> >
>> > Obtained t=(M1 - M2)/sqrt[VarErr1 + VarErr2 - 2*r*SE1*SE2]
>> >
>> > Where M1=mean group1, M2=mean group2,VarErr1=Variance error group1,
>> > VarErr2=Variance error group2, r=Pearson r between group1 and
group2
>> > valaues, SE1=standard error group1, SE2=standard error group2, and
>> > 2=constant (the number 2).
>> >
>> > If you cannot calculate "r", you have to assume that it is equal to
>> zero
>> > which makes the t-test denominator = sqrt [ VarErr1 + VarErr1].
This
>> > denominator will be larger than the denominator if "r" is known.
The
>> > good news is if the t-test is significant under the assumption of
>> r=0.00,
>> > then it has to be significant if you can calculate r (NOTE: r is
>> typically
>> > a positive value -- a negative r should cause you to re-examine
your
>> > data).
>> > The bad news is if the t-test is non-significant, it could be so
>> because
>> > there is no real difference or you failed to find a significant
>> difference
>> > because you could not adjust (reduce) your denominator
appropriately.
>> >
>> > So, treating your data as independent groups makes the test more
>> > conservative
>> > or less powerful.  I am open to correction on these points.
>> >
>> > (2)  It seems to me that you should be able to get an estimate of
the
>> > Pearson r through bootstrapping or some other simulation procedure.
>> > If there is a positive correlation between time 1 and time 2, then,
>> > assuming
>> > data consisting only of 0 and 1, time1 zeros should co-occur with
time2
>> > zeros at a greater than chance level and the same holds for ones.
>> even if
>> > they are not matched up properly.  I haven't thought this through
but

>> > perhaps someone more familiar with bootstrapping with correlation
>> > has more wisdom.
>> >
>> > -Mike Palij
>> > New York University
>> > [hidden email]
>> >
>> >
>> >
>> >   ----- Original Message -----
>> >   From: J P
>> >   To: [hidden email]
>> >   Sent: Friday, November 12, 2010 9:19 AM
>> >   Subject: non-SPSS: appropriate statistical test
>> >
>> >
>> >   Colleaguees,
>> >
>> >   This is not a SPSS question (at least not yet).
>> >
>> >   I am seeking advice on the appropriate test for comparing two
>> > non-independent samples when the non-independence cannot be
modeled.
>> >
>> >   The proportions are drawn from the same employees pop  (~ 700,
>> response
>> > rate of ~50%) employee population, surveyd one year apart. An
>> example of
>> > an actual comparison is 98.4% vs 96.1% between time1 and time2.
>> >
>> >   The problem, as I see it, is the two samples are not independent
but
>> > there is no ID so neither a dependent t-test nor a mixed model can
be

>> > used. I found a test for comparing proportions from two independent
>> > groups.
>> >
>> >   What is the risk of violating the assumption of independence?
>> inflated
>> > type 1 error?
>> >
>> >   As far as I know there is no appropriate test for this situation,
>> but I
>> > thought I'd check with minds greater than mine...
>> >
>> >   Thank you,
>> >
>> >   John
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>> -----
>> --
>> Bruce Weaver
>> [hidden email]
>> http://sites.google.com/a/lakeheadu.ca/bweaver/
>>
>> "When all else fails, RTFM."
>>
>> NOTE: My Hotmail account is not monitored regularly.
>> To send me an e-mail, please use the address shown above.
>>
>> --
>> View this message in context:
>>
http://spssx-discussion.1045642.n5.nabble.com/comparing-groups-in-two-di
fferent-datasets-tp3255523p3262459.html
>> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>> [hidden email] (not to SPSSX-L), with no body text except
the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except
the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>


-----
--
Bruce Weaver
[hidden email]
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

NOTE: My Hotmail account is not monitored regularly.
To send me an e-mail, please use the address shown above.

--
View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/comparing-groups-in-two-di
fferent-datasets-tp3255523p3262780.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

The information in this message is the property of the
New Zealand Department of Corrections. It is intended
only for the person or entity to which it is addressed
and may contain confidential or privileged material.
Any review, storage, copying, editing, summarising,
transmission, retransmission, dissemination or other
use of, by any means, in whole or part, or taking any
action in reliance upon, this information by persons
or entities other than intended recipient are prohibited.
If you received this in error, please contact the sender
and delete the material from all computers.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: non-SPSS: appropriate statistical test

MCMILLEN, Pete (WELLHO)
In reply to this post by J P-6
In no position to query others' posts on this sample overlap (covariance)
problem (given a primarily sample survey/proportions background), but it
might be helpful to consider this problem from the perspective of normal
approximation to the binomial. Which, somewhat low response rate and
proportions close to 100% aside, seems defensible given your sample size(s)
around 700. Also the minor complication relating to employees as the unit
of interest - one year on I would expect the exact same employees will not
in fact comprise the 2nd sample, ie technically the 2nd sample does not
completely "overlap" with the 1st. But for the time being let's just ignore
that nuisance...

In particular a focus only on proportions means you need no information
about correlation per se, as the seperate proportions from both samples,
along with an estimate of the overlap proportion (the proportion in common
to both samples), is sufficient.

The covariance situation you describe is (in theory) complete sample
overlap in the form of a panel/longitudinal survey.

Assuming the usual (social science which I sense appropriate given employee
data?) 95% confidence level, the margin of error (MOE) formula for the
difference between p1 and p2 (proportions of interest from 1st and 2nd
sample, respectively) is:

MOE(p1-p2) = 1.96 x SQRT{1/n[p1(1-p1) + p2(1-p2) - 2(p12-p1p2)]}, where p12
= the overlap proportion, ie proportion in common to both 1st and 2nd
samples.

Importantly, the covariance term (p12-p1p2) is NOT guaranteed to be
positive, essentially because there is no requirement for p1 and p2 to vary
inversely, positively, or at all (not to be confused with the underlying
fact that the two samples overlap from the perspective of common
RESPONDENTS, but not necessarily common RESPONSES). You can relatively
easily model the MOE impact of different p1, p2 and p12 (with appropriate
limitations, ie if p1=0.9 and p2=0.8 the overlap proportion p12 can not be
less than 0.7, ie ALL the p2 0.8 are in common to the p1 0.9; nor logically
can it be higher than 0.8), to confirm that under certain circumstances MOE
can be higher, lower, or the same relative to independent samples (the
latter in which the covariance term disappears altogether).

Some references, in particular Kish who describes solutions to a range of
sample overlap/covariance scenarios:
- Kish. 1965. Survey Sampling. (in particular chapters 12.4 and 12.10)
- Franklin. 2007. The ‘Margin of Error’ for Differences in Polls.
- Scott & Seber. 1983. Difference of Proportions From the Same Survey.
American Statistician,37,319-320.
- Worcester & Downham (eds). 1986. Consumer Market Research Handbook. 3rd
Edition.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD