Confidence Interval for R-square: R2.exe (Steiger&Fouladi) vs Bootstrapping

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Confidence Interval for R-square: R2.exe (Steiger&Fouladi) vs Bootstrapping

Marta García-Granero
Hi everybody:

This is more a theoretical question, rather than a SPSS related one
(skip it safely if you are not interested in CI for R-square).

There is a program for computing confidence intervals for R2 (called
R2.exe)

http://www.interchange.ubc.ca/steiger/r2.htm

BTW, the link to the program inside the page is wrong (points to a non
existent ftp site), and the correct one is:

http://www.interchange.ubc.ca/steiger/r2.zip

Since my students are kind of alergic to DOS programs, I have been
trying to write code to compute CI for R2 using bootstrap. TO my
surprise, the 95% bootstrap CI is quite different to the one
Steiger&Fouladi program gives (narrower and the lower limit is higher
than it should). I have been trying to adjust the limits of the
bootstrap CI to make it match (more or less) the one R2.exe gives. I
had to replace percentyle 2.5 by percentyle 0.7 in order to adjust the
lower limit (see an example dataset and code below).

My question is: should I leave the percentyles as they are (2.5 &
97.5) and simply discuss with my students that bootstrapped CIs will
be narrower than the ones given by R2.exe, or should I replace
percentyle 2.5 by 0.7 and still call the bootstrap results a "95% CI"?

--
Regards,
Marta Garcia-Granero, PhD
Statistician

_______________________________________________________________________
CODE:

* Sample dataset (from 'Statistics at Square One', BMJ on-line book) *.
DATA LIST LIST/height deadsp (2 F8.0).
BEGIN DATA
110  44
116  31
124  43
129  45
131  56
138  79
142  57
150  56
153  58
155  92
156  78
159  64
164  88
168 112
174 101
END DATA.
VARIABLE LABEL height 'Height (cm)' /deadsp 'Dead space (ml)'.

REGRESSION
  /STATISTICS COEFF OUTS R ANOVA
  /DEPENDENT deadsp
  /METHOD=ENTER height  .

* Using R2.exe: 95%CI Lower Limit: 0.33029; Upper Limit: 0.88794 *.

PRESERVE.
SET SEED=29147290.
SET MXLOOPS=1000.
MATRIX.
PRINT /TITLE='Confidence Interval for R-square in Simple Linear Regression'.
GET data /VAR=height deadsp /MISSING OMIT.
* Sample statistics *.
COMPUTE n=NROW(data).
COMPUTE mean=CSUM(data)/n.
COMPUTE variance=(CSSQ(data)-n&*(mean&**2))/(n-1).
COMPUTE x=data(:,1).
COMPUTE y=data(:,2).
COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1).
COMPUTE r=covxy/SQRT(variance(1)*variance(2)).
PRINT r**2
 /FORMAT='F8.3'
 /TITLE='Sample R-square'.
COMPUTE b=covxy/variance(1).
COMPUTE a=mean(2)-b*mean(1).
PRINT {a,b}
 /FORMAT='F8.3'
 /CLABEL='a','b'
 /TITLE='Regression line'.
* Bootstraping R2 *.
COMPUTE lowersum=0.
COMPUTE uppersum=0.
COMPUTE k=1000. /* Number of bootsamples (change it if you want, increase MXLOOPS then) *.
COMPUTE bootR2=MAKE(k,1,0).
COMPUTE bootsamp=MAKE(n,2,0).
LOOP h=1 TO 10.
- LOOP i=1 TO k. /* Extracting bootstrap samples *.
-  LOOP j= 1 TO n.
-   COMPUTE flipcoin=1+TRUNC(n*UNIFORM(1,1)).
-   COMPUTE bootsamp(j,:)=data(flipcoin,:)).
-  END LOOP.
-  COMPUTE mean=CSUM(bootsamp)/n.
-  COMPUTE variance=(CSSQ(bootsamp)-n&*(mean&**2))/(n-1).
 - COMPUTE x=bootsamp(:,1).
-  COMPUTE y=bootsamp(:,2).
-  COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1).
-  COMPUTE r=covxy/SQRT(variance(1)*variance(2)).
-  COMPUTE bootR2(i)=r&**2.
- END LOOP.
* Ordered array: sorting algorithm by R Ristow & J Peck *.
- COMPUTE sortedR2=bootR2.
- COMPUTE sortedR2(GRADE(bootR2))=bootR2.
* NP confidence interval *.
- COMPUTE lower=sortedR2(k*0.007).
- COMPUTE upper=sortedR2(1+k*0.975).
- COMPUTE lowersum=lowersum+lower.
- COMPUTE uppersum=uppersum+upper.
END LOOP.
PRINT {lowersum/10,uppersum/10}
 /FORMAT='F8.4'
 /TITLE='R2 CI: Mean of 10 bootstrap runs (1000 reps. each)'
 /CLABEL='LowerCI','UpperCI'.
END MATRIX.
RESTORE.
Reply | Threaded
Open this post in threaded view
|

Re: Confidence Interval for R-square: R2.exe (Steiger&Fouladi) vs Bootstrapping

Meyer, Gregory J
Hi Marta,

I'm not sophisticated enough with MATRIX commands to follow all that you've done. However, I don't see where you centered the bootstrap distribution on R^2-Adjusted (what Steiger calls P^2). It appears that the distribution is centered on R^2 itself.

The population parameter estimate, P^2, is 1 - [(1 - R^2)(N - 1)]/(N - K), where K is the number of variables in the equation (i.e., 2 for your example).

Could this be the difficulty?

Greg

| -----Original Message-----
| From: SPSSX(r) Discussion [mailto:[hidden email]]
| On Behalf Of Marta García-Granero
| Sent: Monday, June 19, 2006 4:56 AM
| To: [hidden email]
| Subject: Confidence Interval for R-square: R2.exe
| (Steiger&Fouladi) vs Bootstrapping
|
| Hi everybody:
|
| This is more a theoretical question, rather than a SPSS related one
| (skip it safely if you are not interested in CI for R-square).
|
| There is a program for computing confidence intervals for R2 (called
| R2.exe)
|
| http://www.interchange.ubc.ca/steiger/r2.htm
|
| BTW, the link to the program inside the page is wrong (points to a non
| existent ftp site), and the correct one is:
|
| http://www.interchange.ubc.ca/steiger/r2.zip
|
| Since my students are kind of alergic to DOS programs, I have been
| trying to write code to compute CI for R2 using bootstrap. TO my
| surprise, the 95% bootstrap CI is quite different to the one
| Steiger&Fouladi program gives (narrower and the lower limit is higher
| than it should). I have been trying to adjust the limits of the
| bootstrap CI to make it match (more or less) the one R2.exe gives. I
| had to replace percentyle 2.5 by percentyle 0.7 in order to adjust the
| lower limit (see an example dataset and code below).
|
| My question is: should I leave the percentyles as they are (2.5 &
| 97.5) and simply discuss with my students that bootstrapped CIs will
| be narrower than the ones given by R2.exe, or should I replace
| percentyle 2.5 by 0.7 and still call the bootstrap results a "95% CI"?
|
| --
| Regards,
| Marta Garcia-Granero, PhD
| Statistician
|
| ______________________________________________________________
| _________
| CODE:
|
| * Sample dataset (from 'Statistics at Square One', BMJ
| on-line book) *.
| DATA LIST LIST/height deadsp (2 F8.0).
| BEGIN DATA
| 110  44
| 116  31
| 124  43
| 129  45
| 131  56
| 138  79
| 142  57
| 150  56
| 153  58
| 155  92
| 156  78
| 159  64
| 164  88
| 168 112
| 174 101
| END DATA.
| VARIABLE LABEL height 'Height (cm)' /deadsp 'Dead space (ml)'.
|
| REGRESSION
|   /STATISTICS COEFF OUTS R ANOVA
|   /DEPENDENT deadsp
|   /METHOD=ENTER height  .
|
| * Using R2.exe: 95%CI Lower Limit: 0.33029; Upper Limit: 0.88794 *.
|
| PRESERVE.
| SET SEED=29147290.
| SET MXLOOPS=1000.
| MATRIX.
| PRINT /TITLE='Confidence Interval for R-square in Simple
| Linear Regression'.
| GET data /VAR=height deadsp /MISSING OMIT.
| * Sample statistics *.
| COMPUTE n=NROW(data).
| COMPUTE mean=CSUM(data)/n.
| COMPUTE variance=(CSSQ(data)-n&*(mean&**2))/(n-1).
| COMPUTE x=data(:,1).
| COMPUTE y=data(:,2).
| COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1).
| COMPUTE r=covxy/SQRT(variance(1)*variance(2)).
| PRINT r**2
|  /FORMAT='F8.3'
|  /TITLE='Sample R-square'.
| COMPUTE b=covxy/variance(1).
| COMPUTE a=mean(2)-b*mean(1).
| PRINT {a,b}
|  /FORMAT='F8.3'
|  /CLABEL='a','b'
|  /TITLE='Regression line'.
| * Bootstraping R2 *.
| COMPUTE lowersum=0.
| COMPUTE uppersum=0.
| COMPUTE k=1000. /* Number of bootsamples (change it if you
| want, increase MXLOOPS then) *.
| COMPUTE bootR2=MAKE(k,1,0).
| COMPUTE bootsamp=MAKE(n,2,0).
| LOOP h=1 TO 10.
| - LOOP i=1 TO k. /* Extracting bootstrap samples *.
| -  LOOP j= 1 TO n.
| -   COMPUTE flipcoin=1+TRUNC(n*UNIFORM(1,1)).
| -   COMPUTE bootsamp(j,:)=data(flipcoin,:)).
| -  END LOOP.
| -  COMPUTE mean=CSUM(bootsamp)/n.
| -  COMPUTE variance=(CSSQ(bootsamp)-n&*(mean&**2))/(n-1).
|  - COMPUTE x=bootsamp(:,1).
| -  COMPUTE y=bootsamp(:,2).
| -  COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1).
| -  COMPUTE r=covxy/SQRT(variance(1)*variance(2)).
| -  COMPUTE bootR2(i)=r&**2.
| - END LOOP.
| * Ordered array: sorting algorithm by R Ristow & J Peck *.
| - COMPUTE sortedR2=bootR2.
| - COMPUTE sortedR2(GRADE(bootR2))=bootR2.
| * NP confidence interval *.
| - COMPUTE lower=sortedR2(k*0.007).
| - COMPUTE upper=sortedR2(1+k*0.975).
| - COMPUTE lowersum=lowersum+lower.
| - COMPUTE uppersum=uppersum+upper.
| END LOOP.
| PRINT {lowersum/10,uppersum/10}
|  /FORMAT='F8.4'
|  /TITLE='R2 CI: Mean of 10 bootstrap runs (1000 reps. each)'
|  /CLABEL='LowerCI','UpperCI'.
| END MATRIX.
| RESTORE.
|
Reply | Threaded
Open this post in threaded view
|

Re: Confidence Interval for R-square: R2.exe (Steiger&Fouladi) vs Bootstrapping

Marta García-Granero
Hi Gregory,

Although it is true I had forgotten to bootstrap adjusted R2
(Rho-square) instead of sample R2 (thanks!), this doesn't explain the
discrepancies between both methods of estimating CI for Rho-square. I
still have to modify the percentyles (1.2 instead of 2.5 & 98 instead
of 97.5) to get results more consistent with those R2.exe gives (see
below). Therefore, my question is still the same: should I just indicate
the differences or adjust the percentyles?...

* Sample dataset (from 'Statistics at Square One', BMJ on-line book) *.
DATA LIST FREE/height deadsp (2 F8.0).
BEGIN DATA
110  44 116  31 124  43 129  45  131  56
138  79 142  57 150  56 153  58  155  92
156  78 159  64 164  88 168 112  174 101
END DATA.
VARIABLE LABEL height 'Height (cm)' /deadsp 'Dead space (ml)'.

REGRESSION
  /STATISTICS COEFF OUTS R ANOVA
  /DEPENDENT deadsp
  /METHOD=ENTER height  .

* Using R2.exe: 95%CI Lower Limit: 0.33029; Upper Limit: 0.88794 *.

PRESERVE.
SET SEED=29147290.
SET MXLOOPS=1000.
MATRIX.
PRINT /TITLE='Confidence Interval for Rho-square in Simple Linear Regression'.
GET data /VAR=height deadsp /MISSING OMIT.
* Sample statistics *.
COMPUTE n=NROW(data).
COMPUTE mean=CSUM(data)/n.
COMPUTE variance=(CSSQ(data)-n&*(mean&**2))/(n-1).
COMPUTE x=data(:,1).
COMPUTE y=data(:,2).
COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1).
COMPUTE r=covxy/SQRT(variance(1)*variance(2)).
COMPUTE b=covxy/variance(1).
COMPUTE a=mean(2)-b*mean(1).
PRINT {a,b}
 /FORMAT='F8.3'
 /CLABEL='a','b'
 /TITLE='Regression line'.
PRINT {r**2,1-(1-r**2)*(n-1)/(n-2)}
 /FORMAT='F8.3'
 /CLABEL='R-Square','P-Square'
 /TITLE='Sample & Population (Adjusted) R-square'.
* Bootstraping PR2 *.
COMPUTE k=1000.
COMPUTE bootR2  =MAKE(k,1,0).
COMPUTE bootsamp=MAKE(n,2,0).
COMPUTE lowersum=0.
COMPUTE uppersum=0.
LOOP h=1 TO 10.   /* 10 runs (to average them later) *.
- LOOP i=1 TO k.  /* Extracting k bootstrap samples *.
-  LOOP j= 1 TO n./* with sample size n *.
-   COMPUTE flipcoin=1+TRUNC(n*UNIFORM(1,1)).
-   COMPUTE bootsamp(j,:)=data(flipcoin,:)).
-  END LOOP.
-  COMPUTE mean=CSUM(bootsamp)/n.
-  COMPUTE variance=(CSSQ(bootsamp)-n&*(mean&**2))/(n-1).
-  COMPUTE x=bootsamp(:,1).
-  COMPUTE y=bootsamp(:,2).
-  COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1).
-  COMPUTE r=covxy/SQRT(variance(1)*variance(2)).
-  COMPUTE bootR2(i)=1-(1-r**2)*(n-1)/(n-2).
- END LOOP.
* Ordered array: sorting algorithm by R Ristow & J Peck *.
- COMPUTE sortedR2=bootR2.
- COMPUTE sortedR2(GRADE(bootR2))=bootR2.
* NP confidence interval *.
- COMPUTE lower=sortedR2(k*0.012).
- COMPUTE upper=sortedR2(1+k*0.98).
- COMPUTE lowersum=lowersum+lower.
- COMPUTE uppersum=uppersum+upper.
END LOOP.
PRINT {lowersum/10,uppersum/10}
 /FORMAT='F8.4'
 /TITLE='Rho-Square CI: Mean of 10 bootstrap runs (1000 reps. each)'
 /CLABEL='LowerCI','UpperCI'.
END MATRIX.
RESTORE.

MGJ> I'm not sophisticated enough with MATRIX commands to follow
MGJ> all that you've done. However, I don't see where you centered the
MGJ> bootstrap distribution on R^2-Adjusted (what Steiger calls P^2).
MGJ> It appears that the distribution is centered on R^2 itself.

MGJ> The population parameter estimate, P^2, is 1 - [(1 - R^2)(N
MGJ> - 1)]/(N - K), where K is the number of variables in the equation
MGJ> (i.e., 2 for your example).

Regards

Marta
Reply | Threaded
Open this post in threaded view
|

Re: Confidence Interval for R-square: R2.exe (Steiger&Fouladi) vs Bootstrapping

Meyer, Gregory J
Hi again Marta,

Interesting problem. I tinkered with your syntax and replaced the SET SEED=29147290 command with SET SEED=Random. Then I ran 10 iterations of your bootstrap. Here's what I found.

As you reported, using the R2.exe program, the 95% CI limits should be:
Lower:  Upper:
    .3303        .8879

What we get from your syntax is:
Lower:  Upper:
    .3306    .8861
    .3377    .8906
    .3358    .8888
    .3412    .8875
    .3356    .8911
    .3169    .8898
    .3288    .8886
    .3316    .8867
    .3396    .8809
    .3363    .8861
    .3391    .8885

Ave:
    .3339        .8877
SD:
    .0069    .0028

It seems that what appeared to be fixed deviations from the R2.exe program in your bootstrap was really the consequence of having your random number seed fixed at a specific value. Using a fluctuating random seed it becomes clear that despite the large bootstrap samples, there's still a good amount of variability in the CI estimates from one run to the next. The R2.exe findings are within less than one SD of the mean from the 10 estimates from your bootstrap so it looks like both programs are targeting the same parameters.

The Steiger & Fouladi program allows 2 options for computing the CI; one quicker and less exact and the other slower but more precise. If the R2.exe results are based on the latter, perhaps it would be useful to increase your bootstrap parameters to something like h = 100 and k = 10000. On the other hand, it may be that your program is more precise and theirs more variable. I don't recall the parameters they used to generate their estimates. Hope this helps.

Greg

| -----Original Message-----
| From: SPSSX(r) Discussion [mailto:[hidden email]]
| On Behalf Of Marta García-Granero
| Sent: Monday, June 19, 2006 12:28 PM
| To: [hidden email]
| Subject: Re: Confidence Interval for R-square: R2.exe
| (Steiger&Fouladi) vs Bootstrapping
|
| Hi Gregory,
|
| Although it is true I had forgotten to bootstrap adjusted R2
| (Rho-square) instead of sample R2 (thanks!), this doesn't explain the
| discrepancies between both methods of estimating CI for Rho-square. I
| still have to modify the percentyles (1.2 instead of 2.5 & 98 instead
| of 97.5) to get results more consistent with those R2.exe gives (see
| below). Therefore, my question is still the same: should I
| just indicate
| the differences or adjust the percentyles?...
|
| * Sample dataset (from 'Statistics at Square One', BMJ
| on-line book) *.
| DATA LIST FREE/height deadsp (2 F8.0).
| BEGIN DATA
| 110  44 116  31 124  43 129  45  131  56
| 138  79 142  57 150  56 153  58  155  92
| 156  78 159  64 164  88 168 112  174 101
| END DATA.
| VARIABLE LABEL height 'Height (cm)' /deadsp 'Dead space (ml)'.
|
| REGRESSION
|   /STATISTICS COEFF OUTS R ANOVA
|   /DEPENDENT deadsp
|   /METHOD=ENTER height  .
|
| * Using R2.exe: 95%CI Lower Limit: 0.33029; Upper Limit: 0.88794 *.
|
| PRESERVE.
| SET SEED=29147290.
| SET MXLOOPS=1000.
| MATRIX.
| PRINT /TITLE='Confidence Interval for Rho-square in Simple
| Linear Regression'.
| GET data /VAR=height deadsp /MISSING OMIT.
| * Sample statistics *.
| COMPUTE n=NROW(data).
| COMPUTE mean=CSUM(data)/n.
| COMPUTE variance=(CSSQ(data)-n&*(mean&**2))/(n-1).
| COMPUTE x=data(:,1).
| COMPUTE y=data(:,2).
| COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1).
| COMPUTE r=covxy/SQRT(variance(1)*variance(2)).
| COMPUTE b=covxy/variance(1).
| COMPUTE a=mean(2)-b*mean(1).
| PRINT {a,b}
|  /FORMAT='F8.3'
|  /CLABEL='a','b'
|  /TITLE='Regression line'.
| PRINT {r**2,1-(1-r**2)*(n-1)/(n-2)}
|  /FORMAT='F8.3'
|  /CLABEL='R-Square','P-Square'
|  /TITLE='Sample & Population (Adjusted) R-square'.
| * Bootstraping PR2 *.
| COMPUTE k=1000.
| COMPUTE bootR2  =MAKE(k,1,0).
| COMPUTE bootsamp=MAKE(n,2,0).
| COMPUTE lowersum=0.
| COMPUTE uppersum=0.
| LOOP h=1 TO 10.   /* 10 runs (to average them later) *.
| - LOOP i=1 TO k.  /* Extracting k bootstrap samples *.
| -  LOOP j= 1 TO n./* with sample size n *.
| -   COMPUTE flipcoin=1+TRUNC(n*UNIFORM(1,1)).
| -   COMPUTE bootsamp(j,:)=data(flipcoin,:)).
| -  END LOOP.
| -  COMPUTE mean=CSUM(bootsamp)/n.
| -  COMPUTE variance=(CSSQ(bootsamp)-n&*(mean&**2))/(n-1).
| -  COMPUTE x=bootsamp(:,1).
| -  COMPUTE y=bootsamp(:,2).
| -  COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1).
| -  COMPUTE r=covxy/SQRT(variance(1)*variance(2)).
| -  COMPUTE bootR2(i)=1-(1-r**2)*(n-1)/(n-2).
| - END LOOP.
| * Ordered array: sorting algorithm by R Ristow & J Peck *.
| - COMPUTE sortedR2=bootR2.
| - COMPUTE sortedR2(GRADE(bootR2))=bootR2.
| * NP confidence interval *.
| - COMPUTE lower=sortedR2(k*0.012).
| - COMPUTE upper=sortedR2(1+k*0.98).
| - COMPUTE lowersum=lowersum+lower.
| - COMPUTE uppersum=uppersum+upper.
| END LOOP.
| PRINT {lowersum/10,uppersum/10}
|  /FORMAT='F8.4'
|  /TITLE='Rho-Square CI: Mean of 10 bootstrap runs (1000 reps. each)'
|  /CLABEL='LowerCI','UpperCI'.
| END MATRIX.
| RESTORE.
|
| MGJ> I'm not sophisticated enough with MATRIX commands to follow
| MGJ> all that you've done. However, I don't see where you centered the
| MGJ> bootstrap distribution on R^2-Adjusted (what Steiger calls P^2).
| MGJ> It appears that the distribution is centered on R^2 itself.
|
| MGJ> The population parameter estimate, P^2, is 1 - [(1 - R^2)(N
| MGJ> - 1)]/(N - K), where K is the number of variables in the equation
| MGJ> (i.e., 2 for your example).
|
| Regards
|
| Marta
|
Reply | Threaded
Open this post in threaded view
|

Re: Confidence Interval for R-square: R2.exe (Steiger&Fouladi) vs Bootstrapping

Meyer, Gregory J
In reply to this post by Marta García-Granero
Marta, just a quick clarification. The first values listed below from your program were obtained with the seed you specified. The following 10 were based on a random seed. So the M and SD were based on 11 runs, not 10. Hope that wasn't confusing.

Greg

| -----Original Message-----
| From: Meyer, Gregory J
| Sent: Monday, June 19, 2006 6:40 PM
| To: 'Marta García-Granero'; '[hidden email]'
| Subject: RE: Re: Confidence Interval for R-square: R2.exe
| (Steiger&Fouladi) vs Bootstrapping
|
| Hi again Marta,
|
| Interesting problem. I tinkered with your syntax and replaced
| the SET SEED=29147290 command with SET SEED=Random. Then I
| ran 10 iterations of your bootstrap. Here's what I found.
|
| As you reported, using the R2.exe program, the 95% CI limits
| should be:
| Lower:        Upper:
|     .3303      .8879
|
| What we get from your syntax is:
| Lower:        Upper:
|     .3306    .8861
|     .3377    .8906
|     .3358    .8888
|     .3412    .8875
|     .3356    .8911
|     .3169    .8898
|     .3288    .8886
|     .3316    .8867
|     .3396    .8809
|     .3363    .8861
|     .3391    .8885
|
| Ave:
|     .3339      .8877
| SD:
|     .0069    .0028
|
| It seems that what appeared to be fixed deviations from the
| R2.exe program in your bootstrap was really the consequence
| of having your random number seed fixed at a specific value.
| Using a fluctuating random seed it becomes clear that despite
| the large bootstrap samples, there's still a good amount of
| variability in the CI estimates from one run to the next. The
| R2.exe findings are within less than one SD of the mean from
| the 10 estimates from your bootstrap so it looks like both
| programs are targeting the same parameters.
|
| The Steiger & Fouladi program allows 2 options for computing
| the CI; one quicker and less exact and the other slower but
| more precise. If the R2.exe results are based on the latter,
| perhaps it would be useful to increase your bootstrap
| parameters to something like h = 100 and k = 10000. On the
| other hand, it may be that your program is more precise and
| theirs more variable. I don't recall the parameters they used
| to generate their estimates. Hope this helps.
|
| Greg
|
| | -----Original Message-----
| | From: SPSSX(r) Discussion [mailto:[hidden email]]
| | On Behalf Of Marta García-Granero
| | Sent: Monday, June 19, 2006 12:28 PM
| | To: [hidden email]
| | Subject: Re: Confidence Interval for R-square: R2.exe
| | (Steiger&Fouladi) vs Bootstrapping
| |
| | Hi Gregory,
| |
| | Although it is true I had forgotten to bootstrap adjusted R2
| | (Rho-square) instead of sample R2 (thanks!), this doesn't
| explain the
| | discrepancies between both methods of estimating CI for
| Rho-square. I
| | still have to modify the percentyles (1.2 instead of 2.5 &
| 98 instead
| | of 97.5) to get results more consistent with those R2.exe gives (see
| | below). Therefore, my question is still the same: should I
| | just indicate
| | the differences or adjust the percentyles?...
| |
| | * Sample dataset (from 'Statistics at Square One', BMJ
| | on-line book) *.
| | DATA LIST FREE/height deadsp (2 F8.0).
| | BEGIN DATA
| | 110  44 116  31 124  43 129  45  131  56
| | 138  79 142  57 150  56 153  58  155  92
| | 156  78 159  64 164  88 168 112  174 101
| | END DATA.
| | VARIABLE LABEL height 'Height (cm)' /deadsp 'Dead space (ml)'.
| |
| | REGRESSION
| |   /STATISTICS COEFF OUTS R ANOVA
| |   /DEPENDENT deadsp
| |   /METHOD=ENTER height  .
| |
| | * Using R2.exe: 95%CI Lower Limit: 0.33029; Upper Limit: 0.88794 *.
| |
| | PRESERVE.
| | SET SEED=29147290.
| | SET MXLOOPS=1000.
| | MATRIX.
| | PRINT /TITLE='Confidence Interval for Rho-square in Simple
| | Linear Regression'.
| | GET data /VAR=height deadsp /MISSING OMIT.
| | * Sample statistics *.
| | COMPUTE n=NROW(data).
| | COMPUTE mean=CSUM(data)/n.
| | COMPUTE variance=(CSSQ(data)-n&*(mean&**2))/(n-1).
| | COMPUTE x=data(:,1).
| | COMPUTE y=data(:,2).
| | COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1).
| | COMPUTE r=covxy/SQRT(variance(1)*variance(2)).
| | COMPUTE b=covxy/variance(1).
| | COMPUTE a=mean(2)-b*mean(1).
| | PRINT {a,b}
| |  /FORMAT='F8.3'
| |  /CLABEL='a','b'
| |  /TITLE='Regression line'.
| | PRINT {r**2,1-(1-r**2)*(n-1)/(n-2)}
| |  /FORMAT='F8.3'
| |  /CLABEL='R-Square','P-Square'
| |  /TITLE='Sample & Population (Adjusted) R-square'.
| | * Bootstraping PR2 *.
| | COMPUTE k=1000.
| | COMPUTE bootR2  =MAKE(k,1,0).
| | COMPUTE bootsamp=MAKE(n,2,0).
| | COMPUTE lowersum=0.
| | COMPUTE uppersum=0.
| | LOOP h=1 TO 10.   /* 10 runs (to average them later) *.
| | - LOOP i=1 TO k.  /* Extracting k bootstrap samples *.
| | -  LOOP j= 1 TO n./* with sample size n *.
| | -   COMPUTE flipcoin=1+TRUNC(n*UNIFORM(1,1)).
| | -   COMPUTE bootsamp(j,:)=data(flipcoin,:)).
| | -  END LOOP.
| | -  COMPUTE mean=CSUM(bootsamp)/n.
| | -  COMPUTE variance=(CSSQ(bootsamp)-n&*(mean&**2))/(n-1).
| | -  COMPUTE x=bootsamp(:,1).
| | -  COMPUTE y=bootsamp(:,2).
| | -  COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1).
| | -  COMPUTE r=covxy/SQRT(variance(1)*variance(2)).
| | -  COMPUTE bootR2(i)=1-(1-r**2)*(n-1)/(n-2).
| | - END LOOP.
| | * Ordered array: sorting algorithm by R Ristow & J Peck *.
| | - COMPUTE sortedR2=bootR2.
| | - COMPUTE sortedR2(GRADE(bootR2))=bootR2.
| | * NP confidence interval *.
| | - COMPUTE lower=sortedR2(k*0.012).
| | - COMPUTE upper=sortedR2(1+k*0.98).
| | - COMPUTE lowersum=lowersum+lower.
| | - COMPUTE uppersum=uppersum+upper.
| | END LOOP.
| | PRINT {lowersum/10,uppersum/10}
| |  /FORMAT='F8.4'
| |  /TITLE='Rho-Square CI: Mean of 10 bootstrap runs (1000 reps. each)'
| |  /CLABEL='LowerCI','UpperCI'.
| | END MATRIX.
| | RESTORE.
| |
| | MGJ> I'm not sophisticated enough with MATRIX commands to follow
| | MGJ> all that you've done. However, I don't see where you
| centered the
| | MGJ> bootstrap distribution on R^2-Adjusted (what Steiger
| calls P^2).
| | MGJ> It appears that the distribution is centered on R^2 itself.
| |
| | MGJ> The population parameter estimate, P^2, is 1 - [(1 - R^2)(N
| | MGJ> - 1)]/(N - K), where K is the number of variables in
| the equation
| | MGJ> (i.e., 2 for your example).
| |
| | Regards
| |
| | Marta
| |
|
Reply | Threaded
Open this post in threaded view
|

Re: Confidence Interval for R-square: R2.exe (Steiger&Fouladi) vs Bootstrapping

Marta García-Granero
In reply to this post by Meyer, Gregory J
Hi Greg

MGJ> Interesting problem. I tinkered with your syntax and
MGJ> replaced the SET SEED=29147290 command with SET SEED=Random.

The reason for having a fixed seed is that my students need to have
all the same answer (because the will have to select the "correct"
answer as part of a multiple choice exam, I know this is not the best
way, but they have really simple minds, you know). I plan to explain
the difference between a random and a fixed seed, make them modify the
code and see what happens, but I need to have "standardized" results
in order to examine them.

MGJ>  Then I ran 10 iterations of your bootstrap. Here's what I found.

I did that too (random seed and several runs).

MGJ> As you reported, using the R2.exe program, the 95% CI limits should be:
MGJ> Lower:   Upper:
MGJ> .3303    .8879

MGJ> What we get from your syntax is:
MGJ>     Lower:   Upper:
MGJ>     .3306    .8861
MGJ>     .3377    .8906
MGJ>     .3358    .8888
MGJ>     .3412    .8875
MGJ>     .3356    .8911
MGJ>     .3169    .8898
MGJ>     .3288    .8886
MGJ>     .3316    .8867
MGJ>     .3396    .8809
MGJ>     .3363    .8861
MGJ>     .3391    .8885

MGJ> Ave:
MGJ>     .3339    .8877
MGJ> SD:
MGJ>     .0069    .0028

MGJ> It seems that what appeared to be fixed deviations from the
MGJ> R2.exe program in your bootstrap was really the consequence of
MGJ> having your random number seed fixed at a specific value.

If you have taken a look at the end of the matrix program, you'll have
seen that the lower & upper limits are computed using in fact
percentyles 1.2 (instead of 2.5) and 98 (instead of 97.5):

- COMPUTE lower=sortedR2(k*0.012).
- COMPUTE upper=sortedR2(1+k*0.98).

If I set the lower & upper limits to the theoretically correct values,
I get very different results.

- COMPUTE lower=sortedR2(k*0.025).
- COMPUTE upper=sortedR2(1+k*0.975).

Adding and extra nested loop to get 10 different CI (based on the
average of 10 runs of 1000 boostrapped samples), I get the following:

Sample & Population (Adjusted) R-square
 R-Square P-Square
     .716     .694

Rho-Square CI: 10 Means of 10 bootstrap runs (1000 reps. each)
   LowerCI  UpperCI
     .3972    .8806
     .4072    .8808
     .4120    .8803
     .4057    .8806
     .4013    .8836
     .4065    .8825
     .4107    .8782
     .4109    .8796
     .3998    .8796
     .4150    .8816

Mean .4066    .8807
SD   .0058    .0015
Min  .3972    .8782
Max  .4150    .8836

As you can see, the upper limit is systematically slightly lower than
the one reportad by Steiger&Fouladi, and the lower limit is always
much higher than the one reported by S&F. That's what I was talking
about when I said that my boostrap CI were biased.

MGJ> The Steiger & Fouladi program allows 2 options for
MGJ> computing the CI; one quicker and less exact and the other slower
MGJ> but more precise. If the R2.exe results are based on the latter,
MGJ> perhaps it would be useful to increase your bootstrap parameters
MGJ> to something like h = 100 and k = 10000.

Ouch! That will take really long, I can't do that with my students
(the University has a lot -really a lot- of computers, and the
configuration is quite basic: only 256 Mb of RAM and some are a bit
old: low speed processor). Nonetheless, I'll try that at home, with my
own computer (perhaps while I'm cooking, or something like that).

I had been using the speed method in R2.exe. Using the algorithm to
maximize accuracy, the results are these:

Lower Limit: 0.33036
Upper Limit: 0.88852

Quite close to the ones obtained with the speed method:

Lower Limit: 0.33029
Upper Limit: 0.88794

MGJ> On the other hand, it may be that your program is more precise
MGJ> and theirs more variable. I don't recall the parameters they used
MGJ> to generate their estimates.

Michael Healy suggested that I tested different datasets, that it
could be due to this particular dataset (small sample size and perhaps
some characteristics of its distribution). I'm going to do that now,
also, I think I'll leave the limits in their theoretically correct
places (percentyles 2.5 & 97.5).

Thanks a lot for your interest in this little problem.

Marta
Reply | Threaded
Open this post in threaded view
|

Confidence Interval for R-square: R2.exe (S&F) vs Bootstrapping. Epilogue

Marta García-Granero
In reply to this post by Meyer, Gregory J
Hi everybody:

This message closes (hopefully) this thread.

Conclusions:

1) The first dataset was *partially* responsible of the discrepancies
   between bootstrapping & R2.exe. Thanks to Michael Healy for pointing
   that possibility. Anyway, with a different (bigger, residuals
   normally distributed, and linear relationship OK), I still find a
   small discrepancy at the lower limit (not important, but persistent
   along different runs). I can live with that.
2) Eliminating spurious precision helps a bit (thanks Art for that
   private comment).
3) Changing a bit the boostrapping conditions (increasing k, the
   number of repetitions) improved the results (thanks Greg). I have
   also eliminated the 10 runs and average of the 10 lower & upper
   limits (since I get more stable results after increasing k to
   10000). I use now a random seed, since the differences in these
   conditions are alwyas below the second decimal place.

The problem is that the code now takes around 15 seconds to run (and
my computer has a faster processor and more RAM than the average
computer at the University). I daren't think what will happen when I
write code for boostrapping Rho-square from multiple regression
models...

This is the final code, with a new dataset:

* Sample dataset (random sample of 50 cases from Rosner's dataset FEV&Smoke):
  Residuals follow a normal distribution, there is also no evidence of lack
  of linear relationship *.

DATA LIST FREE/fev(F8.3) hgt(F8.1).
BEGIN DATA
1.987 58.5 1.735 54.0 2.604 61.5 2.980 60.0 3.000 65.5
1.776 51.0 2.531 58.0 3.842 69.0 1.751 58.0 1.698 54.5
1.697 59.0 2.288 61.5 3.114 64.5 2.135 58.5 1.759 53.0
2.048 64.5 1.658 53.0 1.789 52.0 3.004 64.0 2.503 63.0
2.316 59.5 1.704 51.0 1.606 57.5 1.165 47.0 2.164 60.0
2.639 63.0 1.728 56.5 2.303 57.0 2.382 62.0 1.535 55.0
1.514 52.0 2.524 64.0 3.490 67.0 2.292 63.0 2.889 64.0
2.957 64.5 2.250 58.0 2.633 62.0 2.417 62.5 4.273 72.5
2.751 63.0 3.774 67.0 3.169 64.0 2.704 61.0 3.255 66.0
2.901 59.5 3.680 67.0 3.022 61.5 3.780 70.0 2.822 69.5
END DATA.
VARIABLE LABEL hgt 'Height (inches)' /fev 'FEV (l)'.

REGRESSION
  /STATISTICS COEFF OUTS R ANOVA
  /DEPENDENT fev
  /METHOD=ENTER hgt  .

* Using R2.exe:   95%CI Lower Limit: 0.63; Upper Limit: 0.86 *.
* Using Boostrap: 95%CI Lower Limit: 0.65; Upper Limit: 0.86 *.

* Bootstrapping conditions: k=10000; random seed *.
PRESERVE.
SET MXLOOPS=10000.
MATRIX.
PRINT /TITLE='Confidence Interval for Rho-square in Simple Linear Regression'.
GET data /VAR=hgt fev /MISSING OMIT.
* Sample statistics *.
COMPUTE n=NROW(data).
COMPUTE mean=CSUM(data)/n.
COMPUTE variance=(CSSQ(data)-n&*(mean&**2))/(n-1).
COMPUTE x=data(:,1).
COMPUTE y=data(:,2).
COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1).
COMPUTE r=covxy/SQRT(variance(1)*variance(2)).
COMPUTE b=covxy/variance(1).
COMPUTE a=mean(2)-b*mean(1).
PRINT {a,b}
 /FORMAT='F8.3'
 /CLABEL='a','b'
 /TITLE='Regression line'.
PRINT {r**2,1-(1-r**2)*(n-1)/(n-2)}
 /FORMAT='F8.2'
 /CLABEL='R-Square','P-Square'
 /TITLE='Sample R-square & Adjusted (Population) Rho-square'.
* Bootstraping conditions *.
COMPUTE k=10000. /* Nr of bootsamples (if you increase it, increase also MXLOOPS) *.
PRINT k
 /FORMAT='F8.0'
 /TITLE='Boostrapping conditions: k (Nr. reps.)'.
* Bootstrapping P2 (Rho-square) *.
COMPUTE bootP2=MAKE(k,1,0).
COMPUTE bootsamp=MAKE(n,2,0).
LOOP i=1 TO k.   /* Extracting K bootstrap samples *.
- LOOP j= 1 TO n. /* of sample size n (sampling with replacement) *.
-  COMPUTE flipcoin=1+TRUNC(n*UNIFORM(1,1)). /* Random number between 1 & n  *.
-  COMPUTE bootsamp(j,:)=data(flipcoin,:)).  /* A case is selected at random *.
- END LOOP.                                  /* Until n cases are selected   *.
* Statistics for every bootsample *.
- COMPUTE mean=CSUM(bootsamp)/n.
- COMPUTE variance=(CSSQ(bootsamp)-n&*(mean&**2))/(n-1).
- COMPUTE x=bootsamp(:,1).
- COMPUTE y=bootsamp(:,2).
- COMPUTE covxy=((T(x)*y)-n*mean(1)*mean(2))/(n-1).
- COMPUTE r=covxy/SQRT(variance(1)*variance(2)).
- COMPUTE bootP2(i)=1-(1-r**2)*(n-1)/(n-2).
END LOOP.
* Ordered array: sorting algorithm by R Ristow & J Peck *.
COMPUTE sortedP2=bootP2.
COMPUTE sortedP2(GRADE(bootP2))=bootP2.
* NP confidence interval (percentyles 2.5 & 97.5) *.
COMPUTE lower=sortedP2(k*0.025).
COMPUTE upper=sortedP2(1+k*0.975).
PRINT {lower,upper}
 /FORMAT='F8.2'
 /TITLE='Rho-Square Bootstraped CI:'
 /CLABEL='LowerCI','UpperCI'.
END MATRIX.