Statistics with errors in the x-variable

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Statistics with errors in the x-variable

Rudobeck, Emil (LLU)
One of the often ignored assumptions of regression is that the predictor variable cannot have any errors. Since mixed models incorporate the assumptions of regression, I am assuming SPSS MIXED also requires precise predictor measurements. Admittedly, I have had a hard time finding all the exact assumptions of mixed models since they are not covered, even in the most recent book by Brady West. As a result, I am wondering what are some correct statistical tests that can be applied when the data has both x and y errors (predictor and response variables). And which of these tests can be done with SPSS?

Some research turned up total/generalized least squares as the answer, but it doesn't seem like SPSS has this option (r plug-in allows partial least squares). And I am not sure if robust regression or bootstrap regression available in SPSS would address the issue either. Some suggestions or solutions would be appreciated.

A further, possible complication is for data that is nonnormal or nonlinear, or both.

Emil
CONFIDENTIALITY NOTICE: This e-mail communication and any attachments may contain confidential and privileged information for the use of the designated recipients named above. If you are not the intended recipient, you are hereby notified that you have received this communication in error and that any review, disclosure, dissemination, distribution or copying of it or its contents is prohibited. If you have received this communication in error, please notify me immediately by replying to this message and destroy all copies of this communication and any attachments. Thank you.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Statistics with errors in the x-variable

Rich Ulrich
I have never worried much about errors-in-variables because it
does not affect the testing.  I was concerned about the tests, and
never took exact coefficients too seriously.  Neither did anyone else
in my particular area; we paid attention to the testing.  If the tests
are your concern, then do not be worried.
 
A method that takes e-in-v  into account will produce a different
estimate of coefficients.  The notion here is the same as "correcting
for attenuation" when looking at simple Pearson r's.  That is not
something that most people do.  Or expect.

"Normal" is an assumption that applies to the residuals.  It is the big
outliers, or correlated outliers, that affect the robustness of the testing.

"Nonlinearity" is often misunderstood as a character of the predictor,
whereas it should be applied to the relationship between predictor and
outcome.  I find it useful to think of the "equal-interval" relationship,
where equal-intervals of changes in the predictor should result in equal-
intervals of changes in the outcome.

--
Rich Ulrich



Date: Thu, 5 May 2016 16:36:03 +0000
From: [hidden email]
Subject: Statistics with errors in the x-variable
To: [hidden email]

One of the often ignored assumptions of regression is that the predictor variable cannot have any errors. Since mixed models incorporate the assumptions of regression, I am assuming SPSS MIXED also requires precise predictor measurements. Admittedly, I have had a hard time finding all the exact assumptions of mixed models since they are not covered, even in the most recent book by Brady West. As a result, I am wondering what are some correct statistical tests that can be applied when the data has both x and y errors (predictor and response variables). And which of these tests can be done with SPSS?

Some research turned up total/generalized least squares as the answer, but it doesn't seem like SPSS has this option (r plug-in allows partial least squares). And I am not sure if robust regression or bootstrap regression available in SPSS would address the issue either. Some suggestions or solutions would be appreciated.

A further, possible complication is for data that is nonnormal or nonlinear, or both.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Statistics with errors in the x-variable

Rudobeck, Emil (LLU)
Hi Rich,

When you say it does not affect the tests, do you mean that the statistical results would be identical whether errors-in-variables are ignored or included? I am wondering if there are any references or simulations to this end. When I was searching the scientific literature (biology), I also found that authors always ignored predictor errors, even excluded horizontal error bars, but biological scientific papers aren't a benchmark of good statistics, so I wasn't sure if that approach was correct.

Much to be said about the residual normality, since SPSS only outputs conditional residuals for MIXED, yet West says that normality should be assessed using the studentized or standardized residuals/eBLUPs.

Interesting point on the definition of nonlinearity. It would seem that definition will always be satisfied automatically, unless one of the axes is categorical data that's treated as continuous without proper transformation. I've seen some colleagues make a graph using x axes values that are not equidistant in terms of measurement (20V, 60V, 100V...) yet are so plotted and analyzed (1,2,3,...). This is essentially a rank transformation. For categorical analyzes this doesn't matter, except when these values are repeated measures and the users are trying to establish polynomial relationships with rANOVA.

Emil


From: Rich Ulrich [[hidden email]]
Sent: Thursday, May 05, 2016 10:19 AM
To: Rudobeck, Emil (LLU); SPSS list
Subject: RE: Statistics with errors in the x-variable

I have never worried much about errors-in-variables because it
does not affect the testing.  I was concerned about the tests, and
never took exact coefficients too seriously.  Neither did anyone else
in my particular area; we paid attention to the testing.  If the tests
are your concern, then do not be worried.
 
A method that takes e-in-v  into account will produce a different
estimate of coefficients.  The notion here is the same as "correcting
for attenuation" when looking at simple Pearson r's.  That is not
something that most people do.  Or expect.

"Normal" is an assumption that applies to the residuals.  It is the big
outliers, or correlated outliers, that affect the robustness of the testing.

"Nonlinearity" is often misunderstood as a character of the predictor,
whereas it should be applied to the relationship between predictor and
outcome.  I find it useful to think of the "equal-interval" relationship,
where equal-intervals of changes in the predictor should result in equal-
intervals of changes in the outcome.

--
Rich Ulrich



Date: Thu, 5 May 2016 16:36:03 +0000
From: [hidden email]
Subject: Statistics with errors in the x-variable
To: [hidden email]

One of the often ignored assumptions of regression is that the predictor variable cannot have any errors. Since mixed models incorporate the assumptions of regression, I am assuming SPSS MIXED also requires precise predictor measurements. Admittedly, I have had a hard time finding all the exact assumptions of mixed models since they are not covered, even in the most recent book by Brady West. As a result, I am wondering what are some correct statistical tests that can be applied when the data has both x and y errors (predictor and response variables). And which of these tests can be done with SPSS?

Some research turned up total/generalized least squares as the answer, but it doesn't seem like SPSS has this option (r plug-in allows partial least squares). And I am not sure if robust regression or bootstrap regression available in SPSS would address the issue either. Some suggestions or solutions would be appreciated.

A further, possible complication is for data that is nonnormal or nonlinear, or both.


WARNING: Please be vigilant when opening emails that appear to be the least bit out of the ordinary, e.g. someone you usually don’t hear from, or attachments you usually don’t receive or didn’t expect, requests to click links or log into systems, etc. If you receive suspicious emails, please do not open attachments or links and immediately forward the suspicious email to [hidden email] and then delete the suspicious email.
CONFIDENTIALITY NOTICE: This e-mail communication and any attachments may contain confidential and privileged information for the use of the designated recipients named above. If you are not the intended recipient, you are hereby notified that you have received this communication in error and that any review, disclosure, dissemination, distribution or copying of it or its contents is prohibited. If you have received this communication in error, please notify me immediately by replying to this message and destroy all copies of this communication and any attachments. Thank you.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Statistics with errors in the x-variable

Mike
In reply to this post by Rudobeck, Emil (LLU)
Issues of errors-in-x and errors-in-y are covered briefly in the
following source:

Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement,
design, and analysis: An integrated approach. Psychology Press.
NOTE: LEA originally published the text in 1991 but the
Psychology Press which bought the LEA catalog has re-issued
a non-updated version in 2013 which is why one may see 2013
as the publication year, as on books.google.com:
https://books.google.com/books?hl=en&lr=&id=WXt_NSiqV7wC&oi=fnd&pg=PR2&dq=pedhazur+schmelkin&ots=7svoK7egOQ&sig=o1I40WF9mrHUqnO_kkoHySHzy3I#v=onepage&q=pedhazur%20schmelkin&f=false

Quoting from page 391:

|(for detailed discussions, see Blalock, Wells, & Carter, 1970;
|Bohrnstedt & Carter, 1971; Cochran, 1968, 1970; Linn & Werts,
|1982). Unlike simple regression analysis, random measurement
|errors in multiple regression may lead to either overestimation or
|underestimation of regression coefficients. Further, the biasing
|effects of measurement errors are not limited to the estimation
|of the regression coefficient for the variable being measured but
|affect also estimates of regression coefficients for other variables
|correlated with the variable in question. Thus, estimates of regression
|coefficients for variables measured with high reliability may be
|biased as a result of their correlations with variables measured
|with low reliability.
|
|Generally speaking, the lower the reliabilities of the measures
|used and the higher the intercorrelations among the variables,
|the more adverse the biasing effects of measurement errors.
|Under such circumstances, regression coefficients should be
|interpreted with great circumspection. Caution is particularly
|called for when attempting to interpret magnitudes of standardized
|regression coefficients as indicating the relative importance
|of the variables with which they are associated. It would be wiser
|to refrain from such attempts altogether when measurement
|errors are prevalent.
|
|Thus far, we have not dealt with the effects of errors in the
|measurement of the dependent variable. Such errors do not lead
|to bias in the estimation of the unstandardized regression
|coefficient (b). They do, however, lead to the attenuation of the
|correlation between the independent and the dependent variable,
|hence, to the attenuation of the standardized regression
|coefficient (beta).(footnote18) Becaute 1 - r^2 (or 1 - R^2 in
|multiple regression analysis) is part of the error term, it can
|be seen that measurement errors in the dependent variable
|reduce the sensitivity of the statistical analysis.
|
|Of various approaches and remedies for managing the magnitude
|of the errors and of taking into taking account of their impact
|on the estimation of model parameters, probably the most
|promising are those incorporated in structural equation modeling
|(SEM). Chapters 23 and 24 are devoted to analytic approaches
|for such models, where it is also shown how measurement
|errors are taken into account when estimating the parameters
|of the model.
|
|Although approaches to managing measurement errors are useful,
|greater benefits would be reaped if researchers were to pay
|more attention to the validity and reliability of measures; if they
|directed their efforts towards optimizing them instead of attempting
|to counteract adverse effects of poorly conceived and poorly
|constructed measures.

In psychology it has been traditional to use SEM to construct
a measurement model for the predictors X (true value + error)
and relating the latent variables to each other and the outcome
or dependent variable (Y or, if Y is measured with error, it's
latent variable).  For an example using AMOS, see:
http://www.spss.com.hk/amos/measurement_error_application.htm

The answer to the question "can one ignore measurement error
in the x-variables" depends on what are the differences in analyses
that ignore them (traditional) or incorporate them (e.g., SEM).

For a more extensive presentation on the role of measurement error
in linear and nonlinear models, see:
Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M.
(2006).
Measurement error in nonlinear models: a modern perspective. CRC press.
Parts can be previewed on books.google.com; see:
https://books.google.com/books?id=9kBx5CPZCqkC&pg=PA52&dq=%22multiple+regression%22+%22measurement+error%22&hl=en&sa=X&ved=0ahUKEwiZoIqT0sPMAhUEHx4KHWg1CCoQ6AEIMDAD#v=onepage&q=%22multiple%20regression%22%20%22measurement%20error%22&f=false

Finally, Bayesian methods have also been employed to
deal with this problem and examples of doing this in R are
available on the R-bloggers' website; see:
http://www.r-bloggers.com/bayesian-type-ii-regression/
and
http://www.r-bloggers.com/errors-in-variables-models-in-stan/

-Mike Palij
New York University
[hidden email]



----- Original Message -----
From: Rudobeck, Emil (LLU)
To: [hidden email]
Sent: Thursday, May 05, 2016 12:36 PM
Subject: Statistics with errors in the x-variable

One of the often ignored assumptions of regression is that
the predictor variable cannot have any errors. Since mixed
models incorporate the assumptions of regression, I am
assuming SPSS MIXED also requires precise predictor
measurements. Admittedly, I have had a hard time finding
all the exact assumptions of mixed models since they are
not covered, even in the most recent book by Brady West.
As a result, I am wondering what are some correct statistical
tests that can be applied when the data has both x and y errors
(predictor and response variables). And which of these tests
can be done with SPSS?

Some research turned up total/generalized least squares
as the answer, but it doesn't seem like SPSS has this option
(r plug-in allows partial least squares). And I am not sure if
robust regression or bootstrap regression available in
SPSS would address the issue either. Some suggestions
or solutions would be appreciated.

A further, possible complication is for data that is nonnormal
or nonlinear, or both.

Emil

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Statistics with errors in the x-variable

Rudobeck, Emil (LLU)
Thanks Mike. As expected, the solution isn't simple. I will need to read up on SEM. Luckily I have AMOS. The Bayesian approach will be a much bigger leap. I found out that Deming regression, which can also be used, has been submitted as an enhancement request for SPSS. I don't know when they will incorporate it. I also found out that 2 stage least squares regression (2SLS) is an alternative to SEM and is available in SPSS. If anyone here knows the differences between the 2SLS and SEM approaches, it would be interesting to find out. I will have to find some good sources on the practical application of SPSS and AMOS for using either 2SLS or SEM.

If one has to run SEM to show that the results aren't much different than regression, then I don't see the point there since for a publication both tests would need to be reported. As such, it would make sense to do only SEM to begin with and simplify the findings.

There is a third approach which to me doesn't seem to be incorrect: if you are experimentally measuring Y11, Y12 in the same animal, then Y21 and Y22 in another animal, all with the same predefined X (without error), instead of graphing the means of Y11, Y21 vs Y12, Y22 (which would result in horizontal error), another approach would be to simply use ratios to get rid of the horizontal error. So in this case it would be the means of Y11/Y12, Y21/Y22 vs X (no measurement error). The main issue with the latter approach is that ratios are more difficult to explain than a simple Y v X graph. If I'm overlooking something statistically, let me know.

Emil
________________________________________
From: Mike Palij [[hidden email]]
Sent: Thursday, May 05, 2016 12:30 PM
To: Rudobeck, Emil (LLU); [hidden email]
Cc: Michael Palij
Subject: Re: Statistics with errors in the x-variable

Issues of errors-in-x and errors-in-y are covered briefly in the
following source:

Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement,
design, and analysis: An integrated approach. Psychology Press.
NOTE: LEA originally published the text in 1991 but the
Psychology Press which bought the LEA catalog has re-issued
a non-updated version in 2013 which is why one may see 2013
as the publication year, as on books.google.com:
https://books.google.com/books?hl=en&lr=&id=WXt_NSiqV7wC&oi=fnd&pg=PR2&dq=pedhazur+schmelkin&ots=7svoK7egOQ&sig=o1I40WF9mrHUqnO_kkoHySHzy3I#v=onepage&q=pedhazur%20schmelkin&f=false

Quoting from page 391:

|(for detailed discussions, see Blalock, Wells, & Carter, 1970;
|Bohrnstedt & Carter, 1971; Cochran, 1968, 1970; Linn & Werts,
|1982). Unlike simple regression analysis, random measurement
|errors in multiple regression may lead to either overestimation or
|underestimation of regression coefficients. Further, the biasing
|effects of measurement errors are not limited to the estimation
|of the regression coefficient for the variable being measured but
|affect also estimates of regression coefficients for other variables
|correlated with the variable in question. Thus, estimates of regression
|coefficients for variables measured with high reliability may be
|biased as a result of their correlations with variables measured
|with low reliability.
|
|Generally speaking, the lower the reliabilities of the measures
|used and the higher the intercorrelations among the variables,
|the more adverse the biasing effects of measurement errors.
|Under such circumstances, regression coefficients should be
|interpreted with great circumspection. Caution is particularly
|called for when attempting to interpret magnitudes of standardized
|regression coefficients as indicating the relative importance
|of the variables with which they are associated. It would be wiser
|to refrain from such attempts altogether when measurement
|errors are prevalent.
|
|Thus far, we have not dealt with the effects of errors in the
|measurement of the dependent variable. Such errors do not lead
|to bias in the estimation of the unstandardized regression
|coefficient (b). They do, however, lead to the attenuation of the
|correlation between the independent and the dependent variable,
|hence, to the attenuation of the standardized regression
|coefficient (beta).(footnote18) Becaute 1 - r^2 (or 1 - R^2 in
|multiple regression analysis) is part of the error term, it can
|be seen that measurement errors in the dependent variable
|reduce the sensitivity of the statistical analysis.
|
|Of various approaches and remedies for managing the magnitude
|of the errors and of taking into taking account of their impact
|on the estimation of model parameters, probably the most
|promising are those incorporated in structural equation modeling
|(SEM). Chapters 23 and 24 are devoted to analytic approaches
|for such models, where it is also shown how measurement
|errors are taken into account when estimating the parameters
|of the model.
|
|Although approaches to managing measurement errors are useful,
|greater benefits would be reaped if researchers were to pay
|more attention to the validity and reliability of measures; if they
|directed their efforts towards optimizing them instead of attempting
|to counteract adverse effects of poorly conceived and poorly
|constructed measures.

In psychology it has been traditional to use SEM to construct
a measurement model for the predictors X (true value + error)
and relating the latent variables to each other and the outcome
or dependent variable (Y or, if Y is measured with error, it's
latent variable).  For an example using AMOS, see:
http://www.spss.com.hk/amos/measurement_error_application.htm

The answer to the question "can one ignore measurement error
in the x-variables" depends on what are the differences in analyses
that ignore them (traditional) or incorporate them (e.g., SEM).

For a more extensive presentation on the role of measurement error
in linear and nonlinear models, see:
Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M.
(2006).
Measurement error in nonlinear models: a modern perspective. CRC press.
Parts can be previewed on books.google.com; see:
https://books.google.com/books?id=9kBx5CPZCqkC&pg=PA52&dq=%22multiple+regression%22+%22measurement+error%22&hl=en&sa=X&ved=0ahUKEwiZoIqT0sPMAhUEHx4KHWg1CCoQ6AEIMDAD#v=onepage&q=%22multiple%20regression%22%20%22measurement%20error%22&f=false

Finally, Bayesian methods have also been employed to
deal with this problem and examples of doing this in R are
available on the R-bloggers' website; see:
http://www.r-bloggers.com/bayesian-type-ii-regression/
and
http://www.r-bloggers.com/errors-in-variables-models-in-stan/

-Mike Palij
New York University
[hidden email]



----- Original Message -----
From: Rudobeck, Emil (LLU)
To: [hidden email]
Sent: Thursday, May 05, 2016 12:36 PM
Subject: Statistics with errors in the x-variable

One of the often ignored assumptions of regression is that
the predictor variable cannot have any errors. Since mixed
models incorporate the assumptions of regression, I am
assuming SPSS MIXED also requires precise predictor
measurements. Admittedly, I have had a hard time finding
all the exact assumptions of mixed models since they are
not covered, even in the most recent book by Brady West.
As a result, I am wondering what are some correct statistical
tests that can be applied when the data has both x and y errors
(predictor and response variables). And which of these tests
can be done with SPSS?

Some research turned up total/generalized least squares
as the answer, but it doesn't seem like SPSS has this option
(r plug-in allows partial least squares). And I am not sure if
robust regression or bootstrap regression available in
SPSS would address the issue either. Some suggestions
or solutions would be appreciated.

A further, possible complication is for data that is nonnormal
or nonlinear, or both.

Emil



________________________________
WARNING: Please be vigilant when opening emails that appear to be the least bit out of the ordinary, e.g. someone you usually don’t hear from, or attachments you usually don’t receive or didn’t expect, requests to click links or log into systems, etc. If you receive suspicious emails, please do not open attachments or links and immediately forward the suspicious email to [hidden email] and then delete the suspicious email.

CONFIDENTIALITY NOTICE: This e-mail communication and any attachments may contain confidential and privileged information for the use of the designated recipients named above. If you are not the intended recipient, you are hereby notified that you have received this communication in error and that any review, disclosure, dissemination, distribution or copying of it or its contents is prohibited. If you have received this communication in error, please notify me immediately by replying to this message and destroy all copies of this communication and any attachments. Thank you.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Statistics with errors in the x-variable

Jon Peck
I was going to mention 2sls, which is based on instrumental variables.  IVs were the original (as far as I know) method for dealing with errors in variables problems.  Besides the built-in 2sls command in Statistics, there is an extension command called stats eqnsystem that provides a variety of estimators for equation systems.

Also, assuming that you have some idea of the error variance, you can explore the effect on your estimates and tests by adding random noise to the variables to see how it affects the results.  Generally you will find that multicollinearity exacerbates the effect of errors in variables.

On Thursday, May 5, 2016, Rudobeck, Emil (LLU) <[hidden email]> wrote:
Thanks Mike. As expected, the solution isn't simple. I will need to read up on SEM. Luckily I have AMOS. The Bayesian approach will be a much bigger leap. I found out that Deming regression, which can also be used, has been submitted as an enhancement request for SPSS. I don't know when they will incorporate it. I also found out that 2 stage least squares regression (2SLS) is an alternative to SEM and is available in SPSS. If anyone here knows the differences between the 2SLS and SEM approaches, it would be interesting to find out. I will have to find some good sources on the practical application of SPSS and AMOS for using either 2SLS or SEM.

If one has to run SEM to show that the results aren't much different than regression, then I don't see the point there since for a publication both tests would need to be reported. As such, it would make sense to do only SEM to begin with and simplify the findings.

There is a third approach which to me doesn't seem to be incorrect: if you are experimentally measuring Y11, Y12 in the same animal, then Y21 and Y22 in another animal, all with the same predefined X (without error), instead of graphing the means of Y11, Y21 vs Y12, Y22 (which would result in horizontal error), another approach would be to simply use ratios to get rid of the horizontal error. So in this case it would be the means of Y11/Y12, Y21/Y22 vs X (no measurement error). The main issue with the latter approach is that ratios are more difficult to explain than a simple Y v X graph. If I'm overlooking something statistically, let me know.

Emil
________________________________________
From: Mike Palij [<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;mp26@nyu.edu&#39;)">mp26@...]
Sent: Thursday, May 05, 2016 12:30 PM
To: Rudobeck, Emil (LLU); <a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;SPSSX-L@LISTSERV.UGA.EDU&#39;)">SPSSX-L@...
Cc: Michael Palij
Subject: Re: Statistics with errors in the x-variable

Issues of errors-in-x and errors-in-y are covered briefly in the
following source:

Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement,
design, and analysis: An integrated approach. Psychology Press.
NOTE: LEA originally published the text in 1991 but the
Psychology Press which bought the LEA catalog has re-issued
a non-updated version in 2013 which is why one may see 2013
as the publication year, as on books.google.com:
https://books.google.com/books?hl=en&lr=&id=WXt_NSiqV7wC&oi=fnd&pg=PR2&dq=pedhazur+schmelkin&ots=7svoK7egOQ&sig=o1I40WF9mrHUqnO_kkoHySHzy3I#v=onepage&q=pedhazur%20schmelkin&f=false

Quoting from page 391:

|(for detailed discussions, see Blalock, Wells, & Carter, 1970;
|Bohrnstedt & Carter, 1971; Cochran, 1968, 1970; Linn & Werts,
|1982). Unlike simple regression analysis, random measurement
|errors in multiple regression may lead to either overestimation or
|underestimation of regression coefficients. Further, the biasing
|effects of measurement errors are not limited to the estimation
|of the regression coefficient for the variable being measured but
|affect also estimates of regression coefficients for other variables
|correlated with the variable in question. Thus, estimates of regression
|coefficients for variables measured with high reliability may be
|biased as a result of their correlations with variables measured
|with low reliability.
|
|Generally speaking, the lower the reliabilities of the measures
|used and the higher the intercorrelations among the variables,
|the more adverse the biasing effects of measurement errors.
|Under such circumstances, regression coefficients should be
|interpreted with great circumspection. Caution is particularly
|called for when attempting to interpret magnitudes of standardized
|regression coefficients as indicating the relative importance
|of the variables with which they are associated. It would be wiser
|to refrain from such attempts altogether when measurement
|errors are prevalent.
|
|Thus far, we have not dealt with the effects of errors in the
|measurement of the dependent variable. Such errors do not lead
|to bias in the estimation of the unstandardized regression
|coefficient (b). They do, however, lead to the attenuation of the
|correlation between the independent and the dependent variable,
|hence, to the attenuation of the standardized regression
|coefficient (beta).(footnote18) Becaute 1 - r^2 (or 1 - R^2 in
|multiple regression analysis) is part of the error term, it can
|be seen that measurement errors in the dependent variable
|reduce the sensitivity of the statistical analysis.
|
|Of various approaches and remedies for managing the magnitude
|of the errors and of taking into taking account of their impact
|on the estimation of model parameters, probably the most
|promising are those incorporated in structural equation modeling
|(SEM). Chapters 23 and 24 are devoted to analytic approaches
|for such models, where it is also shown how measurement
|errors are taken into account when estimating the parameters
|of the model.
|
|Although approaches to managing measurement errors are useful,
|greater benefits would be reaped if researchers were to pay
|more attention to the validity and reliability of measures; if they
|directed their efforts towards optimizing them instead of attempting
|to counteract adverse effects of poorly conceived and poorly
|constructed measures.

In psychology it has been traditional to use SEM to construct
a measurement model for the predictors X (true value + error)
and relating the latent variables to each other and the outcome
or dependent variable (Y or, if Y is measured with error, it's
latent variable).  For an example using AMOS, see:
http://www.spss.com.hk/amos/measurement_error_application.htm

The answer to the question "can one ignore measurement error
in the x-variables" depends on what are the differences in analyses
that ignore them (traditional) or incorporate them (e.g., SEM).

For a more extensive presentation on the role of measurement error
in linear and nonlinear models, see:
Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M.
(2006).
Measurement error in nonlinear models: a modern perspective. CRC press.
Parts can be previewed on books.google.com; see:
https://books.google.com/books?id=9kBx5CPZCqkC&pg=PA52&dq=%22multiple+regression%22+%22measurement+error%22&hl=en&sa=X&ved=0ahUKEwiZoIqT0sPMAhUEHx4KHWg1CCoQ6AEIMDAD#v=onepage&q=%22multiple%20regression%22%20%22measurement%20error%22&f=false

Finally, Bayesian methods have also been employed to
deal with this problem and examples of doing this in R are
available on the R-bloggers' website; see:
http://www.r-bloggers.com/bayesian-type-ii-regression/
and
http://www.r-bloggers.com/errors-in-variables-models-in-stan/

-Mike Palij
New York University
<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;mp26@nyu.edu&#39;)">mp26@...



----- Original Message -----
From: Rudobeck, Emil (LLU)
To: <a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;SPSSX-L@LISTSERV.UGA.EDU&#39;)">SPSSX-L@...
Sent: Thursday, May 05, 2016 12:36 PM
Subject: Statistics with errors in the x-variable

One of the often ignored assumptions of regression is that
the predictor variable cannot have any errors. Since mixed
models incorporate the assumptions of regression, I am
assuming SPSS MIXED also requires precise predictor
measurements. Admittedly, I have had a hard time finding
all the exact assumptions of mixed models since they are
not covered, even in the most recent book by Brady West.
As a result, I am wondering what are some correct statistical
tests that can be applied when the data has both x and y errors
(predictor and response variables). And which of these tests
can be done with SPSS?

Some research turned up total/generalized least squares
as the answer, but it doesn't seem like SPSS has this option
(r plug-in allows partial least squares). And I am not sure if
robust regression or bootstrap regression available in
SPSS would address the issue either. Some suggestions
or solutions would be appreciated.

A further, possible complication is for data that is nonnormal
or nonlinear, or both.

Emil



________________________________
WARNING: Please be vigilant when opening emails that appear to be the least bit out of the ordinary, e.g. someone you usually don’t hear from, or attachments you usually don’t receive or didn’t expect, requests to click links or log into systems, etc. If you receive suspicious emails, please do not open attachments or links and immediately forward the suspicious email to <a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;EmailAbuse@llu.edu&#39;)">EmailAbuse@... and then delete the suspicious email.

CONFIDENTIALITY NOTICE: This e-mail communication and any attachments may contain confidential and privileged information for the use of the designated recipients named above. If you are not the intended recipient, you are hereby notified that you have received this communication in error and that any review, disclosure, dissemination, distribution or copying of it or its contents is prohibited. If you have received this communication in error, please notify me immediately by replying to this message and destroy all copies of this communication and any attachments. Thank you.

=====================
To manage your subscription to SPSSX-L, send a message to
<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;LISTSERV@LISTSERV.UGA.EDU&#39;)">LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD


--
Jon K Peck
[hidden email]


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Statistics with errors in the x-variable

Rich Ulrich
In reply to this post by Rudobeck, Emil (LLU)
Emil, to take your paragraphs in order:

1.  The /test/ results (but not the coefficients) will be robust when ignoring e-in-v.
In some cases, this results is obvious from inspection of the computation of the e-in-v
error terms, which simply translate the robust tests and pretend that the errors still apply
to the new coefficients -- despite some extra (and not error-less) manipulation.  And if you
are considering SEM (or AMOS), they are famous for not providing meaningful tests ... so you
compare your /set/ of models and have to be happy if the best one seems much better than the
others.  They are for estimation of coefficients and alternate paths, not for testing the minimal
existence of an effect.

2.  "Residuals" matter because they are whose sums are squared to form a chi-squared variate
that makes up the ANOVA F-test.  They don't have to be really great, but you don't want one or
two outliers contributing half the sum of squares.  If your sample is small, you don't have much
power for  /testing/  normality;  if your sample is large enough, you don't have much concern,
because the F-test will still be pretty good.  (Pay more attention to heterogeneity, or correlation.)
Consider what is being measured... does it seem like equal intervals (with the outcome in mind)?

3.  Nonlinearity.  Apparently you missed my meaning entirely.  Obviously, a linear equation will
produce equal predictions for equal intervals.  But:  Does common sense tell you that is realistic? 
If you don't have a particular outcome in mind, consider the "latent factor" that is supposed to be
measured by your score.

There is a fairly big difference in status (outcome) between scoring 3 errors (failing?) on a dementia scale
versus 0 errors (healthy) out of 31, where there is very little difference between scoring 20 versus 23
(seriously dysfunctional).  For "20 versus 23", you might seriously wonder if the patient would vary by
that much if you re-tested a few hours later.  For the particular scale I have in mind, for a sample that
spanned the range of scores,  I think I recommended using the square root of the number of errors. 
Almost any model building with that score /ought/  to be concerned with that latent factor, and not with
the count of errors.  [Actually, the protocol-scoring reported scores to 31-- which created an unfortunate
bias toward "demented" when a patient was deaf/blind/whatever  and could not be scored on some item.]

Weekly evaluation of a new psychiatric treatment is not done "equal interval" in time from the start of
treatment.  Doing followups at  (4 days, a week, 2 weeks, 4 weeks, 8 weeks)  will be an approximation of
equal intervals for outcome which will not be perfect; but it will be economical, it will avoid the negative
effects of "overtesting", and it will be far more linear in changes than counting days as equal.  Clinical investigators
typically /chose/  their intervals to represent what they expect to approximate equal amounts of change, up to
the maintenance phase.   If you know the "linearity" that the clinician expects, it makes sense to build your
default model on the spacing and then, if you want, test for the departure from the expected linearity.

--
Rich Ulrich


Date: Thu, 5 May 2016 18:02:29 +0000
From: [hidden email]
Subject: Re: Statistics with errors in the x-variable
To: [hidden email]

Hi Rich,

When you say it does not affect the tests, do you mean that the statistical results would be identical whether errors-in-variables are ignored or included? I am wondering if there are any references or simulations to this end. When I was searching the scientific literature (biology), I also found that authors always ignored predictor errors, even excluded horizontal error bars, but biological scientific papers aren't a benchmark of good statistics, so I wasn't sure if that approach was correct.

Much to be said about the residual normality, since SPSS only outputs conditional residuals for MIXED, yet West says that normality should be assessed using the studentized or standardized residuals/eBLUPs.

Interesting point on the definition of nonlinearity. It would seem that definition will always be satisfied automatically, unless one of the axes is categorical data that's treated as continuous without proper transformation. I've seen some colleagues make a graph using x axes values that are not equidistant in terms of measurement (20V, 60V, 100V...) yet are so plotted and analyzed (1,2,3,...). This is essentially a rank transformation. For categorical analyzes this doesn't matter, except when these values are repeated measures and the users are trying to establish polynomial relationships with rANOVA.

Emil


From: Rich Ulrich [[hidden email]]
Sent: Thursday, May 05, 2016 10:19 AM
To: Rudobeck, Emil (LLU); SPSS list
Subject: RE: Statistics with errors in the x-variable

I have never worried much about errors-in-variables because it
does not affect the testing.  I was concerned about the tests, and
never took exact coefficients too seriously.  Neither did anyone else
in my particular area; we paid attention to the testing.  If the tests
are your concern, then do not be worried.
 
A method that takes e-in-v  into account will produce a different
estimate of coefficients.  The notion here is the same as "correcting
for attenuation" when looking at simple Pearson r's.  That is not
something that most people do.  Or expect.

"Normal" is an assumption that applies to the residuals.  It is the big
outliers, or correlated outliers, that affect the robustness of the testing.

"Nonlinearity" is often misunderstood as a character of the predictor,
whereas it should be applied to the relationship between predictor and
outcome.  I find it useful to think of the "equal-interval" relationship,
where equal-intervals of changes in the predictor should result in equal-
intervals of changes in the outcome.

--
Rich Ulrich



Date: Thu, 5 May 2016 16:36:03 +0000
From: [hidden email]
Subject: Statistics with errors in the x-variable
To: [hidden email]

One of the often ignored assumptions of regression is that the predictor variable cannot have any errors. Since mixed models incorporate the assumptions of regression, I am assuming SPSS MIXED also requires precise predictor measurements. Admittedly, I have had a hard time finding all the exact assumptions of mixed models since they are not covered, even in the most recent book by Brady West. As a result, I am wondering what are some correct statistical tests that can be applied when the data has both x and y errors (predictor and response variables). And which of these tests can be done with SPSS?

Some research turned up total/generalized least squares as the answer, but it doesn't seem like SPSS has this option (r plug-in allows partial least squares). And I am not sure if robust regression or bootstrap regression available in SPSS would address the issue either. Some suggestions or solutions would be appreciated.

A further, possible complication is for data that is nonnormal or nonlinear, or both.

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Statistics with errors in the x-variable

Mike
In reply to this post by Jon Peck
I think that there is some confusion about (a) instrumental variables
and
(b) 2SLS analysis.

Let me suggest the following chapter by Ken Bollen:
Bollen, K. A. (2012). Instrumental variables in sociology and the social
sciences. Annual Review of Sociology, 38, 37-72.

In his Figure 1, he provide the common definition of Instrumental
Variables (IV), namely, there is a covariance/correlation between
a predictor X' and the MODEL error e.  This can occur evem if
X is not a latent variable (i.e., X = Xi/Ksi + epsilon-x, in LISREL
notiation).

In his Figure 2, Bollwns identifies several conditions where X
can be correlated with epsilon-y, the model error:

(1) Figure 2b represents the "measurement error in X" that we
have been disccusiong so far. OLS regression uses the empirical X
which combine Xi + epsilon-x -- becaise Xi is correlated with Y,
epsilon-x will become correlated with epsilon-y (see page 39).
Creating the appropriate measurement model for X, that is,
Xi and epsilon-x, allows one to use Xi in the regression and
epsilon-x now stands alone, independent of all other entities.

(2) Figure 2a represents a model where empirical X and Y
have a feedback relationship (reciprical causation) and both
have epsilon terms that are correlated and influence their
associated empirical indicators (i.e., epsilon-x is causally related
to X and epsilon-y is causally related to Y).

(3) Figure 2d assumes X is a lagged version of Y (a measure
of Y at a prior time) which induced an autoregressive relationship
between epsilon-x and expsilong-y, and each affect X and Y
similar to that in Figure 2a.

(4) Figure 2c assumes that a variable L is omitted but is causally
related to X and Y.  L relationship to Y is expressed through
epsilon-x.

So, 2SLS can be used to correct for the correlation between
epsilon-x and epsilon-y but measurement error in x is just one
situation whete this occurs -- one has to determine whether
one data represents the model in Figure 2b or the in the other
models, which will require different solutions.

Ken Bollen has been studying this situation for a while and he
has suggested various alternative analyses (he does not identify
software package solutions, so one would have to write the
code to identify the appropriate model and then modify the
regression appropriately.  On page 59 Bollen cites one of his
papers where he proposed a 2-stage analysis strategy that can
assist in determining the number of instrumental variables to
use; he reviews other methods tht can be used to check one's
model.  His Table 1 (page 64) contains a list of references,
which method of analysis was used (e.g., SEM, 2SLS, etc.)
to evaluate a model.

This was published in 2012 but one might want to look at
an earlier paper by Bollen where he argues for 2-stage analyses:
Bollen, K. A. (1996). An alternative two stage least
squares (2SLS) estimator for latent variable equations.
Psychometrika, 61(1), 109-121.

Bollen has published post 2012 papers that one might also
want to look at:

Bollen, K. A., & Pearl, J. (2013). Eight myths about causality
and structural equation models. In Handbook of causal analysis
for social research (pp. 301-328). Springer Netherlands.

Bollen, K. A., Kolenikov, S., & Bauldry, S. (2014). Model-Implied
Instrumental Variable—Generalized Method of Moments (MIIV-GMM)
Estimators for Latent Variable Models. Psychometrika, 79(1), 20-50.

-Mike Palij
New York University
[hidden email]


----- Original Message -----
From: Jon Peck
To: [hidden email]
Sent: Thursday, May 05, 2016 10:34 PM
Subject: Re: Statistics with errors in the x-variable

I was going to mention 2sls, which is based on instrumental variables.
IVs were the original (as far as I know) method for dealing with errors
in variables problems.  Besides the built-in 2sls command in Statistics,
there is an extension command called stats eqnsystem that provides a
variety of estimators for equation systems.

Also, assuming that you have some idea of the error variance, you can
explore the effect on your estimates and tests by adding random noise to
the variables to see how it affects the results.  Generally you will
find that multicollinearity exacerbates the effect of errors in
variables.

--------------------------------------------
On Thursday, May 5, 2016, Rudobeck, Emil (LLU) <[hidden email]>
wrote:

Thanks Mike. As expected, the solution isn't simple. I will need to read
up on SEM. Luckily I have AMOS. The Bayesian approach will be a much
bigger leap. I found out that Deming regression, which can also be used,
has been submitted as an enhancement request for SPSS. I don't know when
they will incorporate it. I also found out that 2 stage least squares
regression (2SLS) is an alternative to SEM and is available in SPSS. If
anyone here knows the differences between the 2SLS and SEM approaches,
it would be interesting to find out. I will have to find some good
sources on the practical application of SPSS and AMOS for using either
2SLS or SEM.

If one has to run SEM to show that the results aren't much different
than regression, then I don't see the point there since for a
publication both tests would need to be reported. As such, it would make
sense to do only SEM to begin with and simplify the findings.

There is a third approach which to me doesn't seem to be incorrect: if
you are experimentally measuring Y11, Y12 in the same animal, then Y21
and Y22 in another animal, all with the same predefined X (without
error), instead of graphing the means of Y11, Y21 vs Y12, Y22 (which
would result in horizontal error), another approach would be to simply
use ratios to get rid of the horizontal error. So in this case it would
be the means of Y11/Y12, Y21/Y22 vs X (no measurement error). The main
issue with the latter approach is that ratios are more difficult to
explain than a simple Y v X graph. If I'm overlooking something
statistically, let me know.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Statistics with errors in the x-variable

Rudobeck, Emil (LLU)
In reply to this post by Rich Ulrich
I have not been able to find any references about safely ignoring measurement errors and still achieving unbiased results. Tellinghuisen's Monte Carlo simulations showed that OLS should be used if both X and Y are homoscedastic, which is not easy to satisfy in biological situations. But that's still different because under such conditions the coefficients are accurate, not just the test results. I don't really understand how can the coefficients be inaccurate but the test results accurate when when significance testing compares those very same coefficients. Does anyone have any references, especially Monte Carlo simulations? Maybe I'm not using the right keywords in my searches.

For understanding the results or publishing them, the coefficients themselves are rather important to give some idea about the effect size, even if not officially calculated/standardized.

Rich, your point about latent variables and their relationship to linearity seems to be more about the biological theory. If one is using a particular scale or method (hence theory), which has been developed by prior scientists, then any nonlinearity is valid based on that method. One needs to come up with a new method and a scale/relationship, if there is belief that the latent variables are not properly represented. But in either case, even nonlinear data can sometimes be transformed, purely mathematically, into a linear counterpart before worrying about nonlinear analyses, be it a square root or some other transformation that works. If I understood your example, the approach there was mathematical as well - the theory of how errors should be scaled or measured, or the use of a different measurement system, was not addressed.


Here is the link to the article again, in case the hyperlink above doesn't work: http://www.ncbi.nlm.nih.gov/pubmed/20577693



From: Rich Ulrich [[hidden email]]
Sent: Friday, May 06, 2016 12:20 AM
To: Rudobeck, Emil (LLU); SPSS list
Subject: RE: Statistics with errors in the x-variable

Emil, to take your paragraphs in order:

1.  The /test/ results (but not the coefficients) will be robust when ignoring e-in-v.
In some cases, this results is obvious from inspection of the computation of the e-in-v
error terms, which simply translate the robust tests and pretend that the errors still apply
to the new coefficients -- despite some extra (and not error-less) manipulation.  And if you
are considering SEM (or AMOS), they are famous for not providing meaningful tests ... so you
compare your /set/ of models and have to be happy if the best one seems much better than the
others.  They are for estimation of coefficients and alternate paths, not for testing the minimal
existence of an effect.

2.  "Residuals" matter because they are whose sums are squared to form a chi-squared variate
that makes up the ANOVA F-test.  They don't have to be really great, but you don't want one or
two outliers contributing half the sum of squares.  If your sample is small, you don't have much
power for  /testing/  normality;  if your sample is large enough, you don't have much concern,
because the F-test will still be pretty good.  (Pay more attention to heterogeneity, or correlation.)
Consider what is being measured... does it seem like equal intervals (with the outcome in mind)?

3.  Nonlinearity.  Apparently you missed my meaning entirely.  Obviously, a linear equation will
produce equal predictions for equal intervals.  But:  Does common sense tell you that is realistic? 
If you don't have a particular outcome in mind, consider the "latent factor" that is supposed to be
measured by your score.

There is a fairly big difference in status (outcome) between scoring 3 errors (failing?) on a dementia scale
versus 0 errors (healthy) out of 31, where there is very little difference between scoring 20 versus 23
(seriously dysfunctional).  For "20 versus 23", you might seriously wonder if the patient would vary by
that much if you re-tested a few hours later.  For the particular scale I have in mind, for a sample that
spanned the range of scores,  I think I recommended using the square root of the number of errors. 
Almost any model building with that score /ought/  to be concerned with that latent factor, and not with
the count of errors.  [Actually, the protocol-scoring reported scores to 31-- which created an unfortunate
bias toward "demented" when a patient was deaf/blind/whatever  and could not be scored on some item.]

Weekly evaluation of a new psychiatric treatment is not done "equal interval" in time from the start of
treatment.  Doing followups at  (4 days, a week, 2 weeks, 4 weeks, 8 weeks)  will be an approximation of
equal intervals for outcome which will not be perfect; but it will be economical, it will avoid the negative
effects of "overtesting", and it will be far more linear in changes than counting days as equal.  Clinical investigators
typically /chose/  their intervals to represent what they expect to approximate equal amounts of change, up to
the maintenance phase.   If you know the "linearity" that the clinician expects, it makes sense to build your
default model on the spacing and then, if you want, test for the departure from the expected linearity.

--
Rich Ulrich


Date: Thu, 5 May 2016 18:02:29 +0000
From: [hidden email]
Subject: Re: Statistics with errors in the x-variable
To: [hidden email]

Hi Rich,

When you say it does not affect the tests, do you mean that the statistical results would be identical whether errors-in-variables are ignored or included? I am wondering if there are any references or simulations to this end. When I was searching the scientific literature (biology), I also found that authors always ignored predictor errors, even excluded horizontal error bars, but biological scientific papers aren't a benchmark of good statistics, so I wasn't sure if that approach was correct.

Much to be said about the residual normality, since SPSS only outputs conditional residuals for MIXED, yet West says that normality should be assessed using the studentized or standardized residuals/eBLUPs.

Interesting point on the definition of nonlinearity. It would seem that definition will always be satisfied automatically, unless one of the axes is categorical data that's treated as continuous without proper transformation. I've seen some colleagues make a graph using x axes values that are not equidistant in terms of measurement (20V, 60V, 100V...) yet are so plotted and analyzed (1,2,3,...). This is essentially a rank transformation. For categorical analyzes this doesn't matter, except when these values are repeated measures and the users are trying to establish polynomial relationships with rANOVA.

Emil


From: Rich Ulrich [[hidden email]]
Sent: Thursday, May 05, 2016 10:19 AM
To: Rudobeck, Emil (LLU); SPSS list
Subject: RE: Statistics with errors in the x-variable

I have never worried much about errors-in-variables because it
does not affect the testing.  I was concerned about the tests, and
never took exact coefficients too seriously.  Neither did anyone else
in my particular area; we paid attention to the testing.  If the tests
are your concern, then do not be worried.
 
A method that takes e-in-v  into account will produce a different
estimate of coefficients.  The notion here is the same as "correcting
for attenuation" when looking at simple Pearson r's.  That is not
something that most people do.  Or expect.

"Normal" is an assumption that applies to the residuals.  It is the big
outliers, or correlated outliers, that affect the robustness of the testing.

"Nonlinearity" is often misunderstood as a character of the predictor,
whereas it should be applied to the relationship between predictor and
outcome.  I find it useful to think of the "equal-interval" relationship,
where equal-intervals of changes in the predictor should result in equal-
intervals of changes in the outcome.

--
Rich Ulrich



Date: Thu, 5 May 2016 16:36:03 +0000
From: [hidden email]
Subject: Statistics with errors in the x-variable
To: [hidden email]

One of the often ignored assumptions of regression is that the predictor variable cannot have any errors. Since mixed models incorporate the assumptions of regression, I am assuming SPSS MIXED also requires precise predictor measurements. Admittedly, I have had a hard time finding all the exact assumptions of mixed models since they are not covered, even in the most recent book by Brady West. As a result, I am wondering what are some correct statistical tests that can be applied when the data has both x and y errors (predictor and response variables). And which of these tests can be done with SPSS?

Some research turned up total/generalized least squares as the answer, but it doesn't seem like SPSS has this option (r plug-in allows partial least squares). And I am not sure if robust regression or bootstrap regression available in SPSS would address the issue either. Some suggestions or solutions would be appreciated.

A further, possible complication is for data that is nonnormal or nonlinear, or both.



WARNING: Please be vigilant when opening emails that appear to be the least bit out of the ordinary, e.g. someone you usually don’t hear from, or attachments you usually don’t receive or didn’t expect, requests to click links or log into systems, etc. If you receive suspicious emails, please do not open attachments or links and immediately forward the suspicious email to [hidden email] and then delete the suspicious email.
CONFIDENTIALITY NOTICE: This e-mail communication and any attachments may contain confidential and privileged information for the use of the designated recipients named above. If you are not the intended recipient, you are hereby notified that you have received this communication in error and that any review, disclosure, dissemination, distribution or copying of it or its contents is prohibited. If you have received this communication in error, please notify me immediately by replying to this message and destroy all copies of this communication and any attachments. Thank you.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Statistics with errors in the x-variable

Jon Peck
In reply to this post by Mike
Instrumental variables for errors in the independent variables has a long history - back to 1945.  While 2SLS is focused on dealing with the problem of correlation between endogenous regressors and the error term(s) in a set of equations, the use of instrumental variables is not confined to this situation.  In a simultaneous equations setting, instruments are taken from exogenous variables in all the equations of the model.  Absent such a model, selection of instruments may be more ad hoc, but variables that are correlated with the systematic portion of the regressors and neither the equation error terms nor the measurement error are suitable.    Accuracy requires strong correlation between the instruments and the systematic part of the regressors.  

I'm having trouble at the moment logging in to JSTOR to access good online references, but on my shelf I have Malinvaud, Statistics Methods of Econometrics, and C10, Linear Models with Errors in Variables, section 7 discusses the mathematical details and properties of the IV estimator.

The 2SLS procedure in Statistics allows you to specify the instrumental variables along with the dependent and predictor variables without explicitly specifying other equations of the model.

On Fri, May 6, 2016 at 11:24 AM, Mike Palij <[hidden email]> wrote:
I think that there is some confusion about (a) instrumental variables and
(b) 2SLS analysis.

Let me suggest the following chapter by Ken Bollen:
Bollen, K. A. (2012). Instrumental variables in sociology and the social
sciences. Annual Review of Sociology, 38, 37-72.

In his Figure 1, he provide the common definition of Instrumental
Variables (IV), namely, there is a covariance/correlation between
a predictor X' and the MODEL error e.  This can occur evem if
X is not a latent variable (i.e., X = Xi/Ksi + epsilon-x, in LISREL
notiation).

In his Figure 2, Bollwns identifies several conditions where X
can be correlated with epsilon-y, the model error:

(1) Figure 2b represents the "measurement error in X" that we
have been disccusiong so far. OLS regression uses the empirical X
which combine Xi + epsilon-x -- becaise Xi is correlated with Y,
epsilon-x will become correlated with epsilon-y (see page 39).
Creating the appropriate measurement model for X, that is,
Xi and epsilon-x, allows one to use Xi in the regression and
epsilon-x now stands alone, independent of all other entities.

(2) Figure 2a represents a model where empirical X and Y
have a feedback relationship (reciprical causation) and both
have epsilon terms that are correlated and influence their
associated empirical indicators (i.e., epsilon-x is causally related
to X and epsilon-y is causally related to Y).

(3) Figure 2d assumes X is a lagged version of Y (a measure
of Y at a prior time) which induced an autoregressive relationship
between epsilon-x and expsilong-y, and each affect X and Y
similar to that in Figure 2a.

(4) Figure 2c assumes that a variable L is omitted but is causally
related to X and Y.  L relationship to Y is expressed through
epsilon-x.

So, 2SLS can be used to correct for the correlation between
epsilon-x and epsilon-y but measurement error in x is just one
situation whete this occurs -- one has to determine whether
one data represents the model in Figure 2b or the in the other
models, which will require different solutions.

Ken Bollen has been studying this situation for a while and he
has suggested various alternative analyses (he does not identify
software package solutions, so one would have to write the
code to identify the appropriate model and then modify the
regression appropriately.  On page 59 Bollen cites one of his
papers where he proposed a 2-stage analysis strategy that can
assist in determining the number of instrumental variables to
use; he reviews other methods tht can be used to check one's
model.  His Table 1 (page 64) contains a list of references,
which method of analysis was used (e.g., SEM, 2SLS, etc.)
to evaluate a model.

This was published in 2012 but one might want to look at
an earlier paper by Bollen where he argues for 2-stage analyses:
Bollen, K. A. (1996). An alternative two stage least
squares (2SLS) estimator for latent variable equations.
Psychometrika, 61(1), 109-121.

Bollen has published post 2012 papers that one might also
want to look at:

Bollen, K. A., & Pearl, J. (2013). Eight myths about causality
and structural equation models. In Handbook of causal analysis
for social research (pp. 301-328). Springer Netherlands.

Bollen, K. A., Kolenikov, S., & Bauldry, S. (2014). Model-Implied
Instrumental Variable—Generalized Method of Moments (MIIV-GMM)
Estimators for Latent Variable Models. Psychometrika, 79(1), 20-50.

-Mike Palij
New York University
[hidden email]


----- Original Message ----- From: Jon Peck
To: [hidden email]
Sent: Thursday, May 05, 2016 10:34 PM
Subject: Re: Statistics with errors in the x-variable

I was going to mention 2sls, which is based on instrumental variables. IVs were the original (as far as I know) method for dealing with errors in variables problems.  Besides the built-in 2sls command in Statistics, there is an extension command called stats eqnsystem that provides a variety of estimators for equation systems.

Also, assuming that you have some idea of the error variance, you can explore the effect on your estimates and tests by adding random noise to the variables to see how it affects the results.  Generally you will find that multicollinearity exacerbates the effect of errors in variables.

--------------------------------------------

On Thursday, May 5, 2016, Rudobeck, Emil (LLU) <[hidden email]> wrote:

Thanks Mike. As expected, the solution isn't simple. I will need to read up on SEM. Luckily I have AMOS. The Bayesian approach will be a much bigger leap. I found out that Deming regression, which can also be used, has been submitted as an enhancement request for SPSS. I don't know when they will incorporate it. I also found out that 2 stage least squares regression (2SLS) is an alternative to SEM and is available in SPSS. If anyone here knows the differences between the 2SLS and SEM approaches, it would be interesting to find out. I will have to find some good sources on the practical application of SPSS and AMOS for using either 2SLS or SEM.

If one has to run SEM to show that the results aren't much different than regression, then I don't see the point there since for a publication both tests would need to be reported. As such, it would make sense to do only SEM to begin with and simplify the findings.

There is a third approach which to me doesn't seem to be incorrect: if you are experimentally measuring Y11, Y12 in the same animal, then Y21 and Y22 in another animal, all with the same predefined X (without error), instead of graphing the means of Y11, Y21 vs Y12, Y22 (which would result in horizontal error), another approach would be to simply use ratios to get rid of the horizontal error. So in this case it would be the means of Y11/Y12, Y21/Y22 vs X (no measurement error). The main issue with the latter approach is that ratios are more difficult to explain than a simple Y v X graph. If I'm overlooking something statistically, let me know.





--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Statistics with errors in the x-variable

Mike
Hi Jon,

With all due respect I have to ask the following questions:

(1)  Did you read any of the references that I listed in my previous
posts? Specifically:

Bollen, K. A. (1996). An alternative two stage least
squares (2SLS) estimator for latent variable equations.
Psychometrika, 61(1), 109-121.

I assume that you are up on the current literature and would not
rely on out-of-date references.

(2) In the IBM SPSS website that provides information on the 2SLS
procedure, they list as the source for the algorithms two books by
Thiel from 1953; see:
http://www.ibm.com/support/knowledgecenter/SSLVMB_20.0.0/com.ibm.spss.statistics.help/alg_2sls_references.htm?lang=sl
I note that this is for ver 20 of SPSS but I have to ask: have there
been
no updates to the algorithms or how to solve these problems since 1953?
My reading of Bollen and others suggest that there may have been.

SIDENOTE:  One reason I ask about the use of possibly outdated
algorithms is because I recently became aware that AMOS uses a
formula for the "Modification Indexes" or Lagrangian multipliers that
originally comes from LISREL V and which was replaced by a newer
algorithm in LISREL VI.  The SEM software EQS, LISREL, and SAS
Calis all use the newer algorithm and provide the same values for the
modification indexes but AMOS provides different (lower) values.
Tabachnick and Fidell 5th Edition shows how all of these program
handle a common dataset and do not comment on why AMOS gives
different results..Examination of the AMOS manual would lead one
to think that it also uses the newer algorithm that LISREL VI and
other programs use but this is clearly wrong.  Which leads one to
wonder why no one has (a) clearly explained why this is being done
and that one should expect results that do not agree with other
software, and (b) who decides to keep this as a "feature" instead
of updating the software?

(3) I note that the most recent edition of the text by Malinvaud that
you
refer to below is 1980 (3rd ed) according to WorldCat. I can't make
out what the C10 reference is.  You seem to imply that there have
been no new developments in these areas since 1980 (though Bollen
and others appear to imply otherwise).  Is this true?

Just wondering.

-Mike Palij
New York University
[hidden email]


----- Original Message -----
From: Jon Peck
To: Mike Palij
Cc: [hidden email]
Sent: Friday, May 06, 2016 3:30 PM
Subject: Re: Statistics with errors in the x-variable


Instrumental variables for errors in the independent variables has a
long history - back to 1945.  While 2SLS is focused on dealing with the
problem of correlation between endogenous regressors and the error
term(s) in a set of equations, the use of instrumental variables is not
confined to this situation.  In a simultaneous equations setting,
instruments are taken from exogenous variables in all the equations of
the model.  Absent such a model, selection of instruments may be more ad
hoc, but variables that are correlated with the systematic portion of
the regressors and neither the equation error terms nor the measurement
error are suitable.    Accuracy requires strong correlation between the
instruments and the systematic part of the regressors.


I'm having trouble at the moment logging in to JSTOR to access good
online references, but on my shelf I have Malinvaud, Statistics Methods
of Econometrics, and C10, Linear Models with Errors in Variables,
section 7 discusses the mathematical details and properties of the IV
estimator.


The 2SLS procedure in Statistics allows you to specify the instrumental
variables along with the dependent and predictor variables without
explicitly specifying other equations of the model.


On Fri, May 6, 2016 at 11:24 AM, Mike Palij <[hidden email]> wrote:

I think that there is some confusion about (a) instrumental variables
and
(b) 2SLS analysis.

Let me suggest the following chapter by Ken Bollen:
Bollen, K. A. (2012). Instrumental variables in sociology and the social
sciences. Annual Review of Sociology, 38, 37-72.

In his Figure 1, he provide the common definition of Instrumental
Variables (IV), namely, there is a covariance/correlation between
a predictor X' and the MODEL error e.  This can occur evem if
X is not a latent variable (i.e., X = Xi/Ksi + epsilon-x, in LISREL
notiation).

In his Figure 2, Bollwns identifies several conditions where X
can be correlated with epsilon-y, the model error:

(1) Figure 2b represents the "measurement error in X" that we
have been disccusiong so far. OLS regression uses the empirical X
which combine Xi + epsilon-x -- becaise Xi is correlated with Y,
epsilon-x will become correlated with epsilon-y (see page 39).
Creating the appropriate measurement model for X, that is,
Xi and epsilon-x, allows one to use Xi in the regression and
epsilon-x now stands alone, independent of all other entities.

(2) Figure 2a represents a model where empirical X and Y
have a feedback relationship (reciprical causation) and both
have epsilon terms that are correlated and influence their
associated empirical indicators (i.e., epsilon-x is causally related
to X and epsilon-y is causally related to Y).

(3) Figure 2d assumes X is a lagged version of Y (a measure
of Y at a prior time) which induced an autoregressive relationship
between epsilon-x and expsilong-y, and each affect X and Y
similar to that in Figure 2a.

(4) Figure 2c assumes that a variable L is omitted but is causally
related to X and Y.  L relationship to Y is expressed through
epsilon-x.

So, 2SLS can be used to correct for the correlation between
epsilon-x and epsilon-y but measurement error in x is just one
situation whete this occurs -- one has to determine whether
one data represents the model in Figure 2b or the in the other
models, which will require different solutions.

Ken Bollen has been studying this situation for a while and he
has suggested various alternative analyses (he does not identify
software package solutions, so one would have to write the
code to identify the appropriate model and then modify the
regression appropriately.  On page 59 Bollen cites one of his
papers where he proposed a 2-stage analysis strategy that can
assist in determining the number of instrumental variables to
use; he reviews other methods tht can be used to check one's
model.  His Table 1 (page 64) contains a list of references,
which method of analysis was used (e.g., SEM, 2SLS, etc.)
to evaluate a model.

This was published in 2012 but one might want to look at
an earlier paper by Bollen where he argues for 2-stage analyses:
Bollen, K. A. (1996). An alternative two stage least
squares (2SLS) estimator for latent variable equations.
Psychometrika, 61(1), 109-121.

Bollen has published post 2012 papers that one might also
want to look at:

Bollen, K. A., & Pearl, J. (2013). Eight myths about causality
and structural equation models. In Handbook of causal analysis
for social research (pp. 301-328). Springer Netherlands.

Bollen, K. A., Kolenikov, S., & Bauldry, S. (2014). Model-Implied
Instrumental Variable—Generalized Method of Moments (MIIV-GMM)
Estimators for Latent Variable Models. Psychometrika, 79(1), 20-50.

-Mike Palij
New York University
[hidden email]


----- Original Message ----- From: Jon Peck
To: [hidden email]
Sent: Thursday, May 05, 2016 10:34 PM
Subject: Re: Statistics with errors in the x-variable

I was going to mention 2sls, which is based on instrumental variables.
IVs were the original (as far as I know) method for dealing with errors
in variables problems.  Besides the built-in 2sls command in Statistics,
there is an extension command called stats eqnsystem that provides a
variety of estimators for equation systems.

Also, assuming that you have some idea of the error variance, you can
explore the effect on your estimates and tests by adding random noise to
the variables to see how it affects the results.  Generally you will
find that multicollinearity exacerbates the effect of errors in
variables.

--------------------------------------------

On Thursday, May 5, 2016, Rudobeck, Emil (LLU) <[hidden email]>
wrote:

Thanks Mike. As expected, the solution isn't simple. I will need to read
up on SEM. Luckily I have AMOS. The Bayesian approach will be a much
bigger leap. I found out that Deming regression, which can also be used,
has been submitted as an enhancement request for SPSS. I don't know when
they will incorporate it. I also found out that 2 stage least squares
regression (2SLS) is an alternative to SEM and is available in SPSS. If
anyone here knows the differences between the 2SLS and SEM approaches,
it would be interesting to find out. I will have to find some good
sources on the practical application of SPSS and AMOS for using either
2SLS or SEM.

If one has to run SEM to show that the results aren't much different
than regression, then I don't see the point there since for a
publication both tests would need to be reported. As such, it would make
sense to do only SEM to begin with and simplify the findings.

There is a third approach which to me doesn't seem to be incorrect: if
you are experimentally measuring Y11, Y12 in the same animal, then Y21
and Y22 in another animal, all with the same predefined X (without
error), instead of graphing the means of Y11, Y21 vs Y12, Y22 (which
would result in horizontal error), another approach would be to simply
use ratios to get rid of the horizontal error. So in this case it would
be the means of Y11/Y12, Y21/Y22 vs X (no measurement error). The main
issue with the latter approach is that ratios are more difficult to
explain than a simple Y v X graph. If I'm overlooking something
statistically, let me know.








--

Jon K Peck
[hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Statistics with errors in the x-variable

Rich Ulrich
In reply to this post by Rudobeck, Emil (LLU)
For understanding or publishing results in clinical studies like I participated in, almost
everyone gives the (potentially) biased coefficients, and are comfortable with them.  You
want to do what is conventional in your area, if there is a convention.

UNBIASED/ attenuation.
Suppose the IQ of identical twins is correlated at about 0.94.  That is usually sufficient to say.
However, a theoretical work might take into account the "true score" variability of the
IQ test, and report that this IQ is (say) 0.97 "when corrected for attenuation".   Note that I am
describing the r and not the regression coefficient, which your citation favors.   Also, what
I have used -- I have decided that the correlation between two scales is essentially 1.0, based
on inter-correlations at one time and across a short time.  So I don't want to look at the scales
as if they were distinct.  But I would not attempt to put a confidence limit on the 0.97  or on the 1.00.

What I said about ignoring measurement errors is, "Do it, if the purpose is testing, and /not/
unbiased coefficients."  On the other hand, I never mentioned 2SLS.  If you add information from
elsewhere, it is possible that you get tighter tests.  I am not sure whether your citation suggests
that one method discussed in that abstract gives tighter tests after re-estimating x:  I can imagine
that working in the instance of extremely high r's, but not in general.

NONLINEARITY.  I really don't much understand your comments or your objections.  Yes, I'm all for
using natural units when they make sense.  Yes, I'm very familiar with having PIs come to me with
arbitrary scores or measures that deserve transformation in order to make sense ... and, at the
same time, fix problems that we might otherwise detect with subtle testing.

--
Rich Ulrich


Date: Fri, 6 May 2016 18:44:49 +0000
From: [hidden email]
Subject: Re: Statistics with errors in the x-variable
To: [hidden email]

I have not been able to find any references about safely ignoring measurement errors and still achieving unbiased results. Tellinghuisen's Monte Carlo simulations showed that OLS should be used if both X and Y are homoscedastic, which is not easy to satisfy in biological situations. But that's still different because under such conditions the coefficients are accurate, not just the test results. I don't really understand how can the coefficients be inaccurate but the test results accurate when when significance testing compares those very same coefficients. Does anyone have any references, especially Monte Carlo simulations? Maybe I'm not using the right keywords in my searches.

For understanding the results or publishing them, the coefficients themselves are rather important to give some idea about the effect size, even if not officially calculated/standardized.

Rich, your point about latent variables and their relationship to linearity seems to be more about the biological theory. If one is using a particular scale or method (hence theory), which has been developed by prior scientists, then any nonlinearity is valid based on that method. One needs to come up with a new method and a scale/relationship, if there is belief that the latent variables are not properly represented. But in either case, even nonlinear data can sometimes be transformed, purely mathematically, into a linear counterpart before worrying about nonlinear analyses, be it a square root or some other transformation that works. If I understood your example, the approach there was mathematical as well - the theory of how errors should be scaled or measured, or the use of a different measurement system, was not addressed.


Here is the link to the article again, in case the hyperlink above doesn't work: http://www.ncbi.nlm.nih.gov/pubmed/20577693



From: Rich Ulrich [[hidden email]]
Sent: Friday, May 06, 2016 12:20 AM
To: Rudobeck, Emil (LLU); SPSS list
Subject: RE: Statistics with errors in the x-variable

Emil, to take your paragraphs in order:

1.  The /test/ results (but not the coefficients) will be robust when ignoring e-in-v.
In some cases, this results is obvious from inspection of the computation of the e-in-v
error terms, which simply translate the robust tests and pretend that the errors still apply
to the new coefficients -- despite some extra (and not error-less) manipulation.  And if you
are considering SEM (or AMOS), they are famous for not providing meaningful tests ... so you
compare your /set/ of models and have to be happy if the best one seems much better than the
others.  They are for estimation of coefficients and alternate paths, not for testing the minimal
existence of an effect.

2.  "Residuals" matter because they are whose sums are squared to form a chi-squared variate
that makes up the ANOVA F-test.  They don't have to be really great, but you don't want one or
two outliers contributing half the sum of squares.  If your sample is small, you don't have much
power for  /testing/  normality;  if your sample is large enough, you don't have much concern,
because the F-test will still be pretty good.  (Pay more attention to heterogeneity, or correlation.)
Consider what is being measured... does it seem like equal intervals (with the outcome in mind)?

3.  Nonlinearity.  Apparently you missed my meaning entirely.  Obviously, a linear equation will
produce equal predictions for equal intervals.  But:  Does common sense tell you that is realistic? 
If you don't have a particular outcome in mind, consider the "latent factor" that is supposed to be
measured by your score.

There is a fairly big difference in status (outcome) between scoring 3 errors (failing?) on a dementia scale
versus 0 errors (healthy) out of 31, where there is very little difference between scoring 20 versus 23
(seriously dysfunctional).  For "20 versus 23", you might seriously wonder if the patient would vary by
that much if you re-tested a few hours later.  For the particular scale I have in mind, for a sample that
spanned the range of scores,  I think I recommended using the square root of the number of errors. 
Almost any model building with that score /ought/  to be concerned with that latent factor, and not with
the count of errors.  [Actually, the protocol-scoring reported scores to 31-- which created an unfortunate
bias toward "demented" when a patient was deaf/blind/whatever  and could not be scored on some item.]

Weekly evaluation of a new psychiatric treatment is not done "equal interval" in time from the start of
treatment.  Doing followups at  (4 days, a week, 2 weeks, 4 weeks, 8 weeks)  will be an approximation of
equal intervals for outcome which will not be perfect; but it will be economical, it will avoid the negative
effects of "overtesting", and it will be far more linear in changes than counting days as equal.  Clinical investigators
typically /chose/  their intervals to represent what they expect to approximate equal amounts of change, up to
the maintenance phase.   If you know the "linearity" that the clinician expects, it makes sense to build your
default model on the spacing and then, if you want, test for the departure from the expected linearity.

--
Rich Ulrich

[strip, earlier notes]


===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD