detecting linear combinations/high correlations in a data set

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

detecting linear combinations/high correlations in a data set

M-24
Hi - I've got a large dataset (over 500 variables, 150K rows) and would like to detect

a) variables that are highly correlated with one another
b) linear combinations of variables likely to cause conditioning problems/failed pos.def. correlation matrices.

Whether I'm sampling or not, CORRELATIONS procedure won't take more than 100 variables, and wouldn't help with b), so I'm working with FACTOR and / EXTRACTION PC.

Question:
---------

Before chiseling the wheel, does someone have the code handy to produce the linear combination coefficients of the input variables leading to singularities? Thanks.

Marc.


Hotmail® has ever-growing storage! Don’t worry about storage limits. Check it out.
Reply | Threaded
Open this post in threaded view
|

Re: detecting linear combinations/high correlations in a data set

Art Kendall
When I pseudorandomly generate 150 cases with 550 variables, I of course get singularities.

Please describe the nature of your data. Then we may be able to make suggestions.
Are these some sort of repeated measures, e.g., items intended to be in scales, prices over time, energy at different wave-lengths, etc?

RELIABILITY can be useful for tracking down singularities. Open a new instance of SPSS. Copy the syntax below to a syntax file. Click <run>. Click <all>.
Then go back to the syntax and put fewer items into the scale. Finally try using just 150. You will see that the SMC squared multiple correlation column now has entries, But they are all 1.000. You can edit the RELIABILITY syntax to produce the whole correlation matrix, but in this instance that would be futile.

new file.
input program.
vector x (550,f3).
loop id = 1 to 150.
loop #p = 1 to 550.
compute x(#p) = rnd(rv.normal(50,10)).
end loop.
end case.
end loop.
end file.
end input program.
reliability variables= x1 to x550
/scale (bigbunch) = x1 to x550
/SUMMARY =all.

Art Kendall
Social Research Consultants

M wrote:
Hi - I've got a large dataset (over 500 variables, 150K rows) and would like to detect

a) variables that are highly correlated with one another
b) linear combinations of variables likely to cause conditioning problems/failed pos.def. correlation matrices.

Whether I'm sampling or not, CORRELATIONS procedure won't take more than 100 variables, and wouldn't help with b), so I'm working with FACTOR and / EXTRACTION PC.

Question:
---------

Before chiseling the wheel, does someone have the code handy to produce the linear combination coefficients of the input variables leading to singularities? Thanks.

Marc.


Hotmail® has ever-growing storage! Don’t worry about storage limits. Check it out.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: detecting linear combinations/high correlations in a data set

E. Bernardo
In reply to this post by M-24
Hi Mark, Art, etc

What do you mean by "singularities"?
 
Thank you.
Eins

--- On Sat, 6/6/09, Art Kendall <[hidden email]> wrote:

From: Art Kendall <[hidden email]>
Subject: Re: detecting linear combinations/high correlations in a data set
To: [hidden email]
Date: Saturday, 6 June, 2009, 12:04 PM

When I pseudorandomly generate 150 cases with 550 variables, I of course get singularities.

Please describe the nature of your data. Then we may be able to make suggestions.
Are these some sort of repeated measures, e.g., items intended to be in scales, prices over time, energy at different wave-lengths, etc?

RELIABILITY can be useful for tracking down singularities. Open a new instance of SPSS. Copy the syntax below to a syntax file. Click <run>. Click <all>.
Then go back to the syntax and put fewer items into the scale. Finally try using just 150. You will see that the SMC squared multiple correlation column now has entries, But they are all 1.000. You can edit the RELIABILITY syntax to produce the whole correlation matrix, but in this instance that would be futile.

new file.
input program.
vector x (550,f3).
loop id = 1 to 150.
loop #p = 1 to 550.
compute x(#p) = rnd(rv.normal(50,10)).
end loop.
end case.
end loop.
end file.
end input program.
reliability variables= x1 to x550
/scale (bigbunch) = x1 to x550
/SUMMARY =all.

Art Kendall
Social Research Consultants

M wrote:
Hi - I've got a large dataset (over 500 variables, 150K rows) and would like to detect

a) variables that are highly correlated with one another
b) linear combinations of variables likely to cause conditioning problems/failed pos.def. correlation matrices.

Whether I'm sampling or not, CORRELATIONS procedure won't take more than 100 variables, and wouldn't help with b), so I'm working with FACTOR and / EXTRACTION PC.

Question:
---------

Before chiseling the wheel, does someone have the code handy to produce the linear combination coefficients of the input variables leading to singularities? Thanks.

Marc.


Hotmail® has ever-growing storage! Don’t worry about storage limits. Check it out.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


Get your preferred Email name!
Now you can @ymail.com and @rocketmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: detecting linear combinations/high correlations in a data set

Art Kendall
If you run the syntax I posted, you will get a message that the determinant of the correlation matrix is zero.� � This means that one or more variables is perfectly predictable from one or more other variables.� � This means that the matrix can not be inverted.� This happens when there are more variables than cases, when a variable is used two ore more times, when dummy variables are used and there are as many dummies as there are values of a nominal variable, etc.


Art Kendall
Social Research

Eins Bernardo wrote:
Hi Mark, Art, etc

What do you mean by "singularities"?
Thank you.
Eins

--- On Sat, 6/6/09, Art Kendall [hidden email] wrote:

From: Art Kendall [hidden email]
Subject: Re: detecting linear combinations/high correlations in a data set
To: [hidden email]
Date: Saturday, 6 June, 2009, 12:04 PM

When I pseudorandomly generate 150 cases with 550 variables, I of course get singularities.

Please describe the nature of your data. Then we may be able to make suggestions.
Are these some sort of repeated measures, e.g., items intended to be in scales, prices over time, energy at different wave-lengths, etc?

RELIABILITY can be useful for tracking down singularities. Open a new instance of SPSS. Copy the syntax below to a syntax file. Click <run>. Click <all>.
Then go back to the syntax and put fewer items into the scale. Finally try using just 150. You will see that the SMC squared multiple correlation column now has entries, But they are all 1.000. You can edit the RELIABILITY syntax to produce the whole correlation matrix, but in this instance that would be futile.

new file.
input program.
vector x (550,f3).
loop id = 1 to 150.
loop #p = 1 to 550.
compute x(#p) = rnd(rv.normal(50,10)).
end loop.
end case.
end loop.
end file.
end input program.
reliability variables= x1 to x550
/scale (bigbunch) = x1 to x550
/SUMMARY =all.

Art Kendall
Social Research Consultants

M wrote:
Hi - I've got a large dataset (over 500 variables, 150K rows) and would like to detect

a) variables that are highly correlated with one another
b) linear combinations of variables likely to cause conditioning problems/failed pos.def. correlation matrices.

Whether I'm sampling or not, CORRELATIONS procedure won't take more than 100 variables, and wouldn't help with b), so I'm working with FACTOR and / EXTRACTION PC.

Question:
---------

Before chiseling the wheel, does someone have the code handy to produce the linear combination coefficients of the input variables leading to singularities? Thanks.

Marc.


Hotmail® has ever-growing storage! Don’t worry about storage limits. Check it out.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


Get your preferred Email name!
Now you can @ymail.com and @rocketmail.com.
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: detecting linear combinations/high correlations in a data set

M-24
Thanks for the code & advice on RELIABILITY.

The data in question is census-based data at the zip level. All values are proportions, no missing. There are several groups of variables (ethnicity, household descriptors and such). The highest correlations are around 0.98. One short cut is sample down the data, work through these variable groups by listing pairs of variables with a correlation coef > 0.95, and decide which variables to drop. A PC analysis within these variable families can help get an idea of redundancies. A canonical corr. analysis would probably help too when checking accross groups.

SAS' PROC PRINCOMP would provide a linear combination of the columns when it detected a singularity - if I remember well - , and that's the kind of fast diagnostic output I had in mind at this point.

Regards,

Marc.


Date: Sun, 7 Jun 2009 07:37:46 -0400
From: [hidden email]
Subject: Re: detecting linear combinations/high correlations in a data set
To: [hidden email]

If you run the syntax I posted, you will get a message that the determinant of the correlation matrix is zero.� � This means that one or more variables is perfectly predictable from one or more other variables.� � This means that the matrix can not be inverted.� This happens when there are more variables than cases, when a variable is used two ore more times, when dummy variables are used and there are as many dummies as there are values of a nominal variable, etc.


Art Kendall
Social Research

Eins Bernardo wrote:
Hi Mark, Art, etc

What do you mean by "singularities"?
Thank you.
Eins

--- On Sat, 6/6/09, Art Kendall [hidden email] wrote:

From: Art Kendall [hidden email]
Subject: Re: detecting linear combinations/high correlations in a data set
To: [hidden email]
Date: Saturday, 6 June, 2009, 12:04 PM

When I pseudorandomly generate 150 cases with 550 variables, I of course get singularities.

Please describe the nature of your data. Then we may be able to make suggestions.
Are these some sort of repeated measures, e.g., items intended to be in scales, prices over time, energy at different wave-lengths, etc?

RELIABILITY can be useful for tracking down singularities. Open a new instance of SPSS. Copy the syntax below to a syntax file. Click <run>. Click <all>.
Then go back to the syntax and put fewer items into the scale. Finally try using just 150. You will see that the SMC squared multiple correlation column now has entries, But they are all 1.000. You can edit the RELIABILITY syntax to produce the whole correlation matrix, but in this instance that would be futile.

new file.
input program.
vector x (550,f3).
loop id = 1 to 150.
loop #p = 1 to 550.
compute x(#p) = rnd(rv.normal(50,10)).
end loop.
end case.
end loop.
end file.
end input program.
reliability variables= x1 to x550
/scale (bigbunch) = x1 to x550
/SUMMARY =all.

Art Kendall
Social Research Consultants

M wrote:
Hi - I've got a large dataset (over 500 variables, 150K rows) and would like to detect

a) variables that are highly correlated with one another
b) linear combinations of variables likely to cause conditioning problems/failed pos.def. correlation matrices.

Whether I'm sampling or not, CORRELATIONS procedure won't take more than 100 variables, and wouldn't help with b), so I'm working with FACTOR and / EXTRACTION PC.

Question:
---------

Before chiseling the wheel, does someone have the code handy to produce the linear combination coefficients of the input variables leading to singularities? Thanks.

Marc.


Hotmail® has ever-growing storage! Don’t worry about storage limits. Check it out.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


Get your preferred Email name!
Now you can @ymail.com and @rocketmail.com.


Hotmail® has ever-growing storage! Don’t worry about storage limits. Check it out.
Reply | Threaded
Open this post in threaded view
|

Re: detecting linear combinations/high correlations in a data set

Art Kendall
If I understand correctly.
What you have is compositional data.� Each subset sums to 1.00.� Therefore each subset will fail to have an inverse.

If I understand correctly correspondence analysis can deal with compositional data.� An N-battery canonical correlation might be useful if you dropped one variable from each subset.� I have not done this but I have heard that CATEGORIES can do this.

People from the Leiden group are sometimes on this list.� You may also want to post asking about analyzing several kinds of compositional data on the Classification Society list.
http://lists.sunysb.edu/index.cgi?A0=CLASS-L


Art Kendall
Social Research Consultants

M wrote:
Thanks for the code & advice on RELIABILITY.

The data in question is census-based data at the zip level. All values are proportions, no missing. There are several groups of variables (ethnicity, household descriptors and such). The highest correlations are around 0.98. One short cut is sample down the data, work through these variable groups by listing pairs of variables with a correlation coef > 0.95, and decide which variables to drop. A PC analysis within these variable families can help get an idea of redundancies. A canonical corr. analysis would probably help too when checking accross groups.

SAS' PROC PRINCOMP would provide a linear combination of the columns when it detected a singularity - if I remember well - , and that's the kind of fast diagnostic output I had in mind at this point.

Regards,

Marc.


Date: Sun, 7 Jun 2009 07:37:46 -0400
From: [hidden email]
Subject: Re: detecting linear combinations/high correlations in a data set
To: [hidden email]

If you run the syntax I posted, you will get a message that the determinant of the correlation matrix is zero.� � This means that one or more variables is perfectly predictable from one or more other variables.� � This means that the matrix can not be inverted.� This happens when there are more variables than cases, when a variable is used two ore more times, when dummy variables are used and there are as many dummies as there are values of a nominal variable, etc.


Art Kendall
Social Research

Eins Bernardo wrote:
Hi Mark, Art, etc

What do you mean by "singularities"?
Thank you.
Eins

--- On Sat, 6/6/09, Art Kendall [hidden email] wrote:

From: Art Kendall [hidden email]
Subject: Re: detecting linear combinations/high correlations in a data set
To: [hidden email]
Date: Saturday, 6 June, 2009, 12:04 PM

When I pseudorandomly generate 150 cases with 550 variables, I of course get singularities.

Please describe the nature of your data. Then we may be able to make suggestions.
Are these some sort of repeated measures, e.g., items intended to be in scales, prices over time, energy at different wave-lengths, etc?

RELIABILITY can be useful for tracking down singularities. Open a new instance of SPSS. Copy the syntax below to a syntax file. Click <run>. Click <all>.
Then go back to the syntax and put fewer items into the scale. Finally try using just 150. You will see that the SMC squared multiple correlation column now has entries, But they are all 1.000. You can edit the RELIABILITY syntax to produce the whole correlation matrix, but in this instance that would be futile.

new file.
input program.
vector x (550,f3).
loop id = 1 to 150.
loop #p = 1 to 550.
compute x(#p) = rnd(rv.normal(50,10)).
end loop.
end case.
end loop.
end file.
end input program.
reliability variables= x1 to x550
/scale (bigbunch) = x1 to x550
/SUMMARY =all.

Art Kendall
Social Research Consultants

M wrote:
Hi - I've got a large dataset (over 500 variables, 150K rows) and would like to detect

a) variables that are highly correlated with one another
b) linear combinations of variables likely to cause conditioning problems/failed pos.def. correlation matrices.

Whether I'm sampling or not, CORRELATIONS procedure won't take more than 100 variables, and wouldn't help with b), so I'm working with FACTOR and / EXTRACTION PC.

Question:
---------

Before chiseling the wheel, does someone have the code handy to produce the linear combination coefficients of the input variables leading to singularities? Thanks.

Marc.


Hotmail® has ever-growing storage! Don’t worry about storage limits. Check it out.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


Get your preferred Email name!
Now you can @ymail.com and @rocketmail.com.


Hotmail® has ever-growing storage! Don’t worry about storage limits. Check it out.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Art Kendall
Social Research Consultants