I have not had the opportunity to follow this discussion on the
Relationship between Sets of Dependent and Independent Variables, so apologies if this has already been raised. I pick up from the last postings: In classical inference, the sample size must always out number the number of variables by several factors. This is to avoid singularity and the resulting spurious correlations that give a false sense of a good fit. However, in Pattern recognition, Chemometrics and related studies, the number of variables usually far exceeds number of cases. Furthermore, you can't discard some of these variables because of what they represent. Methods such as SIMCA and 4 stage partial least squares have been developed in Chemotetrics to deal with such situations. Very early papers on these include Jain and Dubes(1978) -- Feature Definition in pattern recognition with small sample sizes. -- in Pattern recognition Vol 10, p85-97 and Albano and Blomqvist(1981). Pattern Recognition by means of disjoint principal components models -- in a conference proceeding). I do not have recent refs just now, but there are many. There is a joke about statisticians and sample sizes: A fire broke out and several academics were consulted for a way to deal with the fire. The Statistician replied that one fire was just too few to enable him to derive a solution. His recommendation was that more fires should be lit and only then could a solution be modeled. N Forcheh University of Botswana (Disclaimant: There are many variants to this joke and I do not know the originator) -----Original Message----- From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hector Maletta Sent: 22 August 2006 19:10 To: [hidden email] Subject: Re: Relationship between Sets of Dependent and Independent Variables It is all the same. You may use, for instance, factor analysis to derive a scale representing the 18 variables of one of the sets, one scale for women and another similar scale for men, but the scale for women will be built with data from a sample of 60 women, i.e. 2.33 women per variable, and that is hardly statistically significant. Take any other sample of 60 women from the same population and the results are likely to be completely different. The margin or error in the factor loadings and the regression coefficients will be very wide, especially if some of the variables are not very highly correlated (r>0.90 or r>0.95) with some of the others. Data, as the old saying goes, can always be tortured till they confess. But you better don't. There are better ways to get to the truth. Hector -----Mensaje original----- De: Susan M. Sereika [mailto:[hidden email]] Enviado el: Tuesday, August 22, 2006 1:45 PM Para: 'Hector Maletta' Asunto: RE: Relationship between Sets of Dependent and Independent Variables Dear Hector: I agree that sample size is problematic. Is there anything that can be savaged from this? Would it be reasonable to present the work as exploratory? Or perhaps to apply principal components analysis to derive a smaller number of derived variables and conduct the analyses with these derived variables using regression analysis? Thank you very much for your thoughts on this. Sincerely, Susan -----Original Message----- From: Hector Maletta [mailto:[hidden email]] Sent: Tuesday, August 22, 2006 12:09 PM To: 'Susan M. Sereika' Subject: RE: Relationship between Sets of Dependent and Independent Variables Now the situation is clearer, and the answer more definitely negative. The 36 variables are far too many for just 200 cases (below 6 per variable), let alone for 30% of them, i.e. for about 60 women (which is about 1.8 cases per variable, when the old rule of thumb, now discredited for insufficiency, was at least 10; nowadays far more than 10 is usually required, depending on the variance of variables and the strength of the relationship). Hector -----Mensaje original----- De: Susan M. Sereika [mailto:[hidden email]] Enviado el: Tuesday, August 22, 2006 12:54 PM Para: 'Hector Maletta' Asunto: RE: Relationship between Sets of Dependent and Independent Variables Dear Hector: Thank you for your very quick and thoughtful reply. The investigation is theoretically driven for the most part with respect to the relationship between the two sets of variables. The idea that the relationship may vary by gender/sex is a little more exploratory, although there is some literature to support some relationships. Each set of dependent and independent variables consists of 18 variables and the variables are subscale scores believed to measure two concepts: beliefs about depression (18 variables) and coping (18 variables). The initial investigation focused on just examining the relationships between the two variables sets and given the complexity of the data, canonical correlation analysis (CCA) was used. Then the investigation was expanded to consider differences between men and women and a CCA was conducted within each gender subsample. The smaller subsample sizes are problematic, especially for the male subsample. Sincerely, Susan -----Original Message----- From: Hector Maletta [mailto:[hidden email]] Sent: Monday, August 21, 2006 12:05 PM To: 'Susan M. Sereika'; [hidden email] Subject: RE: Relationship between Sets of Dependent and Independent Variables You state your friend's goals in a very sketchy way, so it is very difficult to give an opinion. For instance, how many variables are involved? 200 cases may be way too few if the variables happen to be (even moderately) numerous. Is he/she interested in bivariate or multivariate relations between these variables? For instance, one may be interested in crossing pairs of variables such as X BY Z BY sex, and see whether the association/correlation of X and Z varies with sex, and this may be feasible with 200 cases (140 women, 60 men), only if one has, say, K variables there would be K*(K-1)/2 pairs of variables, which rapidly goes into the hundreds or the thousands as K grows. For K=50, there are 1225 pairs of variables to consider. If one is interested in models involving many variables, such as regression, the number of possible models grows exponentially and, besides, the small number of cases in the sample becomes rapidly a limitation. Another consideration is whether your friend has any theory or conceptual approach or problem-oriented goal when facing these data, or is just exploring blindly around. What is he/she looking for? Just mining around for any kind of non-random-looking patterns, like an astronomer searching for signs of extra-terrestrial intelligence among random electromagnetic cosmic noise, or like John Nash, he of the beautiful mind, parsing newspapers in the worst of his madness? In a sample of 200 she/he may find many promising patterns, but they may be nothing but sample flukes. Hector -----Mensaje original----- De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Susan M. Sereika Enviado el: Monday, August 21, 2006 10:44 AM Para: [hidden email] Asunto: Relationship between Sets of Dependent and Independent Variables Dear Listserv members: A colleague is interested in examining the relationship between two sets of variables (dependent variables and independent variables). Additionally she would like to investigate whether the relationship varies between men and women. The total sample size is somewhat moderate (about 200 participants) with 70% being women. What might be a good approach to use when analyze these data in light of the objectives? Any suggestions are most appreciated. Sincerely, Susan Sereika |
At 12:47 PM 8/23/2006, FORCHEH, N. (DR.) wrote:
>In classical inference, the sample size must always out number the >number of variables by several factors. This is to avoid singularity >and the resulting spurious correlations that give a false sense of a >good fit. > >However, in Pattern recognition, Chemometrics and related studies, the >number of variables usually far exceeds number of cases. Methods such >as SIMCA and 4 stage partial least squares have been developed in >Chemotetrics to deal with such situations. To display ignorance and curiosity: can you give us an example? I bounced around the Web about SIMCA a little. Nothing seemed to make the claim that it could handle many more variables than there are case, or even any more. From descriptions I saw, it's based on PCA, which certainly doesn't allow extracting more components than there are cases. Classical statistics or not, it's hard for me to see how you can estimate from any data, more degrees of freedom than it has. It looked like some chemometrics cases involve looking for patterns in spectra and the like. Now, if a 'case' is a full spectrum rather than a single numeric value, you may well be able to estimate many more variables than there are 'cases', because a 'case' is actually a large number of observations. That's the best I could do in a quick scan. Can you give an instance of estimating a "number of variables usually far exceeding the number of cases," using SIMCA or otherwise? -Many thanks, Richard Ristow |
Free forum by Nabble | Edit this page |