C
Correction: I meant to say AHEX36, not A36. Date: Wed, 29 Jun 2011 20:23:37 -0600 From: [hidden email] Subject: Re: match file with string variable To: [hidden email] Right. The problem couldn't be trailing blanks, but it could be nonprinting characters such as tabs or the popular French non-breaking space character that look like spaces and are different between the two files. If you really want to figure out what the problem was, change the variable formats to A36. Then you can see the numerical codes for those "blanks". A true blank would be hex 20. In code page mode, a non-breaking space would be A0. |
Hi,
I have a friend who is currently working on 1.8 million samples of data. He is trying to use Logistic Regression with a dependent variable holding "Event" and "Non-Event" records. However, in his Hosmer-Lemeshow Goodness of Fit, the p-value is significant meaning "observed not equal predicted". May I know the followings:
I appreciate any advice on this issue. Thanks. Dorraj Oet |
|
It may generate syntax. Mark WebbOn 2011/06/30 08:44 AM, Eins Bernardo wrote: ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by E. Bernardo
See VAR function under COMPUTE and SELECT IF or FILTER!!!
----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by E. Bernardo
Hello,
Simply use the standard deviation:
data list free / x (f) y (f) z (f). begin data 1 1 1 0 1 0 1 1 1 0 0 0 1 1 0 end data. compute same = sd(x to z) eq 0. Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From: Eins Bernardo <[hidden email]> To: [hidden email] Sent: Thu, June 30, 2011 8:44:29 AM Subject: [SPSSX-L] How to delete cases with equal values on all variables
|
In reply to this post by Jarrod Teo-2
El 30/06/2011 7:21, DorraJ Oet wrote:
Hi Dorraj: Sometimes, a missing predictor in a model can cause a lack of fit. Since you don't give any information on the model I can't help a lot, but here are some general guidelines/ideas: a) Check for interaction among predictors (those that make sense, don't start throwing everything into the model) b) check if any discarded variable is important (I don't mean significant, but relevant), you might have left out an important confounder. For instance, any model for cardiac events that leaves out smoking habit is probably faulty (event if smoking is non significant). I hope you didn't construct the model using stepwise methods, did you? c) look for non linear relationships (U or J shaped, like the ones observed very often with BMI or age on the risks of certain diseases) by adding squared terms... The significant p value doesn't imply that the lack of fit is important, it only means that there is a less than 1 in 20 chance that the observed result in the sample is due to chance alone. Big sample size will make significant p-values usually associated to trivial effects. Better than the p value, examine the differences between observed and predicted values in the 10 categories (the contingency table below the p-value table): plot the observed vs the expected values, see if the relationship looks linear and if the points seem to align along the y=x line. HTH, Marta GG |
In reply to this post by Jarrod Teo-2
El 30/06/2011 7:21, DorraJ Oet wrote:
Hi Dorraj: Sometimes, a missing predictor in a model can cause a lack of fit. Since you don't give any information on the model I can't help a lot, but here are some general guidelines/ideas: a) check for interaction among predictors (those that make sense, don't start throwing everything into the model) b) check if any discarded variable is important (I don't mean significant, but relevant), you might have left out an important confounder. For instance, any model for cardiac events that leaves out smoking habit is probably faulty (event if smoking is non significant). I hope you didn't construct the model using stepwise methods, did you? c) look for non linear relationships (U or J shaped, like the ones observed very often with BMI or age on the risks of certain diseases) by adding squared terms... The significant p value doesn't imply that the lack of fit is important, it only means that there is a less than 1 in 20 chance that the observed result in the sample is due to chance alone. Big sample size will make significant p-values usually associated to trivial effects. Better than the p value, examine the differences between observed and predicted values in the 10 categories (the contingency table below the p-value table): plot the observed vs the expected probabilities, see if the relationship looks linear and if the points seem to align along the y=x line. HTH, Marta GG |
In reply to this post by Mark Webb-5
Hi Eins,
Try this if you only doing it based on 1 variable. Replace v1 with the variable you want. Warmest regards Dorraj Oet *****This portion identify the duplicate cases******. sort cases by v1. compute flag=0. if (v1=lag(v1)) flag=1. freq flag. *****This portion take out the duplicate cases identify by flag using select cases and check if the selection is done properly*****. select if (flag ne 0). freq flag. Date: Thu, 30 Jun 2011 08:57:10 +0200 From: [hidden email] Subject: Re: How to delete cases with equal values on all variables To: [hidden email] It may generate syntax. Mark WebbOn 2011/06/30 08:44 AM, Eins Bernardo wrote: ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Jarrod Teo-2
Dorral, The Hosmer Lemeshaw test is about comparing observed to predicted probabilities in successive deciles of increasing predicted probability. In essence, it partitions the dataset in 10 segments according to the predicted probability of the event. Then it measures the proportion of events in each decile to estimate the observed probability of the event in each decile. These observed and expected differences are then added up to get a chi square statistic. In this case you are not looking for a large chi square (indicating “low probability of the results being due to chance”) but for a small chi square (indicating that the predicted and observed probabilities are similar). (The usual chi square compares observed frequencies to those frequencies that would arise by chance alone, but in this case the predicted probabilities are not derived from chance, but predicted by the model). If the proportion of events in each decile match the average predicted proportion of events in each decile, then your model is behaving well. To make the result even better you should change your model by including more predictors, or interactions, or whatever else the matter requires. In this regard let me add an additional thought. The criterion used by H-L test is whether the relative frequency of the event (in each decile) matches the predicted frequency. It is NOT about individual events happening to individual cases with high probability. For example, if the predicted probability in the first decile is (on average) 0.10, a good model will find that the event happened in about 10% of the cases in that decile. This does not mean that case #1 or case #k in that decile can be individually predicted to have or to not have suffered the event. The match (and the prediction) is for the relative frequency in the group (the decile) with the average predicted frequency in the decile, not about the individual outcome for each case coinciding with its predicted individual probability. Within a decile where predicted probabilities vary, say, from 0.4 to 0.6, and the average probability is 0.54, the test measures whether the proportion of actual events is more or less 54%, but does not ascertain whether an individual with p=0.58 is more likely to suffer the event than another individual with p=0.42, both within the same decile. In fact, this approach does not make individual predictions of the sort. I think this is a good take on the notion of probability (and predicted probability), and that is the reason why I do not generally use the “classification table” in which individual events are compared with their probabilities being above or below 0.5. It is perfectly possible that the events happen to individuals with low probability, and fail to happen to individuals with high probability: if your predicted (group) probability matches the actual relative frequency, you’re OK. Hector De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de DorraJ Oet Hi, I have a friend who is currently working on 1.8 million samples of data. He is trying to use Logistic Regression with a dependent variable holding "Event" and "Non-Event" records. However, in his Hosmer-Lemeshow Goodness of Fit, the p-value is significant meaning "observed not equal predicted". May I know the followings:
I appreciate any advice on this issue. Thanks. Dorraj Oet No virus found in this message. |
In reply to this post by Jarrod Teo-2
The first thing that everyone should note is that a cutoff "p= 0.05" is
nearly always irrelevant when N=1.8 million for the error. Get Serious, Folks! It is sort of a truism that "Nothing is really normal" so that when N grows enough, a test for normality will eventually fail. The same is generally true elsewhere, that is, "Nothing is really logistic". In fact, very often we are willing to ignore the differences between Normal and Logistic and create one model or the other, assuming potential differences as irrelevant. Interested in infinitesimal effects, are we? - not often. Occasionally, but not often. The second thing, and separate from the first, is that data sets with N=1.8 million is that there is *usually* an underlying order or system or categories... or several. These make the data inhomogeneous on the large scale, so that "proper tests" have a reduced d.f. that matches (perhaps) the number of categories. Or something. Either way, every proper, sensible approach starts with the Effect Size of whatever you are interested in. One generic adaptation for Effect Size is to translate the statistic, with its p-value, into the minimum N that would yield p=0.05. For instance, if *your* "test" is a chi-squared with 1 d.f., and a value of 384, then the 5% cutoff of 3.84 implies that your result would be "just-at" the 5% level if your N was 1/100 of your 1.8 million... since the chi-squared test statistic, for a given effect size, increases directly with N. So H-L would reject if there were "only" 18,000 cases fitting this way. My personal criteria have never been adjusted to samples of that size, but it would seem to me like a pretty good fit, regardless of the criterion. - If H-L GOF has 9 d.f., or something else, you would find the cutoff size, and adjust your calculations accordingly. I conclude that there is a good chance that a proper look at the H-L test will show that it is not serious. But I guess from the sample size that a simple logistic model probably will not be a very complete or accurate model. -- Rich Ulrich Date: Thu, 30 Jun 2011 05:21:08 +0000 From: [hidden email] Subject: Hosmer-Lemeshow Goodness of Fit (Logistic Regression) To: [hidden email] Hi,
I have a friend who is currently working on 1.8 million samples of data. He is trying to use Logistic Regression with a dependent variable holding "Event" and "Non-Event" records. However, in his Hosmer-Lemeshow Goodness of Fit, the p-value is significant meaning "observed not equal predicted". May I know the followings:
I appreciate any advice on this issue. Thanks. Dorraj Oet |
Thank you for your email. Please note that I am away on vacation, returning Tuesday, July 19, 2011. I will respond to my emails upon my return. If this is an urgent matter please contact Dan Buchanan at [hidden email] or at 905-851-8821 ext. 229.
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Rich Ulrich
Rich, I think you got it wrong, but I can be wrong myself. In my view, the Hosmer Lemeshow test is NOT a statistical significance test such as the ordinary significance tests you run to ascertain whether a value is “significantly different from zero”. The H-L test is a test applied in logistic regression, no matter what is the sample size, to ascertain one and only one thing: whether the observed proportions of events are similar to the predicted probabilities of occurrence. This test starts by sorting the cases by predicted probability, and splitting them into deciles, i.e. subgroups comprising 10 percent of total cases each. The first decile, for instance, may group the lower ten percent of cases, with predicted probabilities ranging from zero to, say, 0.24, with an average of 0.14; the second decile may comprise another ten percent of cases with predicted probabilities above 0.24 and up to 0.29, with an average predicted probability of 0.27; an d so on. Hosmer-Lemeshow compare these average probabilities (0.14, 0.27 and so on) to the actual proportion of events occurred within each group. The index is a sum of terms of the form [(O-E)/E]^2. If the sum happens to be zero, it means the predicted probabilities and observed relative frequencies coincide perfectly. If the sum is larger it means there are discrepancies between observed relative frequencies and predicted probabilities (the discrepancies may happen in any of the deciles). One wants the discrepancies to be low, i.e. one wants the sum to be as low as possible. In my case, besides using the sum as a chi square, and then apply the chi square distribution to find out whether the value of the H/L indicator is “significant” (which is of course a function of the number of cases in the sample), I prefer observing a graph showing observed proportions and predicted probabilities in the ten deciles. I recently completed a study based on more than one million households (Census data from Bolivia). Of course, even small values of HL were sometimes “statistically significant”, in the sense of being “too large, given the size of the sample, to have arisen by mere chance”, though the sheer number of cases caused that not to happen too frequently. However, I preferred to look at the graph to see where the (usually small) differences were more noticeable, in the lower or the middle or the higher deciles, whether once or two deciles concentrated the differences or the differences were similar across deciles. (By the way, SPSS does not produce that graph, but only the observed and predicted proportions, from which one can build the graph in Excel). These tests do not test whether a logistic model is appropriate, or the “goodness of fit” of the model to data. But a model predicting different probabilities should be able to produce predicted probabilities (for groupings of people) that somehow match the observed proportions. In my case they matched quite well. Of course, this does not enable you to guess which particular individuals will suffer the event: probabilities (at least in this context) are an attribute of the group. In HL, the groupings are simply constructed by ordering the predicted probabilities from low to high. But one could use the same approach for different groupings of predictor variables. Suppose the predictors are gender, age and education level (with several age groups and several levels of education); this could generate a number of groups, each with individuals of the same sex, same age group and same education level. Within those groups, predicted probabilities would be equal or very similar, and one can assess whether the observed proportions of events within those groups are close to the predicted probabilities. If the groupings are based on ALL the predictors, the predicted probabilities within each group will be uniform; if some predictor is left out, there might be some variability in predicted probabilities within each group (as within the deciles in the H/L test), however one works with the AVERAGE predicted probability within each group, and compares those averages with the actual proportion of events. Individual prediction is not possible: if everyone in a group has a predicted probability of, say, about 0.75, you may expect that one quarter of them do not get the event, and three quarters do; there is no way to identify in advance which individuals will suffer the event, just as knowing you are in a group with 75% risk of lung cancer does not allow you to know whether you or your neighbour will actually have lung cancer. Winston Churchill (fat, heavy drinker and chain smoker) was at a high risk of early death all along his long life, till he died of old age a few months before making 90. He had the same risk of plenty other people in his same risk groups along his life, but it was others who died while he was among the lucky few survivors. Hector De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Rich Ulrich The first thing that everyone should note is that a cutoff "p= 0.05" is Date: Thu, 30 Jun 2011 05:21:08 +0000 Hi, I have a friend who is currently working on 1.8 million samples of data. He is trying to use Logistic Regression with a dependent variable holding "Event" and "Non-Event" records. However, in his Hosmer-Lemeshow Goodness of Fit, the p-value is significant meaning "observed not equal predicted". May I know the followings:
I appreciate any advice on this issue. Thanks. Dorraj Oet No virus found in this message. |
Hector,
The H-L test is a Goodness of Fit test, as you describe. I don't think that I said anything contrary to that, and I don't think I said anything that is irrelevant when it comes to the effects of huge N. I grant that for ordinary, small N, there is little power for discerning whether the Logistic model is the "correct" one, contrasted to other ways of modeling the approach to a dichotomy. But Probit models, based on the normal, were were a frequent alternative to the logistic, before about 1988 when better computerization of the logistic arrived. Fortunately for us all, the cases that truly deserve Probit are rarer than the other cases. Using the wrong model (Logistic versus other) *ought* to show up as a poorer fit in the tails -- with excessive deviations that will be captured by a goodness-of-fit test. I grant that this is subtle; I expect that other problems are more likely to be detected in the usual run of things. By coincidence, there was a similar problem posted today in the Usenet group, sci.stat.consult. I hope that Brendan Halpin won't mind my re-posting his Reply, here. ***from s.s.c. Newsgroups: sci.stat.consult Subject: Re: Hosmer Lemeshaw Test and Large Samples Date: Thu, 30 Jun 2011 15:26:32 +0100 Lines: 17 Message-ID: <[hidden email]> I though the H-L test was out of favour these days, even with H & L. See http://www.biostat.wustl.edu/archives/html/s-news/1999-04/msg00147.html for an explanation, and note that Harrell has implemented this new test for R in his rms package. Mind you, other comments suggest the problems with the H-L test are in the direction of failing to detect lack of fit, so this may not help you! Brendan -- Brendan Halpin, Department of Sociology, University of Limerick, Ireland ***end cite. Date: Thu, 30 Jun 2011 18:54:49 -0300 From: [hidden email] Subject: Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression) To: [hidden email] Rich, I think you got it wrong, but I can be wrong myself. In my view, the Hosmer Lemeshow test is NOT a statistical significance test such as the ordinary significance tests you run to ascertain whether a value is “significantly different from zero”. The H-L test is a test applied in logistic regression, no matter what is the sample size, to ascertain one and only one thing: whether the observed proportions of events are similar to the predicted probabilities of occurrence. This test starts by sorting the cases by predicted probability, and splitting them into deciles, i.e. subgroups comprising 10 percent of total cases each. The first decile, for instance, may group the lower ten percent of cases, with predicted probabilities ranging from zero to, say, 0.24, with an average of 0.14; the second decile may comprise another ten percent of cases with predicted probabilities above 0.24 and up to 0.29, with an average predicted probability of 0.27; an d so on. Hosmer-Lemeshow compare these average probabilities (0.14, 0.27 and so on) to the actual proportion of events occurred within each group. The index is a sum of terms of the form [(O-E)/E]^2. If the sum happens to be zero, it means the predicted probabilities and observed relative frequencies coincide perfectly. If the sum is larger it means there are discrepancies between observed relative frequencies and predicted probabilities (the discrepancies may happen in any of the deciles). One wants the discrepancies to be low, i.e. one wants the sum to be as low as possible. In my case, besides using the sum as a chi square, and then apply the chi square distribution to find out whether the value of the H/L indicator is “significant” (which is of course a function of the number of cases in the sample), I prefer observing a graph showing observed proportions and predicted probabilities in the ten deciles. I recently completed a study based on more than one million households (Census data from Bolivia). Of course, even small values of HL were sometimes “statistically significant”, in the sense of being “too large, given the size of the sample, to have arisen by mere chance”, though the sheer number of cases caused that not to happen too frequently. However, I preferred to look at the graph to see where the (usually small) differences were more noticeable, in the lower or the middle or the higher deciles, whether once or two deciles concentrated the differences or the differences were similar across deciles. (By the way, SPSS does not produce that graph, but only the observed and predicted proportions, from which one can build the graph in Excel).
These tests do not test whether a logistic model is appropriate, or the “goodness of fit” of the model to data. But a model predicting different probabilities should be able to produce predicted probabilities (for groupings of people) that somehow match the observed proportions. In my case they matched quite well. Of course, this does not enable you to guess which particular individuals will suffer the event: probabilities (at least in this context) are an attribute of the group. In HL, the groupings are simply constructed by ordering the predicted probabilities from low to high. But one could use the same approach for different groupings of predictor variables. Suppose the predictors are gender, age and education level (with several age groups and several levels of education); this could generate a number of groups, each with individuals of the same sex, same age group and same education level. Within those groups, predicted probabilities would be equal or very similar, and one can assess whether the observed proportions of events within those groups are close to the predicted probabilities. If the groupings are based on ALL the predictors, the predicted probabilities within each group will be uniform; if some predictor is left out, there might be some variability in predicted probabilities within each group (as within the deciles in the H/L test), however one works with the AVERAGE predicted probability within each group, and compares those averages with the actual proportion of events.
Individual prediction is not possible: if everyone in a group has a predicted probability of, say, about 0.75, you may expect that one quarter of them do not get the event, and three quarters do; there is no way to identify in advance which individuals will suffer the event, just as knowing you are in a group with 75% risk of lung cancer does not allow you to know whether you or your neighbour will actually have lung cancer. Winston Churchill (fat, heavy drinker and chain smoker) was at a high risk of early death all along his long life, till he died of old age a few months before making 90. He had the same risk of plenty other people in his same risk groups along his life, but it was others who died while he was among the lucky few survivors.
Hector |
In reply to this post by Jon K Peck
Dear all,
thanks for the comments and suggestions. I think John is right. On a 11 hour plane trip I was able to figure out the issue (in a different way than John). Recap: Initially I I had a key variable in two files to match. this didn't work. At first I thought it was a leading blank: a list command showed the string key variable values with an extra position (not a real blank but with a small dot in the middel, not a period. Apparently a non-printing one. This character is not visible in the data editor or the syntax editor: I copied the character to the syntax file to have it removed with LTRIM. At first I was confused because copying the character did not show up in the syntax, until I ran the syntax and it did show up in the output. Where the character came from I do not know. I downloaded the data using Twitter's API into a cvs file. In an ASCII editor nothing shows up. After importing the text into SPSS it suddenly appears. I have SPSS running in unicode mode. Still, it is solved for now, but will check using Jon Peck's suggesting changing the string variable to the Ahex36 format. thanks to all, Maurice On Thu, Jun 30, 2011 at 12:41, Jon K Peck <[hidden email]> wrote: C -- ___________________________________________________________________ Maurice Vergeer Department of communication, Radboud University (www.ru.nl) PO Box 9104, NL-6500 HE Nijmegen, The Netherlands Visiting Professor Yeungnam University, Gyeongsan, South Korea Recent publications: -Vergeer, M., Eisinga, R. & Franses, Ph.H. (forthcoming). Supply and demand effects in television viewing. A time series analysis. Communications - The European Journal of Communication Research. -Vergeer, M. Lim, Y.S. Park, H.W. (forthcoming). Mediated Relations: New Methods to study Online Social Capital. Asian Journal of Communication. -Vergeer, M., Hermans, L., & Sams, S. (forthcoming). Online social networks and micro-blogging in political campaigning: The exploration of a new campaign tool and a new campaign style. Party Politics. -Pleijter, A., Hermans, L. & Vergeer, M. (forthcoming). Journalists and journalism in the Netherlands. In D. Weaver & L. Willnat, The Global Journalist in the 21st Century. London: Routledge. Webspace www.mauricevergeer.nl http://blog.mauricevergeer.nl/ www.journalisteninhetdigitaletijdperk.nl maurice.vergeer (skype) ___________________________________________________________________ |
In reply to this post by Rich Ulrich
Of course, Rich, you did not say anything contrary to my comments. You just happened to be the one commenting before me in this thread. Hector De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Rich Ulrich Hector, Date: Thu, 30 Jun 2011 18:54:49 -0300 Rich, I think you got it wrong, but I can be wrong myself. In my view, the Hosmer Lemeshow test is NOT a statistical significance test such as the ordinary significance tests you run to ascertain whether a value is “significantly different from zero”. The H-L test is a test applied in logistic regression, no matter what is the sample size, to ascertain one and only one thing: whether the observed proportions of events are similar to the predicted probabilities of occurrence. This test starts by sorting the cases by predicted probability, and splitting them into deciles, i.e. subgroups comprising 10 percent of total cases each. The first decile, for instance, may group the lower ten percent of cases, with predicted probabilities ranging from zero to, say, 0.24, with an average of 0.14; the second decile may comprise another ten percent of cases with predicted probabilities above 0.24 and up to 0.29, with an average predicted probability of 0.27; an d so on. Hosmer-Lemeshow compare these average probabilities (0.14, 0.27 and so on) to the actual proportion of events occurred within each group. The index is a sum of terms of the form [(O-E)/E]^2. If the sum happens to be zero, it means the predicted probabilities and observed relative frequencies coincide perfectly. If the sum is larger it means there are discrepancies between observed relative frequencies and predicted probabilities (the discrepancies may happen in any of the deciles). One wants the discrepancies to be low, i.e. one wants the sum to be as low as possible. In my case, besides using the sum as a chi square, and then apply the chi square distribution to find out whether the value of the H/L indicator is “significant” (which is of course a function of the number of cases in the sample), I prefer observing a graph showing observed proportions and predicted probabilities in the ten deciles. I recently completed a study based on more than one million households (Census data from Bolivia). Of course, even small values of HL were sometimes “statistically significant”, in the sense of being “too large, given the size of the sample, to have arisen by mere chance”, though the sheer number of cases caused that not to happen too frequently. However, I preferred to look at the graph to see where the (usually small) differences were more noticeable, in the lower or the middle or the higher deciles, whether once or two deciles concentrated the differences or the differences were similar across deciles. (By the way, SPSS does not produce that graph, but only the observed and predicted proportions, from which one can build the graph in Excel). These tests do not test whether a logistic model is appropriate, or the “goodness of fit” of the model to data. But a model predicting different probabilities should be able to produce predicted probabilities (for groupings of people) that somehow match the observed proportions. In my case they matched quite well. Of course, this does not enable you to guess which particular individuals will suffer the event: probabilities (at least in this context) are an attribute of the group. In HL, the groupings are simply constructed by ordering the predicted probabilities from low to high. But one could use the same approach for different groupings of predictor variables. Suppose the predictors are gender, age and education level (with several age groups and several levels of education); this could generate a number of groups, each with individuals of the same sex, same age group and same education level. Within those groups, predicted probabilities would be equal or very similar, and one can assess whether the observed proportions of events within those groups are close to the predicted probabilities. If the groupings are based on ALL the predictors, the predicted probabilities within each group will be uniform; if some predictor is left out, there might be some variability in predicted probabilities within each group (as within the deciles in the H/L test), however one works with the AVERAGE predicted probability within each group, and compares those averages with the actual proportion of events. Individual prediction is not possible: if everyone in a group has a predicted probability of, say, about 0.75, you may expect that one quarter of them do not get the event, and three quarters do; there is no way to identify in advance which individuals will suffer the event, just as knowing you are in a group with 75% risk of lung cancer does not allow you to know whether you or your neighbour will actually have lung cancer. Winston Churchill (fat, heavy drinker and chain smoker) was at a high risk of early death all along his long life, till he died of old age a few months before making 90. He had the same risk of plenty other people in his same risk groups along his life, but it was others who died while he was among the lucky few survivors. Hector No virus found in this message. |
I did a little simulation. I generated a normally distributed variable (theta) as a function of two independent normally distributed IVs. I then created a new dichotomous variable by selecting values of theta greater than one. I did this 1000 times with a samples of 500, 50000, and 100000. The Hosmer-Lemeshow test was significant 4.1% of the time when n=500 but 33.5% when n was 50000, and 66.6% of the time when n was 100,000. So it does appear that the test becomes more sensitive when n is large. Thus, very small deviations from the model may result in significant lack of fit when the sample size is large. This is similar to the problem in SEM where large samples are much more likely to show lack of fit than small samples. Yes, there is a significant lack of fit. But is it substantial enough to invalidate the findings of the logistic model. Dr. Paul R. Swank, Professor Children's Learning Institute University of Texas Health Science Center-Houston From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hector Maletta Of course, Rich, you did not say anything contrary to my comments. You just happened to be the one commenting before me in this thread. Hector De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Rich Ulrich Hector, Date: Thu, 30 Jun 2011 18:54:49 -0300 Rich, I think you got it wrong, but I can be wrong myself. In my view, the Hosmer Lemeshow test is NOT a statistical significance test such as the ordinary significance tests you run to ascertain whether a value is “significantly different from zero”. The H-L test is a test applied in logistic regression, no matter what is the sample size, to ascertain one and only one thing: whether the observed proportions of events are similar to the predicted probabilities of occurrence. This test starts by sorting the cases by predicted probability, and splitting them into deciles, i.e. subgroups comprising 10 percent of total cases each. The first decile, for instance, may group the lower ten percent of cases, with predicted probabilities ranging from zero to, say, 0.24, with an average of 0.14; the second decile may comprise another ten percent of cases with predicted probabilities above 0.24 and up to 0.29, with an average predicted probability of 0.27; an d so on. Hosmer-Lemeshow compare these average probabilities (0.14, 0.27 and so on) to the actual proportion of events occurred within each group. The index is a sum of terms of the form [(O-E)/E]^2. If the sum happens to be zero, it means the predicted probabilities and observed relative frequencies coincide perfectly. If the sum is larger it means there are discrepancies between observed relative frequencies and predicted probabilities (the discrepancies may happen in any of the deciles). One wants the discrepancies to be low, i.e. one wants the sum to be as low as possible. In my case, besides using the sum as a chi square, and then apply the chi square distribution to find out whether the value of the H/L indicator is “significant” (which is of course a function of the number of cases in the sample), I prefer observing a graph showing observed proportions and predicted probabilities in the ten deciles. I recently completed a study based on more than one million households (Census data from Bolivia). Of course, even small values of HL were sometimes “statistically significant”, in the sense of being “too large, given the size of the sample, to have arisen by mere chance”, though the sheer number of cases caused that not to happen too frequently. However, I preferred to look at the graph to see where the (usually small) differences were more noticeable, in the lower or the middle or the higher deciles, whether once or two deciles concentrated the differences or the differences were similar across deciles. (By the way, SPSS does not produce that graph, but only the observed and predicted proportions, from which one can build the graph in Excel). These tests do not test whether a logistic model is appropriate, or the “goodness of fit” of the model to data. But a model predicting different probabilities should be able to produce predicted probabilities (for groupings of people) that somehow match the observed proportions. In my case they matched quite well. Of course, this does not enable you to guess which particular individuals will suffer the event: probabilities (at least in this context) are an attribute of the group. In HL, the groupings are simply constructed by ordering the predicted probabilities from low to high. But one could use the same approach for different groupings of predictor variables. Suppose the predictors are gender, age and education level (with several age groups and several levels of education); this could generate a number of groups, each with individuals of the same sex, same age group and same education level. Within those groups, predicted probabilities would be equal or very similar, and one can assess whether the observed proportions of events within those groups are close to the predicted probabilities. If the groupings are based on ALL the predictors, the predicted probabilities within each group will be uniform; if some predictor is left out, there might be some variability in predicted probabilities within each group (as within the deciles in the H/L test), however one works with the AVERAGE predicted probability within each group, and compares those averages with the actual proportion of events. Individual prediction is not possible: if everyone in a group has a predicted probability of, say, about 0.75, you may expect that one quarter of them do not get the event, and three quarters do; there is no way to identify in advance which individuals will suffer the event, just as knowing you are in a group with 75% risk of lung cancer does not allow you to know whether you or your neighbour will actually have lung cancer. Winston Churchill (fat, heavy drinker and chain smoker) was at a high risk of early death all along his long life, till he died of old age a few months before making 90. He had the same risk of plenty other people in his same risk groups along his life, but it was others who died while he was among the lucky few survivors. Hector No virus found in this message. |
Hi Dr Paul,
Thanks for the simulation. This was what I suspected as well as in when the sample goes larger and larger, HL becomes significant. However can I boldly say that with a large data file, data mining models (maybe CHAID, C5 for profiling) could be a better choice than Statistical models? By the way, thank you Hector and Rich. Those were interesting discussion. Warmest regards Dorraj Oet
Date: Fri, 1 Jul 2011 15:25:37 -0500 From: [hidden email] Subject: Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression) To: [hidden email] I did a little simulation. I generated a normally distributed variable (theta) as a function of two independent normally distributed IVs. I then created a new dichotomous variable by selecting values of theta greater than one. I did this 1000 times with a samples of 500, 50000, and 100000. The Hosmer-Lemeshow test was significant 4.1% of the time when n=500 but 33.5% when n was 50000, and 66.6% of the time when n was 100,000. So it does appear that the test becomes more sensitive when n is large. Thus, very small deviations from the model may result in significant lack of fit when the sample size is large. This is similar to the problem in SEM where large samples are much more likely to show lack of fit than small samples. Yes, there is a significant lack of fit. But is it substantial enough to invalidate the findings of the logistic model.
Dr. Paul R. Swank, Professor Children's Learning Institute University of Texas Health Science Center-Houston
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hector Maletta
Of course, Rich, you did not say anything contrary to my comments. You just happened to be the one commenting before me in this thread. Hector
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Rich Ulrich
Hector, Date: Thu, 30 Jun 2011 18:54:49 -0300 Rich, I think you got it wrong, but I can be wrong myself. In my view, the Hosmer Lemeshow test is NOT a statistical significance test such as the ordinary significance tests you run to ascertain whether a value is “significantly different from zero”. The H-L test is a test applied in logistic regression, no matter what is the sample size, to ascertain one and only one thing: whether the observed proportions of events are similar to the predicted probabilities of occurrence. This test starts by sorting the cases by predicted probability, and splitting them into deciles, i.e. subgroups comprising 10 percent of total cases each. The first decile, for instance, may group the lower ten percent of cases, with predicted probabilities ranging from zero to, say, 0.24, with an average of 0.14; the second decile may comprise another ten percent of cases with predicted probabilities above 0.24 and up to 0.29, with an average predicted probability of 0.27; an d so on. Hosmer-Lemeshow compare these average probabilities (0.14, 0.27 and so on) to the actual proportion of events occurred within each group. The index is a sum of terms of the form [(O-E)/E]^2. If the sum happens to be zero, it mean s the predicted probabilities and observed relative frequencies coincide perfectly. If the sum is larger it means there are discrepancies between observed relative frequencies and predicted probabilities (the discrepancies may happen in any of the deciles). One wants the discrepancies to be low, i.e. one wants the sum to be as low as possible. In my case, besides using the sum as a chi square, and then apply the chi square distribution to find out whether the value of the H/L indicator is “significant” (which is of course a function of the number of cases in the sample), I prefer observing a graph showing observed proportions and predicted probabilities in the ten deciles. I recently completed a study based on more than one million households (Census data from Bolivia). Of course, even small values of HL were sometimes “statistically significant”, in the sense of being “too large, given the size of the sample, to have arisen by mere chance”, though the sheer number of cases caused that not to happen too frequently. However, I preferred to look at the graph to see where the (usually small) differences were more noticeable, in the lower or the middle or the higher deciles, whether once or two deciles concentrated the differences or the differences were similar across deciles. (By the way, SPSS does not produce that graph, but only the observed and predicted proportions, from which one can build the graph in Excel).
These tests do not test whether a logistic model is appropriate, or the “goodness of fit” of the model to data. But a model predicting different probabilities should be able to produce predicted probabilities (for groupings of people) that somehow match the observed proportions. In my case they matched quite well. Of course, this does not enable you to guess which particular individuals will suffer the event: probabilities (at least in this context) are an attribute of the group. In HL, the groupings are simply constructed by ordering the predicted probabilities from low to high. But one could use the same approach for different groupings of predictor variables. Suppose the predictors are gender, age and education level (with several age groups and several levels of education); this could generate a number of groups, each with individuals of the same sex, same age group and same education level. Within those groups, predicted probabilities would be equal or very similar, and one can assess whether the observed proportions of events within those groups are close to the predicted probabilities. If the groupings are based on ALL the predictors, the predicted probabilities within each group will be uniform; if some predictor is left out, there might be some variability in predicted probabilities within each group (as within the deciles in the H/L test), however one works with the AVERAGE predicted probability within each group, and compares those averages with the actual proportion of events.
Individual prediction is not possible: if everyone in a group has a predicted probability of, say, about 0.75, you may expect that one quarter of them do not get the event, and three quarters do; there is no way to identify in advance which individuals will suffer the event, just as knowing you are in a group with 75% risk of lung cancer does not allow you to know whether you or your neighbour will actually have lung cancer. Winston Churchill (fat, heavy drinker and chain smoker) was at a high risk of early death all along his long life, till he died of old age a few months before making 90. He had the same risk of plenty other people in his same risk groups along his life, but it was others who died while he was among the lucky few survivors.
Hector
No virus found in this message. |
In reply to this post by Maurice Vergeer
Dear all,
just a short note on some new info, given the problem discussed in this thread some time ago , I had the same problem with a new data set. Taking up Jon's advice I formatted the string variable as AHEX36. The data showed that the recurring code at the beginning of the string was EFBBBF. Googling this code I found a wikipedia page suggesting this seems to be a byte order mark in unicode. I am not sure whether Twitter's API delevers the data in such a way or whether SPSS inserts it when importing a csv file. I think and hope the former, not the latter. best wishes Maurice On Fri, Jul 1, 2011 at 5:09 AM, Maurice Vergeer <[hidden email]> wrote: > Dear all, > > thanks for the comments and suggestions. I think John is right. > On a 11 hour plane trip I was able to figure out the issue (in a different > way than John). > > Recap: > Initially I I had a key variable in two files to match. this didn't work. > At first I thought it was a leading blank: a list command showed the string > key variable values with an extra position (not a real blank but with a > small dot in the middel, not a period. Apparently a non-printing one. This > character is not visible in the data editor or the syntax editor: I copied > the character to the syntax file to have it removed with LTRIM. At first I > was confused because copying the character did not show up in the syntax, > until I ran the syntax and it did show up in the output. > > Where the character came from I do not know. I downloaded the data using > Twitter's API into a cvs file. In an ASCII editor nothing shows up. After > importing the text into SPSS it suddenly appears. I have SPSS running in > unicode mode. > > Still, it is solved for now, but will check using Jon Peck's� suggesting > changing the string variable to the Ahex36 format. > > thanks to all, > Maurice > > > On Thu, Jun 30, 2011 at 12:41, Jon K Peck <[hidden email]> wrote: >> >> C >> Correction: I meant to say AHEX36, not A36. >> >> ________________________________ >> Date: Wed, 29 Jun 2011 20:23:37 -0600 >> From: [hidden email] >> Subject: Re: match file with string variable >> To: [hidden email] >> >> Right. � The problem couldn't be trailing blanks, but it could be >> nonprinting characters such as tabs or the popular French non-breaking space >> character that look like spaces and are different between the two files. � If >> you really want to figure out what the problem was, change the variable >> formats to A36. � Then you can see the numerical codes for those "blanks". � A >> true blank would be hex 20. � In code page mode, a non-breaking space would >> be A0. >> > > > > -- > ___________________________________________________________________ > Maurice Vergeer > Department of communication, Radboud University� (www.ru.nl) > PO Box 9104, NL-6500 HE Nijmegen, The Netherlands > > Visiting Professor Yeungnam University, Gyeongsan, South Korea > > Recent publications: > -Vergeer, M., Eisinga, R. & Franses, Ph.H. (forthcoming). Supply and demand > effects in television viewing. A time series analysis. Communications - The > European Journal of Communication Research. > -Vergeer, M. Lim, Y.S. Park, H.W. (forthcoming). Mediated Relations: New > Methods to study Online Social Capital. Asian Journal of Communication. > -Vergeer, M., Hermans, L., & Sams, S. (forthcoming). Online social networks > and micro-blogging in political campaigning: The exploration of a new > campaign tool and a new campaign style. Party Politics. > -Pleijter, A., Hermans, L. & Vergeer, M. (forthcoming). Journalists and > journalism in the Netherlands. In D. Weaver & L. Willnat, The Global > Journalist in the 21st Century. London: Routledge. > > Webspace > www.mauricevergeer.nl > http://blog.mauricevergeer.nl/ > www.journalisteninhetdigitaletijdperk.nl > maurice.vergeer (skype) > ___________________________________________________________________ > > > > > > -- ___________________________________________________________________ Maurice Vergeer Department of communication, Radboud University� (www.ru.nl) PO Box 9104, NL-6500 HE Nijmegen, The Netherlands Visiting Professor Yeungnam University, Gyeongsan, South Korea Recent publications: -Vergeer, M. Lim, Y.S. Park, H.W. (2011). Mediated Relations: New Methods to study Online Social Capital. Asian Journal of Communication, 21(5), 430-449. -Vergeer, M., Hermans, L., & Sams, S. (forthcoming). Online social networks and micro-blogging in political campaigning: The exploration of a new campaign tool and a new campaign style. Party Politics. -Pleijter, A., Hermans, L. & Vergeer, M. (forthcoming). Journalists and journalism in the Netherlands. In D. Weaver & L. Willnat, The Global Journalist in the 21st Century. London: Routledge. -Vergeer, M., Eisinga, R. & Franses, Ph.H. (forthcoming). Supply and demand effects in television viewing. A time series analysis. Communications - The European Journal of Communication Research. Webspace www.mauricevergeer.nl http://blog.mauricevergeer.nl/ www.journalisteninhetdigitaletijdperk.nl maurice.vergeer (skype) ___________________________________________________________________ ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
The BOM indicates that the characters in
the file are encoded in utf-8 (a form of Unicode). This would be
coming from the Twitter api. Statistics should recognize that and
read the data appropriately. SPSS Statistics will write a BOM when
saving a csv file if it is in Unicode mode. If there are any extended
characters in the file, you should use Unicode mode in Statistics when
reading it.
Jon Peck (no "h") Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: Maurice Vergeer <[hidden email]> To: Jon K Peck/Chicago/IBM@IBMUS Cc: [hidden email] Date: 10/29/2011 02:07 AM Subject: Re: match file with string variable Sent by: [hidden email] Dear all, just a short note on some new info, given the problem discussed in this thread some time ago , I had the same problem with a new data set. Taking up Jon's advice I formatted the string variable as AHEX36. The data showed that the recurring code at the beginning of the string was EFBBBF. Googling this code I found a wikipedia page suggesting this seems to be a byte order mark in unicode. I am not sure whether Twitter's API delevers the data in such a way or whether SPSS inserts it when importing a csv file. I think and hope the former, not the latter. best wishes Maurice On Fri, Jul 1, 2011 at 5:09 AM, Maurice Vergeer <[hidden email]> wrote: > Dear all, > > thanks for the comments and suggestions. I think John is right. > On a 11 hour plane trip I was able to figure out the issue (in a different > way than John). > > Recap: > Initially I I had a key variable in two files to match. this didn't work. > At first I thought it was a leading blank: a list command showed the string > key variable values with an extra position (not a real blank but with a > small dot in the middel, not a period. Apparently a non-printing one. This > character is not visible in the data editor or the syntax editor: I copied > the character to the syntax file to have it removed with LTRIM. At first I > was confused because copying the character did not show up in the syntax, > until I ran the syntax and it did show up in the output. > > Where the character came from I do not know. I downloaded the data using > Twitter's API into a cvs file. In an ASCII editor nothing shows up. After > importing the text into SPSS it suddenly appears. I have SPSS running in > unicode mode. > > Still, it is solved for now, but will check using Jon Peck's suggesting > changing the string variable to the Ahex36 format. > > thanks to all, > Maurice > > > On Thu, Jun 30, 2011 at 12:41, Jon K Peck <[hidden email]> wrote: >> >> C >> Correction: I meant to say AHEX36, not A36. >> >> ________________________________ >> Date: Wed, 29 Jun 2011 20:23:37 -0600 >> From: [hidden email] >> Subject: Re: match file with string variable >> To: [hidden email] >> >> Right. The problem couldn't be trailing blanks, but it could be >> nonprinting characters such as tabs or the popular French non-breaking space >> character that look like spaces and are different between the two files. If >> you really want to figure out what the problem was, change the variable >> formats to A36. Then you can see the numerical codes for those "blanks". A >> true blank would be hex 20. In code page mode, a non-breaking space would >> be A0. >> > > > > -- > ___________________________________________________________________ > Maurice Vergeer > Department of communication, Radboud University (www.ru.nl) > PO Box 9104, NL-6500 HE Nijmegen, The Netherlands > > Visiting Professor Yeungnam University, Gyeongsan, South Korea > > Recent publications: > -Vergeer, M., Eisinga, R. & Franses, Ph.H. (forthcoming). Supply and demand > effects in television viewing. A time series analysis. Communications - The > European Journal of Communication Research. > -Vergeer, M. Lim, Y.S. Park, H.W. (forthcoming). Mediated Relations: New > Methods to study Online Social Capital. Asian Journal of Communication. > -Vergeer, M., Hermans, L., & Sams, S. (forthcoming). Online social networks > and micro-blogging in political campaigning: The exploration of a new > campaign tool and a new campaign style. Party Politics. > -Pleijter, A., Hermans, L. & Vergeer, M. (forthcoming). Journalists and > journalism in the Netherlands. In D. Weaver & L. Willnat, The Global > Journalist in the 21st Century. London: Routledge. > > Webspace > www.mauricevergeer.nl > http://blog.mauricevergeer.nl/ > www.journalisteninhetdigitaletijdperk.nl > maurice.vergeer (skype) > ___________________________________________________________________ > > > > > > -- ___________________________________________________________________ Maurice Vergeer Department of communication, Radboud University (www.ru.nl) PO Box 9104, NL-6500 HE Nijmegen, The Netherlands Visiting Professor Yeungnam University, Gyeongsan, South Korea Recent publications: -Vergeer, M. Lim, Y.S. Park, H.W. (2011). Mediated Relations: New Methods to study Online Social Capital. Asian Journal of Communication, 21(5), 430-449. -Vergeer, M., Hermans, L., & Sams, S. (forthcoming). Online social networks and micro-blogging in political campaigning: The exploration of a new campaign tool and a new campaign style. Party Politics. -Pleijter, A., Hermans, L. & Vergeer, M. (forthcoming). Journalists and journalism in the Netherlands. In D. Weaver & L. Willnat, The Global Journalist in the 21st Century. London: Routledge. -Vergeer, M., Eisinga, R. & Franses, Ph.H. (forthcoming). Supply and demand effects in television viewing. A time series analysis. Communications - The European Journal of Communication Research. Webspace www.mauricevergeer.nl http://blog.mauricevergeer.nl/ www.journalisteninhetdigitaletijdperk.nl maurice.vergeer (skype) ___________________________________________________________________ |
Free forum by Nabble | Edit this page |