match file with string variable

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

match file with string variable

Jon K Peck
C
Correction: I meant to say AHEX36, not A36.



Date: Wed, 29 Jun 2011 20:23:37 -0600
From: [hidden email]
Subject: Re: match file with string variable
To: [hidden email]

Right.  The problem couldn't be trailing blanks, but it could be nonprinting characters such as tabs or the popular French non-breaking space character that look like spaces and are different between the two files.  If you really want to figure out what the problem was, change the variable formats to A36.  Then you can see the numerical codes for those "blanks".  A true blank would be hex 20.  In code page mode, a non-breaking space would be A0.


Reply | Threaded
Open this post in threaded view
|

Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

Jarrod Teo-2
Hi,

I have a friend who is currently working on 1.8 million samples of data. He is trying to use Logistic Regression with a dependent variable holding "Event" and "Non-Event" records.

However, in his Hosmer-Lemeshow Goodness of Fit, the p-value is significant meaning "observed not equal predicted". 

May I know the followings:

  1. Is interaction a culpr! it in this case?
  2. Is it also because of the large sample size that causes the Hosmer-Lemeshow Goodness of Fit's p-value to be < 0.05.
  3. What will be a solution out of this? Is it to ignore that Hosmer-Lemeshow Goodness of Fit's p-value to be < 0.05 because the sample size is large?
  4. Should we look at other models then?

I appreciate any advice on this issue.

Thanks.
Dorraj Oet
Reply | Threaded
Open this post in threaded view
|

How to delete cases with equal values on all variables

E. Bernardo
Do you know the syntax that delete cases with equal values on all variables?

Thank you for your help.

Eins
Reply | Threaded
Open this post in threaded view
|

Re: How to delete cases with equal values on all variables

Mark Webb-5
You may want to use the GUI "Identify Duplicate Cases" to flag those with equal values and then delete these.
It may generate syntax.
Mark Webb

On 2011/06/30 08:44 AM, Eins Bernardo wrote:
Do you know the syntax that delete cases with equal values on all variables?

Thank you for your help.

Eins
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: How to delete cases with equal values on all variables

David Marso
Administrator
In reply to this post by E. Bernardo
See VAR function under COMPUTE and SELECT IF or FILTER!!!
----
Eins Bernardo wrote
Do you know the syntax that delete cases with equal values on all variables?

Thank you for your help.

Eins
Eins Bernardo wrote
Do you know the syntax that delete cases with equal values on all variables?

Thank you for your help.

Eins
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: How to delete cases with equal values on all variables

Albert-Jan Roskam
In reply to this post by E. Bernardo

Hello,

 

Simply use the standard deviation:

 

data list free / x (f) y (f) z (f).

begin data

1 1 1

0 1 0

1 1 1

0 0 0

1 1 0

end data.

compute same = sd(x to z) eq 0.


 
Cheers!!
Albert-Jan


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



From: Eins Bernardo <[hidden email]>
To: [hidden email]
Sent: Thu, June 30, 2011 8:44:29 AM
Subject: [SPSSX-L] How to delete cases with equal values on all variables

Do you know the syntax that delete cases with equal values on all variables?

Thank you for your help.

Eins
Reply | Threaded
Open this post in threaded view
|

Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

Marta Garcia-Granero
In reply to this post by Jarrod Teo-2
El 30/06/2011 7:21, DorraJ Oet wrote:
<!-- .hmmessage P { margin:0px; padding:0px } body.hmmessage { font-size: 10pt; font-family:Tahoma } --> I have a friend who is currently working on 1.8 million samples of data. He is trying to use Logistic Regression with a dependent variable holding "Event" and "Non-Event" records.

However, in his Hosmer-Lemeshow Goodness of Fit, the p-value is significant meaning "observed not equal predicted". 

May I know the followings:

  1. Is interaction a culpr! it in this case?
  2. Is it also because of the large sample size that causes the Hosmer-Lemeshow Goodness of Fit's p-value to be < 0.05.
  3. What will be a solution out of this? Is it to ignore that Hosmer-Lemeshow Goodness of Fit's p-value to be < 0.05 because the sample size is large?
  4. Should we look at other models then?

Hi Dorraj:

Sometimes, a missing predictor in a model can cause a lack of fit. Since you don't give any information on the model I can't help a lot, but here are some general guidelines/ideas:

a) Check for interaction among predictors (those that make sense, don't start throwing everything into the model)

 b) check if any discarded variable is important (I don't mean significant, but relevant), you might have left out an important confounder. For instance, any model for cardiac events that leaves out smoking habit is probably faulty (event if smoking is non significant). I hope you didn't construct the model using stepwise methods, did you?

c) look for non linear relationships (U or J shaped, like the ones observed very often with BMI or age on the risks of certain diseases) by adding squared terms...

The significant p value doesn't imply that the lack of fit is important, it only means that there is a less than 1 in 20 chance that the observed result in the sample is due to chance alone. Big sample size will make significant p-values usually associated to trivial effects. Better than the p value, examine the differences between observed and predicted values in the 10 categories (the contingency table below the p-value table): plot the observed vs the expected values, see if the relationship looks linear and if the points seem to align along the y=x line.

HTH,
Marta GG
Reply | Threaded
Open this post in threaded view
|

Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

Marta Garcia-Granero
In reply to this post by Jarrod Teo-2
El 30/06/2011 7:21, DorraJ Oet wrote:
  I have a friend who is currently working on 1.8 million samples of data. He is trying to use Logistic Regression with a dependent variable holding "Event" and "Non-Event" records.

However, in his Hosmer-Lemeshow Goodness of Fit, the p-value is significant meaning "observed not equal predicted". 

May I know the followings:

  1. Is interaction a culpr! it in this case?
  2. Is it also because of the large sample size that causes the Hosmer-Lemeshow Goodness of Fit's p-value to be < 0.05.
  3. What will be a solution out of this? Is it to ignore that Hosmer-Lemeshow Goodness of Fit's p-value to be < 0.05 because the sample size is large?
  4. Should we look at other models then?

Hi Dorraj:

Sometimes, a missing predictor in a model can cause a lack of fit. Since you don't give any information on the model I can't help a lot, but here are some general guidelines/ideas:

a) check for interaction among predictors (those that make sense, don't start throwing everything into the model)

 b) check if any discarded variable is important (I don't mean significant, but relevant), you might have left out an important confounder. For instance, any model for cardiac events that leaves out smoking habit is probably faulty (event if smoking is non significant). I hope you didn't construct the model using stepwise methods, did you?

c) look for non linear relationships (U or J shaped, like the ones observed very often with BMI or age on the risks of certain diseases) by adding squared terms...

The significant p value doesn't imply that the lack of fit is important, it only means that there is a less than 1 in 20 chance that the observed result in the sample is due to chance alone. Big sample size will make significant p-values usually associated to trivial effects. Better than the p value, examine the differences between observed and predicted values in the 10 categories (the contingency table below the p-value table): plot the observed vs the expected probabilities, see if the relationship looks linear and if the points seem to align along the y=x line.

HTH,
Marta GG

Reply | Threaded
Open this post in threaded view
|

Re: How to delete cases with equal values on all variables

Jarrod Teo-2
In reply to this post by Mark Webb-5
Hi Eins,

Try this if you only doing it based on 1 variable. Replace v1 with the variable you want.

Warmest regards
Dorraj Oet

*****This portion identify the duplicate cases******.

sort cases by v1.
compute flag=0.
if (v1=lag(v1)) flag=1.
freq flag.

*****This portion take out the duplicate cases identify by flag using select cases and check if the selection is done properly*****.

select if (flag ne 0).
freq flag.


Date: Thu, 30 Jun 2011 08:57:10 +0200
From: [hidden email]
Subject: Re: How to delete cases with equal values on all variables
To: [hidden email]

You may want to use the GUI "Identify Duplicate Cases" to flag those with equal values and then delete these.
It may generate syntax.
Mark Webb

On 2011/06/30 08:44 AM, Eins Bernardo wrote:
Do you know the syntax that delete cases with equal values on all variables?

Thank you for your help.

Eins
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

Hector Maletta
In reply to this post by Jarrod Teo-2

Dorral,

The Hosmer Lemeshaw test is about comparing observed to predicted probabilities in successive deciles of increasing predicted probability. In essence, it partitions the dataset in 10 segments according to the predicted probability of the event. Then it measures the proportion of events in each decile to estimate the observed probability of the event in each decile. These observed and expected differences are then added up to get a chi square statistic. In this case you are not looking for a large chi square (indicating “low probability of the results being due to chance”) but for a small chi square (indicating that the predicted and observed probabilities are similar). (The usual chi square compares observed frequencies to those frequencies that would arise by chance alone, but in this case the predicted probabilities are not derived from chance, but predicted by the model). If the proportion of events in each decile match the average predicted proportion of events in each decile, then your model is behaving well. To make the result even better you should change your model by including more predictors, or interactions, or whatever else the matter requires.

 

In this regard let me add an additional thought. The criterion used by H-L test is whether the relative frequency of the event (in each decile) matches the predicted frequency. It is NOT about individual events happening to individual cases with high probability. For example, if the predicted probability in the first decile is (on average) 0.10, a good model will find that the event happened in about 10% of the cases in that decile. This does not mean that case #1 or case #k in that decile can be individually predicted to have or to not have suffered the event. The match (and the prediction) is for the relative frequency in the group (the decile) with the average predicted frequency in the decile, not about the individual outcome for each case coinciding with its predicted individual probability. Within a decile where predicted probabilities vary, say, from 0.4 to 0.6, and the average probability is 0.54, the test measures whether the proportion of actual events is more or less 54%, but does not ascertain whether an individual with p=0.58 is more likely to suffer the event than another individual with p=0.42, both within the same decile. In fact, this approach does not make individual predictions of the sort.

 

I think this is a good take on the notion of probability (and predicted probability), and that is the reason why I do not generally use the “classification table” in which individual events are compared with their probabilities being above or below 0.5. It is perfectly possible that the events happen to individuals with low probability, and fail to happen to individuals with high probability: if your predicted (group) probability matches the actual relative frequency, you’re OK.

 

Hector

 

De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de DorraJ Oet
Enviado el: Thursday, June 30, 2011 02:21
Para: [hidden email]
Asunto: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

 

Hi,

 

I have a friend who is currently working on 1.8 million samples of data. He is trying to use Logistic Regression with a dependent variable holding "Event" and "Non-Event" records.

 

However, in his Hosmer-Lemeshow Goodness of Fit, the p-value is significant meaning "observed not equal predicted". 

 

May I know the followings:

 

  1. Is interaction a culpr! it in this case?
  2. Is it also because of the large sample size that causes the Hosmer-Lemeshow Goodness of Fit's p-value to be < 0.05.
  3. What will be a solution out of this? Is it to ignore that Hosmer-Lemeshow Goodness of Fit's p-value to be < 0.05 because the sample size is large?
  4. Should we look at other models then?

 

I appreciate any advice on this issue.

 

Thanks.

Dorraj Oet


No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1388 / Virus Database: 1516/3733 - Release Date: 06/29/11

Reply | Threaded
Open this post in threaded view
|

Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

Rich Ulrich
In reply to this post by Jarrod Teo-2
The first thing that everyone should note is that a cutoff  "p= 0.05" is
nearly always irrelevant when N=1.8 million for the error.   Get Serious, Folks!
It is sort of a truism that "Nothing is really normal" so that when N grows
enough, a test for normality will eventually fail.  The same is generally true
elsewhere, that is, "Nothing is really logistic".  In fact, very often we are
willing to ignore the differences between Normal and Logistic and create
one model or the other, assuming potential differences as irrelevant.

Interested in infinitesimal effects, are we? - not often.
Occasionally, but not often.

The second thing, and separate from the first, is that data sets with N=1.8 million
is that there is *usually*  an underlying order or system or categories... or several.
These make the data inhomogeneous on the large scale, so that "proper tests"
have a reduced d.f. that matches (perhaps) the number of categories.  Or something.

Either way, every proper, sensible approach starts with the Effect Size of whatever
you are interested in.  One generic adaptation for Effect Size is to translate the
statistic, with its p-value, into the minimum N that would yield p=0.05.  For instance,
if  *your*  "test"  is a chi-squared with 1 d.f., and a value of 384,  then the 5% cutoff
of 3.84  implies that your result would be "just-at" the 5% level if your N was 1/100
of your 1.8 million...  since the chi-squared test statistic, for a given effect size,
increases directly with N.  So H-L  would reject if there were "only" 18,000 cases
fitting this way.  My personal criteria have never been adjusted to samples of that
size, but it would seem to me like a pretty good fit, regardless of the criterion.
- If H-L GOF  has 9 d.f., or something else, you would find the cutoff size, and
adjust your calculations accordingly.

I conclude that there is a good chance that a proper look at the H-L test will
show that it is not serious.  But I guess from the sample size that a simple logistic
model probably will not be a very complete or accurate model.

--
Rich Ulrich


Date: Thu, 30 Jun 2011 05:21:08 +0000
From: [hidden email]
Subject: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)
To: [hidden email]

Hi,

I have a friend who is currently working on 1.8 million samples of data. He is trying to use Logistic Regression with a dependent variable holding "Event" and "Non-Event" records.

However, in his Hosmer-Lemeshow Goodness of Fit, the p-value is significant meaning "observed not equal predicted". 

May I know the followings:

  1. Is interaction a culpr! it in this case?
  2. Is it also because of the large sample size that causes the Hosmer-Lemeshow Goodness of Fit's p-value to be < 0.05.
  3. What will be a solution out of this? Is it to ignore that Hosmer-Lemeshow Goodness of Fit's p-value to be < 0.05 because the sample size is large?
  4. Should we look at other models then?

I appreciate any advice on this issue.

Thanks.
Dorraj Oet
Reply | Threaded
Open this post in threaded view
|

Automatic reply: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

Valerie Villella
Thank you for your email. Please note that I am away on vacation, returning Tuesday, July 19, 2011. I will respond to my emails upon my return. If this is an urgent matter please contact Dan Buchanan at [hidden email] or at 905-851-8821 ext. 229.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

Hector Maletta
In reply to this post by Rich Ulrich

Rich,

I think you got it wrong, but I can be wrong myself.

In my view, the Hosmer Lemeshow test is NOT a statistical significance test such as the ordinary significance tests you run to ascertain whether a value is “significantly different from zero”.

The H-L test is a test applied in logistic regression, no matter what is the sample size, to ascertain one and only one thing: whether the observed proportions of events are similar to the predicted probabilities of occurrence. This test starts by sorting the cases by predicted probability, and splitting them into deciles, i.e. subgroups comprising 10 percent of total cases each. The first decile, for instance, may group the lower ten percent of cases, with predicted probabilities ranging from zero to, say, 0.24, with an average of 0.14; the second decile may comprise another ten percent of cases with predicted probabilities above 0.24 and up to 0.29, with an average predicted probability of 0.27; an d so on. Hosmer-Lemeshow compare these average probabilities (0.14, 0.27 and so on) to the actual proportion of events occurred within each group. The index is a sum of terms of the form [(O-E)/E]^2. If the sum happens to be zero, it means the predicted probabilities and observed relative frequencies coincide perfectly. If the sum is larger it means there are discrepancies between observed relative frequencies and predicted probabilities (the discrepancies may happen in any of the deciles). One wants the discrepancies to be low, i.e. one wants the sum to be as low as possible.

In my case, besides using the sum as a chi square, and then apply the chi square distribution to find out whether the value of the H/L indicator is “significant” (which is of course a function of the number of cases in the sample), I prefer observing a graph showing observed proportions and predicted probabilities in the ten deciles. I recently completed a study based on more than one million households (Census data from Bolivia). Of course, even small values of HL were sometimes “statistically significant”, in the sense of being “too large, given the size of the sample, to have arisen by mere chance”, though the sheer number of cases caused that not to happen too frequently. However, I preferred to look at the graph to see where the (usually small) differences were more noticeable, in the lower or the middle or the higher deciles, whether once or two deciles concentrated the differences or the differences were similar across deciles. (By the way, SPSS does not produce that graph, but only the observed and predicted proportions, from which one can build the graph in Excel).

 

These tests do not test whether a logistic model is appropriate, or the “goodness of fit” of the model to data. But a model predicting different probabilities should be able to produce predicted probabilities (for groupings of people) that somehow match the observed proportions. In my case they matched quite well. Of course, this does not enable you to guess which particular individuals will suffer the event: probabilities (at least in this context) are an attribute of the group.

In HL, the groupings are simply constructed by ordering the predicted probabilities from low to high. But one could use the same approach for different groupings of predictor variables. Suppose the predictors are gender, age and education level (with several age groups and several levels of education); this could generate a number of groups, each with individuals of the same sex, same age group and same education level. Within those groups, predicted probabilities would be equal or very similar, and one can assess whether the observed proportions of events within those groups are close to the predicted probabilities. If the groupings are based on ALL the predictors, the predicted probabilities within each group will be uniform; if some predictor is left out, there might be some variability in predicted probabilities within each group (as within the deciles in the H/L test), however one works with the AVERAGE predicted probability within each group, and compares those averages with the actual proportion of events.

 

Individual prediction is not possible: if everyone in a group has a predicted probability of, say, about 0.75, you may expect that one quarter of them do not get the event, and three quarters do; there is no way to identify in advance which individuals will suffer the event, just as knowing you are in a group with 75% risk of lung cancer does not allow you to know whether you or your neighbour will actually have lung cancer. Winston Churchill (fat, heavy drinker and chain smoker) was at a high risk of early death all along his long life, till he died of old age a few months before making 90. He had the same risk of plenty other people in his same risk groups along his life, but it was others who died while he was among the lucky few survivors.

 

Hector

 

De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Rich Ulrich
Enviado el: Thursday, June 30, 2011 18:02
Para: [hidden email]
Asunto: Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

 

The first thing that everyone should note is that a cutoff  "p= 0.05" is
nearly always irrelevant when N=1.8 million for the error.   Get Serious, Folks!
It is sort of a truism that "Nothing is really normal" so that when N grows
enough, a test for normality will eventually fail.  The same is generally true
elsewhere, that is, "Nothing is really logistic".  In fact, very often we are
willing to ignore the differences between Normal and Logistic and create
one model or the other, assuming potential differences as irrelevant.

Interested in infinitesimal effects, are we? - not often.
Occasionally, but not often.

The second thing, and separate from the first, is that data sets with N=1.8 million
is that there is *usually*  an underlying order or system or categories... or several.
These make the data inhomogeneous on the large scale, so that "proper tests"
have a reduced d.f. that matches (perhaps) the number of categories.  Or something.

Either way, every proper, sensible approach starts with the Effect Size of whatever
you are interested in.  One generic adaptation for Effect Size is to translate the
statistic, with its p-value, into the minimum N that would yield p=0.05.  For instance,
if  *your*  "test"  is a chi-squared with 1 d.f., and a value of 384,  then the 5% cutoff
of 3.84  implies that your result would be "just-at" the 5% level if your N was 1/100
of your 1.8 million...  since the chi-squared test statistic, for a given effect size,
increases directly with N.  So H-L  would reject if there were "only" 18,000 cases
fitting this way.  My personal criteria have never been adjusted to samples of that
size, but it would seem to me like a pretty good fit, regardless of the criterion.
- If H-L GOF  has 9 d.f., or something else, you would find the cutoff size, and
adjust your calculations accordingly.

I conclude that there is a good chance that a proper look at the H-L test will
show that it is not serious.  But I guess from the sample size that a simple logistic
model probably will not be a very complete or accurate model.

--
Rich Ulrich


Date: Thu, 30 Jun 2011 05:21:08 +0000
From: [hidden email]
Subject: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)
To: [hidden email]

Hi,

 

I have a friend who is currently working on 1.8 million samples of data. He is trying to use Logistic Regression with a dependent variable holding "Event" and "Non-Event" records.

 

However, in his Hosmer-Lemeshow Goodness of Fit, the p-value is significant meaning "observed not equal predicted". 

 

May I know the followings:

 

  1. Is interaction a culpr! it in this case?
  2. Is it also because of the large sample size that causes the Hosmer-Lemeshow Goodness of Fit's p-value to be < 0.05.
  3. What will be a solution out of this? Is it to ignore that Hosmer-Lemeshow Goodness of Fit's p-value to be < 0.05 because the sample size is large?
  4. Should we look at other models then?

 

I appreciate any advice on this issue.

 

Thanks.

Dorraj Oet


No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1388 / Virus Database: 1516/3735 - Release Date: 06/30/11

Reply | Threaded
Open this post in threaded view
|

Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

Rich Ulrich
Hector,
The H-L test is a Goodness of Fit test, as you describe.  I don't think
that I said anything contrary to that, and I don't think I said anything
that is irrelevant when it comes to the effects of huge N.

I grant that for ordinary, small N, there is little power for discerning whether
the Logistic model is the "correct" one, contrasted to other ways of modeling
the approach to a dichotomy.  But Probit models, based on the normal, were
were a frequent alternative to the logistic, before about 1988 when better
computerization of the logistic arrived.  Fortunately for us all, the cases that
truly deserve Probit are rarer than the other cases. 

Using the wrong model (Logistic versus other) *ought*  to show up as a
poorer fit in the tails -- with excessive deviations that will be captured
by a goodness-of-fit test.  I grant that this is subtle; I expect that other
problems are more likely to be detected in the usual run of things.

By coincidence, there was a similar problem posted today in the Usenet
group, sci.stat.consult.  I hope that Brendan Halpin won't mind my
re-posting his Reply, here.

***from s.s.c.
Newsgroups: sci.stat.consult
Subject: Re: Hosmer Lemeshaw Test and Large Samples
Date: Thu, 30 Jun 2011 15:26:32 +0100
Lines: 17
Message-ID: <[hidden email]>

I though the H-L test was out of favour these days, even with H & L.

See
http://www.biostat.wustl.edu/archives/html/s-news/1999-04/msg00147.html
for an explanation, and note that Harrell has implemented this new test
for R in his rms package.

Mind you, other comments suggest the problems with the H-L test are in
the direction of failing to detect lack of fit, so this may not help
you!

Brendan
--
Brendan Halpin,   Department of Sociology,   University of Limerick,   Ireland
***end cite.




Date: Thu, 30 Jun 2011 18:54:49 -0300
From: [hidden email]
Subject: Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)
To: [hidden email]

Rich,

I think you got it wrong, but I can be wrong myself.

In my view, the Hosmer Lemeshow test is NOT a statistical significance test such as the ordinary significance tests you run to ascertain whether a value is “significantly different from zero”.

The H-L test is a test applied in logistic regression, no matter what is the sample size, to ascertain one and only one thing: whether the observed proportions of events are similar to the predicted probabilities of occurrence. This test starts by sorting the cases by predicted probability, and splitting them into deciles, i.e. subgroups comprising 10 percent of total cases each. The first decile, for instance, may group the lower ten percent of cases, with predicted probabilities ranging from zero to, say, 0.24, with an average of 0.14; the second decile may comprise another ten percent of cases with predicted probabilities above 0.24 and up to 0.29, with an average predicted probability of 0.27; an d so on. Hosmer-Lemeshow compare these average probabilities (0.14, 0.27 and so on) to the actual proportion of events occurred within each group. The index is a sum of terms of the form [(O-E)/E]^2. If the sum happens to be zero, it means the predicted probabilities and observed relative frequencies coincide perfectly. If the sum is larger it means there are discrepancies between observed relative frequencies and predicted probabilities (the discrepancies may happen in any of the deciles). One wants the discrepancies to be low, i.e. one wants the sum to be as low as possible.

In my case, besides using the sum as a chi square, and then apply the chi square distribution to find out whether the value of the H/L indicator is “significant” (which is of course a function of the number of cases in the sample), I prefer observing a graph showing observed proportions and predicted probabilities in the ten deciles. I recently completed a study based on more than one million households (Census data from Bolivia). Of course, even small values of HL were sometimes “statistically significant”, in the sense of being “too large, given the size of the sample, to have arisen by mere chance”, though the sheer number of cases caused that not to happen too frequently. However, I preferred to look at the graph to see where the (usually small) differences were more noticeable, in the lower or the middle or the higher deciles, whether once or two deciles concentrated the differences or the differences were similar across deciles. (By the way, SPSS does not produce that graph, but only the observed and predicted proportions, from which one can build the graph in Excel).

 

These tests do not test whether a logistic model is appropriate, or the “goodness of fit” of the model to data. But a model predicting different probabilities should be able to produce predicted probabilities (for groupings of people) that somehow match the observed proportions. In my case they matched quite well. Of course, this does not enable you to guess which particular individuals will suffer the event: probabilities (at least in this context) are an attribute of the group.

In HL, the groupings are simply constructed by ordering the predicted probabilities from low to high. But one could use the same approach for different groupings of predictor variables. Suppose the predictors are gender, age and education level (with several age groups and several levels of education); this could generate a number of groups, each with individuals of the same sex, same age group and same education level. Within those groups, predicted probabilities would be equal or very similar, and one can assess whether the observed proportions of events within those groups are close to the predicted probabilities. If the groupings are based on ALL the predictors, the predicted probabilities within each group will be uniform; if some predictor is left out, there might be some variability in predicted probabilities within each group (as within the deciles in the H/L test), however one works with the AVERAGE predicted probability within each group, and compares those averages with the actual proportion of events.

 

Individual prediction is not possible: if everyone in a group has a predicted probability of, say, about 0.75, you may expect that one quarter of them do not get the event, and three quarters do; there is no way to identify in advance which individuals will suffer the event, just as knowing you are in a group with 75% risk of lung cancer does not allow you to know whether you or your neighbour will actually have lung cancer. Winston Churchill (fat, heavy drinker and chain smoker) was at a high risk of early death all along his long life, till he died of old age a few months before making 90. He had the same risk of plenty other people in his same risk groups along his life, but it was others who died while he was among the lucky few survivors.

 

Hector

 
[snip, previous]


Reply | Threaded
Open this post in threaded view
|

Re: match file with string variable

Maurice Vergeer
In reply to this post by Jon K Peck
Dear all,

thanks for the comments and suggestions. I think John is right.
On a 11 hour plane trip I was able to figure out the issue (in a different way than John).

Recap:
Initially I I had a key variable in two files to match. this didn't work.  At first I thought it was a leading blank: a list command showed the string key variable values with an extra position (not a real blank but with a small dot in the middel, not a period. Apparently a non-printing one. This character is not visible in the data editor or the syntax editor: I copied the character to the syntax file to have it removed with LTRIM. At first I was confused because copying the character did not show up in the syntax, until I ran the syntax and it did show up in the output.

Where the character came from I do not know. I downloaded the data using Twitter's API into a cvs file. In an ASCII editor nothing shows up. After importing the text into SPSS it suddenly appears. I have SPSS running in unicode mode.

Still, it is solved for now, but will check using Jon Peck's  suggesting changing the string variable to the Ahex36 format.

thanks to all,
Maurice


On Thu, Jun 30, 2011 at 12:41, Jon K Peck <[hidden email]> wrote:
C
Correction: I meant to say AHEX36, not A36.



Date: Wed, 29 Jun 2011 20:23:37 -0600
From: [hidden email]

Subject: Re: match file with string variable

Right.  The problem couldn't be trailing blanks, but it could be nonprinting characters such as tabs or the popular French non-breaking space character that look like spaces and are different between the two files.  If you really want to figure out what the problem was, change the variable formats to A36.  Then you can see the numerical codes for those "blanks".  A true blank would be hex 20.  In code page mode, a non-breaking space would be A0.





--
___________________________________________________________________
Maurice Vergeer
Department of communication, Radboud University  (www.ru.nl)
PO Box 9104, NL-6500 HE Nijmegen, The Netherlands

Visiting Professor Yeungnam University, Gyeongsan, South Korea

Recent publications:
-Vergeer, M., Eisinga, R. & Franses, Ph.H. (forthcoming). Supply and demand effects in television viewing. A time series analysis. Communications - The European Journal of Communication Research.
-Vergeer, M. Lim, Y.S. Park, H.W. (forthcoming). Mediated Relations: New Methods to study Online Social Capital. Asian Journal of Communication.
-Vergeer, M., Hermans, L., & Sams, S. (forthcoming). Online social networks and micro-blogging in political campaigning: The exploration of a new campaign tool and a new campaign style. Party Politics.
-Pleijter, A., Hermans, L. & Vergeer, M. (forthcoming). Journalists and journalism in the Netherlands. In D. Weaver & L. Willnat, The Global Journalist in the 21st Century. London: Routledge.

Webspace
www.mauricevergeer.nl
http://blog.mauricevergeer.nl/
www.journalisteninhetdigitaletijdperk.nl
maurice.vergeer (skype)
___________________________________________________________________





Reply | Threaded
Open this post in threaded view
|

Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

Hector Maletta
In reply to this post by Rich Ulrich

Of course, Rich, you did not say anything contrary to my comments. You just happened to be the one commenting before me in this thread.

Hector

 

De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Rich Ulrich
Enviado el: Thursday, June 30, 2011 23:11
Para: [hidden email]
Asunto: Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

 

Hector,
The H-L test is a Goodness of Fit test, as you describe.  I don't think
that I said anything contrary to that, and I don't think I said anything
that is irrelevant when it comes to the effects of huge N.

I grant that for ordinary, small N, there is little power for discerning whether
the Logistic model is the "correct" one, contrasted to other ways of modeling
the approach to a dichotomy.  But Probit models, based on the normal, were
were a frequent alternative to the logistic, before about 1988 when better
computerization of the logistic arrived.  Fortunately for us all, the cases that
truly deserve Probit are rarer than the other cases. 

Using the wrong model (Logistic versus other) *ought*  to show up as a
poorer fit in the tails -- with excessive deviations that will be captured
by a goodness-of-fit test.  I grant that this is subtle; I expect that other
problems are more likely to be detected in the usual run of things.

By coincidence, there was a similar problem posted today in the Usenet
group, sci.stat.consult.  I hope that Brendan Halpin won't mind my
re-posting his Reply, here.

***from s.s.c.
Newsgroups: sci.stat.consult
Subject: Re: Hosmer Lemeshaw Test and Large Samples
Date: Thu, 30 Jun 2011 15:26:32 +0100
Lines: 17
Message-ID: <[hidden email]>

I though the H-L test was out of favour these days, even with H & L.

See
http://www.biostat.wustl.edu/archives/html/s-news/1999-04/msg00147.html
for an explanation, and note that Harrell has implemented this new test
for R in his rms package.

Mind you, other comments suggest the problems with the H-L test are in
the direction of failing to detect lack of fit, so this may not help
you!

Brendan
--
Brendan Halpin,   Department of Sociology,   University of Limerick,   Ireland
***end cite.



Date: Thu, 30 Jun 2011 18:54:49 -0300
From: [hidden email]
Subject: Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)
To: [hidden email]

Rich,

I think you got it wrong, but I can be wrong myself.

In my view, the Hosmer Lemeshow test is NOT a statistical significance test such as the ordinary significance tests you run to ascertain whether a value is “significantly different from zero”.

The H-L test is a test applied in logistic regression, no matter what is the sample size, to ascertain one and only one thing: whether the observed proportions of events are similar to the predicted probabilities of occurrence. This test starts by sorting the cases by predicted probability, and splitting them into deciles, i.e. subgroups comprising 10 percent of total cases each. The first decile, for instance, may group the lower ten percent of cases, with predicted probabilities ranging from zero to, say, 0.24, with an average of 0.14; the second decile may comprise another ten percent of cases with predicted probabilities above 0.24 and up to 0.29, with an average predicted probability of 0.27; an d so on. Hosmer-Lemeshow compare these average probabilities (0.14, 0.27 and so on) to the actual proportion of events occurred within each group. The index is a sum of terms of the form [(O-E)/E]^2. If the sum happens to be zero, it means the predicted probabilities and observed relative frequencies coincide perfectly. If the sum is larger it means there are discrepancies between observed relative frequencies and predicted probabilities (the discrepancies may happen in any of the deciles). One wants the discrepancies to be low, i.e. one wants the sum to be as low as possible.

In my case, besides using the sum as a chi square, and then apply the chi square distribution to find out whether the value of the H/L indicator is “significant” (which is of course a function of the number of cases in the sample), I prefer observing a graph showing observed proportions and predicted probabilities in the ten deciles. I recently completed a study based on more than one million households (Census data from Bolivia). Of course, even small values of HL were sometimes “statistically significant”, in the sense of being “too large, given the size of the sample, to have arisen by mere chance”, though the sheer number of cases caused that not to happen too frequently. However, I preferred to look at the graph to see where the (usually small) differences were more noticeable, in the lower or the middle or the higher deciles, whether once or two deciles concentrated the differences or the differences were similar across deciles. (By the way, SPSS does not produce that graph, but only the observed and predicted proportions, from which one can build the graph in Excel).

 

These tests do not test whether a logistic model is appropriate, or the “goodness of fit” of the model to data. But a model predicting different probabilities should be able to produce predicted probabilities (for groupings of people) that somehow match the observed proportions. In my case they matched quite well. Of course, this does not enable you to guess which particular individuals will suffer the event: probabilities (at least in this context) are an attribute of the group.

In HL, the groupings are simply constructed by ordering the predicted probabilities from low to high. But one could use the same approach for different groupings of predictor variables. Suppose the predictors are gender, age and education level (with several age groups and several levels of education); this could generate a number of groups, each with individuals of the same sex, same age group and same education level. Within those groups, predicted probabilities would be equal or very similar, and one can assess whether the observed proportions of events within those groups are close to the predicted probabilities. If the groupings are based on ALL the predictors, the predicted probabilities within each group will be uniform; if some predictor is left out, there might be some variability in predicted probabilities within each group (as within the deciles in the H/L test), however one works with the AVERAGE predicted probability within each group, and compares those averages with the actual proportion of events.

 

Individual prediction is not possible: if everyone in a group has a predicted probability of, say, about 0.75, you may expect that one quarter of them do not get the event, and three quarters do; there is no way to identify in advance which individuals will suffer the event, just as knowing you are in a group with 75% risk of lung cancer does not allow you to know whether you or your neighbour will actually have lung cancer. Winston Churchill (fat, heavy drinker and chain smoker) was at a high risk of early death all along his long life, till he died of old age a few months before making 90. He had the same risk of plenty other people in his same risk groups along his life, but it was others who died while he was among the lucky few survivors.

 

Hector

 
[snip, previous]

 


No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1388 / Virus Database: 1516/3735 - Release Date: 06/30/11

Reply | Threaded
Open this post in threaded view
|

Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

Swank, Paul R

I did a little simulation. I generated a normally distributed variable (theta) as a function of two independent normally  distributed IVs. I then created a new dichotomous variable by selecting values of theta greater than one. I did this 1000 times with a samples of 500, 50000, and 100000. The Hosmer-Lemeshow test was significant 4.1% of the time when n=500 but 33.5% when n was 50000, and 66.6% of the time when n was 100,000. So it does appear that the test becomes more sensitive when n is large. Thus, very small deviations from the model may result in significant lack of fit when the sample size is large. This is similar to the problem in SEM where large samples are much more likely to show lack of fit than small samples. Yes, there is a significant lack of fit. But is it substantial enough to invalidate the findings of the logistic model.

 

Dr. Paul R. Swank,

Professor

Children's Learning Institute

University of Texas Health Science Center-Houston

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hector Maletta
Sent: Friday, July 01, 2011 12:19 AM
To: [hidden email]
Subject: Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

 

Of course, Rich, you did not say anything contrary to my comments. You just happened to be the one commenting before me in this thread.

Hector

 

De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Rich Ulrich
Enviado el: Thursday, June 30, 2011 23:11
Para: [hidden email]
Asunto: Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

 

Hector,
The H-L test is a Goodness of Fit test, as you describe.  I don't think
that I said anything contrary to that, and I don't think I said anything
that is irrelevant when it comes to the effects of huge N.

I grant that for ordinary, small N, there is little power for discerning whether
the Logistic model is the "correct" one, contrasted to other ways of modeling
the approach to a dichotomy.  But Probit models, based on the normal, were
were a frequent alternative to the logistic, before about 1988 when better
computerization of the logistic arrived.  Fortunately for us all, the cases that
truly deserve Probit are rarer than the other cases. 

Using the wrong model (Logistic versus other) *ought*  to show up as a
poorer fit in the tails -- with excessive deviations that will be captured
by a goodness-of-fit test.  I grant that this is subtle; I expect that other
problems are more likely to be detected in the usual run of things.

By coincidence, there was a similar problem posted today in the Usenet
group, sci.stat.consult.  I hope that Brendan Halpin won't mind my
re-posting his Reply, here.

***from s.s.c.
Newsgroups: sci.stat.consult
Subject: Re: Hosmer Lemeshaw Test and Large Samples
Date: Thu, 30 Jun 2011 15:26:32 +0100
Lines: 17
Message-ID: <[hidden email]>

I though the H-L test was out of favour these days, even with H & L.

See
http://www.biostat.wustl.edu/archives/html/s-news/1999-04/msg00147.html
for an explanation, and note that Harrell has implemented this new test
for R in his rms package.

Mind you, other comments suggest the problems with the H-L test are in
the direction of failing to detect lack of fit, so this may not help
you!

Brendan
--
Brendan Halpin,   Department of Sociology,   University of Limerick,   Ireland
***end cite.


Date: Thu, 30 Jun 2011 18:54:49 -0300
From: [hidden email]
Subject: Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)
To: [hidden email]

Rich,

I think you got it wrong, but I can be wrong myself.

In my view, the Hosmer Lemeshow test is NOT a statistical significance test such as the ordinary significance tests you run to ascertain whether a value is “significantly different from zero”.

The H-L test is a test applied in logistic regression, no matter what is the sample size, to ascertain one and only one thing: whether the observed proportions of events are similar to the predicted probabilities of occurrence. This test starts by sorting the cases by predicted probability, and splitting them into deciles, i.e. subgroups comprising 10 percent of total cases each. The first decile, for instance, may group the lower ten percent of cases, with predicted probabilities ranging from zero to, say, 0.24, with an average of 0.14; the second decile may comprise another ten percent of cases with predicted probabilities above 0.24 and up to 0.29, with an average predicted probability of 0.27; an d so on. Hosmer-Lemeshow compare these average probabilities (0.14, 0.27 and so on) to the actual proportion of events occurred within each group. The index is a sum of terms of the form [(O-E)/E]^2. If the sum happens to be zero, it means the predicted probabilities and observed relative frequencies coincide perfectly. If the sum is larger it means there are discrepancies between observed relative frequencies and predicted probabilities (the discrepancies may happen in any of the deciles). One wants the discrepancies to be low, i.e. one wants the sum to be as low as possible.

In my case, besides using the sum as a chi square, and then apply the chi square distribution to find out whether the value of the H/L indicator is “significant” (which is of course a function of the number of cases in the sample), I prefer observing a graph showing observed proportions and predicted probabilities in the ten deciles. I recently completed a study based on more than one million households (Census data from Bolivia). Of course, even small values of HL were sometimes “statistically significant”, in the sense of being “too large, given the size of the sample, to have arisen by mere chance”, though the sheer number of cases caused that not to happen too frequently. However, I preferred to look at the graph to see where the (usually small) differences were more noticeable, in the lower or the middle or the higher deciles, whether once or two deciles concentrated the differences or the differences were similar across deciles. (By the way, SPSS does not produce that graph, but only the observed and predicted proportions, from which one can build the graph in Excel).

 

These tests do not test whether a logistic model is appropriate, or the “goodness of fit” of the model to data. But a model predicting different probabilities should be able to produce predicted probabilities (for groupings of people) that somehow match the observed proportions. In my case they matched quite well. Of course, this does not enable you to guess which particular individuals will suffer the event: probabilities (at least in this context) are an attribute of the group.

In HL, the groupings are simply constructed by ordering the predicted probabilities from low to high. But one could use the same approach for different groupings of predictor variables. Suppose the predictors are gender, age and education level (with several age groups and several levels of education); this could generate a number of groups, each with individuals of the same sex, same age group and same education level. Within those groups, predicted probabilities would be equal or very similar, and one can assess whether the observed proportions of events within those groups are close to the predicted probabilities. If the groupings are based on ALL the predictors, the predicted probabilities within each group will be uniform; if some predictor is left out, there might be some variability in predicted probabilities within each group (as within the deciles in the H/L test), however one works with the AVERAGE predicted probability within each group, and compares those averages with the actual proportion of events.

 

Individual prediction is not possible: if everyone in a group has a predicted probability of, say, about 0.75, you may expect that one quarter of them do not get the event, and three quarters do; there is no way to identify in advance which individuals will suffer the event, just as knowing you are in a group with 75% risk of lung cancer does not allow you to know whether you or your neighbour will actually have lung cancer. Winston Churchill (fat, heavy drinker and chain smoker) was at a high risk of early death all along his long life, till he died of old age a few months before making 90. He had the same risk of plenty other people in his same risk groups along his life, but it was others who died while he was among the lucky few survivors.

 

Hector

 
[snip, previous]

 


No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1388 / Virus Database: 1516/3735 - Release Date: 06/30/11

Reply | Threaded
Open this post in threaded view
|

Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

Jarrod Teo-2
Hi Dr Paul,

Thanks for the simulation. This was what I suspected as well as in when the sample goes larger and larger, HL becomes significant. However can I boldly say that with a large data file, data mining models (maybe CHAID, C5 for profiling) could be a better choice than Statistical models?

By the way, thank you Hector and Rich. Those were interesting discussion.

Warmest regards
Dorraj Oet


Date: Fri, 1 Jul 2011 15:25:37 -0500
From: [hidden email]
Subject: Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)
To: [hidden email]

I did a little simulation. I generated a normally distributed variable (theta) as a function of two independent normally  distributed IVs. I then created a new dichotomous variable by selecting values of theta greater than one. I did this 1000 times with a samples of 500, 50000, and 100000. The Hosmer-Lemeshow test was significant 4.1% of the time when n=500 but 33.5% when n was 50000, and 66.6% of the time when n was 100,000. So it does appear that the test becomes more sensitive when n is large. Thus, very small deviations from the model may result in significant lack of fit when the sample size is large. This is similar to the problem in SEM where large samples are much more likely to show lack of fit than small samples. Yes, there is a significant lack of fit. But is it substantial enough to invalidate the findings of the logistic model.

 

Dr. Paul R. Swank,

Professor

Children's Learning Institute

University of Texas Health Science Center-Houston

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hector Maletta
Sent: Friday, July 01, 2011 12:19 AM
To: [hidden email]
Subject: Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

 

Of course, Rich, you did not say anything contrary to my comments. You just happened to be the one commenting before me in this thread.

Hector

 

De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Rich Ulrich
Enviado el: Thursday, June 30, 2011 23:11
Para: [hidden email]
Asunto: Re: Hosmer-Lemeshow Goodness of Fit (Logistic Regression)

 

Hector,
The H-L test is a Goodness of Fit test, as you describe.  I don't think
that I said anything contrary to that, and I don't think I said anything
that is irrelevant when it comes to the effects of huge N.

I grant that for ordinary, small N, there is little power for discerning whether
the Logistic model is the "correct" one, contrasted to other ways of modeling
the approach t o a dichotomy.  But Probit models, based on the normal, were
were a frequent alternative to the logistic, before about 1988 when better
computerization of the logistic arrived.  Fortunately for us all, the cases that
truly deserve Probit are rarer than the other cases. 

Using the wrong model (Logistic versus other) *ought*  to show up as a
poorer fit in the tails -- with excessive deviations that will be captured
by a goodness-of-fit test.  I grant that this is subtle; I expect that other
problems are more likely to be detected in the usual run of things.

By coincidence, there was a similar problem posted today in the Usenet
group, sci.stat.consult.  I hope that Brendan Halpin won't mind my
re-posting his Reply, here.

***from s.s.c.
Newsgroups: sci.stat.consult
Subject: Re: Hosmer Lemeshaw Test and Large Samples
Date: Thu, 30 Jun 2011 15:26:32 +0100
Lines: 17
Message-ID: <[hidden email]>

I though the H-L test was out of favour these days, even with H & L.

See
http://www.biostat.wustl.edu/archives/html/s-news/1999-04/msg00147.html
for an explanation, and note that Harrell has implemented this new test
for R in his rms package.

Mind you, other comments suggest the problems with the H-L test are in
the direction of failing to detect lack of fit, so this may not help
you!

Brendan
--
Brendan Halpin,   Department of Sociology,   University of Limerick,   Ireland
***end cite.


Date: Thu, 30 Jun 2011 18:54:49 -0300
From: [hidden email]
Subject: Re: Hosmer-Lem eshow Goodness of Fit (Logistic Regression)
To: [hidden email]

Rich,

I think you got it wrong, but I can be wrong myself.

In my view, the Hosmer Lemeshow test is NOT a statistical significance test such as the ordinary significance tests you run to ascertain whether a value is “significantly different from zero”.

The H-L test is a test applied in logistic regression, no matter what is the sample size, to ascertain one and only one thing: whether the observed proportions of events are similar to the predicted probabilities of occurrence. This test starts by sorting the cases by predicted probability, and splitting them into deciles, i.e. subgroups comprising 10 percent of total cases each. The first decile, for instance, may group the lower ten percent of cases, with predicted probabilities ranging from zero to, say, 0.24, with an average of 0.14; the second decile may comprise another ten percent of cases with predicted probabilities above 0.24 and up to 0.29, with an average predicted probability of 0.27; an d so on. Hosmer-Lemeshow compare these average probabilities (0.14, 0.27 and so on) to the actual proportion of events occurred within each group. The index is a sum of terms of the form [(O-E)/E]^2. If the sum happens to be zero, it mean s the predicted probabilities and observed relative frequencies coincide perfectly. If the sum is larger it means there are discrepancies between observed relative frequencies and predicted probabilities (the discrepancies may happen in any of the deciles). One wants the discrepancies to be low, i.e. one wants the sum to be as low as possible.

In my case, besides using the sum as a chi square, and then apply the chi square distribution to find out whether the value of the H/L indicator is “significant” (which is of course a function of the number of cases in the sample), I prefer observing a graph showing observed proportions and predicted probabilities in the ten deciles. I recently completed a study based on more than one million households (Census data from Bolivia). Of course, even small values of HL were sometimes “statistically significant”, in the sense of being “too large, given the size of the sample, to have arisen by mere chance”, though the sheer number of cases caused that not to happen too frequently. However, I preferred to look at the graph to see where the (usually small) differences were more noticeable, in the lower or the middle or the higher deciles, whether once or two deciles concentrated the differences or the differences were similar across deciles. (By the way, SPSS does not produce that graph, but only the observed and predicted proportions, from which one can build the graph in Excel).

 

These tests do not test whether a logistic model is appropriate, or the “goodness of fit” of the model to data. But a model predicting different probabilities should be able to produce predicted probabilities (for groupings of people) that somehow match the observed proportions. In my case they matched quite well. Of course, this does not enable you to guess which particular individuals will suffer the event: probabilities (at least in this context) are an attribute of the group.

In HL, the groupings are simply constructed by ordering the predicted probabilities from low to high. But one could use the same approach for different groupings of predictor variables. Suppose the predictors are gender, age and education level (with several age groups and several levels of education); this could generate a number of groups, each with individuals of the same sex, same age group and same education level. Within those groups, predicted probabilities would be equal or very similar, and one can assess whether the observed proportions of events within those groups are close to the predicted probabilities. If the groupings are based on ALL the predictors, the predicted probabilities within each group will be uniform; if some predictor is left out, there might be some variability in predicted probabilities within each group (as within the deciles in the H/L test), however one works with the AVERAGE predicted probability within each group, and compares those averages with the actual proportion of events.

 

Individual prediction is not possible: if everyone in a group has a predicted probability of, say, about 0.75, you may expect that one quarter of them do not get the event, and three quarters do; there is no way to identify in advance which individuals will suffer the event, just as knowing you are in a group with 75% risk of lung cancer does not allow you to know whether you or your neighbour will actually have lung cancer. Winston Churchill (fat, heavy drinker and chain smoker) was at a high risk of early death all along his long life, till he died of old age a few months before making 90. He had the same risk of plenty other people in his same risk groups along his life, but it was others who died while he was among the lucky few survivors.

 

Hector

 
[snip, previous]

 


No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1388 / Virus Database: 1516/3735 - Release Date: 06/30/11

Reply | Threaded
Open this post in threaded view
|

Re: match file with string variable

Maurice Vergeer
In reply to this post by Maurice Vergeer
Dear all,

just a short note on some new info, given the problem discussed in
this thread some time ago , I had the same problem with a new data
set.
Taking up Jon's advice I formatted the string variable as AHEX36.
The data showed that the recurring code at the beginning of the string
was EFBBBF. Googling this code I found a wikipedia page suggesting
this seems to be a byte order mark in unicode. I am not sure whether
Twitter's API delevers the data in such a way or whether SPSS inserts
it when importing a csv file. I think and hope the former, not the
latter.

best wishes
Maurice


On Fri, Jul 1, 2011 at 5:09 AM, Maurice Vergeer <[hidden email]> wrote:

> Dear all,
>
> thanks for the comments and suggestions. I think John is right.
> On a 11 hour plane trip I was able to figure out the issue (in a different
> way than John).
>
> Recap:
> Initially I I had a key variable in two files to match. this didn't work.
> At first I thought it was a leading blank: a list command showed the string
> key variable values with an extra position (not a real blank but with a
> small dot in the middel, not a period. Apparently a non-printing one. This
> character is not visible in the data editor or the syntax editor: I copied
> the character to the syntax file to have it removed with LTRIM. At first I
> was confused because copying the character did not show up in the syntax,
> until I ran the syntax and it did show up in the output.
>
> Where the character came from I do not know. I downloaded the data using
> Twitter's API into a cvs file. In an ASCII editor nothing shows up. After
> importing the text into SPSS it suddenly appears. I have SPSS running in
> unicode mode.
>
> Still, it is solved for now, but will check using Jon Peck's�  suggesting
> changing the string variable to the Ahex36 format.
>
> thanks to all,
> Maurice
>
>
> On Thu, Jun 30, 2011 at 12:41, Jon K Peck <[hidden email]> wrote:
>>
>> C
>> Correction: I meant to say AHEX36, not A36.
>>
>> ________________________________
>> Date: Wed, 29 Jun 2011 20:23:37 -0600
>> From: [hidden email]
>> Subject: Re: match file with string variable
>> To: [hidden email]
>>
>> Right. � The problem couldn't be trailing blanks, but it could be
>> nonprinting characters such as tabs or the popular French non-breaking space
>> character that look like spaces and are different between the two files. � If
>> you really want to figure out what the problem was, change the variable
>> formats to A36. � Then you can see the numerical codes for those "blanks". � A
>> true blank would be hex 20. � In code page mode, a non-breaking space would
>> be A0.
>>
>
>
>
> --
> ___________________________________________________________________
> Maurice Vergeer
> Department of communication, Radboud University�  (www.ru.nl)
> PO Box 9104, NL-6500 HE Nijmegen, The Netherlands
>
> Visiting Professor Yeungnam University, Gyeongsan, South Korea
>
> Recent publications:
> -Vergeer, M., Eisinga, R. & Franses, Ph.H. (forthcoming). Supply and demand
> effects in television viewing. A time series analysis. Communications - The
> European Journal of Communication Research.
> -Vergeer, M. Lim, Y.S. Park, H.W. (forthcoming). Mediated Relations: New
> Methods to study Online Social Capital. Asian Journal of Communication.
> -Vergeer, M., Hermans, L., & Sams, S. (forthcoming). Online social networks
> and micro-blogging in political campaigning: The exploration of a new
> campaign tool and a new campaign style. Party Politics.
> -Pleijter, A., Hermans, L. & Vergeer, M. (forthcoming). Journalists and
> journalism in the Netherlands. In D. Weaver & L. Willnat, The Global
> Journalist in the 21st Century. London: Routledge.
>
> Webspace
> www.mauricevergeer.nl
> http://blog.mauricevergeer.nl/
> www.journalisteninhetdigitaletijdperk.nl
> maurice.vergeer (skype)
> ___________________________________________________________________
>
>
>
>
>
>



--
___________________________________________________________________
Maurice Vergeer
Department of communication, Radboud University�  (www.ru.nl)
PO Box 9104, NL-6500 HE Nijmegen, The Netherlands

Visiting Professor Yeungnam University, Gyeongsan, South Korea

Recent publications:
-Vergeer, M. Lim, Y.S. Park, H.W. (2011). Mediated Relations: New
Methods to study Online Social Capital. Asian Journal of
Communication, 21(5), 430-449.
-Vergeer, M., Hermans, L., & Sams, S. (forthcoming). Online social
networks and micro-blogging in political campaigning: The exploration
of a new campaign tool and a new campaign style. Party Politics.
-Pleijter, A., Hermans, L. & Vergeer, M. (forthcoming). Journalists
and journalism in the Netherlands. In D. Weaver & L. Willnat, The
Global Journalist in the 21st Century. London: Routledge.
-Vergeer, M., Eisinga, R. & Franses, Ph.H. (forthcoming). Supply and
demand effects in television viewing. A time series analysis.
Communications - The European Journal of Communication Research.

Webspace
www.mauricevergeer.nl
http://blog.mauricevergeer.nl/
www.journalisteninhetdigitaletijdperk.nl
maurice.vergeer (skype)
___________________________________________________________________

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: match file with string variable

Jon K Peck
The BOM indicates that the characters in the file are encoded in utf-8 (a form of Unicode).  This would be coming from the Twitter api.  Statistics should recognize that and read the data appropriately.  SPSS Statistics will write a BOM when saving a csv file if it is in Unicode mode.  If there are any extended characters in the file, you should use Unicode mode in Statistics when reading it.

Jon Peck (no "h")
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621




From:        Maurice Vergeer <[hidden email]>
To:        Jon K Peck/Chicago/IBM@IBMUS
Cc:        [hidden email]
Date:        10/29/2011 02:07 AM
Subject:        Re: match file with string variable
Sent by:        [hidden email]




Dear all,

just a short note on some new info, given the problem discussed in
this thread some time ago , I had the same problem with a new data
set.
Taking up Jon's advice I formatted the string variable as AHEX36.
The data showed that the recurring code at the beginning of the string
was EFBBBF. Googling this code I found a wikipedia page suggesting
this seems to be a byte order mark in unicode. I am not sure whether
Twitter's API delevers the data in such a way or whether SPSS inserts
it when importing a csv file. I think and hope the former, not the
latter.

best wishes
Maurice


On Fri, Jul 1, 2011 at 5:09 AM, Maurice Vergeer <[hidden email]> wrote:
> Dear all,
>
> thanks for the comments and suggestions. I think John is right.
> On a 11 hour plane trip I was able to figure out the issue (in a different
> way than John).
>
> Recap:
> Initially I I had a key variable in two files to match. this didn't work.
> At first I thought it was a leading blank: a list command showed the string
> key variable values with an extra position (not a real blank but with a
> small dot in the middel, not a period. Apparently a non-printing one. This
> character is not visible in the data editor or the syntax editor: I copied
> the character to the syntax file to have it removed with LTRIM. At first I
> was confused because copying the character did not show up in the syntax,
> until I ran the syntax and it did show up in the output.
>
> Where the character came from I do not know. I downloaded the data using
> Twitter's API into a cvs file. In an ASCII editor nothing shows up. After
> importing the text into SPSS it suddenly appears. I have SPSS running in
> unicode mode.
>
> Still, it is solved for now, but will check using Jon Peck's  suggesting
> changing the string variable to the Ahex36 format.
>
> thanks to all,
> Maurice
>
>
> On Thu, Jun 30, 2011 at 12:41, Jon K Peck <[hidden email]> wrote:
>>
>> C
>> Correction: I meant to say AHEX36, not A36.
>>
>> ________________________________
>> Date: Wed, 29 Jun 2011 20:23:37 -0600
>> From: [hidden email]
>> Subject: Re: match file with string variable
>> To: [hidden email]
>>
>> Right.  The problem couldn't be trailing blanks, but it could be
>> nonprinting characters such as tabs or the popular French non-breaking space
>> character that look like spaces and are different between the two files.  If
>> you really want to figure out what the problem was, change the variable
>> formats to A36.  Then you can see the numerical codes for those "blanks".  A
>> true blank would be hex 20.  In code page mode, a non-breaking space would
>> be A0.
>>
>
>
>
> --
> ___________________________________________________________________
> Maurice Vergeer
> Department of communication, Radboud University  (
www.ru.nl)
> PO Box 9104, NL-6500 HE Nijmegen, The Netherlands
>
> Visiting Professor Yeungnam University, Gyeongsan, South Korea
>
> Recent publications:
> -Vergeer, M., Eisinga, R. & Franses, Ph.H. (forthcoming). Supply and demand
> effects in television viewing. A time series analysis. Communications - The
> European Journal of Communication Research.
> -Vergeer, M. Lim, Y.S. Park, H.W. (forthcoming). Mediated Relations: New
> Methods to study Online Social Capital. Asian Journal of Communication.
> -Vergeer, M., Hermans, L., & Sams, S. (forthcoming). Online social networks
> and micro-blogging in political campaigning: The exploration of a new
> campaign tool and a new campaign style. Party Politics.
> -Pleijter, A., Hermans, L. & Vergeer, M. (forthcoming). Journalists and
> journalism in the Netherlands. In D. Weaver & L. Willnat, The Global
> Journalist in the 21st Century. London: Routledge.
>
> Webspace
>
www.mauricevergeer.nl
>
http://blog.mauricevergeer.nl/
>
www.journalisteninhetdigitaletijdperk.nl
> maurice.vergeer (skype)
> ___________________________________________________________________
>
>
>
>
>
>



--
___________________________________________________________________
Maurice Vergeer
Department of communication, Radboud University  (
www.ru.nl)
PO Box 9104, NL-6500 HE Nijmegen, The Netherlands

Visiting Professor Yeungnam University, Gyeongsan, South Korea

Recent publications:
-Vergeer, M. Lim, Y.S. Park, H.W. (2011). Mediated Relations: New
Methods to study Online Social Capital. Asian Journal of
Communication, 21(5), 430-449.
-Vergeer, M., Hermans, L., & Sams, S. (forthcoming). Online social
networks and micro-blogging in political campaigning: The exploration
of a new campaign tool and a new campaign style. Party Politics.
-Pleijter, A., Hermans, L. & Vergeer, M. (forthcoming). Journalists
and journalism in the Netherlands. In D. Weaver & L. Willnat, The
Global Journalist in the 21st Century. London: Routledge.
-Vergeer, M., Eisinga, R. & Franses, Ph.H. (forthcoming). Supply and
demand effects in television viewing. A time series analysis.
Communications - The European Journal of Communication Research.

Webspace
www.mauricevergeer.nl
http://blog.mauricevergeer.nl/
www.journalisteninhetdigitaletijdperk.nl
maurice.vergeer (skype)
___________________________________________________________________