Estimating Missing Dependent Variables

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Estimating Missing Dependent Variables

Aaron Kreider
I want to use the results of a logistic regression to estimate values
for cases where the dependent variable is missing. I have a dataset
which has 10,000 cases with dependent variables and another 10,000 cases
that do not. How can I do this?

I'm hoping to avoid writing a long mathematical formula as my
coefficients and significant variables are going to change often.

Aaron

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Estimating Missing Dependent Variables

Hector Maletta
Logistic Regression typically estimates the odds that a binary variable
turns out to be 0 or 1. It does not estimate whether in a particular case
the value would be 0 or 1.
In some applications it is often found that practitioners use a cutoff point
to assign a predicted value to the dependent variable, e.g. assign the value
1 to a particular case if the estimated probability of the outcome being 1
is higher than 0.5, and 0 otherwise.
In your case, you may estimate the logistic equation for the first set of
cases, and then apply the same equation to the second set, to generate the
probability of the event for each case. SPSS allows you to save this
probability as a new variable in your dataset, with the SAVE subcommand of
LOGREG.
This is common practice, and would raise few eyebrows. However, I think it
is wrong in a philosophical level. Probability, I think, is not about
individual cases but about groups or populations. When you predict that
people with Agegroup 20-29 and sex=1 and education=3 would have
probability=0.7 of voting for candidate Obama, you are simply saying that
70% of all people with those characteristics will probably vote for Obama,
but you know nothing about the vote of Mrs Jones or Mr Smith, even if they
share all those characteristics: they might be in the 70% voting for Obama,
or in the other 30%. Just the same, you know there is 4/6 probability that a
die turns out value greater than 3, but you know nothing at all about the
next throw: it may show any number from 1 to 6. Its "probability" is
indeterminate: it will turn into a concrete number (with probability 1) or
remain indeterminate (all numbers possible). The probability just says that
4/6 or 2/3 out of any large number of dice throws would be a number larger
than 3. You are speaking of large numbers, and that's the reason why the
fundamental theorem of statistics is known to its buddies as the Law of
Large Numbers. All statistics is about large numbers. For statistitians,
individuals are disposable, and indeterminate.
This has also practical implications. Suppose you are allocating fellowships
at college entrance, based on a number of variables predicting college
success. Your donors do not want the money wasted on college dropouts. So
you analyse past students, with and without fellowships, and come up with an
log reg equation to predict the probability of scholarly success, thus
allowing you to give the money only to the best prospects.
You will keep your donors happy, until they discover you rejected Albert
Einstein, a terrible prospect by any measure at the age of 17. Instead you
accepted a lot of dull guys able to approve SAT after SAT, most of which
will end up as clerks or salesmen, will never discover anything more
trascendental than a cheating wife or husband, and will quickly forget
everything they ever learnt in college about matter or energy.
Your equation, in short, tells you the truth for relatively large groups of
people sharing certain features, but not necessarily for specific cases.

Hector



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Aaron Kreider
Sent: 08 June 2009 17:04
To: [hidden email]
Subject: Estimating Missing Dependent Variables

I want to use the results of a logistic regression to estimate values
for cases where the dependent variable is missing. I have a dataset
which has 10,000 cases with dependent variables and another 10,000 cases
that do not. How can I do this?

I'm hoping to avoid writing a long mathematical formula as my
coefficients and significant variables are going to change often.

Aaron

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Estimating Missing Dependent Variables

Hector Maletta

Jon, I will repeat your comments separately from my main text for the sake of clarity.

1. Regarding my idea that probability is about groups, and not about individuals, you write:

"[>>>Peck, Jon] I don't think this stand has anything to do with logistic regression, per se.  I think you would make the same argument about ordinary regression and any other individual forecasting technique?  Would you agree?

So then, how would you do prediction for individual cases, be they people, countries, or any other observation unit?  If faced with a decision problem, would you disallow statistical models altogether?"

Jon, it is different in the case of linear regression. Log reg does not predict individual values, but probabilities. My point is that probabilities are essentially a property of groups (groups of events, populations, and so on). There are certain concepts of probability (e.g. Finetti’s subjective concept of probability) that may assign something called "probability" to a single case, but the most accepted meaning of probability is "frequentist", identifying probability with relative frequency of an event or attribute within a given population, group or set. In linear (or for that matter nonlinear) regression one predicts the value of a variable for each individual case, which is modelled as the sum of a linear/nonlinear function plus a random error, using an algorithm that minimizes the sum of the squared errors. In such case you are actually estimating individual values.

 

2. Later I mention the example of a model predicting college success based on SAT, and failing to select Einstein. You reply :

« [>>>Peck, Jon] I think this stand does not follow from your previous argument.  It's simply an asserting that basing predictions on SAT scores is a bad model.  You could logically have measures of creativity and other factors that might be better predictors. »

Of course, but it was just an example. Even if you include creativity or whatever, what you are doing in the example (assigning scholarships to candidates) is TRYING TO MINIMIZE THE PROPORTION OF FAILURES in the GROUP of people you are analyzing. You are not predicting the outcome of individual cases. Within a a subgroup with a certain probability of success (e.g. people sharing the same gender, education, SAT score, creativity, and all other predictors), some individuals will ultimately succeed, some will fail, and the information you have is exactly the same for all of them : as far as you know, the individual outcome WITHIN THE GROUP is indeterminate. That was my point. The probability is predicated of the group, not of the individual. The Dean of Admissions will probably minimize the number or proportion of dropouts among students granted a scholarship, but (1) cannot tell in advance who among the beneficiaries will drop out, and (b) cannot tell in advance whether there is someone else who was not a beneficiary but would have been (with hindsight) a better choice. He deals with groups, not individuals.

 

I hope this clarifies the issue.

 

Hector

 

 

 

 

 

 

-----Original Message-----
From: Peck, Jon [mailto:[hidden email]]
Sent: 08 June 2009 20:31
To: Hector Maletta
Subject: RE: Re: [SPSSX-L] Estimating Missing Dependent Variables

 

Hector,

I've made a few comments below.

 

-----Original Message-----

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hector Maletta

Sent: Monday, June 08, 2009 3:13 PM

To: [hidden email]

Subject: Re: [SPSSX-L] Estimating Missing Dependent Variables

 

Logistic Regression typically estimates the odds that a binary variable

turns out to be 0 or 1. It does not estimate whether in a particular case

the value would be 0 or 1.

In some applications it is often found that practitioners use a cutoff point

to assign a predicted value to the dependent variable, e.g. assign the value

1 to a particular case if the estimated probability of the outcome being 1

is higher than 0.5, and 0 otherwise.

In your case, you may estimate the logistic equation for the first set of

cases, and then apply the same equation to the second set, to generate the

probability of the event for each case. SPSS allows you to save this

probability as a new variable in your dataset, with the SAVE subcommand of

LOGREG.

This is common practice, and would raise few eyebrows. However, I think it

is wrong in a philosophical level. Probability, I think, is not about

individual cases but about groups or populations. When you predict that

people with Agegroup 20-29 and sex=1 and education=3 would have

probability=0.7 of voting for candidate Obama, you are simply saying that

70% of all people with those characteristics will probably vote for Obama,

but you know nothing about the vote of Mrs Jones or Mr Smith, even if they

share all those characteristics: they might be in the 70% voting for Obama,

or in the other 30%. Just the same, you know there is 4/6 probability that a

die turns out value greater than 3, but you know nothing at all about the

next throw: it may show any number from 1 to 6. Its "probability" is

indeterminate: it will turn into a concrete number (with probability 1) or

remain indeterminate (all numbers possible). The probability just says that

4/6 or 2/3 out of any large number of dice throws would be a number larger

than 3. You are speaking of large numbers, and that's the reason why the

fundamental theorem of statistics is known to its buddies as the Law of

Large Numbers. All statistics is about large numbers. For statistitians,

individuals are disposable, and indeterminate.

[>>>Peck, Jon] I don't think this stand has anything to do with logistic regression, per se.  I think you would make the same argument about ordinary regression and any other individual forecasting technique?  Would you agree?

So then, how would you do prediction for individual cases, be they people, countries, or any other observation unit?  If faced with a decision problem, would you disallow statistical models altogether?

This has also practical implications. Suppose you are allocating fellowships

at college entrance, based on a number of variables predicting college

success. Your donors do not want the money wasted on college dropouts. So

you analyse past students, with and without fellowships, and come up with an

log reg equation to predict the probability of scholarly success, thus

allowing you to give the money only to the best prospects.

You will keep your donors happy, until they discover you rejected Albert

Einstein, a terrible prospect by any measure at the age of 17. Instead you

accepted a lot of dull guys able to approve SAT after SAT, most of which

will end up as clerks or salesmen, will never discover anything more

trascendental than a cheating wife or husband, and will quickly forget

everything they ever learnt in college about matter or energy.

[>>>Peck, Jon] I think this stand does not follow from your previous argument.  It's simply an asserting that basing predictions on SAT scores is a bad model.  You could logically have measures of creativity and other factors that might be better predictors.

Your equation, in short, tells you the truth for relatively large groups of

people sharing certain features, but not necessarily for specific cases.

[>>>Peck, Jon]

Regards,

Jon

 

Hector

 

 

 

-----Original Message-----

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of

Aaron Kreider

Sent: 08 June 2009 17:04

To: [hidden email]

Subject: Estimating Missing Dependent Variables

 

I want to use the results of a logistic regression to estimate values

for cases where the dependent variable is missing. I have a dataset

which has 10,000 cases with dependent variables and another 10,000 cases

that do not. How can I do this?

 

I'm hoping to avoid writing a long mathematical formula as my

coefficients and significant variables are going to change often.

 

Aaron

 

=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD

 

=====================

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD

Reply | Threaded
Open this post in threaded view
|

Re: Estimating Missing Dependent Variables

Peck, Jon

 

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hector Maletta
Sent: Monday, June 08, 2009 7:54 PM
To: [hidden email]
Subject: Re: [SPSSX-L] Estimating Missing Dependent Variables

 

Jon, I will repeat your comments separately from my main text for the sake of clarity.

1. Regarding my idea that probability is about groups, and not about individuals, you write:

"[>>>Peck, Jon] I don't think this stand has anything to do with logistic regression, per se.  I think you would make the same argument about ordinary regression and any other individual forecasting technique?  Would you agree?

So then, how would you do prediction for individual cases, be they people, countries, or any other observation unit?  If faced with a decision problem, would you disallow statistical models altogether?"

Jon, it is different in the case of linear regression. Log reg does not predict individual values, but probabilities. My point is that probabilities are essentially a property of groups (groups of events, populations, and so on). There are certain concepts of probability (e.g. Finetti’s subjective concept of probability) that may assign something called "probability" to a single case, but the most accepted meaning of probability is "frequentist", identifying probability with relative frequency of an event or attribute within a given population, group or set. In linear (or for that matter nonlinear) regression one predicts the value of a variable for each individual case, which is modelled as the sum of a linear/nonlinear function plus a random error, using an algorithm that minimizes the sum of the squared errors. In such case you are actually estimating individual values.

[>>>Peck, Jon] Well, you are estimating expected values in ordinary regression.  p is also an estimated expected value (of a 0/1 outcome).  But in both cases that still leaves the decision problem to be solved.  Who would you admit?  Or, in classic terms, given the probability of rain, when would you carry an umbrella?

[>>>Peck, Jon] [snip]

Reply | Threaded
Open this post in threaded view
|

Re: Estimating Missing Dependent Variables

Hector Maletta

1. Regarding my idea that probability is about groups, and not about individuals, Jon Peck wrote:

"[>>>Peck, Jon] I don't think this stand has anything to do with logistic regression, per se.  I think you would make the same argument about ordinary regression and any other individual forecasting technique?  Would you agree? So then, how would you do prediction for individual cases, be they people, countries, or any other observation unit?  If faced with a decision problem, would you disallow statistical models altogether?"

Jon, it is different in the case of linear regression. Log reg does not predict individual values, but probabilities. My point is that probabilities are essentially a property of groups (groups of events, populations, and so on). There are certain concepts of probability (e.g. Finetti’s subjective concept of probability) that may assign something called "probability" to a single case, but the most accepted meaning of probability is "frequentist", identifying probability with relative frequency of an event or attribute within a given population, group or set. In linear (or for that matter nonlinear) regression one predicts the value of a variable for each individual case, which is modelled as the sum of a linear/nonlinear function plus a random error, using an algorithm that minimizes the sum of the squared errors. In such case you are actually estimating individual values.

[>>>Peck, Jon] Well, you are estimating expected values in ordinary regression.  p is also an estimated expected value (of a 0/1 outcome).  But in both cases that still leaves the decision problem to be solved.  Who would you admit?  Or, in classic terms, given the probability of rain, when would you carry an umbrella?

 

Jon, the probability p is not an expected value of the outcome. It is a probability. The expected values are only two : call them 0 or 1, or A and B, or whatever alphanumeric symbols you choose for the two outcomes. (either your event happens or it does not). On the other hand p is a real number between 0 or 1. It is the probability of getting one of the values, say the value A (the plane crashes in mid Atlantic, or this candidate drops out of college). The term probability is best interpreted as « the proportion of cases with the outcome A»  within a given set of cases with such and such characteristics

Now, regarding decision, that’s a different story altogether. Once you know the probability of an event, you can take a number of possible decisions about it (not only one), and your decision can be based on any of a number of possible criteria. If you, for instance, adopt a (possibly arbitrary) cutoff point like p=0.5, and assign all cases with p >0.5 to Group A and all others to Group not-A; but that would be, well, your decision. You may assign a cutoff value for some decisions, and a different cutoff value for other decisions. Someone else could adopt some other criterion for each decision, such as p>0.9 or p>0.2 instead of your 0.5, and there might be some valid reasons for any such decision rules. For instance, you may adopt the prudential decision to build a costly hurricane levee in New Orleans on a very small (say 1/50) probability of a Katrina force hurricane in the next 5 years (and refrain from building it if the probability is 1/51 or less). A new agricultural herbicide is approved only if the probability of human mortality among handlers is less than one in a million, but it is allowed with a lower probability such as p=0.0000009. With the current population of farm workers in the US, this would grant about three deaths per year among US farm hands (far less than the 350 people in the US that die annually by drowning in their own bathtub, but three people are always three people). And so on. Such are the decisions we make every day, taking our chances, like taking a plane from Rio to Paris or crossing a busy street. Some of us are brave, some are more cowardly. The decision rule we follow on each occasion does not depend on the method whereby the value of p for each event is estimated. It is a separate matter altogether.

Do you agree with this?

 

Hector

 

Reply | Threaded
Open this post in threaded view
|

Re: Estimating Missing Dependent Variables

Hector Maletta
In reply to this post by Peck, Jon

Moreover, Jon,

Besides the decision question being different from the question of meaning and estimation of the probability, in any case your decision will affect entire groups or categories of people; you would be unable to say anything about how the die will roll in each particular case within each category of people sharing the same probability. Some people with high probability will not suffer the event, while some with very low probability will.

You, for instance, may decide to give a college scholarship (or student loan) to all candidates whose predicted probability of college success is over 0.8. Some of them will actually succeed, some will not, and you do not have a clue about which is which. By the same token, you as a doctor should decide on giving or not giving Treatment A to patients suffering certain painful disease which (if untreated) would probably kill them in a few years. The treatment is effective in most cases, but in some cases it has lethal side effects that would kill the patient instantly; that unfortunate outcome is never certain: its probability p depends on some predictors.  You (or the medical profession) give the treatment to all patients whose p is lower than, say, 1%. But you do not know in advance whether your next patient (estimated p=0.009) will be cured, or she is instead a member of the unlucky lot that will be instantly killed by the treatment. It’s kinda Russian roulette at that point, only the gun has 100 holes, of which 99 are empty and one has a bullet. Knowing the probability of a category of people tells you nothing about the individual fate of each patient, just as (in Russian roulette) knowing that you have 1 chance in 6 tells you nothing how things will turn out the next time you pull the trigger.

Hector

 


From: Hector Maletta [mailto:[hidden email]]
Sent: 09 June 2009 00:22
To: 'Peck, Jon'; '[hidden email]'
Subject: RE: Estimating Missing Dependent Variables

 

1. Regarding my idea that probability is about groups, and not about individuals, Jon Peck wrote:

"[>>>Peck, Jon] I don't think this stand has anything to do with logistic regression, per se.  I think you would make the same argument about ordinary regression and any other individual forecasting technique?  Would you agree? So then, how would you do prediction for individual cases, be they people, countries, or any other observation unit?  If faced with a decision problem, would you disallow statistical models altogether?"

Jon, it is different in the case of linear regression. Log reg does not predict individual values, but probabilities. My point is that probabilities are essentially a property of groups (groups of events, populations, and so on). There are certain concepts of probability (e.g. Finetti’s subjective concept of probability) that may assign something called "probability" to a single case, but the most accepted meaning of probability is "frequentist", identifying probability with relative frequency of an event or attribute within a given population, group or set. In linear (or for that matter nonlinear) regression one predicts the value of a variable for each individual case, which is modelled as the sum of a linear/nonlinear function plus a random error, using an algorithm that minimizes the sum of the squared errors. In such case you are actually estimating individual values.

[>>>Peck, Jon] Well, you are estimating expected values in ordinary regression.  p is also an estimated expected value (of a 0/1 outcome).  But in both cases that still leaves the decision problem to be solved.  Who would you admit?  Or, in classic terms, given the probability of rain, when would you carry an umbrella?

 

Jon, the probability p is not an expected value of the outcome. It is a probability. The expected values are only two : call them 0 or 1, or A and B, or whatever alphanumeric symbols you choose for the two outcomes. (either your event happens or it does not). On the other hand p is a real number between 0 or 1. It is the probability of getting one of the values, say the value A (the plane crashes in mid Atlantic, or this candidate drops out of college). The term probability is best interpreted as « the proportion of cases with the outcome A»  within a given set of cases with such and such characteristics

Now, regarding decision, that’s a different story altogether. Once you know the probability of an event, you can take a number of possible decisions about it (not only one), and your decision can be based on any of a number of possible criteria. If you, for instance, adopt a (possibly arbitrary) cutoff point like p=0.5, and assign all cases with p >0.5 to Group A and all others to Group not-A; but that would be, well, your decision. You may assign a cutoff value for some decisions, and a different cutoff value for other decisions. Someone else could adopt some other criterion for each decision, such as p>0.9 or p>0.2 instead of your 0.5, and there might be some valid reasons for any such decision rules. For instance, you may adopt the prudential decision to build a costly hurricane levee in New Orleans on a very small (say 1/50) probability of a Katrina force hurricane in the next 5 years (and refrain from building it if the probability is 1/51 or less). A new agricultural herbicide is approved only if the probability of human mortality among handlers is less than one in a million, but it is allowed with a lower probability such as p=0.0000009. With the current population of farm workers in the US, this would grant about three deaths per year among US farm hands (far less than the 350 people in the US that die annually by drowning in their own bathtub, but three people are always three people). And so on. Such are the decisions we make every day, taking our chances, like taking a plane from Rio to Paris or crossing a busy street. Some of us are brave, some are more cowardly. The decision rule we follow on each occasion does not depend on the method whereby the value of p for each event is estimated. It is a separate matter altogether.

Do you agree with this?

 

Hector

 

Reply | Threaded
Open this post in threaded view
|

Re: Estimating Missing Dependent Variables

Ruben Geert van den Berg
In reply to this post by Aaron Kreider
Dear Aaron,
 
I'm not sure whether this is the easiest option with SPSS, but you could
 
1) merge your files with ADD FILES
2) Compute near zero weights for cases with missing dependents (in the example: 1E-25). If your data are unweighted, assign unity case weights to the other cases. Otherwise, copy the existing weights for those
3) replace the missing dependent values with zeros
4) Weight your sample
5) Run Logistic regression and save the predicted probabilities.
 
The trick is that the cases with missing dependent values will get predicted probabilities because the missing values have been replaced with zeros. However, because they've near zero weight, they won't influence the regression results substantially.
 
*A completely different approach could be to use OMS to create a new dataset containing the betas and use string manipulations to convert this into syntax *that uses only independent variables and the regression equation to calculate predicted probabilities.
 
For example, try:
 
***Create testdata.
 
DATAS CLO ALL.
NEW FIL.
SET SEED 123456.
INP PRO.
LOOP #I=1 to 15.
COMP ID=#I.
END CASE.
END LOOP.
END FIL.
END INP PRO.
EXE.
 
COMP Dep=RV.BER(.5).
 
DO REP Vars=Ind_1 to Ind_4.
COMP Vars=RV.UNI(1,10).
END REP.
 
DO IF $casenum GE 11.
RECOD Dep (ELSE=SYSMIS).
END IF.
 
EXE.
 
***End create data.
 
***Regression without incomplete cases.
 
LOGISTIC REGRESSION Dep
/METHOD = ENTER Ind_1 Ind_2 Ind_3 Ind_4
/SAVE = PRED
/CRITERIA = PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .
 
***Assigning case weights.
 
COMP Weight=1.
DO IF MIS(Dep)=1.
COMP Dep=0.
COMP Weight=1E-25.
END IF.
EXE.
 
WEI by Weight.
 
***Regression with completed cases.
 
LOGISTIC REGRESSION Dep
/METHOD = ENTER Ind_1 Ind_2 Ind_3 Ind_4
/SAVE = PRED
/CRITERIA = PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .
 
***Replace previously missing values with predicted probabilities if desired.
 
DO IF Weight=1E-25.
COMP Dep=PRE_2.
END IF.
EXE.




 



 

> Date: Mon, 8 Jun 2009 16:03:50 -0400
> From: [hidden email]
> Subject: Estimating Missing Dependent Variables
> To: [hidden email]
>
> I want to use the results of a logistic regression to estimate values
> for cases where the dependent variable is missing. I have a dataset
> which has 10,000 cases with dependent variables and another 10,000 cases
> that do not. How can I do this?
>
> I'm hoping to avoid writing a long mathematical formula as my
> coefficients and significant variables are going to change often.
>
> Aaron
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD


Express yourself instantly with MSN Messenger! MSN Messenger
Reply | Threaded
Open this post in threaded view
|

Sequencing Syntax Problem

Guerrero, Rodrigo
In reply to this post by Hector Maletta
Hello,

I have a data sequence problem that seems to be beyond my syntax
ability.  I hope you can help.

This is this page view data I currently have.

EntryPage       NextPage        Page     PreviousPage   ExitPage
Page C                  Page F  Page D  Page F
Page C  Page E  Page G  Page A  Page F
Page C  Page G  Page A  Page C  Page F
Page C  Page A  Page C                  Page F
Page C  Page D  Page E  Page G  Page F
Page C  Page F  Page D  Page E  Page F

I want to get it to look like this which would identify the sequence of
page views.

EntryPage       NextPage        Page     PreviousPage   ExitPage
Page C                  Page C                  Page F
Page C  Page G  Page A  Page C  Page F
Page C  Page E  Page G  Page A  Page F
Page C  Page D  Page E  Page G  Page F
Page C  Page F  Page D  Page E  Page F
Page C                  Page F  Page D  Page F

This is driving me crazy because I cannot get the original file in the
right order and it is not a quick sort to identify the sequence.

Thanks for your help.



RG

Rodrigo A. Guerrero | Director Of Marketing Research and Analysis | The
Scooter Store | 830.627.4317


The information transmitted is intended only for the addressee(s) and may contain confidential or privileged material, or both.  Any review, receipt, dissemination or other use of this information by non-addressees is prohibited.   If you received this in error or are a non-addressee, please contact the sender and delete the transmitted information.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

reformatting frequency tables

J P-6
Dear list,
 
I need to produce frequency reports for about 300 tables where the inforamtion (counts and simple percentages) is presented in rows instead of columns, as the default. Then I need to move that into Excel as a table so it can be edited some more. It would be great if I could just somehow turn this into a data file. Using SPSS V15.
 
Years ago I used scrips and macros but those skills are now quite rusty and this report is due yesterday! 
 
Any help is greatly appreciated.
 
John
 
 
 
 

 
 

Reply | Threaded
Open this post in threaded view
|

Re: Sequencing Syntax Problem

Maguin, Eugene
In reply to this post by Guerrero, Rodrigo
Rodrigo,

Two questions. 1) It looks like you have six columns of data but only five
column labels. Did you omit something or am I missing something?

2) What is the set of rules that you used to transform the input data set to
the output dataset? I don't get it but then I've never worked with page view
sequence data.

Gene Maguin


>>I have a data sequence problem that seems to be beyond my syntax
ability.  I hope you can help.

This is this page view data I currently have.

EntryPage       NextPage        Page     PreviousPage   ExitPage
Page C                  Page F  Page D  Page F
Page C  Page E  Page G  Page A  Page F
Page C  Page G  Page A  Page C  Page F
Page C  Page A  Page C                  Page F
Page C  Page D  Page E  Page G  Page F
Page C  Page F  Page D  Page E  Page F

I want to get it to look like this which would identify the sequence of
page views.

EntryPage       NextPage        Page     PreviousPage   ExitPage
Page C                  Page C                  Page F
Page C  Page G  Page A  Page C  Page F
Page C  Page E  Page G  Page A  Page F
Page C  Page D  Page E  Page G  Page F
Page C  Page F  Page D  Page E  Page F
Page C                  Page F  Page D  Page F

This is driving me crazy because I cannot get the original file in the
right order and it is not a quick sort to identify the sequence.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Estimating Missing Dependent Variables

Peck, Jon
In reply to this post by Hector Maletta

A few more last comments below.

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hector Maletta
Sent: Monday, June 08, 2009 9:46 PM
To: [hidden email]
Subject: Re: [SPSSX-L] Estimating Missing Dependent Variables

 

Moreover, Jon,

Besides the decision question being different from the question of meaning and estimation of the probability, in any case your decision will affect entire groups or categories of people; you would be unable to say anything about how the die will roll in each particular case within each category of people sharing the same probability. Some people with high probability will not suffer the event, while some with very low probability will.

[>>>Peck, Jon] And in the same way, in an ordinary conditional expectation model, some units will have higher than expected outcomes and some lower, and a regression equation cannot say any more about which are which given equal expectation.

You, for instance, may decide to give a college scholarship (or student loan) to all candidates whose predicted probability of college success is over 0.8. Some of them will actually succeed, some will not, and you do not have a clue about which is which. By the same token, you as a doctor should decide on giving or not giving Treatment A to patients suffering certain painful disease which (if untreated) would probably kill them in a few years. The treatment is effective in most cases, but in some cases it has lethal side effects that would kill the patient instantly; that unfortunate outcome is never certain: its probability p depends on some predictors.  You (or the medical profession) give the treatment to all patients whose p is lower than, say, 1%. But you do not know in advance whether your next patient (estimated p=0.009) will be cured, or she is instead a member of the unlucky lot that will be instantly killed by the treatment. It’s kinda Russian roulette at that point, only the gun has 100 holes, of which 99 are empty and one has a bullet. Knowing the probability of a category of people tells you nothing about the individual fate of each patient, just as (in Russian roulette) knowing that you have 1 chance in 6 tells you nothing how things will turn out the next time you pull the trigger.

[>>>Peck, Jon] Isn't it equivalent to say that a very good p function – in the extreme predicting 0 or 1 probabilities, is like a regression equation with a very high fit/small error?  If there is inherent randomness, no model is going to capture that (but some vigorous overfitting might make it seem otherwise L)

Hector

Reply | Threaded
Open this post in threaded view
|

Re: Estimating Missing Dependent Variables

Hector Maletta

Last comments indeed. In all statistical estimates there is a margin of error, granted, Jon. But in regression you actually estimate the value, and your THEORETICAL model includes the error: y=a+bX+e. The actual event is SUPPOSED to obey two kinds of causes, one systematic (X) and one random (e). Thus you obtain and estimate of the expected value (a+bX) and an estimate of the error component (e); the least squares algorithm chooses “a” and “b” such that the sum of the squared error components “e” is minimized.

Instead in logistic regression your theoretical model for an individual is totally different. It is deterministic. Either she is in State A or State B. No middle way. Either she drops out of college or she doesn’t. Either she is alive or she is dead. There is no “probability” or “error” in the model for individuals, only the two stark discrete states. What you predict, what you deal about indeed, is not the state of each individual, but the probability (read: EXPECTED RELATIVE FREQUENCY) of one of the states IN A GROUP OF INDIVIDUALS. Your margin of error would be in this relative frequency: for instance when you throw 100 coins, the expected frequency of tails is 50, but perhaps you get 49 or 52, and that is your margin of error. By the same token, if your model predicts (for some group sharing certain combination of predictors’ values) a probability of 0.20, perhaps the proportion of actual events turns out to be 0.22 or 0.18. The error is in the prediction of the relative frequency, not in the estimation of each individual outcome, because you are not predicting individual outcomes in that kind of models.

Hector


From: Peck, Jon [mailto:[hidden email]]
Sent: 10 June 2009 13:55
To: Hector Maletta; [hidden email]
Subject: RE: Re: [SPSSX-L] Estimating Missing Dependent Variables

 

A few more last comments below.

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Hector Maletta
Sent: Monday, June 08, 2009 9:46 PM
To: [hidden email]
Subject: Re: [SPSSX-L] Estimating Missing Dependent Variables

 

Moreover, Jon,

Besides the decision question being different from the question of meaning and estimation of the probability, in any case your decision will affect entire groups or categories of people; you would be unable to say anything about how the die will roll in each particular case within each category of people sharing the same probability. Some people with high probability will not suffer the event, while some with very low probability will.

[>>>Peck, Jon] And in the same way, in an ordinary conditional expectation model, some units will have higher than expected outcomes and some lower, and a regression equation cannot say any more about which are which given equal expectation.

You, for instance, may decide to give a college scholarship (or student loan) to all candidates whose predicted probability of college success is over 0.8. Some of them will actually succeed, some will not, and you do not have a clue about which is which. By the same token, you as a doctor should decide on giving or not giving Treatment A to patients suffering certain painful disease which (if untreated) would probably kill them in a few years. The treatment is effective in most cases, but in some cases it has lethal side effects that would kill the patient instantly; that unfortunate outcome is never certain: its probability p depends on some predictors.  You (or the medical profession) give the treatment to all patients whose p is lower than, say, 1%. But you do not know in advance whether your next patient (estimated p=0.009) will be cured, or she is instead a member of the unlucky lot that will be instantly killed by the treatment. It’s kinda Russian roulette at that point, only the gun has 100 holes, of which 99 are empty and one has a bullet. Knowing the probability of a category of people tells you nothing about the individual fate of each patient, just as (in Russian roulette) knowing that you have 1 chance in 6 tells you nothing how things will turn out the next time you pull the trigger.

[>>>Peck, Jon] Isn't it equivalent to say that a very good p function – in the extreme predicting 0 or 1 probabilities, is like a regression equation with a very high fit/small error?  If there is inherent randomness, no model is going to capture that (but some vigorous overfitting might make it seem otherwise L)

Hector

Reply | Threaded
Open this post in threaded view
|

Re: Estimating Missing Dependent Variables

Gerard M. Keogh
Hector,

I think thing are not as clear cut as you say.
In a logistic model the probabilities come from a logistic distribution -
forget about error it only confuses things.  Estimation involves fitting a
logistic distribution to get estimated probabilities.
 Similarly in a normal model the data values themselves are normally
distributed and estimation involves fitting a normal dist to the y values.
For bernoulli data as you describe the key thing is that logit transforms
the [0,1] probability scale [-inf, +inf]. For normal data this
transformation is the identity.

Gerard






             Hector Maletta
             <hmaletta@fiberte
             l.com.ar>                                                  To
             Sent by:                  [hidden email]
             "SPSSX(r)                                                  cc
             Discussion"
             <SPSSX-L@LISTSERV                                     Subject
             .UGA.EDU>                 Re: Estimating Missing Dependent
                                       Variables

             10/06/2009 19:28


             Please respond to
              Hector Maletta
             <hmaletta@fiberte
                 l.com.ar>






Last comments indeed. In all statistical estimates there is a margin of
error, granted, Jon. But in regression you actually estimate the value, and
your THEORETICAL model includes the error: y=a+bX+e. The actual event is
SUPPOSED to obey two kinds of causes, one systematic (X) and one random
(e). Thus you obtain and estimate of the expected value (a+bX) and an
estimate of the error component (e); the least squares algorithm chooses
“a” and “b” such that the sum of the squared error components “e” is
minimized.
Instead in logistic regression your theoretical model for an individual is
totally different. It is deterministic. Either she is in State A or State
B. No middle way. Either she drops out of college or she doesn’t. Either
she is alive or she is dead. There is no “probability” or “error” in the
model for individuals, only the two stark discrete states. What you
predict, what you deal about indeed, is not the state of each individual,
but the probability (read: EXPECTED RELATIVE FREQUENCY) of one of the
states IN A GROUP OF INDIVIDUALS. Your margin of error would be in this
relative frequency: for instance when you throw 100 coins, the expected
frequency of tails is 50, but perhaps you get 49 or 52, and that is your
margin of error. By the same token, if your model predicts (for some group
sharing certain combination of predictors’ values) a probability of 0.20,
perhaps the proportion of actual events turns out to be 0.22 or 0.18. The
error is in the prediction of the relative frequency, not in the estimation
of each individual outcome, because you are not predicting individual
outcomes in that kind of models.
Hector

From: Peck, Jon [mailto:[hidden email]]
Sent: 10 June 2009 13:55
To: Hector Maletta; [hidden email]
Subject: RE: Re: [SPSSX-L] Estimating Missing Dependent Variables

A few more last comments below.


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Hector Maletta
Sent: Monday, June 08, 2009 9:46 PM
To: [hidden email]
Subject: Re: [SPSSX-L] Estimating Missing Dependent Variables

Moreover, Jon,
Besides the decision question being different from the question of meaning
and estimation of the probability, in any case your decision will affect
entire groups or categories of people; you would be unable to say anything
about how the die will roll in each particular case within each category of
people sharing the same probability. Some people with high probability will
not suffer the event, while some with very low probability will.
[>>>Peck, Jon] And in the same way, in an ordinary conditional expectation
model, some units will have higher than expected outcomes and some lower,
and a regression equation cannot say any more about which are which given
equal expectation.
You, for instance, may decide to give a college scholarship (or student
loan) to all candidates whose predicted probability of college success is
over 0.8. Some of them will actually succeed, some will not, and you do not
have a clue about which is which. By the same token, you as a doctor should
decide on giving or not giving Treatment A to patients suffering certain
painful disease which (if untreated) would probably kill them in a few
years. The treatment is effective in most cases, but in some cases it has
lethal side effects that would kill the patient instantly; that unfortunate
outcome is never certain: its probability p depends on some predictors.
You (or the medical profession) give the treatment to all patients whose p
is lower than, say, 1%. But you do not know in advance whether your next
patient (estimated p=0.009) will be cured, or she is instead a member of
the unlucky lot that will be instantly killed by the treatment. It’s kinda
Russian roulette at that point, only the gun has 100 holes, of which 99 are
empty and one has a bullet. Knowing the probability of a category of people
tells you nothing about the individual fate of each patient, just as (in
Russian roulette) knowing that you have 1 chance in 6 tells you nothing how
things will turn out the next time you pull the trigger.
[>>>Peck, Jon] Isn't it equivalent to say that a very good p function – in
the extreme predicting 0 or 1 probabilities, is like a regression equation
with a very high fit/small error?  If there is inherent randomness, no
model is going to capture that (but some vigorous overfitting might make it
seem otherwise L)
Hector
**********************************************************************************
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.  It is the policy of the Department of Justice, Equality and Law Reform and the Agencies and Offices using its IT services to disallow the sending of offensive material.
Should you consider that the material contained in this message is offensive you should contact the sender immediately and also mailminder[at]justice.ie.

Is le haghaidh an duine nó an eintitis ar a bhfuil sí dírithe, agus le haghaidh an duine nó an eintitis sin amháin, a bheartaítear an fhaisnéis a tarchuireadh agus féadfaidh sé go bhfuil ábhar faoi rún agus/nó faoi phribhléid inti. Toirmisctear aon athbhreithniú, atarchur nó leathadh a dhéanamh ar an bhfaisnéis seo, aon úsáid eile a bhaint aisti nó aon ghníomh a dhéanamh ar a hiontaoibh, ag daoine nó ag eintitis seachas an faighteoir beartaithe. Má fuair tú é seo trí dhearmad, téigh i dteagmháil leis an seoltóir, le do thoil, agus scrios an t-ábhar as aon ríomhaire. Is é beartas na Roinne Dlí agus Cirt, Comhionannais agus Athchóirithe Dlí, agus na nOifígí agus na nGníomhaireachtaí a úsáideann seirbhísí TF na Roinne, seoladh ábhair cholúil a dhícheadú.
Más rud é go measann tú gur ábhar colúil atá san ábhar atá sa teachtaireacht seo is ceart duit dul i dteagmháil leis an seoltóir láithreach agus le mailminder[ag]justice.ie chomh maith.
***********************************************************************************

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Estimating Missing Dependent Variables

Hector Maletta
Gerard,
The form of the assumed distribution of probabilities (normal or logistic or
whatever) is irrelevant to the point I am making. In the case of a
dichotomous variable treated with log reg, you do not "fit a logistic
distribution" to the distribution of subjects over the two values of the
variable: the two values remain two values, people sit only at the two
values with nobody in between, and the log distribution pertains to the
probability of one of the values in a group of people, which varies from
group to group as other variable(s) vary. In the case of a continuous
variable, the normal distribution is not required for the dependent variable
itself: linear regression only assumes that the errors (the "e" in the
equation) are normally distributed around the regression line, which is not
quite the same. You do not "fit a normal distribution" to the distribution
of subjects over the dependent variable in a linear regression. And
moreover, all of this has nothing to do with my point.
Probably the difficulty in "seeing" my point is that many of us have been
for long considering probabilities as attributes of the individuals. Often
we see variables, dichotomous or interval, as arguments of an underlying
probability distribution whereby each value of the variable and the
individual cases having that value are assigned a probability. So one
imagines the logistic distribution as representing a low (or zero)
probability for individuals not suffering the event, a high (or unit)
probability for individuals having the event, and intermediate probabilities
for the fictitious individuals in between. But this is all imaginary, and a
bit quaint, because there are no intermediate situations, and there is no
objective or empirical "probability" property in individuals. It is, at
best, a construct you assign to individuals after measuring it in groups. If
you adopt a frequentist view of probability, which is the one with better
mathematical foundations and sounder empirical basis, things become easier
and interpretations more natural.

Hector


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Gerard M. Keogh
Sent: 11 June 2009 06:29
To: [hidden email]
Subject: Re: Estimating Missing Dependent Variables

Hector,

I think thing are not as clear cut as you say.
In a logistic model the probabilities come from a logistic distribution -
forget about error it only confuses things.  Estimation involves fitting a
logistic distribution to get estimated probabilities.
 Similarly in a normal model the data values themselves are normally
distributed and estimation involves fitting a normal dist to the y values.
For bernoulli data as you describe the key thing is that logit transforms
the [0,1] probability scale [-inf, +inf]. For normal data this
transformation is the identity.

Gerard






             Hector Maletta
             <hmaletta@fiberte
             l.com.ar>                                                  To
             Sent by:                  [hidden email]
             "SPSSX(r)                                                  cc
             Discussion"
             <SPSSX-L@LISTSERV                                     Subject
             .UGA.EDU>                 Re: Estimating Missing Dependent
                                       Variables

             10/06/2009 19:28


             Please respond to
              Hector Maletta
             <hmaletta@fiberte
                 l.com.ar>






Last comments indeed. In all statistical estimates there is a margin of
error, granted, Jon. But in regression you actually estimate the value, and
your THEORETICAL model includes the error: y=a+bX+e. The actual event is
SUPPOSED to obey two kinds of causes, one systematic (X) and one random
(e). Thus you obtain and estimate of the expected value (a+bX) and an
estimate of the error component (e); the least squares algorithm chooses
“a” and “b” such that the sum of the squared error components “e” is
minimized.
Instead in logistic regression your theoretical model for an individual is
totally different. It is deterministic. Either she is in State A or State
B. No middle way. Either she drops out of college or she doesn’t. Either
she is alive or she is dead. There is no “probability” or “error” in the
model for individuals, only the two stark discrete states. What you
predict, what you deal about indeed, is not the state of each individual,
but the probability (read: EXPECTED RELATIVE FREQUENCY) of one of the
states IN A GROUP OF INDIVIDUALS. Your margin of error would be in this
relative frequency: for instance when you throw 100 coins, the expected
frequency of tails is 50, but perhaps you get 49 or 52, and that is your
margin of error. By the same token, if your model predicts (for some group
sharing certain combination of predictors’ values) a probability of 0.20,
perhaps the proportion of actual events turns out to be 0.22 or 0.18. The
error is in the prediction of the relative frequency, not in the estimation
of each individual outcome, because you are not predicting individual
outcomes in that kind of models.
Hector

From: Peck, Jon [mailto:[hidden email]]
Sent: 10 June 2009 13:55
To: Hector Maletta; [hidden email]
Subject: RE: Re: [SPSSX-L] Estimating Missing Dependent Variables

A few more last comments below.


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Hector Maletta
Sent: Monday, June 08, 2009 9:46 PM
To: [hidden email]
Subject: Re: [SPSSX-L] Estimating Missing Dependent Variables

Moreover, Jon,
Besides the decision question being different from the question of meaning
and estimation of the probability, in any case your decision will affect
entire groups or categories of people; you would be unable to say anything
about how the die will roll in each particular case within each category of
people sharing the same probability. Some people with high probability will
not suffer the event, while some with very low probability will.
[>>>Peck, Jon] And in the same way, in an ordinary conditional expectation
model, some units will have higher than expected outcomes and some lower,
and a regression equation cannot say any more about which are which given
equal expectation.
You, for instance, may decide to give a college scholarship (or student
loan) to all candidates whose predicted probability of college success is
over 0.8. Some of them will actually succeed, some will not, and you do not
have a clue about which is which. By the same token, you as a doctor should
decide on giving or not giving Treatment A to patients suffering certain
painful disease which (if untreated) would probably kill them in a few
years. The treatment is effective in most cases, but in some cases it has
lethal side effects that would kill the patient instantly; that unfortunate
outcome is never certain: its probability p depends on some predictors.
You (or the medical profession) give the treatment to all patients whose p
is lower than, say, 1%. But you do not know in advance whether your next
patient (estimated p=0.009) will be cured, or she is instead a member of
the unlucky lot that will be instantly killed by the treatment. It’s kinda
Russian roulette at that point, only the gun has 100 holes, of which 99 are
empty and one has a bullet. Knowing the probability of a category of people
tells you nothing about the individual fate of each patient, just as (in
Russian roulette) knowing that you have 1 chance in 6 tells you nothing how
things will turn out the next time you pull the trigger.
[>>>Peck, Jon] Isn't it equivalent to say that a very good p function – in
the extreme predicting 0 or 1 probabilities, is like a regression equation
with a very high fit/small error?  If there is inherent randomness, no
model is going to capture that (but some vigorous overfitting might make it
seem otherwise L)
Hector
****************************************************************************
******
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.  It is the policy of the Department of Justice, Equality and Law
Reform and the Agencies and Offices using its IT services to disallow the
sending of offensive material.
Should you consider that the material contained in this message is offensive
you should contact the sender immediately and also mailminder[at]justice.ie.

Is le haghaidh an duine nó an eintitis ar a bhfuil sí dírithe, agus le
haghaidh an duine nó an eintitis sin amháin, a bheartaítear an fhaisnéis a
tarchuireadh agus féadfaidh sé go bhfuil ábhar faoi rún agus/nó faoi
phribhléid inti. Toirmisctear aon athbhreithniú, atarchur nó leathadh a
dhéanamh ar an bhfaisnéis seo, aon úsáid eile a bhaint aisti nó aon ghníomh
a dhéanamh ar a hiontaoibh, ag daoine nó ag eintitis seachas an faighteoir
beartaithe. Má fuair tú é seo trí dhearmad, téigh i dteagmháil leis an
seoltóir, le do thoil, agus scrios an t-ábhar as aon ríomhaire. Is é beartas
na Roinne Dlí agus Cirt, Comhionannais agus Athchóirithe Dlí, agus na
nOifígí agus na nGníomhaireachtaí a úsáideann seirbhísí TF na Roinne,
seoladh ábhair cholúil a dhícheadú.
Más rud é go measann tú gur ábhar colúil atá san ábhar atá sa teachtaireacht
seo is ceart duit dul i dteagmháil leis an seoltóir láithreach agus le
mailminder[ag]justice.ie chomh maith.
****************************************************************************
*******

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Estimating Missing Dependent Variables

Hector Maletta
The mechanics is as you say. I am talking about the substantive meaning of
the operation.
On second thought, I correct a previous statement I made: also in the case
of linear regression what you actually predict is the EXPECTED value of a
GROUP of people sharing certain value(s) of predictor variable(s), i.e. the
average value of Y for that group of people. Individuals are distributed
around the expected value, and may have any value of Y, with a presumed
normal distribution around the expected value, and you do not have a clue
about where a particular individual will be. That is, also in this case the
prediction is about a group, not about individuals.
What you are saying in a linear regression is more or less like this: "If
you take a certain number of women of age 40 with 12 years of education and
and income of $35,000 their expected average net wealth is $52,000". As a
subsidiary prediction you are also saying that "The actual net wealths of
the various women in the group are normally distributed around a mean of
$52,000 and a SD=SE, where SE is also estimated by the regression model".
Given a particular woman, Ms Smith, with the given characteristics, you
cannot say anything else about how much is her net wealth. However, in
linear regression you partition the entire sample in many subgroups
(ultimately as many as cases, if individual values are not repeated) and
predict a value of Y for each combination of predictor values, and
separately you predict a probability distribution of cases in the subgroup
around the predicted value of the variable.
In a logistic regression, instead, if you take a group of 100 women with the
same characteristics of the previous example, you may predict that about 43
of them are currently married, and the remaining 57 unmarried; for another
group of women with different characteristics, you predict that 77 will be
married. But you do not predict the marital status of Ms Smith, a member of
the first group, or Ms Jones, a member of the second group, nor the
distribution of the members around the "predicted value" because you do not
predict any "value".
About the practical implications of thinking probabilities in frequentist or
non frequentist terms, you may like to see Gerd Gigerenzer (2000), Adaptive
Thinking: Rationality in the Real World, OUP (e.g. pp.17-19 and elsewhere).
Hector
-----Original Message-----
From: Gerard M. Keogh [mailto:[hidden email]]
Sent: 11 June 2009 10:49
To: Hector Maletta
Subject: Re: Estimating Missing Dependent Variables

Hector,

I'm not hung up on this!

.... You do not "fit a normal distribution" to the distribution
of subjects over the dependent variable in a linear regression....

I fit a normal dist to the dependent var for each subj conditional on the
indep vars - E[y|x].
This is just integration which marginalises out the y variable giving y_bar
= mu(x).
It's just a mechanical process with no hidden meaning.

For the logistic we use g(E[y|x]) where g is the link function or logit.

And yes, it's a construct but no different to the normal data case where g
= identity - no different because it's all just integration of weighted
pdf's.

food for thought though!

Gerard





             Hector Maletta
             <hmaletta@fiberte
             l.com.ar>                                                  To
             Sent by:                  [hidden email]
             "SPSSX(r)                                                  cc
             Discussion"
             <SPSSX-L@LISTSERV                                     Subject
             .UGA.EDU>                 Re: Estimating Missing Dependent
                                       Variables

             11/06/2009 14:20


             Please respond to
              Hector Maletta
             <hmaletta@fiberte
                 l.com.ar>






Gerard,
The form of the assumed distribution of probabilities (normal or logistic
or
whatever) is irrelevant to the point I am making. In the case of a
dichotomous variable treated with log reg, you do not "fit a logistic
distribution" to the distribution of subjects over the two values of the
variable: the two values remain two values, people sit only at the two
values with nobody in between, and the log distribution pertains to the
probability of one of the values in a group of people, which varies from
group to group as other variable(s) vary. In the case of a continuous
variable, the normal distribution is not required for the dependent
variable
itself: linear regression only assumes that the errors (the "e" in the
equation) are normally distributed around the regression line, which is not
quite the same. You do not "fit a normal distribution" to the distribution
of subjects over the dependent variable in a linear regression. And
moreover, all of this has nothing to do with my point.
Probably the difficulty in "seeing" my point is that many of us have been
for long considering probabilities as attributes of the individuals. Often
we see variables, dichotomous or interval, as arguments of an underlying
probability distribution whereby each value of the variable and the
individual cases having that value are assigned a probability. So one
imagines the logistic distribution as representing a low (or zero)
probability for individuals not suffering the event, a high (or unit)
probability for individuals having the event, and intermediate
probabilities
for the fictitious individuals in between. But this is all imaginary, and a
bit quaint, because there are no intermediate situations, and there is no
objective or empirical "probability" property in individuals. It is, at
best, a construct you assign to individuals after measuring it in groups.
If
you adopt a frequentist view of probability, which is the one with better
mathematical foundations and sounder empirical basis, things become easier
and interpretations more natural.

Hector


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Gerard M. Keogh
Sent: 11 June 2009 06:29
To: [hidden email]
Subject: Re: Estimating Missing Dependent Variables

Hector,

I think thing are not as clear cut as you say.
In a logistic model the probabilities come from a logistic distribution -
forget about error it only confuses things.  Estimation involves fitting a
logistic distribution to get estimated probabilities.
 Similarly in a normal model the data values themselves are normally
distributed and estimation involves fitting a normal dist to the y values.
For bernoulli data as you describe the key thing is that logit transforms
the [0,1] probability scale [-inf, +inf]. For normal data this
transformation is the identity.

Gerard






             Hector Maletta
             <hmaletta@fiberte
             l.com.ar>                                                  To
             Sent by:                  [hidden email]
             "SPSSX(r)                                                  cc
             Discussion"
             <SPSSX-L@LISTSERV                                     Subject
             .UGA.EDU>                 Re: Estimating Missing Dependent
                                       Variables

             10/06/2009 19:28


             Please respond to
              Hector Maletta
             <hmaletta@fiberte
                 l.com.ar>






Last comments indeed. In all statistical estimates there is a margin of
error, granted, Jon. But in regression you actually estimate the value, and
your THEORETICAL model includes the error: y=a+bX+e. The actual event is
SUPPOSED to obey two kinds of causes, one systematic (X) and one random
(e). Thus you obtain and estimate of the expected value (a+bX) and an
estimate of the error component (e); the least squares algorithm chooses
“a” and “b” such that the sum of the squared error components “e” is
minimized.
Instead in logistic regression your theoretical model for an individual is
totally different. It is deterministic. Either she is in State A or State
B. No middle way. Either she drops out of college or she doesn’t. Either
she is alive or she is dead. There is no “probability” or “error” in the
model for individuals, only the two stark discrete states. What you
predict, what you deal about indeed, is not the state of each individual,
but the probability (read: EXPECTED RELATIVE FREQUENCY) of one of the
states IN A GROUP OF INDIVIDUALS. Your margin of error would be in this
relative frequency: for instance when you throw 100 coins, the expected
frequency of tails is 50, but perhaps you get 49 or 52, and that is your
margin of error. By the same token, if your model predicts (for some group
sharing certain combination of predictors’ values) a probability of 0.20,
perhaps the proportion of actual events turns out to be 0.22 or 0.18. The
error is in the prediction of the relative frequency, not in the estimation
of each individual outcome, because you are not predicting individual
outcomes in that kind of models.
Hector

From: Peck, Jon [mailto:[hidden email]]
Sent: 10 June 2009 13:55
To: Hector Maletta; [hidden email]
Subject: RE: Re: [SPSSX-L] Estimating Missing Dependent Variables

A few more last comments below.


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
Hector Maletta
Sent: Monday, June 08, 2009 9:46 PM
To: [hidden email]
Subject: Re: [SPSSX-L] Estimating Missing Dependent Variables

Moreover, Jon,
Besides the decision question being different from the question of meaning
and estimation of the probability, in any case your decision will affect
entire groups or categories of people; you would be unable to say anything
about how the die will roll in each particular case within each category of
people sharing the same probability. Some people with high probability will
not suffer the event, while some with very low probability will.
[>>>Peck, Jon] And in the same way, in an ordinary conditional expectation
model, some units will have higher than expected outcomes and some lower,
and a regression equation cannot say any more about which are which given
equal expectation.
You, for instance, may decide to give a college scholarship (or student
loan) to all candidates whose predicted probability of college success is
over 0.8. Some of them will actually succeed, some will not, and you do not
have a clue about which is which. By the same token, you as a doctor should
decide on giving or not giving Treatment A to patients suffering certain
painful disease which (if untreated) would probably kill them in a few
years. The treatment is effective in most cases, but in some cases it has
lethal side effects that would kill the patient instantly; that unfortunate
outcome is never certain: its probability p depends on some predictors.
You (or the medical profession) give the treatment to all patients whose p
is lower than, say, 1%. But you do not know in advance whether your next
patient (estimated p=0.009) will be cured, or she is instead a member of
the unlucky lot that will be instantly killed by the treatment. It’s kinda
Russian roulette at that point, only the gun has 100 holes, of which 99 are
empty and one has a bullet. Knowing the probability of a category of people
tells you nothing about the individual fate of each patient, just as (in
Russian roulette) knowing that you have 1 chance in 6 tells you nothing how
things will turn out the next time you pull the trigger.
[>>>Peck, Jon] Isn't it equivalent to say that a very good p function – in
the extreme predicting 0 or 1 probabilities, is like a regression equation
with a very high fit/small error?  If there is inherent randomness, no
model is going to capture that (but some vigorous overfitting might make it
seem otherwise L)
Hector
****************************************************************************

******
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.  It is the policy of the Department of Justice, Equality and Law
Reform and the Agencies and Offices using its IT services to disallow the
sending of offensive material.
Should you consider that the material contained in this message is
offensive
you should contact the sender immediately and also
mailminder[at]justice.ie.

Is le haghaidh an duine nó an eintitis ar a bhfuil sí dírithe, agus le
haghaidh an duine nó an eintitis sin amháin, a bheartaítear an fhaisnéis a
tarchuireadh agus féadfaidh sé go bhfuil ábhar faoi rún agus/nó faoi
phribhléid inti. Toirmisctear aon athbhreithniú, atarchur nó leathadh a
dhéanamh ar an bhfaisnéis seo, aon úsáid eile a bhaint aisti nó aon ghníomh
a dhéanamh ar a hiontaoibh, ag daoine nó ag eintitis seachas an faighteoir
beartaithe. Má fuair tú é seo trí dhearmad, téigh i dteagmháil leis an
seoltóir, le do thoil, agus scrios an t-ábhar as aon ríomhaire. Is é
beartas
na Roinne Dlí agus Cirt, Comhionannais agus Athchóirithe Dlí, agus na
nOifígí agus na nGníomhaireachtaí a úsáideann seirbhísí TF na Roinne,
seoladh ábhair cholúil a dhícheadú.
Más rud é go measann tú gur ábhar colúil atá san ábhar atá sa
teachtaireacht
seo is ceart duit dul i dteagmháil leis an seoltóir láithreach agus le
mailminder[ag]justice.ie chomh maith.
****************************************************************************

*******

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD