Mean of an unknown number of variables

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Mean of an unknown number of variables

Greg Chulsky
I am writing a code that will, when done, be run as a batch operation on
over 1,000 files.  At one point in the code, I need to create a new
variable whose value is the mean of all the other variables on the data
sheet.  The problem is that each file contains a different number of
variables.  This is where something like COMPUTE M0=MEAN(ALL) would have
been wonderful--except that, alas, ALL can't be used that way.  Is there
any way to pull this off?

Thanks!

-Greg

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Mean of an unknown number of variables

Peck, Jon
If you know the first and last variables, you can use TO in the argument:
compute mean = mean(a to z).

You could also do this with programmability:
begin program.
import spss
c = spss.GetVariableCount()
v0 = spss.GetVariableName(0)
vn = spss.GetVariableName(c-1)
cmd = "compute overallmean = mean(%(v0) TO %(vn)s)" % locals()
spss.Submit(cmd)
end program.

For more sophisticated filtering, you could use the spssaux.VariableDict class.

This requires at least SPSS 14 and the Python programmability plugin,

HTH,
Jon Peck



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Greg Chulsky
Sent: Tuesday, August 05, 2008 2:24 PM
To: [hidden email]
Subject: [SPSSX-L] Mean of an unknown number of variables

I am writing a code that will, when done, be run as a batch operation on
over 1,000 files.  At one point in the code, I need to create a new
variable whose value is the mean of all the other variables on the data
sheet.  The problem is that each file contains a different number of
variables.  This is where something like COMPUTE M0=MEAN(ALL) would have
been wonderful--except that, alas, ALL can't be used that way.  Is there
any way to pull this off?

Thanks!

-Greg

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Mean of an unknown number of variables

Peck, Jon
Sorry, missing an s.  Corrected below.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Peck, Jon
Sent: Tuesday, August 05, 2008 2:52 PM
To: [hidden email]
Subject: Re: [SPSSX-L] Mean of an unknown number of variables

If you know the first and last variables, you can use TO in the argument:
compute mean = mean(a to z).

You could also do this with programmability:
begin program.
import spss
c = spss.GetVariableCount()
v0 = spss.GetVariableName(0)
vn = spss.GetVariableName(c-1)
cmd = "compute overallmean = mean(%(v0)s TO %(vn)s)" % locals()
spss.Submit(cmd)
end program.

For more sophisticated filtering, you could use the spssaux.VariableDict class.

This requires at least SPSS 14 and the Python programmability plugin,

HTH,
Jon Peck



-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Greg Chulsky
Sent: Tuesday, August 05, 2008 2:24 PM
To: [hidden email]
Subject: [SPSSX-L] Mean of an unknown number of variables

I am writing a code that will, when done, be run as a batch operation on
over 1,000 files.  At one point in the code, I need to create a new
variable whose value is the mean of all the other variables on the data
sheet.  The problem is that each file contains a different number of
variables.  This is where something like COMPUTE M0=MEAN(ALL) would have
been wonderful--except that, alas, ALL can't be used that way.  Is there
any way to pull this off?

Thanks!

-Greg

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Mean of an unknown number of variables

Greg Chulsky
Thank you!  The code makes sense to me, but when I run it, "overallmean"
ends up equaling not the mean, but the last variable.  I cannot for my
life understand why it's behaving like that.

-Greg

Peck, Jon wrote:

> Sorry, missing an s.  Corrected below.
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Peck, Jon
> Sent: Tuesday, August 05, 2008 2:52 PM
> To: [hidden email]
> Subject: Re: [SPSSX-L] Mean of an unknown number of variables
>
> If you know the first and last variables, you can use TO in the argument:
> compute mean = mean(a to z).
>
> You could also do this with programmability:
> begin program.
> import spss
> c = spss.GetVariableCount()
> v0 = spss.GetVariableName(0)
> vn = spss.GetVariableName(c-1)
> cmd = "compute overallmean = mean(%(v0)s TO %(vn)s)" % locals()
> spss.Submit(cmd)
> end program.
>
> For more sophisticated filtering, you could use the spssaux.VariableDict class.
>
> This requires at least SPSS 14 and the Python programmability plugin,
>
> HTH,
> Jon Peck
>
>
>
> -----Original Message-----
> From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Greg Chulsky
> Sent: Tuesday, August 05, 2008 2:24 PM
> To: [hidden email]
> Subject: [SPSSX-L] Mean of an unknown number of variables
>
> I am writing a code that will, when done, be run as a batch operation on
> over 1,000 files.  At one point in the code, I need to create a new
> variable whose value is the mean of all the other variables on the data
> sheet.  The problem is that each file contains a different number of
> variables.  This is where something like COMPUTE M0=MEAN(ALL) would have
> been wonderful--except that, alas, ALL can't be used that way.  Is there
> any way to pull this off?
>
> Thanks!
>
> -Greg
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Opinions about validity of Predictive Analytics programs?

mpirritano
Hello spssers,

I know that SPSS has a predictive analytics module. I've also been
exposed to predictive analytic programs that make use of actuarial data
to predict risk in healthcare settings. What do statisticians think of
these models? Let me explain my own motivation for asking the question.

As an experimental psychologist I have seen a lot of research that tries
to explain human behavior. This research is often held up as exemplary
if it can explain 40 or 60 percent of the variance in that behavior. I
think 60 percent in many areas is probably all but unheard of. Now how
do these predictive analytic techniques describe the degree of explained
variance? I asked someone who works for a statistical package software
company and they told me that there was nothing akin to r squared in
these packages. Not to mention the fact that the back end (actual
calculations) of these techniques is not realistically understandable to
99% of the individuals that use them. So somehow statisticians have
developed these incredibly accurate ways of predicting future behaviors,
while the field of psychology plows on unawares of these successes?
Seems unlikely.

To me it just seems like software companies are playing into the myth
that statistics can magically tell you what you want to increase your
profits. They present 'testimonials', the last refuge of a scoundrel, to
support their claims. Is this not a case of absolute power leading to
absolute corruption?

All joking aside, does anyone have an opinion about this? As a lowly
peon I'm not sure if my opinion is valid or if I'm missing something
basic.

Thanks
Matt

Matthew Pirritano, Ph.D.
Research Analyst IV
Orange County Health Care Agency
(714) 834-6011

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Python to identify and change length of long strings

King Douglas
I'm using SPSS 15. 

I have a need to automate a process that will identify strings of length greater than 255 and reduce them to length = 255.

Now if my input files were never to change, I could do this once in SPSS syntax and I'd be done.  However, the input variables are likely to change from time to time.

A Python workaround would be a very useful tool.

Thanks,

King Douglas
American Airlines Customer Research

====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Opinions about validity of Predictive Analytics programs?

Hector Maletta
In reply to this post by mpirritano
I do not know much about software for predictive analytics, but could offer just a two-cents bit on some of the more general issues raised in your message:

Probabilities based on actuarial data in general, although often used to predict individual outcomes, are best understood in frequentist terms, i.e. as predicated on populations, rather than individuals. In other terms, when the outcome of a prediction is, say, a 0.6 probability of survival for 5 years after a heart attack, this simply means that 6 out of every 10 patients with such and such characteristics are predicted to survive for 5 years; the fate of any individual patient with such characteristics, say Mr John Smith, is radically indeterminate: he could die tomorrow or live for another 40 years; any individual outcome for Mr Smith is compatible with the prediction of a 0.6 probability of surviving 5 years, which refers to the entire group that includes Mr Smith and other similar patients, but not to any individual in particular.

Regarding variance explained, the matter is likewise. A particular model explains, say, 60% of the squared differences between individuals in a particular variable; the other 40% of variability is due to other factors alien to the model. The prediction produced by the model, explaining such proportion of total variance with a sample of given size, will have a confidence interval, which means that on average, say, 95% of individual cases will fall within the confidence interval (which in that case will be +/- 2 standard errors around the predicted value, the SE being a function of variance and sample size). A particular individual (our Mr Smith again) could possibly be inside or outside the confidence interval: nothing hinders him from being miles away from the predictive curve. If the model prediction for patients like Mr Smith is a life expectancy of 8 years, Mr Smith himself may ultimately survive for another 40 years, or just for a few hours: both are compatible with the
prediction. But if you have 1000 guys like Mr Smith, and your study is good, you may bet that about 95% of them will live for 8 years +/- 2 standard errors.

Now, all this is true of classical parametric analyses such as linear regression and related procedures. Predictive analytics use often nonparametric techniques like Cox survival models. Being non parametric, these procedures do not assume normally distributed errors, like linear regression does. However, some approximate measures of fit and statistical significance do exist even for those procedures, and stats software packages usually provide them. One key point to remember is that complex models with many predictors require large samples to do a proper job with small margins of error. This is doubly true for non parametric models, because their margins of error are unknown and probably larger. Many empirical studies of that kind are based on small samples, and therefore the results can easily contradict those of other similar studies, just by sample fluke.

A final word: predictive models do not try to EXPLAIN behavior, but to PREDICT it. There is a huge difference.

Hector

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Pirritano, Matthew
Sent: 22 September 2008 17:28
To: [hidden email]
Subject: Opinions about validity of Predictive Analytics programs?

Hello spssers,

I know that SPSS has a predictive analytics module. I've also been
exposed to predictive analytic programs that make use of actuarial data
to predict risk in healthcare settings. What do statisticians think of
these models? Let me explain my own motivation for asking the question.

As an experimental psychologist I have seen a lot of research that tries
to explain human behavior. This research is often held up as exemplary
if it can explain 40 or 60 percent of the variance in that behavior. I
think 60 percent in many areas is probably all but unheard of. Now how
do these predictive analytic techniques describe the degree of explained
variance? I asked someone who works for a statistical package software
company and they told me that there was nothing akin to r squared in
these packages. Not to mention the fact that the back end (actual
calculations) of these techniques is not realistically understandable to
99% of the individuals that use them. So somehow statisticians have
developed these incredibly accurate ways of predicting future behaviors,
while the field of psychology plows on unawares of these successes?
Seems unlikely.

To me it just seems like software companies are playing into the myth
that statistics can magically tell you what you want to increase your
profits. They present 'testimonials', the last refuge of a scoundrel, to
support their claims. Is this not a case of absolute power leading to
absolute corruption?

All joking aside, does anyone have an opinion about this? As a lowly
peon I'm not sure if my opinion is valid or if I'm missing something
basic.

Thanks
Matt

Matthew Pirritano, Ph.D.
Research Analyst IV
Orange County Health Care Agency
(714) 834-6011

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Opinions about validity of Predictive Analytics programs?

David Hitchin
In reply to this post by mpirritano
Quoting "Pirritano, Matthew" <[hidden email]>:
>  I know that SPSS has a predictive analytics module. I've also been
> exposed to predictive analytic programs that make use of actuarial
> data to predict risk in healthcare settings. etc
>
> All joking aside, does anyone have an opinion about this? As a lowly
> peon I'm not sure if my opinion is valid or if I'm missing something
> basic.
>
As always, Hector Maletta has written some very sensible words in reply,
but I have a few comments to add.

The first is that predictive equations work on the assumption that the
world has not changed between the formulation of the model and the
calculation of risk in the future.

Next, when a very large proportion of a population behaves in a similar
way, then trivial predictions seem to have great predictive power, e.g.
if only 1 person in 1000 gets a rare disease, then the trivial
prediction that of 1000 people will all remain healthy achieves an
accuracy of 99.9% - but is completely useless at identifying the one
person who may need treatment. (The statistics for screening programmes
which attempt to identify cancers generally identify so many false
positives that they are of questionable value).

Often models are fitted on the basis of limited data, with no subjects
at all observed under some of the combinations of circumstances, so
predictions are inappropriate when new individuals are observed with
those characteristics.

Finally, fitted models need proper validation. A common method is to
construct a "hits and misses table", i.e. each observation is checked
against the model to see what the outcome should be, and this is
compared with the known outcome. The flaw here is testing the model on
the same data which was used to set its parameters. Jack-knifing and
bootstrapping methods can be used to reduce this bias, e.g. no
observation is used at the same time for fitting the model and testing it.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Opinions about validity of Predictive Analytics programs?

mpirritano
Are there any reviews and or critiques of predictive modeling that
anyone might recommend?

Thanks for the feedback!

Matthew Pirritano, Ph.D.
Research Analyst IV
Orange County Health Care Agency
(714) 834-6011
-----Original Message-----
From: David Hitchin [mailto:[hidden email]]
Sent: Monday, September 22, 2008 1:56 PM
To: Pirritano, Matthew
Cc: [hidden email]
Subject: Re: Opinions about validity of Predictive Analytics programs?

Quoting "Pirritano, Matthew" <[hidden email]>:
>  I know that SPSS has a predictive analytics module. I've also been
> exposed to predictive analytic programs that make use of actuarial
> data to predict risk in healthcare settings. etc
>
> All joking aside, does anyone have an opinion about this? As a lowly
> peon I'm not sure if my opinion is valid or if I'm missing something
> basic.
>
As always, Hector Maletta has written some very sensible words in reply,
but I have a few comments to add.

The first is that predictive equations work on the assumption that the
world has not changed between the formulation of the model and the
calculation of risk in the future.

Next, when a very large proportion of a population behaves in a similar
way, then trivial predictions seem to have great predictive power, e.g.
if only 1 person in 1000 gets a rare disease, then the trivial
prediction that of 1000 people will all remain healthy achieves an
accuracy of 99.9% - but is completely useless at identifying the one
person who may need treatment. (The statistics for screening programmes
which attempt to identify cancers generally identify so many false
positives that they are of questionable value).

Often models are fitted on the basis of limited data, with no subjects
at all observed under some of the combinations of circumstances, so
predictions are inappropriate when new individuals are observed with
those characteristics.

Finally, fitted models need proper validation. A common method is to
construct a "hits and misses table", i.e. each observation is checked
against the model to see what the outcome should be, and this is
compared with the known outcome. The flaw here is testing the model on
the same data which was used to set its parameters. Jack-knifing and
bootstrapping methods can be used to reduce this bias, e.g. no
observation is used at the same time for fitting the model and testing
it.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Python to identify and change length of long strings

Peck, Jon
In reply to this post by King Douglas
Why do you need this?  If the issue is reading long strings into old versions of SPSS, this is done automatically by SPSS already.  An overly long string will show up as multiple string variables when that is needed (and magically reunite when read in a newer version as long as you don't mess with the parts)

HTH,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion <[hidden email]>
To: [hidden email] <[hidden email]>
Sent: Mon Sep 22 11:48:17 2008
Subject:      [SPSSX-L] Python to identify and change length of long strings

I'm using SPSS 15. 

I have a need to automate a process that will identify strings of length greater than 255 and reduce them to length = 255.

Now if my input files were never to change, I could do this once in SPSS syntax and I'd be done.  However, the input variables are likely to change from time to time.

A Python workaround would be a very useful tool.

Thanks,

King Douglas
American Airlines Customer Research

=======
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Python to identify and change length of long strings

King Douglas-2
In reply to this post by King Douglas
Hey, Jon,
Thanks for jumping in.  This is related to a problem I'm having with an online survey vendor.  I posted a separate thread on a problem I'm trying to diagnose where that vendors survey data files when exported to SPSS can be viewed but not saved by SPSS 13 users.

The vendor's application will also, without warning, sometimes change the length of existing string variables.  The files in question must be merged into monthly and quarterly data files, so two string variables with the same name and unequal lengths cause an error.

I'm looking for a way to automatically convert previously unidentified changes to string lengths to the appropriate lengths.

My hunch is that the nifty SPSS feature you describe will not solve my problem in this case.

Cheers,

King

----------------------------------------
From: "Peck, Jon" <[hidden email]>
Sent: Monday, September 22, 2008 5:00 PM
To: [hidden email]>, <[hidden email]
Subject: Re: [SPSSX-L] Python to identify and change length of long strings

Why do you need this?  If the issue is reading long strings into old versions of SPSS, this is done automatically by SPSS already.  An overly long string will show up as multiple string variables when that is needed (and magically reunite when read in a newer version as long as you don't mess with the parts)

HTH,

Jon Peck

-----Original Message-----

From: SPSSX(r) Discussion <[hidden email]>

To: [hidden email] <[hidden email]>

Sent: Mon Sep 22 11:48:17 2008

Subject:      [SPSSX-L] Python to identify and change length of long strings

I'm using SPSS 15.

I have a need to automate a process that will identify strings of length greater than 255 and reduce them to length = 255.

Now if my input files were never to change, I could do this once in SPSS syntax and I'd be done.  However, the input variables are likely to change from time to time.

A Python workaround would be a very useful tool.

Thanks,

King Douglas

American Airlines Customer Research

=======

To manage your subscription to SPSSX-L, send a message to

[hidden email] (not to SPSSX-L), with no body text except the

command. To leave the list, send the command

SIGNOFF SPSSX-L

For a list of commands to manage subscriptions, send the command

INFO REFCARD



====================To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD