MANOVA vs multiple ANOVA's

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

MANOVA vs multiple ANOVA's

William Dudley WNDUDLEY

I have results form a study in which I ran two univariate ANOVA's each with the same three binary independent variables
but with different dependent variables.

UNIANOVA Math  BY schoolgrp gender status
  /METHOD=SSTYPE(3)
  /INTERCEPT=INCLUDE
  /CRITERIA=ALPHA(0.05)
  /DESIGN=schoolgrp gender schoolgrp*gender  gender* status gender*schoolgrp status *gender*schoolgrp .

UNIANOVA verbal  BY schoolgrp gender status
  /METHOD=SSTYPE(3)
  /INTERCEPT=INCLUDE
  /CRITERIA=ALPHA(0.05)
  /DESIGN=schoolgrp gender schoolgrp*gender  gender* status gender*schoolgrp status *gender*schoolgrp .

The main and interaction effects are different for the two models.  A reviewer wants us to re run the analyses as a MANVOA.
If seems in this case that the main purpose of the MANOVA would be to control for Type I error inflation.
Which would look like this.

GLM  Math Verbal  BY schoolgrp gender status
  /METHOD=SSTYPE(3)
  /INTERCEPT=INCLUDE
  /PRINT=DESCRIPTIVE ETASQ OPOWER PARAMETER
  /CRITERIA=ALPHA(.05)
  /DESIGN=schoolgrp gender status gender*schoolgrp schoolgrp*status gender*status
    gender*schoolgrp*status.


It has been a while since I worked with MANOVA and I wonder if I can use some type of
LMATRIX contrast to examine the univariate results within the overall Multivariate model.
That is I would like to estimate the full model and then get each of my univariate results fr math and then verbal  - the three main effects and the interactions?

Thanks in advance for any help

Bill


William N. Dudley, PhD
Associate Dean for Research
The School of Health and Human Performance Office of Research
The University of North Carolina at Greensboro
126 HHP Building, PO Box 26170
Greensboro, NC 27402-6170
VOICE 336.2562475
FAX 336.334.3238
Reply | Threaded
Open this post in threaded view
|

Re: MANOVA vs multiple ANOVA's

Burleson,Joseph A.

Bill:

 

You could use LMATRIX, but you can do that with GLM whether uni or multi-variate.

 

Still, multivariate will give you the means for each DV. And the COMPARE subcommand on EMEANS can assist in interpreting the interaction patterns.

 

Still, LMATRIX is very useful nonetheless.

 

Joe Burleson

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of William Dudley WNDUDLEY
Sent: Monday, April 20, 2009 5:17 PM
To: [hidden email]
Subject: MANOVA vs multiple ANOVA's

 


I have results form a study in which I ran two univariate ANOVA's each with the same three binary independent variables
but with different dependent variables.

UNIANOVA Math  BY schoolgrp gender status
  /METHOD=SSTYPE(3)
  /INTERCEPT=INCLUDE
  /CRITERIA=ALPHA(0.05)
  /DESIGN=schoolgrp gender schoolgrp*gender  gender* status gender*schoolgrp status *gender*schoolgrp .

UNIANOVA verbal  BY schoolgrp gender status
  /METHOD=SSTYPE(3)
  /INTERCEPT=INCLUDE
  /CRITERIA=ALPHA(0.05)
  /DESIGN=schoolgrp gender schoolgrp*gender  gender* status gender*schoolgrp status *gender*schoolgrp .

The main and interaction effects are different for the two models.  A reviewer wants us to re run the analyses as a MANVOA.
If seems in this case that the main purpose of the MANOVA would be to control for Type I error inflation.
Which would look like this.

GLM  Math Verbal  BY schoolgrp gender status
  /METHOD=SSTYPE(3)
  /INTERCEPT=INCLUDE
  /PRINT=DESCRIPTIVE ETASQ OPOWER PARAMETER
  /CRITERIA=ALPHA(.05)
  /DESIGN=schoolgrp gender status gender*schoolgrp schoolgrp*status gender*status
    gender*schoolgrp*status.


It has been a while since I worked with MANOVA and I wonder if I can use some type of
LMATRIX contrast to examine the univariate results within the overall Multivariate model.
That is I would like to estimate the full model and then get each of my univariate results fr math and then verbal  - the three main effects and the interactions?

Thanks in advance for any help

Bill


William N. Dudley, PhD
Associate Dean for Research
The School of Health and Human Performance Office of Research
The University of North Carolina at Greensboro
126 HHP Building, PO Box 26170
Greensboro, NC 27402-6170
VOICE 336.2562475
FAX 336.334.3238

Reply | Threaded
Open this post in threaded view
|

Loop using Leave...re-initialize at break points

King Douglas
Folks,

Below is a greatly simplified sample of an aggregated data file and syntax that works for the file as a whole.  What I need to do is compute year-to-date (YTD) means across Period by City and Year.

I have no trouble computing YTD.  My problem is getting the two LEAVE variables to re-initialize for each new combination of City and Year.  I'm open to using Python or other SPSS solutions.

DATA LIST FREE
  /CITY (A3) YEAR PERIOD VarXMean VarXN.
BEGIN DATA.
SFO 2007 1 71  300
SFO 2007 2 74  400
SFO 2007 3 75  250
SFO 2008 1 74  350
SFO 2008 2 74  400
SFO 2008 3 75  350
DFW 2007 1 82  300
DFW 2007 2 74  400
DFW 2007 3 75  250
DFW 2008 1 84  350
DFW 2008 2 84  400
DFW 2008 3 85  350
NYC 2007 1 62  300
NYC 2007 2 74  400
NYC 2007 3 85  250
NYC 2008 1 74  350
NYC 2008 2 64  400
NYC 2008 3 75  350
END DATA.

* This works for the file as a whole, but I can't figure out how to get the LEAVE variables to re-initialize for each new combination of City and Year.

COMPUTE Expanded = Expanded + VarXMean*VarXN.
LEAVE Expanded.
COMPUTE Cum_N = Cum_N + VarXN.
LEAVE Cum_N.
COMPUTE YTD = Expanded/Cum_N.

EXE.

To further complicate the issue, I have six variables in the same file for which I need YTD computed at the same break points, so I'm using the syntax above inside a DO REPEAT statement.

Hope someone is kind enough to come to my assistance.  :>)

Thanks very much,

King Douglas
American Airlines
Reply | Threaded
Open this post in threaded view
|

Re: Loop using Leave...re-initialize at break points

Maguin, Eugene
King,

It'd seem like Aggregate would be the natural choice for what you're doing
after first computing the VarXMean*VarXN product. Does Aggregate not work
for you in this problem?

Gene Maguin



>>Below is a greatly simplified sample of an aggregated data file and syntax
that works for the file as a whole.  What I need to do is compute
year-to-date (YTD) means across Period by City and Year.

I have no trouble computing YTD.  My problem is getting the two LEAVE
variables to re-initialize for each new combination of City and Year.  I'm
open to using Python or other SPSS solutions.

DATA LIST FREE
  /CITY (A3) YEAR PERIOD VarXMean VarXN.
BEGIN DATA.
SFO 2007 1 71  300
SFO 2007 2 74  400
SFO 2007 3 75  250
SFO 2008 1 74  350
SFO 2008 2 74  400
SFO 2008 3 75  350
DFW 2007 1 82  300
DFW 2007 2 74  400
DFW 2007 3 75  250
DFW 2008 1 84  350
DFW 2008 2 84  400
DFW 2008 3 85  350
NYC 2007 1 62  300
NYC 2007 2 74  400
NYC 2007 3 85  250
NYC 2008 1 74  350
NYC 2008 2 64  400
NYC 2008 3 75  350
END DATA.

* This works for the file as a whole, but I can't figure out how to get the
LEAVE variables to re-initialize for each new combination of City and Year.

COMPUTE Expanded = Expanded + VarXMean*VarXN.
LEAVE Expanded.
COMPUTE Cum_N = Cum_N + VarXN.
LEAVE Cum_N.
COMPUTE YTD = Expanded/Cum_N.
EXE.

To further complicate the issue, I have six variables in the same file for
which I need YTD computed at the same break points, so I'm using the syntax
above inside a DO REPEAT statement.

Hope someone is kind enough to come to my assistance.  :>)

Thanks very much,

King Douglas
American Airlines


=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Loop using Leave...re-initialize at break points

King Douglas
Gene,

I don't see how Aggregate can help.  For each City, Year Period, the YTD figure has to be the grand mean of that period plus the preceding periods...therefore the YTD is must be "weighted" by sample size per period.

In any case, I'm interested in the SPSS data manipulation challenge of how to re-initialize the LEAVE variables at each break point and loop through the computations.  I admit to being a tyro at LOOP/END LOOP, although I have a few LOOP snippets that I use from time to time.  This is a new problem.

Thanks very much for your input...and maybe you are right and I just can't see it.

King

--- On Tue, 4/21/09, Gene Maguin <[hidden email]> wrote:
From: Gene Maguin <[hidden email]>
Subject: Re: Loop using Leave...re-initialize at break points
To: [hidden email]
Date: Tuesday, April 21, 2009, 10:31 AM

King,

It'd seem like Aggregate would be the natural choice for what you're
doing
after first computing the VarXMean*VarXN product. Does Aggregate not work
for you in this problem?

Gene Maguin



>>Below is a greatly simplified sample of an aggregated data file and
syntax
that works for the file as a whole. What I need to do is compute
year-to-date (YTD) means across Period by City and Year.

I have no trouble computing YTD. My problem is getting the two LEAVE
variables to re-initialize for each new combination of City and Year. I'm
open to using Python or other SPSS solutions.

DATA LIST FREE
/CITY (A3) YEAR PERIOD VarXMean VarXN.
BEGIN DATA.
SFO 2007 1 71 300
SFO 2007 2 74 400
SFO 2007 3 75 250
SFO 2008 1 74 350
SFO 2008 2 74 400
SFO 2008 3 75 350
DFW 2007 1 82 300
DFW 2007 2 74 400
DFW 2007 3 75 250
DFW 2008 1 84 350
DFW 2008 2 84 400
DFW 2008 3 85 350
NYC 2007 1 62 300
NYC 2007 2 74 400
NYC 2007 3 85 250
NYC 2008 1 74 350
NYC 2008 2 64 400
NYC 2008 3 75 350
END DATA.

* This works for the file as a whole, but I can't figure out how to get the
LEAVE variables to re-initialize for each new combination of City and Year.

COMPUTE Expanded = Expanded + VarXMean*VarXN.
LEAVE Expanded.
COMPUTE Cum_N = Cum_N + VarXN.
LEAVE Cum_N.
COMPUTE YTD = Expanded/Cum_N.
EXE.

To further complicate the issue, I have six variables in the same file for
which I need YTD computed at the same break points, so I'm using the syntax
above inside a DO REPEAT statement.

Hope someone is kind enough to come to my assistance. :>)

Thanks very much,

King Douglas
American Airlines


=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Loop using Leave...re-initialize at break points

Maguin, Eugene
King,

Well, maybe I'm misunderstanding what you want to do. Two questions.
1) Do you want to come out of the computation with the same number of cases
as you went into it with? In the example, you go in 18 and you come out with
18?
2) You want each case to show a running total or a total to date?

If 1) and 2) are yes, then I agree; Aggregate won't work. How's this for a
sample calculation?

CITY YEAR PERIOD VarXMean VarXN VarXMean*VarXN CumSum
SFO 2007 1 71  300  21300  21300
SFO 2007 2 74  400  29600  50900
SFO 2007 3 75  250  18750  69650
SFO 2008 1 74  350  25900  25900
SFO 2008 2 74  400  29600  55500
SFO 2008 3 75  350  26250  81750

If this is what you want, I'd do it this way. (I acknowledge that there are
probably other command sequences.)

Do if ($casenum eq 1).
+  compute cumsum=VarXMean*VarXN.
Else if (city eq lag(city) and year eq lag(year)).
+  compute cumsum=VarXMean*VarXN+lag(cumsum).
Else. /* city and/or year change.
+  compute cumsum=VarXMean*VarXN.
End if.


You know, I think you could use your code if you just reset the value if
your accumulation variables when you switched cities and/or year. Basically,
what I do in the Else clause, except that you'd set the accumulation
variables to 0.0.

Gene Maguin



I don't see how Aggregate can help.  For each City, Year Period, the YTD
figure has to be the grand mean of that period plus the preceding
periods...therefore the YTD is must be "weighted" by sample size per period.


In any case, I'm interested in the SPSS data manipulation challenge of how
to re-initialize the LEAVE variables at each break point and loop through
the computations.  I admit to being a tyro at LOOP/END LOOP, although I have
a few LOOP snippets that I use from time to time.  This is a new problem.

Thanks very much for your input...and maybe you are right and I just can't
see it.

King

--- On Tue, 4/21/09, Gene Maguin <[hidden email]> wrote:


        From: Gene Maguin <[hidden email]>
        Subject: Re: Loop using Leave...re-initialize at break points
        To: [hidden email]
        Date: Tuesday, April 21, 2009, 10:31 AM


        King,

        It'd seem like Aggregate would be the natural choice for what you're
        doing
        after first computing the VarXMean*VarXN product. Does Aggregate not
work
        for you in this problem?

        Gene Maguin



        >>Below is a greatly simplified sample of an aggregated data file
and
        syntax
        that works for the file as a whole.  What I need to do is compute
        year-to-date (YTD) means across Period by City and Year.

        I have no trouble computing YTD.  My problem is getting the two
LEAVE
        variables to re-initialize for each new combination of City and
Year.  I'm
        open to using Python or other SPSS solutions.

        DATA LIST FREE
          /CITY (A3) YEAR PERIOD VarXMean VarXN.
        BEGIN DATA.
        SFO 2007 1 71  300
        SFO 2007 2 74  400
        SFO 2007 3
         75  250
        SFO 2008 1 74  350
        SFO 2008 2 74  400
        SFO 2008 3 75  350
        DFW 2007 1 82  300
        DFW 2007 2 74  400
        DFW 2007 3 75  250
        DFW 2008 1 84  350
        DFW 2008 2 84  400
        DFW 2008 3 85  350
        NYC 2007 1 62  300
        NYC 2007 2 74  400
        NYC 2007 3 85  250
        NYC 2008 1 74  350
        NYC 2008 2 64  400
        NYC 2008 3 75  350
        END DATA.

        * This works for the file as a whole, but I can't figure out how to
get the
        LEAVE variables to re-initialize for each new combination of City
and Year.

        COMPUTE Expanded = Expanded + VarXMean*VarXN.
        LEAVE Expanded.
        COMPUTE Cum_N = Cum_N + VarXN.
        LEAVE Cum_N.
        COMPUTE YTD = Expanded/Cum_N.
        EXE.

        To further complicate the issue, I have six variables in the same
file for
        which I need YTD computed at the same break points, so I'm using the
syntax
        above inside a DO REPEAT statement.

        Hope someone is kind enough to come to my assistance.  :>)

        Thanks very
         much,

        King Douglas
        American Airlines


        =====================
        To manage your subscription to SPSSX-L, send a message to
        [hidden email] (not to SPSSX-L), with no body text except
the
        command. To leave the list, send the command
        SIGNOFF SPSSX-L
        For a list of commands to manage subscriptions, send the command
        INFO REFCARD




=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Loop using Leave...re-initialize at break points

hillel vardi
In reply to this post by King Douglas
Shalom

Using ether add files or match files you can set the first and last case
in a group.
Base on thet you can re-initialize your cumulated variable .

Here is a small example
.
sort cases by City Year .
add files file=* / by City Year / first=first.
if  first eq 1  Expanded=0 .
COMPUTE Expanded = Expanded + VarXMean*VarXN.



Hillel Vardi
BGU
King Douglas wrote:

> Folks,
>
> Below is a greatly simplified sample of an aggregated data file and
> syntax that works for the file as a whole.  What I need to do is
> compute year-to-date (YTD) means across Period by City and Year.
>
> I have no trouble computing YTD.  My problem is getting the two LEAVE
> variables to re-initialize for each new combination of City and Year.
> I'm open to using Python or other SPSS solutions.
>
> DATA LIST FREE
>   /CITY (A3) YEAR PERIOD VarXMean VarXN.
> BEGIN DATA.
> SFO 2007 1 71  300
> SFO 2007 2 74  400
> SFO 2007 3 75  250
> SFO 2008 1 74  350
> SFO 2008 2 74  400
> SFO 2008 3 75  350
> DFW 2007 1 82  300
> DFW 2007 2 74  400
> DFW 2007 3 75  250
> DFW 2008 1 84  350
> DFW 2008 2 84  400
> DFW 2008 3 85  350
> NYC 2007 1 62  300
> NYC 2007 2 74  400
> NYC 2007 3 85  250
> NYC 2008 1 74  350
> NYC 2008 2 64  400
> NYC 2008 3 75  350
> END DATA.
>
> * This works for the file as a whole, but I can't figure out how to
> get the LEAVE variables to re-initialize for each new combination of
> City and Year.
>
> COMPUTE Expanded = Expanded + VarXMean*VarXN.
> LEAVE Expanded.
> COMPUTE Cum_N = Cum_N + VarXN.
> LEAVE Cum_N.
> COMPUTE YTD = Expanded/Cum_N.
>
> EXE.
>
> To further complicate the issue, I have six variables in the same file
> for which I need YTD computed at the same break points, so I'm using
> the syntax above inside a DO REPEAT statement.
>
> Hope someone is kind enough to come to my assistance.  :>)
>
> Thanks very much,
>
> King Douglas
> American Airlines
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Loop using Leave...re-initialize at break points

King Douglas
In reply to this post by Maguin, Eugene
Gene,

Just getting ready to post my solution, I see yours, which is not dissimilar.

All I need to do is re-initialize the LEAVE variables whenever the break variable changes, which depends on the sort, of course.  I tried a variation on this earlier that didn't work (of course).

DATA LIST FREE
/CITY (A3) YEAR PERIOD VarXMean VarXN.
BEGIN DATA.
SFO 2007 1 71  300
SFO 2007 2 74  400
SFO 2007 3 75  250
SFO 2008 1 74  350
SFO 2008 2 74  400
SFO 2008 3 75  350
DFW 2007 1 82  300
DFW 2007 2 74  400
DFW 2007 3 75  250
DFW 2008 1 84  350
DFW 2008 2 84  400
DFW 2008 3 85  350
NYC 2007 1 62  300
NYC 2007 2 74  400
NYC 2007 3 85  250
NYC 2008 1 74  350
NYC 2008 2 64  400
NYC 2008 3 75  350
END DATA.

* Different sorts yield different results.

SORT CASES BY Year City Period.

* Initialize the LEAVE variables computed below.

NUMERIC Expanded Cum_N (f5.2).

* Set the initial value for each LEAVE variable to zero.

DO IF City NE LAG(City).
COMPUTE Expanded = 0.
COMPUTE Cum_N = 0.
END IF.

* Compute Year-to-Date.

COMPUTE Expanded = Expanded + VarXMean*VarXN.
LEAVE Expanded.
COMPUTE Cum_N = Cum_N + VarXN.
LEAVE Cum_N.
COMPUTE YTD = Expanded/Cum_N.
EXE.

Thanks Very, Very Much for your help.

King Douglas
American Airlines Customer Research

--- On Tue, 4/21/09, Gene Maguin <[hidden email]> wrote:
From: Gene Maguin <[hidden email]>
Subject: Re: Loop using Leave...re-initialize at break points
To: [hidden email]
Date: Tuesday, April 21, 2009, 2:59 PM

King,

Well, maybe I'm misunderstanding what you want to do. Two questions.
1) Do you want to come out of the computation with the same number of cases
as you went into it with? In the example, you go in 18 and you come out with
18?
2) You want each case to show a running total or a total to date?

If 1) and 2) are yes, then I agree; Aggregate won't work. How's this
for a
sample calculation?

CITY YEAR PERIOD VarXMean VarXN VarXMean*VarXN CumSum
SFO 2007 1 71 300 21300 21300
SFO 2007 2 74 400 29600 50900
SFO 2007 3 75 250 18750 69650
SFO 2008 1 74 350 25900 25900
SFO 2008 2 74 400 29600 55500
SFO 2008 3 75 350 26250 81750

If this is what you want, I'd do it this way. (I acknowledge that there are
probably other command sequences.)

Do if ($casenum eq 1).
+ compute cumsum=VarXMean*VarXN.
Else if (city eq lag(city) and year eq lag(year)).
+ compute cumsum=VarXMean*VarXN+lag(cumsum).
Else. /* city and/or year change.
+ compute cumsum=VarXMean*VarXN.
End if.


You know, I think you could use your code if you just reset the value if
your accumulation variables when you switched cities and/or year. Basically,
what I do in the Else clause, except that you'd set the accumulation
variables to 0.0.

Gene Maguin



I don't see how Aggregate can help. For each City, Year Period, the YTD
figure has to be the grand mean of that period plus the preceding
periods...therefore the YTD is must be "weighted" by sample size per
period.


In any case, I'm interested in the SPSS data manipulation challenge of how
to re-initialize the LEAVE variables at each break point and loop through
the computations. I admit to being a tyro at LOOP/END LOOP, although I have
a few LOOP snippets that I use from time to time. This is a new problem.

Thanks very much for your input...and maybe you are right and I just can't
see it.

King

--- On Tue, 4/21/09, Gene Maguin <[hidden email]> wrote:


From: Gene Maguin <[hidden email]>
Subject: Re: Loop using Leave...re-initialize at break points
To: [hidden email]
Date: Tuesday, April 21, 2009, 10:31 AM


King,

It'd seem like Aggregate would be the natural choice for what
you're
doing
after first computing the VarXMean*VarXN product. Does Aggregate not
work
for you in this problem?

Gene Maguin



>>Below is a greatly simplified sample of an aggregated data file
and
syntax
that works for the file as a whole. What I need to do is compute
year-to-date (YTD) means across Period by City and Year.

I have no trouble computing YTD. My problem is getting the two
LEAVE
variables to re-initialize for each new combination of City and
Year. I'm
open to using Python or other SPSS solutions.

DATA LIST FREE
/CITY (A3) YEAR PERIOD VarXMean VarXN.
BEGIN DATA.
SFO 2007 1 71 300
SFO 2007 2 74 400
SFO 2007 3
75 250
SFO 2008 1 74 350
SFO 2008 2 74 400
SFO 2008 3 75 350
DFW 2007 1 82 300
DFW 2007 2 74 400
DFW 2007 3 75 250
DFW 2008 1 84 350
DFW 2008 2 84 400
DFW 2008 3 85 350
NYC 2007 1 62 300
NYC 2007 2 74 400
NYC 2007 3 85 250
NYC 2008 1 74 350
NYC 2008 2 64 400
NYC 2008 3 75 350
END DATA.

* This works for the file as a whole, but I can't figure out how to
get the
LEAVE variables to re-initialize for each new combination of City
and Year.

COMPUTE Expanded = Expanded + VarXMean*VarXN.
LEAVE Expanded.
COMPUTE Cum_N = Cum_N + VarXN.
LEAVE Cum_N.
COMPUTE YTD = Expanded/Cum_N.
EXE.

To further complicate the issue, I have six variables in the same
file for
which I need YTD computed at the same break points, so I'm using
the
syntax
above inside a DO REPEAT statement.

Hope someone is kind enough to come to my assistance. :>)

Thanks very
much,

King Douglas
American Airlines


=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except
the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD




=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Loop using Leave...re-initialize at break points

Art Kendall
In reply to this post by King Douglas
see if this does what you want.  No need for macros, python, or LEAVE.

Art Kendall
Social Research Consultants
DATA LIST FREE
  /CITY (A3) YEAR PERIOD VarXMean VarXN.
BEGIN DATA.
SFO 2007 1 71  300
SFO 2007 2 74  400
SFO 2007 3 75  250
SFO 2008 1 74  350
SFO 2008 2 74  400
SFO 2008 3 75  350
DFW 2007 1 82  300
DFW 2007 2 74  400
DFW 2007 3 75  250
DFW 2008 1 84  350
DFW 2008 2 84  400
DFW 2008 3 85  350
NYC 2007 1 62  300
NYC 2007 2 74  400
NYC 2007 3 85  250
NYC 2008 1 74  350
NYC 2008 2 64  400
NYC 2008 3 75  350
END DATA.
*this is where you would have something like.
*do repeat
  expanded = a list of variables
 /cum_expanded = a list of variables
 /cum_n = a list of variables
 /YTD = a list of variables.


COMPUTE Expanded =VarXMean*VarXN.
do if  (city eq lag(city) and period gt lag(period)).
COMPUTE cum_Expanded = lag(cum_expanded) + (VarXMean*VarXN).
COMPUTE Cum_N    = lag(cum_n) + VarXN.
ELSE  .
COMPUTE cum_expanded =Expanded .
COMPUTE Cum_N =  VarXN.
end if.
COMPUTE YTD = cum_Expanded/Cum_N.
* end repeat.
EXE.


King Douglas wrote:
Folks,

Below is a greatly simplified sample of an aggregated data file and syntax that works for the file as a whole.  What I need to do is compute year-to-date (YTD) means across Period by City and Year.

I have no trouble computing YTD.  My problem is getting the two LEAVE variables to re-initialize for each new combination of City and Year.  I'm open to using Python or other SPSS solutions.

DATA LIST FREE
  /CITY (A3) YEAR PERIOD VarXMean VarXN.
BEGIN DATA.
SFO 2007 1 71  300
SFO 2007 2 74  400
SFO 2007 3 75  250
SFO 2008 1 74  350
SFO 2008 2 74  400
SFO 2008 3 75  350
DFW 2007 1 82  300
DFW 2007 2 74  400
DFW 2007 3 75  250
DFW 2008 1 84  350
DFW 2008 2 84  400
DFW 2008 3 85  350
NYC 2007 1 62  300
NYC 2007 2 74  400
NYC 2007 3 85  250
NYC 2008 1 74  350
NYC 2008 2 64  400
NYC 2008 3 75  350
END DATA.

* This works for the file as a whole, but I can't figure out how to get the LEAVE variables to re-initialize for each new combination of City and Year.

COMPUTE Expanded = Expanded + VarXMean*VarXN.
LEAVE Expanded.
COMPUTE Cum_N = Cum_N + VarXN.
LEAVE Cum_N.
COMPUTE YTD = Expanded/Cum_N.

EXE.

To further complicate the issue, I have six variables in the same file for which I need YTD computed at the same break points, so I'm using the syntax above inside a DO REPEAT statement.

Hope someone is kind enough to come to my assistance.  :>)

Thanks very much,

King Douglas
American Airlines
Art Kendall
Social Research Consultants
Reply | Threaded
Open this post in threaded view
|

Re: Loop using Leave...re-initialize at break points

King Douglas
Art,

Some people see the forest, some only trees.

Thanks for the parsimonious and cool solution to my problem.  As usual, I learned something from you in addition to the solution.

King

--- On Wed, 4/22/09, Art Kendall <[hidden email]> wrote:
From: Art Kendall <[hidden email]>
Subject: Re: Loop using Leave...re-initialize at break points
To: [hidden email]
Cc: [hidden email]
Date: Wednesday, April 22, 2009, 9:40 AM

see if this does what you want.  No need for macros, python, or LEAVE.

Art Kendall
Social Research Consultants
DATA LIST FREE
  /CITY (A3) YEAR PERIOD VarXMean VarXN.
BEGIN DATA.
SFO 2007 1 71  300
SFO 2007 2 74  400
SFO 2007 3 75  250
SFO 2008 1 74  350
SFO 2008 2 74  400
SFO 2008 3 75  350
DFW 2007 1 82  300
DFW 2007 2 74  400
DFW 2007 3 75  250
DFW 2008 1 84  350
DFW 2008 2 84  400
DFW 2008 3 85  350
NYC 2007 1 62  300
NYC 2007 2 74  400
NYC 2007 3 85  250
NYC 2008 1 74  350
NYC 2008 2 64  400
NYC 2008 3 75  350
END DATA.
*this is where you would have something like.
*do repeat
  expanded = a list of variables
 /cum_expanded = a list of variables
 /cum_n = a list of variables
 /YTD = a list of variables.


COMPUTE Expanded =VarXMean*VarXN.
do if  (city eq lag(city) and period gt lag(period)).
COMPUTE cum_Expanded = lag(cum_expanded) + (VarXMean*VarXN).
COMPUTE Cum_N    = lag(cum_n) + VarXN.
ELSE  .
COMPUTE cum_expanded =Expanded .
COMPUTE Cum_N =  VarXN.
end if.
COMPUTE YTD = cum_Expanded/Cum_N.
* end repeat.
EXE.


King Douglas wrote:
Folks,

Below is a greatly simplified sample of an aggregated data file and syntax that works for the file as a whole.  What I need to do is compute year-to-date (YTD) means across Period by City and Year.

I have no trouble computing YTD.  My problem is getting the two LEAVE variables to re-initialize for each new combination of City and Year.  I'm open to using Python or other SPSS solutions.

DATA LIST FREE
  /CITY (A3) YEAR PERIOD VarXMean VarXN.
BEGIN DATA.
SFO 2007 1 71  300
SFO 2007 2 74  400
SFO 2007 3 75  250
SFO 2008 1 74  350
SFO 2008 2 74  400
SFO 2008 3 75  350
DFW 2007 1 82  300
DFW 2007 2 74  400
DFW 2007 3 75  250
DFW 2008 1 84  350
DFW 2008 2 84  400
DFW 2008 3 85  350
NYC 2007 1 62  300
NYC 2007 2 74  400
NYC 2007 3 85  250
NYC 2008 1 74  350
NYC 2008 2 64  400
NYC 2008 3 75  350
END DATA.

* This works for the file as a whole, but I can't figure out how to get the LEAVE variables to re-initialize for each new combination of City and Year.

COMPUTE Expanded = Expanded + VarXMean*VarXN.
LEAVE Expanded.
COMPUTE Cum_N = Cum_N + VarXN.
LEAVE Cum_N.
COMPUTE YTD = Expanded/Cum_N.

EXE.

To further complicate the issue, I have six variables in the same file for which I need YTD computed at the same break points, so I'm using the syntax above inside a DO REPEAT statement.

Hope someone is kind enough to come to my assistance.  :>)

Thanks very much,

King Douglas
American Airlines
Reply | Threaded
Open this post in threaded view
|

Logistic Regression

<R. Abraham>

I have 2 questions on Predictive Modeling:

1. I am building a logistic regression model with about 480 predictors. The 'training' sample has about 18000 records with about 3000 responders. I would like to know how many significant predictors can the model have? Is there any suggested number of significant variables that a model can have?

2. Can a predictive model perform better on the 'validation' sample than that seen in the 'Training' sample. The test results on my validation sample performs atleast 15% better than the 'train' sample in the prediction in the top deciles.

3. Does the "Kolmogorov–Smirnov test" help in finding out how much the 'Validation' sample results can differ from the 'Training' sample results? If so, can someone give me some pointers on how to perform the test?

Thanks.
R. Abraham


Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression

Ornelas, Fermin-2

This answer is on (1) and (2). There is no magic number for the set of predictor variables in a model but once you clean the data and the model itself you could have between 7 – 18 predictors. That was my experience.

 

It is possible for a model to perform better or worse in a validation sample than in a training sample. However, to ensure that the model performs equally well you need to make sure that your descriptive statistics on the data are similar in the validation and training sample. If the performance difference is large that could pose a problem when implementing a model particularly if the performance is worse, which is not your case.

 

Regarding the test, I cannot give input from the top of my head for fear of getting some uncomfortable feedback. But I use to graph a Lorenz curve plotting both training and validation and calculate curve lift.

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent: Wednesday, April 22, 2009 1:37 PM
To: [hidden email]
Subject: Logistic Regression

 


I have 2 questions on Predictive Modeling:

1. I am building a logistic regression model with about 480 predictors. The 'training' sample has about 18000 records with about 3000 responders. I would like to know how many significant predictors can the model have? Is there any suggested number of significant variables that a model can have?

2. Can a predictive model perform better on the 'validation' sample than that seen in the 'Training' sample. The test results on my validation sample performs atleast 15% better than the 'train' sample in the prediction in the top deciles.

3. Does the "Kolmogorov–Smirnov test" help in finding out how much the 'Validation' sample results can differ from the 'Training' sample results? If so, can someone give me some pointers on how to perform the test?

Thanks.
R. Abraham



NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR CONFIDENTIAL information and is intended only for the use of the specific individual(s) to whom it is addressed. It may contain information that is privileged and confidential under state and federal law. This information may be used or disclosed only in accordance with law, and you may be subject to penalties under law for improper use or further disclosure of the information in this e-mail and its attachments. If you have received this e-mail in error, please immediately notify the person named above by reply e-mail, and then delete the original e-mail. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression

Hector Maletta
In reply to this post by <R. Abraham>

The number of records are, I think, irrelevant if the sample consists of 3000 subjects. Sample size is 3000. If the sample was a random sample, there are ways to judge the marginal increase in significance due to the addition of one more predictor.

Now, using 480 predictors seems a bit of an overkill strategy. Is there any actual theory with 480 mutually independent additive factors jointly influencing the probability (or the odds) of occurrence of your event? Or you’re simply shooting in the dark? I bet you can obtain a reasonably good model with just a smaller number of judiciously chosen predictors. Choosing judiciously may indeed involve some initial shooting in the dark, until you find out what are the very best predictors. Choose the best and ignore the rest, unless you have good theoretical reasons to include them all.

Besides, remember that the outcome of logistic regression is NOT the prediction of individual outcomes, but the prediction of population proportions. You may throw a coin 1000 times and predict heads 50% of the times and tails the other 50% of the throws, and miserably fail every time; but still the coin will show heads 50% of the times even if your prediction of individual throws mostly fail. Each particular throw is indeterminate, but the population of 1000 throws will have about 500 heads and 500 tails (you simply do not know which).

 

Hector

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent: 22 April 2009 17:37
To: [hidden email]
Subject: Logistic Regression

 


I have 2 questions on Predictive Modeling:

1. I am building a logistic regression model with about 480 predictors. The 'training' sample has about 18000 records with about 3000 responders. I would like to know how many significant predictors can the model have? Is there any suggested number of significant variables that a model can have?

2. Can a predictive model perform better on the 'validation' sample than that seen in the 'Training' sample. The test results on my validation sample performs atleast 15% better than the 'train' sample in the prediction in the top deciles.

3. Does the "Kolmogorov–Smirnov test" help in finding out how much the 'Validation' sample results can differ from the 'Training' sample results? If so, can someone give me some pointers on how to perform the test?

Thanks.
R. Abraham

Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression

<R. Abraham>

Hector,
I started with about 700 variables (includes demographic, lifestyles, census, household cluster variables). By eliminating variables not deemed useful/less useful, and correlated variables, I brought it down to somewhere around 375, the rest (480-375) were dummy variables of categorical variables. So I am really not shooting in the dark. But can't think of further eliminating variables, unless I start eliminating even faintly correlated variables (say 0.4). Or start running a series of regression models with different sets of variables and eliminate the insignificant ones from all the models, thereby reducing the number of variables to be included in the final model? I already do that to a certain extent, not sure if it's advisable but it does work.

My final model is quite stable, and infact it performs better in the validation sample when tested. However the number of significant variables in my final model is somewhere around 40. I know that modelers usually suggest around 10-18 variables similar to what Fermin in the earlier post suggested. I get the same number of variables when I run the regression model using the 'Enter' method in SPSS, and also when running it in SAS (Stepwise). The Forward:LR Stepwise in SPSS takes days to complete, so still waiting for its results. I can limit the number of significant model variables in SPSS using the Stepwise method, by selecting the appropriate Step in the Model.

But since I am getting good results with the model with about 40 variables that I already have, I was thinking if it's Ok to accept it.

And any suggestions on my third question regarding the K-S test?

Thank you so much.

R. Abraham




"Hector Maletta" <[hidden email]>

04/22/2009 04:57 PM

To
<[hidden email]>, <[hidden email]>
cc
Subject
RE: Logistic Regression





The number of records are, I think, irrelevant if the sample consists of 3000 subjects. Sample size is 3000. If the sample was a random sample, there are ways to judge the marginal increase in significance due to the addition of one more predictor.
Now, using 480 predictors seems a bit of an overkill strategy. Is there any actual theory with 480 mutually independent additive factors jointly influencing the probability (or the odds) of occurrence of your event? Or you’re simply shooting in the dark? I bet you can obtain a reasonably good model with just a smaller number of judiciously chosen predictors. Choosing judiciously may indeed involve some initial shooting in the dark, until you find out what are the very best predictors. Choose the best and ignore the rest, unless you have good theoretical reasons to include them all.
Besides, remember that the outcome of logistic regression is NOT the prediction of individual outcomes, but the prediction of population proportions. You may throw a coin 1000 times and predict heads 50% of the times and tails the other 50% of the throws, and miserably fail every time; but still the coin will show heads 50% of the times even if your prediction of individual throws mostly fail. Each particular throw is indeterminate, but the population of 1000 throws will have about 500 heads and 500 tails (you simply do not know which).
 
Hector
 



From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent:
22 April 2009 17:37
To:
[hidden email]
Subject:
Logistic Regression

 

I have 2 questions on Predictive Modeling:


1. I am building a logistic regression model with about 480 predictors. The 'training' sample has about 18000 records with about 3000 responders. I would like to know how many significant predictors can the model have? Is there any suggested number of significant variables that a model can have?


2. Can a predictive model perform better on the 'validation' sample than that seen in the 'Training' sample. The test results on my validation sample performs atleast 15% better than the 'train' sample in the prediction in the top deciles.


3. Does the "
Kolmogorov–Smirnov test" help in finding out how much the 'Validation' sample results can differ from the 'Training' sample results? If so, can someone give me some pointers on how to perform the test?

Thanks.

R. Abraham


Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression

Martin P. Holt-2
In reply to this post by Hector Maletta
Dear R. Abraham,
 
To answer your question, there are guidelines, but not everyone is agreed on which should be used. The one I've heard most favoured is that you should have at least 10 of the rarer of the two outcomes for every variable considered. That's not the same as "for every variable in the final model." And then, usually, it is said,"But 15 would be better." But I have seen it as low as 5....but I definitely wouldn't go there ! I'm just illustrating that there is no hard-and-fast rule.
 
Best Wishes,
Martin Holt
----- Original Message -----
Sent: Wednesday, April 22, 2009 9:57 PM
Subject: Re: Logistic Regression

The number of records are, I think, irrelevant if the sample consists of 3000 subjects. Sample size is 3000. If the sample was a random sample, there are ways to judge the marginal increase in significance due to the addition of one more predictor.

Now, using 480 predictors seems a bit of an overkill strategy. Is there any actual theory with 480 mutually independent additive factors jointly influencing the probability (or the odds) of occurrence of your event? Or you’re simply shooting in the dark? I bet you can obtain a reasonably good model with just a smaller number of judiciously chosen predictors. Choosing judiciously may indeed involve some initial shooting in the dark, until you find out what are the very best predictors. Choose the best and ignore the rest, unless you have good theoretical reasons to include them all.

Besides, remember that the outcome of logistic regression is NOT the prediction of individual outcomes, but the prediction of population proportions. You may throw a coin 1000 times and predict heads 50% of the times and tails the other 50% of the throws, and miserably fail every time; but still the coin will show heads 50% of the times even if your prediction of individual throws mostly fail. Each particular throw is indeterminate, but the population of 1000 throws will have about 500 heads and 500 tails (you simply do not know which).

 

Hector

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent: 22 April 2009 17:37
To: [hidden email]
Subject: Logistic Regression

 


I have 2 questions on Predictive Modeling:

1. I am building a logistic regression model with about 480 predictors. The 'training' sample has about 18000 records with about 3000 responders. I would like to know how many significant predictors can the model have? Is there any suggested number of significant variables that a model can have?

2. Can a predictive model perform better on the 'validation' sample than that seen in the 'Training' sample. The test results on my validation sample performs atleast 15% better than the 'train' sample in the prediction in the top deciles.

3. Does the "Kolmogorov–Smirnov test" help in finding out how much the 'Validation' sample results can differ from the 'Training' sample results? If so, can someone give me some pointers on how to perform the test?

Thanks.
R. Abraham

Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression

Hector Maletta
In reply to this post by <R. Abraham>

If you have a model with 40 significant coefficients, just take it and discard the other 440 variables.

 

But even if your model with 40 variables performs well, it may be the case that the last 5 or 10 of those 40 predictors add very little to the result. They are certainly statistically significant in the sense that the probability of their being zero in the population is lower than 5%, but they may still be quite low, and when multiplied by a variable with low absolute values they may modify the result by an almost imperceptible amount. If this is the case, you may try a leaner model without those last variables. Unless, of course, there are strong theoretical reasons to have them in the model.

 

Hector

 

 

 

 


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent: 22 April 2009 18:39
To: [hidden email]
Subject: Re: Logistic Regression

 


Hector,
I started with about 700 variables (includes demographic, lifestyles, census, household cluster variables). By eliminating variables not deemed useful/less useful, and correlated variables, I brought it down to somewhere around 375, the rest (480-375) were dummy variables of categorical variables. So I am really not shooting in the dark. But can't think of further eliminating variables, unless I start eliminating even faintly correlated variables (say 0.4). Or start running a series of regression models with different sets of variables and eliminate the insignificant ones from all the models, thereby reducing the number of variables to be included in the final model? I already do that to a certain extent, not sure if it's advisable but it does work.

My final model is quite stable, and infact it performs better in the validation sample when tested. However the number of significant variables in my final model is somewhere around 40. I know that modelers usually suggest around 10-18 variables similar to what Fermin in the earlier post suggested. I get the same number of variables when I run the regression model using the 'Enter' method in SPSS, and also when running it in SAS (Stepwise). The Forward:LR Stepwise in SPSS takes days to complete, so still waiting for its results. I can limit the number of significant model variables in SPSS using the Stepwise method, by selecting the appropriate Step in the Model.

But since I am getting good results with the model with about 40 variables that I already have, I was thinking if it's Ok to accept it.

And any suggestions on my third question regarding the K-S test?

Thank you so much.

R. Abraham



"Hector Maletta" <[hidden email]>

04/22/2009 04:57 PM

To

<[hidden email]>, <[hidden email]>

cc

 

Subject

RE: Logistic Regression

 

 

 




The number of records are, I think, irrelevant if the sample consists of 3000 subjects. Sample size is 3000. If the sample was a random sample, there are ways to judge the marginal increase in significance due to the addition of one more predictor.
Now, using 480 predictors seems a bit of an overkill strategy. Is there any actual theory with 480 mutually independent additive factors jointly influencing the probability (or the odds) of occurrence of your event? Or you’re simply shooting in the dark? I bet you can obtain a reasonably good model with just a smaller number of judiciously chosen predictors. Choosing judiciously may indeed involve some initial shooting in the dark, until you find out what are the very best predictors. Choose the best and ignore the rest, unless you have good theoretical reasons to include them all.
Besides, remember that the outcome of logistic regression is NOT the prediction of individual outcomes, but the prediction of population proportions. You may throw a coin 1000 times and predict heads 50% of the times and tails the other 50% of the throws, and miserably fail every time; but still the coin will show heads 50% of the times even if your prediction of individual throws mostly fail. Each particular throw is indeterminate, but the population of 1000 throws will have about 500 heads and 500 tails (you simply do not know which).
 
Hector
 

 



From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent:
22 April 2009 17:37
To:
[hidden email]
Subject:
Logistic Regression

 

I have 2 questions on Predictive Modeling:


1. I am building a logistic regression model with about 480 predictors. The 'training' sample has about 18000 records with about 3000 responders. I would like to know how many significant predictors can the model have? Is there any suggested number of significant variables that a model can have?


2. Can a predictive model perform better on the 'validation' sample than that seen in the 'Training' sample. The test results on my validation sample performs atleast 15% better than the 'train' sample in the prediction in the top deciles.


3. Does the "
Kolmogorov–Smirnov test" help in finding out how much the 'Validation' sample results can differ from the 'Training' sample results? If so, can someone give me some pointers on how to perform the test?

Thanks.

R. Abraham

Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression

Ergul, Emel A.
In reply to this post by Ornelas, Fermin-2
OK
I remember from journal reviewers that total number of predictor for LR can be
maximum number of event/10. They say over this number, model becomes
unstable...How about that?



-----Original Message-----
From: SPSSX(r) Discussion on behalf of Ornelas, Fermin
Sent: Wed 4/22/2009 4:55 PM
To: [hidden email]
Subject:      Re: Logistic Regression

This answer is on (1) and (2). There is no magic number for the set of predictor
variables in a model but once you clean the data and the model itself you could
have between 7 - 18 predictors. That was my experience.

It is possible for a model to perform better or worse in a validation sample
than in a training sample. However, to ensure that the model performs equally
well you need to make sure that your descriptive statistics on the data are
similar in the validation and training sample. If the performance difference is
large that could pose a problem when implementing a model particularly if the
performance is worse, which is not your case.

Regarding the test, I cannot give input from the top of my head for fear of
getting some uncomfortable feedback. But I use to graph a Lorenz curve plotting
both training and validation and calculate curve lift.

________________________________
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R.
Abraham>
Sent: Wednesday, April 22, 2009 1:37 PM
To: [hidden email]
Subject: Logistic Regression


I have 2 questions on Predictive Modeling:

1. I am building a logistic regression model with about 480 predictors. The
'training' sample has about 18000 records with about 3000 responders. I would
like to know how many significant predictors can the model have? Is there any
suggested number of significant variables that a model can have?

2. Can a predictive model perform better on the 'validation' sample than that
seen in the 'Training' sample. The test results on my validation sample performs
atleast 15% better than the 'train' sample in the prediction in the top deciles.

3. Does the "Kolmogorov-Smirnov test" help in finding out how much the
'Validation' sample results can differ from the 'Training' sample results? If
so, can someone give me some pointers on how to perform the test?

Thanks.
R. Abraham


________________________________
NOTICE: This e-mail (and any attachments) may contain PRIVILEGED OR CONFIDENTIAL
information and is intended only for the use of the specific individual(s) to
whom it is addressed. It may contain information that is privileged and
confidential under state and federal law. This information may be used or
disclosed only in accordance with law, and you may be subject to penalties under
law for improper use or further disclosure of the information in this e-mail and
its attachments. If you have received this e-mail in error, please immediately
notify the person named above by reply e-mail, and then delete the original
e-mail. Thank you.



The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression

Johnny Amora
In reply to this post by <R. Abraham>
Hi Abraham,
May I know why you only use Logistic Regression? Why did you not use other predictive models such as C5.0, CHAID, C&RT, Neural Network, etc.  The literature of predictive modeling says that there are lots of instances that the models I have mentioned can performed better than Logictic Regression.  Such models are available in SPSS Clementine(now called PASW Modeler).

-Johnny

--- On Thu, 4/23/09, <R. Abraham> <[hidden email]> wrote:

From: <R. Abraham> <[hidden email]>
Subject: Logistic Regression
To: [hidden email]
Date: Thursday, 23 April, 2009, 4:36 AM


I have 2 questions on Predictive Modeling:

1. I am building a logistic regression model with about 480 predictors. The 'training' sample has about 18000 records with about 3000 responders. I would like to know how many significant predictors can the model have? Is there any suggested number of significant variables that a model can have?

2. Can a predictive model perform better on the 'validation' sample than that seen in the 'Training' sample. The test results on my validation sample performs atleast 15% better than the 'train' sample in the prediction in the top deciles.

3. Does the "Kolmogorov–Smirnov test" help in finding out how much the 'Validation' sample results can differ from the 'Training' sample results? If so, can someone give me some pointers on how to perform the test?

Thanks.
R. Abraham




Importing contacts has never been easier..
Bring your friends over to Yahoo! Mail today!
Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression

Soley, Bonita (HQ-LF010)
Just as a note - if your budget can't handle Clementine (I refuse to bow to the new nomenclature), the tree models are also available as an add-on (DecisionTrees) or stand-alone "AnswerTree" program which is much less expensive and a viable, although they don't offer quite as much as Clementine.  
 
I've used this type of predictive analysis quite a bit, and actually did it "manually" before the software came out (printed on those nice big green and white computer printout sheets - remember those!?!!??!)
 
Bonita


From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Johnny Amora
Sent: Wednesday, April 22, 2009 10:10 PM
To: [hidden email]
Subject: Re: Logistic Regression

Hi Abraham,
May I know why you only use Logistic Regression? Why did you not use other predictive models such as C5.0, CHAID, C&RT, Neural Network, etc.  The literature of predictive modeling says that there are lots of instances that the models I have mentioned can performed better than Logictic Regression.  Such models are available in SPSS Clementine(now called PASW Modeler).

-Johnny

--- On Thu, 4/23/09, <R. Abraham> <[hidden email]> wrote:

From: <R. Abraham> <[hidden email]>
Subject: Logistic Regression
To: [hidden email]
Date: Thursday, 23 April, 2009, 4:36 AM


I have 2 questions on Predictive Modeling:

1. I am building a logistic regression model with about 480 predictors. The 'training' sample has about 18000 records with about 3000 responders. I would like to know how many significant predictors can the model have? Is there any suggested number of significant variables that a model can have?

2. Can a predictive model perform better on the 'validation' sample than that seen in the 'Training' sample. The test results on my validation sample performs atleast 15% better than the 'train' sample in the prediction in the top deciles.

3. Does the "Kolmogorov–Smirnov test" help in finding out how much the 'Validation' sample results can differ from the 'Training' sample results? If so, can someone give me some pointers on how to perform the test?

Thanks.
R. Abraham




Importing contacts has never been easier..
Bring your friends over to Yahoo! Mail today!
Reply | Threaded
Open this post in threaded view
|

Re: Logistic Regression

<R. Abraham>
In reply to this post by Hector Maletta

That makes sense. So you suggest to eliminate the significant variables at the bottom with negligible betas from the model, and rerun the model. I am going to try that.

R. Abraham




"Hector Maletta" <[hidden email]>

04/22/2009 06:41 PM

To
<[hidden email]>, <[hidden email]>
cc
Subject
RE: Logistic Regression





If you have a model with 40 significant coefficients, just take it and discard the other 440 variables.
 
But even if your model with 40 variables performs well, it may be the case that the last 5 or 10 of those 40 predictors add very little to the result. They are certainly statistically significant in the sense that the probability of their being zero in the population is lower than 5%, but they may still be quite low, and when multiplied by a variable with low absolute values they may modify the result by an almost imperceptible amount. If this is the case, you may try a leaner model without those last variables. Unless, of course, there are strong theoretical reasons to have them in the model.
 
Hector
 
 
 
 



From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent:
22 April 2009 18:39
To:
[hidden email]
Subject:
Re: Logistic Regression

 

Hector,

I started with about 700 variables (includes demographic, lifestyles, census, household cluster variables). By eliminating variables not deemed useful/less useful, and correlated variables, I brought it down to somewhere around 375, the rest (480-375) were dummy variables of categorical variables. So I am really not shooting in the dark. But can't think of further eliminating variables, unless I start eliminating even faintly correlated variables (say 0.4). Or start running a series of regression models with different sets of variables and eliminate the insignificant ones from all the models, thereby reducing the number of variables to be included in the final model? I already do that to a certain extent, not sure if it's advisable but it does work.


My final model is quite stable, and infact it performs better in the validation sample when tested. However the number of significant variables in my final model is somewhere around 40. I know that modelers usually suggest around 10-18 variables similar to what Fermin in the earlier post suggested. I get the same number of variables when I run the regression model using the 'Enter' method in SPSS, and also when running it in SAS (Stepwise). The Forward:LR Stepwise in SPSS takes days to complete, so still waiting for its results. I can limit the number of significant model variables in SPSS using the Stepwise method, by selecting the appropriate Step in the Model.


But since I am getting good results with the model with about 40 variables that I already have, I was thinking if it's Ok to accept it.


And any suggestions on my third question regarding the K-S test?


Thank you so much.


R. Abraham


"Hector Maletta" <[hidden email]>

04/22/2009 04:57 PM


To
<[hidden email]>, <[hidden email]>
cc
 
Subject
RE: Logistic Regression

 


   





The number of records are, I think, irrelevant if the sample consists of 3000 subjects. Sample size is 3000. If the sample was a random sample, there are ways to judge the marginal increase in significance due to the addition of one more predictor.

Now, using 480 predictors seems a bit of an overkill strategy. Is there any actual theory with 480 mutually independent additive factors jointly influencing the probability (or the odds) of occurrence of your event? Or you’re simply shooting in the dark? I bet you can obtain a reasonably good model with just a smaller number of judiciously chosen predictors. Choosing judiciously may indeed involve some initial shooting in the dark, until you find out what are the very best predictors. Choose the best and ignore the rest, unless you have good theoretical reasons to include them all.

Besides, remember that the outcome of logistic regression is NOT the prediction of individual outcomes, but the prediction of population proportions. You may throw a coin 1000 times and predict heads 50% of the times and tails the other 50% of the throws, and miserably fail every time; but still the coin will show heads 50% of the times even if your prediction of individual throws mostly fail. Each particular throw is indeterminate, but the population of 1000 throws will have about 500 heads and 500 tails (you simply do not know which).

 
Hector

 

 



From:
SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of <R. Abraham>
Sent:
22 April 2009 17:37
To:
[hidden email]
Subject:
Logistic Regression

 


I have 2 questions on Predictive Modeling:


1. I am building a logistic regression model with about 480 predictors. The 'training' sample has about 18000 records with about 3000 responders. I would like to know how many significant predictors can the model have? Is there any suggested number of significant variables that a model can have?


2. Can a predictive model perform better on the 'validation' sample than that seen in the 'Training' sample. The test results on my validation sample performs atleast 15% better than the 'train' sample in the prediction in the top deciles.


3. Does the "
Kolmogorov–Smirnov test" help in finding out how much the 'Validation' sample results can differ from the 'Training' sample results? If so, can someone give me some pointers on how to perform the test?

Thanks.

R. Abraham

12