Performing regression on groups of items

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Performing regression on groups of items

Steve Wagner-3
Hello,



I have an application where I would like to perform 1,000 or more simple
regressions for a historical file of sales transactions by item number.  The
dependent variable will be the sales amount (actually, LN(Sales)); the
independent variable will be time (1/day to be exact).  Each item number
will have a varying number of days in which sales occur.  I would like to
automatically take in the full transaction file (say, 500,000 records) and
extract a file that includes the regression statistics for each of the item
numbers (say, 5,000 records).  I have tried this under Statistica using
their grouping function but found that the output workbook blew up after 800
or so regressions (too large a size), requiring me to divide the input file.



1)      Is there a way, within SPSS, to automatically perform and extract
regressions statistics based on a grouping variable?

2)      If so, are there any scaling concerns, as with Statistica?

3)      If there is not a way, do you know of any other statistical program
that could support this (like S Plus)?



Thanks, and looking forward to your input.



Steve



___________________________________
Steve Wagner
Senior Manager, Supply Chain Design
CHAINalytics

2500 Cumberland Parkway, Suite 550
Atlanta, GA 30339 USA
o: 715.757.2200
m: 920.737.2742
f: 770.456.5393
 <mailto:[hidden email]> [hidden email]
 <http://www.chainalytics.com/> www.chainalytics.com
Reply | Threaded
Open this post in threaded view
|

Re: Performing regression on groups of items

Art Kendall-2
look at SPLIT FILE.

Art Kendall
Social Research Consultants

Steve Wagner wrote:

> Hello,
>
>
>
> I have an application where I would like to perform 1,000 or more simple
> regressions for a historical file of sales transactions by item number.  The
> dependent variable will be the sales amount (actually, LN(Sales)); the
> independent variable will be time (1/day to be exact).  Each item number
> will have a varying number of days in which sales occur.  I would like to
> automatically take in the full transaction file (say, 500,000 records) and
> extract a file that includes the regression statistics for each of the item
> numbers (say, 5,000 records).  I have tried this under Statistica using
> their grouping function but found that the output workbook blew up after 800
> or so regressions (too large a size), requiring me to divide the input file.
>
>
>
> 1)      Is there a way, within SPSS, to automatically perform and extract
> regressions statistics based on a grouping variable?
>
> 2)      If so, are there any scaling concerns, as with Statistica?
>
> 3)      If there is not a way, do you know of any other statistical program
> that could support this (like S Plus)?
>
>
>
> Thanks, and looking forward to your input.
>
>
>
> Steve
>
>
>
> ___________________________________
> Steve Wagner
> Senior Manager, Supply Chain Design
> CHAINalytics
>
> 2500 Cumberland Parkway, Suite 550
> Atlanta, GA 30339 USA
> o: 715.757.2200
> m: 920.737.2742
> f: 770.456.5393
>  <mailto:[hidden email]> [hidden email]
>  <http://www.chainalytics.com/> www.chainalytics.com
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Performing regression on groups of items

Peck, Jon
In reply to this post by Steve Wagner-3
This is an interesting question.  The first part, regressing for each item number, is simply using SPLIT FILES on the item number, assuming that the data are sorted by item number.

The second part, capturing the output, is a matter of using OMS or the Regression OUTFILE or MATRIX OUT subcommands to capture the regression output.

The interesting part of this problem is scalability.  I can't say how many splits can be handled before SPSS runs out of memory.  Experimentation would be required, but it looks like 100 splits would be about what you have with 500,000 cases and 5,000 per split, and that should be no problem.  1000 is probably okay, too.

However, if you can't get everything at once, you could run blocks of splits and merge the resulting files.

Just using regression in this way will produce all the output in the Viewer, which may bog things down, although, again, 100 regressions isn't much.  Even 1000 isn't that much output as long as you choose just the regression statistics you need.

But another way to approach this would be to use the external drives mode available with programmability since SPSS 14.  This dispenses with the entire user interface.  This has performance benefits: first, there is no user interface or Data Editor or Viewer window, which reduces memory overhead, and, second, speed.  We have had some users report dramatically faster times using this mode.  In xd mode, you would use a little bit of Python code to drive SPSS.  It could be as simple as this code running in a Python shell or IDE.
import spss
spss.SetOutput("off")
spss.Submit(r"""get file ....
split file by varname.
regression /dependent y /enter x /outfile covb('c:/temp/regstats.sav'""")


and you are done.  The SetOutput call is to prevent all the regression text output to flow back to your shell window, although that is another way to catch the regression output.

HTH,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Steve Wagner
Sent: Friday, July 20, 2007 4:04 PM
To: [hidden email]
Subject: [SPSSX-L] Performing regression on groups of items

Hello,



I have an application where I would like to perform 1,000 or more simple
regressions for a historical file of sales transactions by item number.  The
dependent variable will be the sales amount (actually, LN(Sales)); the
independent variable will be time (1/day to be exact).  Each item number
will have a varying number of days in which sales occur.  I would like to
automatically take in the full transaction file (say, 500,000 records) and
extract a file that includes the regression statistics for each of the item
numbers (say, 5,000 records).  I have tried this under Statistica using
their grouping function but found that the output workbook blew up after 800
or so regressions (too large a size), requiring me to divide the input file.



1)      Is there a way, within SPSS, to automatically perform and extract
regressions statistics based on a grouping variable?

2)      If so, are there any scaling concerns, as with Statistica?

3)      If there is not a way, do you know of any other statistical program
that could support this (like S Plus)?



Thanks, and looking forward to your input.



Steve



___________________________________
Steve Wagner
Senior Manager, Supply Chain Design
CHAINalytics

2500 Cumberland Parkway, Suite 550
Atlanta, GA 30339 USA
o: 715.757.2200
m: 920.737.2742
f: 770.456.5393
 <mailto:[hidden email]> [hidden email]
 <http://www.chainalytics.com/> www.chainalytics.com