SPSSX Discussion

REDUCING PROCESSING TIME

Classic

List

Threaded

4 messages Options

Bruce Colton

REDUCING PROCESSING TIME

I cut processing off when it hadn't completed after 2 hours. The syntax I used follows.

The cases are hourly, with 24 x 50 = 1200 cases for 50 days. The data is arranged from most recent day, or day 1(cases 1-24) to oldest day(day50 - cases 1177-1200). My objective: for the first 30 days of the data, compute average based on prior 20 days. For example, for day1, average for any given hour of the day = fc<day1>*.05*(node<day2>/fc<day2> + ..... + node<day21>/fc<day21>). For day2, average=fc<day2>*.05*(node<day3>/fc<day3> + ..... +node<day22>/fc<day22>). Variables are day, hr, fc, node1 - node1500.

I attempt to do it all in one pass, storing 20 days worth of values in day_2, thus creating too many variables, and probably the source of the problem. Day_2 has a FIFO structure, where I am writing over, thus eliminating the most recent day in day_2 as cases are read. Question: Might there be a better way to do this, possibly with multiple passes that is more efficient? Richard mentioned in an e-mail that SPSS is efficient processing a 'long' data structure. Would it make sense then to split the data vertically, yielding 2 files with the same number of cases and each containing approx. half the variables, process separately, and concatenate the result?

Any help/suggestions/recommendations are greatly appreciated - in the meantime, I'll pursue the long data structure idea.

SYNTAX

get file = 'c:\users\bruce\documents\spsstestdata_1500.sav'.
*file handle data_xout name = 'c:\users\bruce\documents\data_xout1.sav'.
* ----------.
*
compute daymnth=xdate.mday(day).
do if lag(daymnth) ne daymnth.
compute day_1=day_1+1.
end if.
* -----------------.
* day_1 gives the day of the case, starting with the most recent (day_1 = 0) to the oldest day - in a dataset of 50 days - last day - day_1=49.
* -----------------.
* ---------.
get file = 'c:\users\bruce\documents\spsstestdata_1500.sav'.
*file handle data_xout name = 'c:\users\bruce\documents\data_xout1.sav'.
* ----------.
* keep day 2 for weighted ave calc.
compute daymnth=xdate.mday(day).
do if lag(daymnth) ne daymnth.
compute day_1=day_1+1.
end if.
*leave day_1.

* -----------------.
* day_1 gives the day of the case, starting with the most recent (day_1 = 0) to the oldest day - in a dataset of 50 days - last day - day_1=49.
* -----------------.
* ---------.
vector loadpaste (720) node=var00001 to var01500 day_2(720000) weightedave(36000) nodeforecast(1500).
leave loadpaste1 to loadpaste720 weightedave1 to weightedave36000 nodeforecast1 to nodeforecast1500
day_21 to day_2720000 day_1.
* ---------------.
* process day1.
* --------------.
do if $casenum < 25.
compute loadpaste (hr) = fc.
loop #i=1 to 1500.
compute day_2(#i + 1500*(hr-1))=0.
end loop.
end if.
* ----------------.
* process days 2 thru 20.
* -----------------.
do if $casenum >24 and $casenum <481.
compute loadpaste(day_1*24 + hr)=fc.
loop #i=1 to 1500.
compute day_2(day_1)*36000 + #i + 1500*(hr-1)) = node(#i).
compute weightedave(#i + 1500*(hr-1))=weightedave(#i + 1500*(hr-1)) + node(#i)/fc.
end loop.
end if.
* ----------------.
* process days 21 thru 30 - loadpaste.
* -----------------.
do if $casenum >480 and $casenum <721.
compute loadpaste(day_1*24 + hr)=fc.
end if.
* -----------.
* print forecast day1-day30 starting with day21.
* ----------.
do if $casenum >480 and $casenum <1201.
loop #i=1 to 1500.
*compute day_2(day_1)*36000 + #i + 1500*(hr-1)) = node(#i).
compute weightedave(#i + 1500*(hr-1))=weightedave(#i + 1500*(hr-1)) + node(#i)/fc -
day_2((0)*36000 + #i + 1500*(hr-1))/loadpaste((day_1-20)*24+hr).
compute nodeforecast(#i)=loadpaste((day_1-20)*24+hr)*.05*weightedave(#i + 1500*(hr-1)).
end loop.
xsave outfile='c:\users\bruce\documents\data_xout1.sav'
/keep hr nodeforecast1 to nodeforecast1500.
* -----------------.
* streamline this later.
* --------------.
loop #i=1 to 1500.
compute day_2((0)*36000 + #i + 1500*(hr-1)) = day_2((1)*36000 + #i + 1500*(hr-1)).
compute day_2((1)*36000 + #i + 1500*(hr-1)) = day_2((2)*36000 + #i + 1500*(hr-1)).
compute day_2((2)*36000 + #i + 1500*(hr-1)) = day_2((3)*36000 + #i + 1500*(hr-1)).
compute day_2((3)*36000 + #i + 1500*(hr-1)) = day_2((4)*36000 + #i + 1500*(hr-1)).
compute day_2((4)*36000 + #i + 1500*(hr-1)) = day_2((5)*36000 + #i + 1500*(hr-1)).
compute day_2((5)*36000 + #i + 1500*(hr-1)) = day_2((6)*36000 + #i + 1500*(hr-1)).
compute day_2((6)*36000 + #i + 1500*(hr-1)) = day_2((7)*36000 + #i + 1500*(hr-1)).
compute day_2((7)*36000 + #i + 1500*(hr-1)) = day_2((8)*36000 + #i + 1500*(hr-1)).
compute day_2((8)*36000 + #i + 1500*(hr-1)) = day_2((9)*36000 + #i + 1500*(hr-1)).
compute day_2((9)*36000 + #i + 1500*(hr-1)) = day_2((10)*36000 + #i + 1500*(hr-1)).
compute day_2((10)*36000 + #i + 1500*(hr-1)) = day_2((11)*36000 + #i + 1500*(hr-1)).
compute day_2((11)*36000 + #i + 1500*(hr-1)) = day_2((12)*36000 + #i + 1500*(hr-1)).
compute day_2((12)*36000 + #i + 1500*(hr-1)) = day_2((13)*36000 + #i + 1500*(hr-1)).
compute day_2((13)*36000 + #i + 1500*(hr-1)) = day_2((14)*36000 + #i + 1500*(hr-1)).
compute day_2((14)*36000 + #i + 1500*(hr-1)) = day_2((15)*36000 + #i + 1500*(hr-1)).
compute day_2((15)*36000 + #i + 1500*(hr-1)) = day_2((16)*36000 + #i + 1500*(hr-1)).
compute day_2((16)*36000 + #i + 1500*(hr-1)) = day_2((17)*36000 + #i + 1500*(hr-1)).
compute day_2((17)*36000 + #i + 1500*(hr-1)) = day_2((18)*36000 + #i + 1500*(hr-1)).
compute day_2((18)*36000 + #i + 1500*(hr-1)) = day_2((19)*36000 + #i + 1500*(hr-1)).
compute day_2((19)*36000 + #i + 1500*(hr-1)) = node(#).
end loop.
end if.
execute.

Data looks like:

day hr fc node1 node2 node3 node4 node5
17-Nov-2008 1 23 45 12 17 25 77
17-Nov-2008 2 18 41 99 77 88 77

18-Nov-2008 1 66 33 45 22 44 11

etc.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: REDUCING PROCESSING TIME

Bruce,

I found it really, really hard to follow your explantion. I hope that others
have more success. So, let's start over. I picture your dataset as looking
pretty much like this

Date hour node1 ... node1500
6/10/08 0 12 ... 18
etc
6/10/08 23 23 ... 0
6/11/08 0 87 ... 65
Etc
7/30/08 23 4 ... 87

What you want is
>> My objective: for the first 30 days of the data, compute average based on
prior 20 days.

Based on this statement, you can 1) throw away data for days 31 thru
50--because you spcified only the first 30 days; 2) aggregate the data
across hours to get a 'day' dataset--because hours don't matter, only days.

Therefore, your dataset is reduced to 30 rows and 1 (date) + 1500 columns.
If you think about your output dataset, it will have 11 rows and 1 (date) +
1500 columns. Why 11 rows? Well, you are doing a 20 day moving average. That
can't be done for dates before, in my example data, 6/29/08 because 6/29 is
the first day with 19 prior days and 6/29 being the 20th day. The 30th and
last day in your dataset is 7/9. So, 11 rows.

I think I'd structure the problem as a do repeat issue. However, there are
probably other ways also. The place where the do repeat might fail is in the
use of the lag function. I think it will work but I haven't tried that
combination for a long while. It also might be that you'll need multiple do
repeat passes if 1500 variables exceeds the do repeat command capacity.
Anyway, Something like this.

Do if (date ge date.mdy(6,29,2008)).
Do repeat x=node1 to node1500/
y=cum1 to cum1500.
+ compute y=lag(x,1)+lag(x,2)+lag(x,3)+lag(x,4)+lag(x,5)+lag(x,6)+
lag(x,7)+lag(x,8)+lag(x,9)+lag(x,10)+lag(x,11)+lag(x,12)+lag(x,13)+

lag(x,14)+lag(x,15)+lag(x,16)+lag(x,17)+lag(x,18)+lag(x,19)+lag(x,20).
End repeat.
End if.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Hal 9000

Re: REDUCING PROCESSING TIME

Here's the solution:

do if Richard_Ristow_Suggested.
compute Solution_1.
else if Other_Good_Idea.
compute Solution_234.
end if.

Sorry about that Gene, your solutions are fantastic as well.

A pre/post dataset mock-up with variable descriptions and the gist of
what you're working on would be really interesting and helpful.

Best,
-Gary

On Wed, Jul 16, 2008 at 10:33 AM, Gene Maguin <[hidden email]> wrote:

> Bruce,
>
> I found it really, really hard to follow your explantion. I hope that others
> have more success. So, let's start over. I picture your dataset as looking
> pretty much like this
>
> Date hour node1 ... node1500
> 6/10/08 0 12 ... 18
> etc
> 6/10/08 23 23 ... 0
> 6/11/08 0 87 ... 65
> Etc
> 7/30/08 23 4 ... 87
>
> What you want is
>>> My objective: for the first 30 days of the data, compute average based on
> prior 20 days.
>
> Based on this statement, you can 1) throw away data for days 31 thru
> 50--because you spcified only the first 30 days; 2) aggregate the data
> across hours to get a 'day' dataset--because hours don't matter, only days.
>
> Therefore, your dataset is reduced to 30 rows and 1 (date) + 1500 columns.
> If you think about your output dataset, it will have 11 rows and 1 (date) +
> 1500 columns. Why 11 rows? Well, you are doing a 20 day moving average. That
> can't be done for dates before, in my example data, 6/29/08 because 6/29 is
> the first day with 19 prior days and 6/29 being the 20th day. The 30th and
> last day in your dataset is 7/9. So, 11 rows.
>
> I think I'd structure the problem as a do repeat issue. However, there are
> probably other ways also. The place where the do repeat might fail is in the
> use of the lag function. I think it will work but I haven't tried that
> combination for a long while. It also might be that you'll need multiple do
> repeat passes if 1500 variables exceeds the do repeat command capacity.
> Anyway, Something like this.
>
> Do if (date ge date.mdy(6,29,2008)).
> Do repeat x=node1 to node1500/
> y=cum1 to cum1500.
> + compute y=lag(x,1)+lag(x,2)+lag(x,3)+lag(x,4)+lag(x,5)+lag(x,6)+
> lag(x,7)+lag(x,8)+lag(x,9)+lag(x,10)+lag(x,11)+lag(x,12)+lag(x,13)+
>
> lag(x,14)+lag(x,15)+lag(x,16)+lag(x,17)+lag(x,18)+lag(x,19)+lag(x,20).
> End repeat.
> End if.
>
> Gene Maguin
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Richard Ristow

Re: REDUCING PROCESSING TIME

In reply to this post by Bruce Colton

At 09:47 AM 7/16/2008, Bruce Colton wrote:

>[My] cases are hourly, with 24 x 50 = 1200 cases for 50 days. The
>data is arranged from most recent day, or day 1(cases 1-24) to
>oldest day(day50 - cases 1177-1200). My objective: for the first 30
>days of the data, compute average based on prior 20 days. For
>example, for day1, average for any given hour of the day
>
>average = fc<day1>*.05*(node<day2>/fc<day2> + ..... + node<day21>/fc<day21>).
>
>For day2,
>
>average=fc<day2>*.05*(node<day3>/fc<day3> + ..... +node<day22>/fc<day22>).
>
>Variables are day, hr, fc, node1 - node1500.
>
>Richard mentioned in an e-mail that SPSS is efficient processing a
>'long' data structure. Would it make sense then to split the data vertically,

A 'long' structure following the logical structure of the data would
probably make sense. But, like Gene, I don't understand what you want
to do. First, your data. You write,

>Data looks like:
>
> day hr fc node1 node2 node3 node4 [...]
>17-Nov-2008 1 23 45 12 17 25 77
>17-Nov-2008 2 18 41 99 77 88 77
>...
>18-Nov-2008 1 66 33 45 22 44 11

What does it mean for 17-Nov-2008 to be both day 1 and day 2? And in
the record for 18 Nov, what does hr=66 (i.e., not 1-24) mean?

Going on, it looks like 'fc' and variables 'node1'-'node1500' are
measured values, all taken simultaneously at one hour on one day. You
say you want to calculate, for day2

>average=fc<day2>*.05*(node<day3>/fc<day3> + ..... +node<day22>/fc<day22>).

Is this meant to calculate one average (over 20 days, for the same
hour), or 1500 averages (of, that is, all 1500 'nodeX' values (over
20 days, for the same hour)?

Finally, you write,

>I attempt to do it all in one pass, storing 20 days worth of values
>in [vector] day_2, thus creating too many variables, and probably
>the source of the problem.

That'll give you 20*1500=30,000 variables. That's well within SPSS's
inherent capacity, but very likely enough to slow processing badly.

-Onward, ever onward,
Richard

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD