SHIFT VALUES vs CREATE vs LAG+CREATE

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

SHIFT VALUES vs CREATE vs LAG+CREATE

David Marso
Administrator
I have been working on a client project with a relatively large file (12 million records) with multiple records per individual.  Crucial to the project is building lags and leads for various variables in the file and building logic around comparisons to current  record values.  Noticing that CREATE with LAG was taking FOREVER I decided to explore SHIFT VALUES as an alternative.
Using 5,000,000 simulated data points I determined the following.
CREATE is quite a bit faster that SHIFT VALUES
LAG followed by CREATE is much slower than both.

Source      Mean (sec)  SD  N
Create 10.90 .69 5
LagCreate 23.87 .57 5
Shift        13.88 .44 5


CODE:

DEFINE !MakeData (!POS !TOKENS(1))
MATRIX.
SAVE MAKE(!1) / OUTFILE * / VARIABLES x.
END MATRIX.
DATASET NAME raw.
MATRIX .
SAVE {1} /OUTFILE * / VARIABLES y.
END MATRIX.
DATASET NAME dummy.
!ENDDEFINE .

DEFINE !logTime (!POS !TOKENS(1))
DATASET ACTIVATE dummy.
COMPUTE !1=$TIME.
FORMATS !1(DATETIME20).
DESCRIPTIVES VARIABLES !1 /STATISTICS MEAN.
DELETE VARIABLES !1.
!ENDDEFINE.


DEFINE !Run (!POS !TOKENS(1)/!POS !TOKENS(1))
NEW FILE.
DATASET CLOSE raw.
DATASET CLOSE dummy.
SET WORKSPACE=10000000.
!DO !I=1 !TO !2
!logTime Base
!Makedata !1.
!logTime Generate.
DATASET ACTIVATE raw.
SHIFT VALUES VARIABLE=X RESULT=@X LEAD=1 /VARIABLE =X RESULT=LX LAG=1.
!logTime Shift.
DATASET ACTIVATE raw.
CREATE @X2=LEAD(x,1) /lxe=LAG(x,1).
!logtime Create.
DATASET ACTIVATE raw.
COMPUTE lx3=LAG(x,1).
CREATE @X3=LEAD(x,1).
!logtime LagCreate.
!DOEND
!ENDDEFINE .


* OMS.
DEFINE !Benchmark (Size !TOKENS(1) / NReps !TOKENS(1) )
NEW FILE.
DATASET CLOSE ALL.
DATASET DECLARE benchmark .
OMS
  /SELECT TABLES
  /IF COMMANDS=['Descriptives'] SUBTYPES=['Descriptive Statistics']
  /DESTINATION FORMAT=SAV NUMBERED=TableNumber_
   OUTFILE=benchmark VIEWER=NO.
OMS
  /SELECT TABLES
  /IF COMMANDS=['Create']
  /DESTINATION VIEWER=NO.
!Run !Size !Nreps.
OMSEND .
OMSEND.
DATASET ACTIVATE benchmark.
DELETE VARIABLES Command_ Subtype_ Label_ .
SELECT IF VAR1 NE 'Valid N (listwise)'.
COMPUTE delta=Mean-LAG(Mean).
DATASET DECLARE aggbench.
AGGREGATE OUTFILE aggbench
  / BREAK Var1
  / MeanSeconds=MEAN(delta)
  / SDSeconds=SD(Delta)
  / Count=N(Delta).
SELECT IF ANY(VAR1,'Create','LagCreate','Shift').
EXECUTE.
!ENDDEFINE.
!Benchmark Size=5000000 Nreps=5 .


Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
Reply | Threaded
Open this post in threaded view
|

Re: SHIFT VALUES vs CREATE vs LAG+CREATE

Jon Peck
CREATE sometimes requires data in memory, depending on the calculation (certainly for FFT and IFFT and probably for T4253H.  I am not sure about the other functions, but I remember that the memory requirement was one of the issues that lead to the creation of SHIFT VALUES back when memory was limited).  It was designed for time series problems.  In fact, it uses the time series mini-central system.  It does not use the transformation subsystem.

The regular LAG function, being part of the transformation subsystem, is interpreted, which carries the overhead of interpreting general formulas.  It's possible that compiled transformations, which are only available with Statistics Server, would make a big difference.

SHIFT VALUES was created mainly to deal with leads and to handle split file situations more conveniently than LAG. However, it is not technically a transformation, i. e., it does not use the transformation subsystem, so it requires its own data pass, and it does not keep data in memory (other than the minimum to do the shifting).  However, SHIFT VALUES can process any number of variables on a single data pass without extra interpretation overhead.  SHIFT VALUES propagates the meta data to the output variables (except for custom attributes :-)).

On Fri, Mar 18, 2016 at 12:32 PM, David Marso <[hidden email]> wrote:
I have been working on a client project with a relatively large file (12
million records) with multiple records per individual.  Crucial to the
project is building lags and leads for various variables in the file and
building logic around comparisons to current  record values.  Noticing that
CREATE with LAG was taking FOREVER I decided to explore SHIFT VALUES as an
alternative.
Using 5,000,000 simulated data points I determined the following.
CREATE is quite a bit faster that SHIFT VALUES
LAG followed by CREATE is much slower than both.

Source      Mean (sec)  SD  N
Create  10.90   .69     5
LagCreate       23.87   .57     5
Shift           13.88   .44     5


CODE:

DEFINE !MakeData (!POS !TOKENS(1))
MATRIX.
SAVE MAKE(!1) / OUTFILE * / VARIABLES x.
END MATRIX.
DATASET NAME raw.
MATRIX .
SAVE {1} /OUTFILE * / VARIABLES y.
END MATRIX.
DATASET NAME dummy.
!ENDDEFINE .

DEFINE !logTime (!POS !TOKENS(1))
DATASET ACTIVATE dummy.
COMPUTE !1=$TIME.
FORMATS !1(DATETIME20).
DESCRIPTIVES VARIABLES !1 /STATISTICS MEAN.
DELETE VARIABLES !1.
!ENDDEFINE.


DEFINE !Run (!POS !TOKENS(1)/!POS !TOKENS(1))
NEW FILE.
DATASET CLOSE raw.
DATASET CLOSE dummy.
SET WORKSPACE=10000000.
!DO !I=1 !TO !2
!logTime Base
!Makedata !1.
!logTime Generate.
DATASET ACTIVATE raw.
SHIFT VALUES VARIABLE=X RESULT=@X LEAD=1 /VARIABLE =X RESULT=LX LAG=1.
!logTime Shift.
DATASET ACTIVATE raw.
CREATE @X2=LEAD(x,1) /lxe=LAG(x,1).
!logtime Create.
DATASET ACTIVATE raw.
COMPUTE lx3=LAG(x,1).
CREATE @X3=LEAD(x,1).
!logtime LagCreate.
!DOEND
!ENDDEFINE .


* OMS.
DEFINE !Benchmark (Size !TOKENS(1) / NReps !TOKENS(1) )
NEW FILE.
DATASET CLOSE ALL.
DATASET DECLARE benchmark .
OMS
  /SELECT TABLES
  /IF COMMANDS=['Descriptives'] SUBTYPES=['Descriptive Statistics']
  /DESTINATION FORMAT=SAV NUMBERED=TableNumber_
   OUTFILE=benchmark VIEWER=NO.
OMS
  /SELECT TABLES
  /IF COMMANDS=['Create']
  /DESTINATION VIEWER=NO.
!Run !Size !Nreps.
OMSEND .
OMSEND.
DATASET ACTIVATE benchmark.
DELETE VARIABLES Command_ Subtype_ Label_ .
SELECT IF VAR1 NE 'Valid N (listwise)'.
COMPUTE delta=Mean-LAG(Mean).
DATASET DECLARE aggbench.
AGGREGATE OUTFILE aggbench
  / BREAK Var1
  / MeanSeconds=MEAN(delta)
  / SDSeconds=SD(Delta)
  / Count=N(Delta).
SELECT IF ANY(VAR1,'Create','LagCreate','Shift').
EXECUTE.
!ENDDEFINE.
!Benchmark Size=5000000 Nreps=5 .






-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/SHIFT-VALUES-vs-CREATE-vs-LAG-CREATE-tp5731773.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD



--
Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD