Administrator
|
I have been working on a client project with a relatively large file (12 million records) with multiple records per individual. Crucial to the project is building lags and leads for various variables in the file and building logic around comparisons to current record values. Noticing that CREATE with LAG was taking FOREVER I decided to explore SHIFT VALUES as an alternative.
Using 5,000,000 simulated data points I determined the following. CREATE is quite a bit faster that SHIFT VALUES LAG followed by CREATE is much slower than both. Source Mean (sec) SD N Create 10.90 .69 5 LagCreate 23.87 .57 5 Shift 13.88 .44 5 CODE: DEFINE !MakeData (!POS !TOKENS(1)) MATRIX. SAVE MAKE(!1) / OUTFILE * / VARIABLES x. END MATRIX. DATASET NAME raw. MATRIX . SAVE {1} /OUTFILE * / VARIABLES y. END MATRIX. DATASET NAME dummy. !ENDDEFINE . DEFINE !logTime (!POS !TOKENS(1)) DATASET ACTIVATE dummy. COMPUTE !1=$TIME. FORMATS !1(DATETIME20). DESCRIPTIVES VARIABLES !1 /STATISTICS MEAN. DELETE VARIABLES !1. !ENDDEFINE. DEFINE !Run (!POS !TOKENS(1)/!POS !TOKENS(1)) NEW FILE. DATASET CLOSE raw. DATASET CLOSE dummy. SET WORKSPACE=10000000. !DO !I=1 !TO !2 !logTime Base !Makedata !1. !logTime Generate. DATASET ACTIVATE raw. SHIFT VALUES VARIABLE=X RESULT=@X LEAD=1 /VARIABLE =X RESULT=LX LAG=1. !logTime Shift. DATASET ACTIVATE raw. CREATE @X2=LEAD(x,1) /lxe=LAG(x,1). !logtime Create. DATASET ACTIVATE raw. COMPUTE lx3=LAG(x,1). CREATE @X3=LEAD(x,1). !logtime LagCreate. !DOEND !ENDDEFINE . * OMS. DEFINE !Benchmark (Size !TOKENS(1) / NReps !TOKENS(1) ) NEW FILE. DATASET CLOSE ALL. DATASET DECLARE benchmark . OMS /SELECT TABLES /IF COMMANDS=['Descriptives'] SUBTYPES=['Descriptive Statistics'] /DESTINATION FORMAT=SAV NUMBERED=TableNumber_ OUTFILE=benchmark VIEWER=NO. OMS /SELECT TABLES /IF COMMANDS=['Create'] /DESTINATION VIEWER=NO. !Run !Size !Nreps. OMSEND . OMSEND. DATASET ACTIVATE benchmark. DELETE VARIABLES Command_ Subtype_ Label_ . SELECT IF VAR1 NE 'Valid N (listwise)'. COMPUTE delta=Mean-LAG(Mean). DATASET DECLARE aggbench. AGGREGATE OUTFILE aggbench / BREAK Var1 / MeanSeconds=MEAN(delta) / SDSeconds=SD(Delta) / Count=N(Delta). SELECT IF ANY(VAR1,'Create','LagCreate','Shift'). EXECUTE. !ENDDEFINE. !Benchmark Size=5000000 Nreps=5 .
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
CREATE sometimes requires data in memory, depending on the calculation (certainly for FFT and IFFT and probably for T4253H. I am not sure about the other functions, but I remember that the memory requirement was one of the issues that lead to the creation of SHIFT VALUES back when memory was limited). It was designed for time series problems. In fact, it uses the time series mini-central system. It does not use the transformation subsystem. The regular LAG function, being part of the transformation subsystem, is interpreted, which carries the overhead of interpreting general formulas. It's possible that compiled transformations, which are only available with Statistics Server, would make a big difference. SHIFT VALUES was created mainly to deal with leads and to handle split file situations more conveniently than LAG. However, it is not technically a transformation, i. e., it does not use the transformation subsystem, so it requires its own data pass, and it does not keep data in memory (other than the minimum to do the shifting). However, SHIFT VALUES can process any number of variables on a single data pass without extra interpretation overhead. SHIFT VALUES propagates the meta data to the output variables (except for custom attributes :-)). On Fri, Mar 18, 2016 at 12:32 PM, David Marso <[hidden email]> wrote: I have been working on a client project with a relatively large file (12 |
Free forum by Nabble | Edit this page |