Administrator
|
Has anyone had any success using Multiple Imputation when there are a large number of levels of a SPLIT FILE?
Try this code and see if it goes into the crapper. It fails on both version 22 and 24. Note that the combined variable Condition_Iter is an 'clever' attempt at a work around after failure in using the 2 splits separately ;-((( Note also that this is a tiny fraction of the actual problem size ;-( -- * Encoding: UTF-8. INPUT PROGRAM. LOOP Condition=1 TO 100. LOOP Iter=1 to 50. LOOP GP=1 TO 2. LOOP N=1 TO 20. LEAVE ALL. END CASE. END LOOP. END LOOP. END LOOP. END LOOP. END FILE. END INPUT PROGRAM. EXECUTE. DATASET NAME Raw. COMPUTE DV=RV.NORMAL(0,1). IF RV.UNIFORM(0,1) < .2 DV=$SYSMIS . DATASET DECLARE imputed . COMPUTE Condition_Iter =Condition * 10000+ Iter. SPLIT FILE BY Condition_Iter . /* This should work (crashes my system for NO good reason */. MULTIPLE IMPUTATION DV GP /IMPUTE METHOD=AUTO NIMPUTATIONS=5 MAXPCTMISSING=NONE /CONSTRAINTS GP ( ROLE=IND) /CONSTRAINTS DV ( ROLE=DEP) /MISSINGSUMMARIES NONE /IMPUTATIONSUMMARIES MODELS /OUTFILE IMPUTATIONS=imputed .
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
So in this example you have 5000 splits, each of which should get a separate analysis and set of imputed datasets. That adds up to 25000 imputed datasets. On my system, the backend silently vanishes almost immediately when the MI command starts even if the imputation method is set to NONE, which suggests an upfront memory management error. I doubt that the procedure authors ever expected such a large problem. I see the crash, though, with many fewer splits. I noticed that some of the generated variables have an unknown measurement level before the procedure starts, which requires extra work to resolve, but setting that ahead of time (to scale) does not prevent the crash. Since the procedure is sensitive to the measurement levels, they should be set first. I don't see a solution for this. I reported this to Development (with a much smaller number of splits), but I can't say anything about when it will be addressed. On Tue, Jun 7, 2016 at 3:21 PM, David Marso <[hidden email]> wrote: Has anyone had any success using Multiple Imputation when there are a large |
Administrator
|
Thank You Jon for your ultra prompt reply .
I have found a workaround for those interested. Basically CASESTOVARS the beast and then repetitively GET MASTER FILE. SELECT strata. MULTIPLE IMPUTATION. stats. combine.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
YW. You could also use the SPSSINC SPLIT DATASET and SPSSINC PROCESS FILES commands to iterate over all the splits in a way that is more general than regular split files processing. The pattern that these two extensions fulfill comes up a lot. On Tue, Jun 7, 2016 at 5:19 PM, David Marso <[hidden email]> wrote: Thank You Jon for your ultra prompt reply . |
Administrator
|
I need to look into these.
How much greater efficiency can be gleaned over LOOP...GET/SELECT/DOIT. I imagine quite a bit. We have something like 320 split levels across 180,000,000 records.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
SPLIT DATASET and PROCESS FILES have file overhead, since they write each group to disk (SPLIT DATASET) and read it (PROCESS FILES), but using SELECT IF or equivalent, for every group you have to pass the entire dataset. With 320 groups, SPLIT DATASET requires 5 data passes to create the group files due to the 64-file limit of XSAVE, but then each PROCESS FILES iteration will have to handle only 1/320th of the data, assuming equal split sizes. SELECT IF would read the dataset 320 times and would evaluate the condition expression 320 x 18,000,000 times. PROCESS FILES will have some setup time for each group, but almost all the execution time will be spent in the Statistics backend, so there should be only a small percentage of overhead between the Python and spssengine processses. I have never benchmarked these two approaches against each other: my main goal in creating these extensions was to generalize SPLIT FILES so that you could split over a whole batch of commands, not just one procedure at a time, but in the many-group case, there ought to be a lot of time saved as well. On Fri, Jun 10, 2016 at 8:22 AM, David Marso <[hidden email]> wrote: I need to look into these. |
Free forum by Nabble | Edit this page |