SPSSX Discussion

URGENT Serious problem with MULTIPLE IMPUTATION with large number of SPLITS

Classic

List

Threaded

6 messages Options

David Marso

URGENT Serious problem with MULTIPLE IMPUTATION with large number of SPLITS

Administrator

Has anyone had any success using Multiple Imputation when there are a large number of levels of a SPLIT FILE?

Try this code and see if it goes into the crapper.
It fails on both version 22 and 24.
Note that the combined variable Condition_Iter is an 'clever' attempt at a work around after failure in using the 2 splits separately ;-(((
Note also that this is a tiny fraction of the actual problem size ;-(
--
* Encoding: UTF-8.
INPUT PROGRAM.
LOOP Condition=1 TO 100.
LOOP Iter=1 to 50.
LOOP GP=1 TO 2.
LOOP N=1 TO 20.
LEAVE ALL.
END CASE.
END LOOP.
END LOOP.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
EXECUTE.
DATASET NAME Raw.
COMPUTE DV=RV.NORMAL(0,1).
IF RV.UNIFORM(0,1) < .2 DV=$SYSMIS .

DATASET DECLARE imputed .
COMPUTE Condition_Iter =Condition * 10000+ Iter.
SPLIT FILE BY Condition_Iter .

/* This should work (crashes my system for NO good reason */.
MULTIPLE IMPUTATION DV GP
/IMPUTE METHOD=AUTO NIMPUTATIONS=5 MAXPCTMISSING=NONE
/CONSTRAINTS GP ( ROLE=IND)
/CONSTRAINTS DV ( ROLE=DEP)
/MISSINGSUMMARIES NONE
/IMPUTATIONSUMMARIES MODELS
/OUTFILE IMPUTATIONS=imputed .

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Jon Peck

Re: URGENT Serious problem with MULTIPLE IMPUTATION with large number of SPLITS

So in this example you have 5000 splits, each of which should get a separate analysis and set of imputed datasets. That adds up to 25000 imputed datasets. On my system, the backend silently vanishes almost immediately when the MI command starts even if the imputation method is set to NONE, which suggests an upfront memory management error. I doubt that the procedure authors ever expected such a large problem. I see the crash, though, with many fewer splits. I noticed that some of the generated variables have an unknown measurement level before the procedure starts, which requires extra work to resolve, but setting that ahead of time (to scale) does not prevent the crash. Since the procedure is sensitive to the measurement levels, they should be set first. I don't see a solution for this. I reported this to Development (with a much smaller number of splits), but I can't say anything about when it will be addressed.

On Tue, Jun 7, 2016 at 3:21 PM, David Marso <[hidden email]> wrote:

Has anyone had any success using Multiple Imputation when there are a large
number of levels of a SPLIT FILE?

Try this code and see if it goes into the crapper.
It fails on both version 22 and 24.
Note that the combined variable Condition_Iter is an 'clever' attempt at a
work around after failure in using the 2 splits separately ;-(((
Note also that this is a tiny fraction of the actual problem size ;-(
--
* Encoding: UTF-8.
INPUT PROGRAM.
LOOP Condition=1 TO 100.
LOOP Iter=1 to 50.
LOOP GP=1 TO 2.
LOOP N=1 TO 20.
LEAVE ALL.
END CASE.
END LOOP.
END LOOP.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
EXECUTE.
DATASET NAME Raw.
COMPUTE DV=RV.NORMAL(0,1).
IF RV.UNIFORM(0,1) < .2 DV=$SYSMIS .

DATASET DECLARE imputed .
COMPUTE Condition_Iter =Condition * 10000+ Iter.
SPLIT FILE BY Condition_Iter .

/* This should work (crashes my system for NO good reason */.
MULTIPLE IMPUTATION DV GP
/IMPUTE METHOD=AUTO NIMPUTATIONS=5 MAXPCTMISSING=NONE
/CONSTRAINTS GP ( ROLE=IND)
/CONSTRAINTS DV ( ROLE=DEP)
/MISSINGSUMMARIES NONE
/IMPUTATIONSUMMARIES MODELS
/OUTFILE IMPUTATIONS=imputed .

-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/URGENT-Serious-problem-with-MULTIPLE-IMPUTATION-with-large-number-of-SPLITS-tp5732334.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

David Marso

Re: URGENT Serious problem with MULTIPLE IMPUTATION with large number of SPLITS

Administrator

Thank You Jon for your ultra prompt reply .
I have found a workaround for those interested.
Basically CASESTOVARS the beast and then repetitively

GET MASTER FILE.
SELECT strata.
MULTIPLE IMPUTATION.
stats.
combine.

Jon Peck

Re: URGENT Serious problem with MULTIPLE IMPUTATION with large number of SPLITS

YW. You could also use the SPSSINC SPLIT DATASET and SPSSINC PROCESS FILES commands to iterate over all the splits in a way that is more general than regular split files processing. The pattern that these two extensions fulfill comes up a lot.

On Tue, Jun 7, 2016 at 5:19 PM, David Marso <[hidden email]> wrote:

Thank You Jon for your ultra prompt reply .
I have found a workaround for those interested.
Basically CASESTOVARS the beast and then repetitively

GET MASTER FILE.
SELECT strata.
MULTIPLE IMPUTATION.
stats.
combine.

-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/URGENT-Serious-problem-with-MULTIPLE-IMPUTATION-with-large-number-of-SPLITS-tp5732334p5732336.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck
[hidden email]

David Marso

Re: URGENT Serious problem with MULTIPLE IMPUTATION with large number of SPLITS

Administrator

I need to look into these.
How much greater efficiency can be gleaned over LOOP...GET/SELECT/DOIT.
I imagine quite a bit.
We have something like 320 split levels across 180,000,000 records.

Jon Peck wrote

YW. You could also use the SPSSINC SPLIT DATASET and SPSSINC PROCESS FILES
commands to iterate over all the splits in a way that is more general than
regular split files processing. The pattern that these two extensions
fulfill comes up a lot.

On Tue, Jun 7, 2016 at 5:19 PM, David Marso <[hidden email]> wrote:

> Thank You Jon for your ultra prompt reply .
> I have found a workaround for those interested.
> Basically CASESTOVARS the beast and then repetitively
>
> GET MASTER FILE.
> SELECT strata.
> MULTIPLE IMPUTATION.
> stats.
> combine.
>
>
>
>
>
> -----
> Please reply to the list and not to my personal email.
> Those desiring my consulting or training services please feel free to
> email me.
> ---
> "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos
> ne forte conculcent eas pedibus suis."
> Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in
> abyssum?"
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/URGENT-Serious-problem-with-MULTIPLE-IMPUTATION-with-large-number-of-SPLITS-tp5732334p5732336.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>

--
Jon K Peck
[hidden email]

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon Peck

Re: URGENT Serious problem with MULTIPLE IMPUTATION with large number of SPLITS

SPLIT DATASET and PROCESS FILES have file overhead, since they write each group to disk (SPLIT DATASET) and read it (PROCESS FILES), but using SELECT IF or equivalent, for every group you have to pass the entire dataset. With 320 groups, SPLIT DATASET requires 5 data passes to create the group files due to the 64-file limit of XSAVE, but then each PROCESS FILES iteration will have to handle only 1/320th of the data, assuming equal split sizes. SELECT IF would read the dataset 320 times and would evaluate the condition expression 320 x 18,000,000 times. PROCESS FILES will have some setup time for each group, but almost all the execution time will be spent in the Statistics backend, so there should be only a small percentage of overhead between the Python and spssengine processses.

I have never benchmarked these two approaches against each other: my main goal in creating these extensions was to generalize SPLIT FILES so that you could split over a whole batch of commands, not just one procedure at a time, but in the many-group case, there ought to be a lot of time saved as well.

On Fri, Jun 10, 2016 at 8:22 AM, David Marso <[hidden email]> wrote:

I need to look into these.
How much greater efficiency can be gleaned over LOOP...GET/SELECT/DOIT.
I imagine quite a bit.
We have something like 320 split levels across 180,000,000 records.

Jon Peck wrote
> YW. You could also use the SPSSINC SPLIT DATASET and SPSSINC PROCESS
> FILES
> commands to iterate over all the splits in a way that is more general than
> regular split files processing. The pattern that these two extensions
> fulfill comes up a lot.
>
> On Tue, Jun 7, 2016 at 5:19 PM, David Marso <

> david.marso@

> > wrote:
>
>> Thank You Jon for your ultra prompt reply .
>> I have found a workaround for those interested.
>> Basically CASESTOVARS the beast and then repetitively
>>
>> GET MASTER FILE.
>> SELECT strata.
>> MULTIPLE IMPUTATION.
>> stats.
>> combine.
>>
>>
>>
>>
>>
>> -----
>> Please reply to the list and not to my personal email.
>> Those desiring my consulting or training services please feel free to
>> email me.
>> ---
>> "Nolite dare sanctum canibus neque mittatis margaritas vestras ante
>> porcos
>> ne forte conculcent eas pedibus suis."
>> Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff
>> in
>> abyssum?"
>> --
>> View this message in context:
>> http://spssx-discussion.1045642.n5.nabble.com/URGENT-Serious-problem-with-MULTIPLE-IMPUTATION-with-large-number-of-SPLITS-tp5732334p5732336.html
>> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>>
>> =====================
>> To manage your subscription to SPSSX-L, send a message to
>>

> LISTSERV@.UGA

> (not to SPSSX-L), with no body text except the
>> command. To leave the list, send the command
>> SIGNOFF SPSSX-L
>> For a list of commands to manage subscriptions, send the command
>> INFO REFCARD
>>
>
>
>
> --
> Jon K Peck

> jkpeck@

>
> =====================
> To manage your subscription to SPSSX-L, send a message to

> LISTSERV@.UGA

> (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

-----
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/URGENT-Serious-problem-with-MULTIPLE-IMPUTATION-with-large-number-of-SPLITS-tp5732334p5732376.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck
[hidden email]