Re: SPSS memory problems with aggregate command

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: SPSS memory problems with aggregate command

David Preen
Hi all,

 

I have been unable to get SPSS to complete an aggregate command on a
reasonably large data file. The data file comprises ~70 million
cases/records for ~350,000 individuals, with about 12 variables. When I
run the aggregate command I get the following error message:

 

There is memory for only 139313 cases in the aggregated file.

 

>Error # 10963

>There is not enough memory for all the cases in the aggregated file.
The

>aggregated file is missing some cases.  Rerun with more memory.

>This command not executed.

 

I have sorted the file to be aggregated by the break variable. In
addition, I have included the SET WORKSPACE = 2097151 command in my
syntax to attempt to boost the memory allocation of SPSS (I am using
SPSS v14.0). I have tried running the command both on my computer (P4
3.8Ghz processor, 300Gb hard drive with ~160GB of free space and 3GB of
RAM) and from our departmental server. I have also changed the swap
settings on my computer to allow all free hard drive space to be used
for extra memory and I do not run any other applications while SPSS is
processing. However, after many hours of processing the analysis
terminates and gives me the above error message. Also, it may be worth
noting that all other SPSS commands seem to work without any problem on
the large data file and that it is only the aggregate command which is
proving problematic.

 

Does anyone know if there is anything that can be done to circumvent
this problem or is SPSS not capable of running aggregate commands on
data files of this size?  At a pinch, I can segment the file into a
number of much smaller data files, or alternatively transfer the file to
SAS to use the proc summary command which seems to work ok. However,
this has proved to be a little bit of a hassle and I was hoping that
there was a more efficient way of overcoming the problem. If anyone has
information on this issue it would be greatly appreciated. Thanks.

 

Kind regards,
David

 
Reply | Threaded
Open this post in threaded view
|

Re: SPSS memory problems with aggregate command

Richard Ristow
At 12:23 AM 3/1/2007, David Preen wrote:

>I have been unable to get SPSS to complete an aggregate command on a
>reasonably large data file. The data file comprises ~70 million
>cases/records for ~350,000 individuals, with about 12 variables. When
>I run the aggregate command I get the following error message:
>
>There is memory for only 139313 cases in the aggregated file.
>
>>Error # 10963
>>There is not enough memory for all the cases in the aggregated file.
>>The aggregated file is missing some cases.  Rerun with more memory.
>>This command not executed.
>
>I have sorted the file to be aggregated by the break variable.

That's it. However, you must also specify the /PRESORTED subcommand for
AGGREGATE.

I was surprised as the dickens when I learned this, but AGGREGATE, by
default, builds all the output cases in memory. I understand from SPSS,
Inc., that letting it do that, is significantly faster than sorting and
then using /PRESORTED.

However, if the cases are in memory, they take up memory. If you have a
great many break groups (number of input cases doesn't matter), they
can fill available memory and give you what you're seeing.

You got "there is memory for only 139,313 cases". That surprises me. I
once ran with over a million (output) cases, because I'd meant to
specify /PRESORTED but had forgotten. On a good machine, but less
capable than yours (1GB main memory), it ran painfully slowly, but it
did run.

It partly depends on how many variables in the OUTPUT records, since
those are what are built in memory. And a few AGGREGATE functions -
MEDIAN, anyway - keep information from all *input* cases in memory, and
may take much more. /PRESORTED will help with those, as well.

-Good luck,
  Richard
Reply | Threaded
Open this post in threaded view
|

Re: SPSS memory problems with aggregate command

Roberts, Michael
In reply to this post by David Preen
How many variables are you aggregating by (break vars)?  I have run into
a similar situation with large datasets, which resolved when pared down
to bare minimum breaks (My machine is p3.2ghz, 160gbhdd, 2gb RAM), and
let the program do the sorting, etc.  If you absolutely have to have
some variables, you can possibly merge them back in after aggregation,
depending upon characteristics.

HTH


Mike

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of
David Preen
Sent: Thursday, March 01, 2007 12:24 AM
To: [hidden email]
Subject: Re: SPSS memory problems with aggregate command

Hi all,



I have been unable to get SPSS to complete an aggregate command on a
reasonably large data file. The data file comprises ~70 million
cases/records for ~350,000 individuals, with about 12 variables. When I
run the aggregate command I get the following error message:



There is memory for only 139313 cases in the aggregated file.



>Error # 10963

>There is not enough memory for all the cases in the aggregated file.
The

>aggregated file is missing some cases.  Rerun with more memory.

>This command not executed.



I have sorted the file to be aggregated by the break variable. In
addition, I have included the SET WORKSPACE = 2097151 command in my
syntax to attempt to boost the memory allocation of SPSS (I am using
SPSS v14.0). I have tried running the command both on my computer (P4
3.8Ghz processor, 300Gb hard drive with ~160GB of free space and 3GB of
RAM) and from our departmental server. I have also changed the swap
settings on my computer to allow all free hard drive space to be used
for extra memory and I do not run any other applications while SPSS is
processing. However, after many hours of processing the analysis
terminates and gives me the above error message. Also, it may be worth
noting that all other SPSS commands seem to work without any problem on
the large data file and that it is only the aggregate command which is
proving problematic.



Does anyone know if there is anything that can be done to circumvent
this problem or is SPSS not capable of running aggregate commands on
data files of this size?  At a pinch, I can segment the file into a
number of much smaller data files, or alternatively transfer the file to
SAS to use the proc summary command which seems to work ok. However,
this has proved to be a little bit of a hassle and I was hoping that
there was a more efficient way of overcoming the problem. If anyone has
information on this issue it would be greatly appreciated. Thanks.



Kind regards,
David
Reply | Threaded
Open this post in threaded view
|

Re: SPSS memory problems with aggregate command

Egon Kraan
In reply to this post by David Preen
I have a similar problem where sometimes I get a similar error message, and
other times I get an output file where some of the aggregated data, after a
few million records have been processed, are simply not written out, so I
get partial output.

Is there a way around this?  Is the solution simply throwing more memory at
the problem?  Is there a way to make SPSS write the output cases to disk
instead of memory....


-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf Of
Richard Ristow
Sent: Thursday, March 01, 2007 1:10 AM
To: [hidden email]
Subject: Re: SPSS memory problems with aggregate command


At 12:23 AM 3/1/2007, David Preen wrote:

>I have been unable to get SPSS to complete an aggregate command on a
>reasonably large data file. The data file comprises ~70 million
>cases/records for ~350,000 individuals, with about 12 variables. When
>I run the aggregate command I get the following error message:
>
>There is memory for only 139313 cases in the aggregated file.
>
>>Error # 10963
>>There is not enough memory for all the cases in the aggregated file.
>>The aggregated file is missing some cases.  Rerun with more memory.
>>This command not executed.
>
>I have sorted the file to be aggregated by the break variable.

That's it. However, you must also specify the /PRESORTED subcommand for
AGGREGATE.

I was surprised as the dickens when I learned this, but AGGREGATE, by
default, builds all the output cases in memory. I understand from SPSS,
Inc., that letting it do that, is significantly faster than sorting and
then using /PRESORTED.

However, if the cases are in memory, they take up memory. If you have a
great many break groups (number of input cases doesn't matter), they
can fill available memory and give you what you're seeing.

You got "there is memory for only 139,313 cases". That surprises me. I
once ran with over a million (output) cases, because I'd meant to
specify /PRESORTED but had forgotten. On a good machine, but less
capable than yours (1GB main memory), it ran painfully slowly, but it
did run.

It partly depends on how many variables in the OUTPUT records, since
those are what are built in memory. And a few AGGREGATE functions -
MEDIAN, anyway - keep information from all *input* cases in memory, and
may take much more. /PRESORTED will help with those, as well.

-Good luck,
  Richard
Reply | Threaded
Open this post in threaded view
|

Re: SPSS memory problems with aggregate command

Hector Maletta
        Besides memory, one useful trick is having the file sorted according
to the aggregating variables, and using the /PRESORTED subcommand in the
AGGREGATE command. This speeds things up considerably.

        Hector

        -----Mensaje original-----
De: SPSSX(r) Discussion [mailto:[hidden email]] En nombre de Egon
Kraan
Enviado el: 01 March 2007 17:35
Para: [hidden email]
Asunto: Re: SPSS memory problems with aggregate command

        I have a similar problem where sometimes I get a similar error
message, and
        other times I get an output file where some of the aggregated data,
after a
        few million records have been processed, are simply not written out,
so I
        get partial output.

        Is there a way around this?  Is the solution simply throwing more
memory at
        the problem?  Is there a way to make SPSS write the output cases to
disk
        instead of memory....


        -----Original Message-----
        From: SPSSX(r) Discussion [mailto:[hidden email]]On Behalf
Of
        Richard Ristow
        Sent: Thursday, March 01, 2007 1:10 AM
        To: [hidden email]
        Subject: Re: SPSS memory problems with aggregate command


        At 12:23 AM 3/1/2007, David Preen wrote:

        >I have been unable to get SPSS to complete an aggregate command on
a
        >reasonably large data file. The data file comprises ~70 million
        >cases/records for ~350,000 individuals, with about 12 variables.
When
        >I run the aggregate command I get the following error message:
        >
        >There is memory for only 139313 cases in the aggregated file.
        >
        >>Error # 10963
        >>There is not enough memory for all the cases in the aggregated
file.
        >>The aggregated file is missing some cases.  Rerun with more
memory.
        >>This command not executed.
        >
        >I have sorted the file to be aggregated by the break variable.

        That's it. However, you must also specify the /PRESORTED subcommand
for
        AGGREGATE.

        I was surprised as the dickens when I learned this, but AGGREGATE,
by
        default, builds all the output cases in memory. I understand from
SPSS,
        Inc., that letting it do that, is significantly faster than sorting
and
        then using /PRESORTED.

        However, if the cases are in memory, they take up memory. If you
have a
        great many break groups (number of input cases doesn't matter), they
        can fill available memory and give you what you're seeing.

        You got "there is memory for only 139,313 cases". That surprises me.
I
        once ran with over a million (output) cases, because I'd meant to
        specify /PRESORTED but had forgotten. On a good machine, but less
        capable than yours (1GB main memory), it ran painfully slowly, but
it
        did run.

        It partly depends on how many variables in the OUTPUT records, since
        those are what are built in memory. And a few AGGREGATE functions -
        MEDIAN, anyway - keep information from all *input* cases in memory,
and
        may take much more. /PRESORTED will help with those, as well.

        -Good luck,
          Richard
Reply | Threaded
Open this post in threaded view
|

Re: SPSS memory problems with aggregate command

Richard Ristow
In reply to this post by Egon Kraan
At 11:35 AM 3/1/2007, Egon Kraan wrote:

>Sometimes I get a similar error message, and other times I get an
>output file where some of the aggregated data, after a few million
>records have been processed, are simply not written out, so I get
>partial output.
>
>Is there a way around this?  Is the solution simply throwing more
>memory at the problem?  Is there a way to make SPSS write the output
>cases to disk instead of memory....

Yes, there is: sort the cases by the break variables, and specify
subcommand /PRESORTED on AGGREGATE. I think I'd said that, as Hector
Maletta also just did. Can you visualize how AGGREGATE is going to
operate, with and without /PRESORTED?

By the way, I believe the only recommended use of /PRESORTED is when
you have very many break categories - hundreds of thousands.