Speeding up aggregate

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Speeding up aggregate

Derek Willemsen

Dear all,

 

I have a dataset which contains 16 million records. I need to count how many records there are on 4 break variables so I use an simple aggregate with ADDVARIABLES mode.

 

AGGREGATE

  /OUTFILE=* MODE=ADDVARIABLES

  /BREAK=VAR1 VAR2 VAR3 VAR4

  /N_BREAK=N.

 

The first couple of million goes fast, but after 11 million records the aggregation is getting really slow. It takes ages to finish the last 5 million records. Normally it takes about 2,5 hours to finish the operation.

 

Is there a way speed this process up?

 

(I have 2GB internal memory)

 

Thanks in advance!



Derek Willemsen

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up aggregate

ViAnn Beadle

Crosstabs will most likely give you those numbers faster unless you really want to add the counts back to your records. I think that would be two passes of the data. Also if the files are already sorted on the break variables, specify presorted.

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Derek Willemsen
Sent: Thursday, February 11, 2010 8:32 AM
To: [hidden email]
Subject: Speeding up aggregate

 

Dear all,

 

I have a dataset which contains 16 million records. I need to count how many records there are on 4 break variables so I use an simple aggregate with ADDVARIABLES mode.

 

AGGREGATE

  /OUTFILE=* MODE=ADDVARIABLES

  /BREAK=VAR1 VAR2 VAR3 VAR4

  /N_BREAK=N.

 

The first couple of million goes fast, but after 11 million records the aggregation is getting really slow. It takes ages to finish the last 5 million records. Normally it takes about 2,5 hours to finish the operation.

 

Is there a way speed this process up?

 

(I have 2GB internal memory)

 

Thanks in advance!



Derek Willemsen

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up aggregate

Mike
In reply to this post by Derek Willemsen
Does it still take that long if you use a file that only has the four break
variables and perhaps a case ID? 
 
-Mike Palij
New York University
 
----- Original Message -----
Sent: Thursday, February 11, 2010 10:31 AM
Subject: Speeding up aggregate

Dear all,

 

I have a dataset which contains 16 million records. I need to count how many records there are on 4 break variables so I use an simple aggregate with ADDVARIABLES mode.

 

AGGREGATE

  /OUTFILE=* MODE=ADDVARIABLES

  /BREAK=VAR1 VAR2 VAR3 VAR4

  /N_BREAK=N.

 

The first couple of million goes fast, but after 11 million records the aggregation is getting really slow. It takes ages to finish the last 5 million records. Normally it takes about 2,5 hours to finish the operation.

 

Is there a way speed this process up?

 

(I have 2GB internal memory)

 

Thanks in advance!



Derek Willemsen

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up aggregate

Derek Willemsen

Hi Mike,

 

Thanks for your reply. I’ve tested it and it’s still getting slow after 9-10 million records.. The first 9 million were processed in a few seconds so I was hopeful, but after a while it slowed down and it was processing a couple of hundred records a second (instead of thousands/millions).

 

Greetings,
Derek

 


Van: SPSSX(r) Discussion [mailto:[hidden email]] Namens Mike Palij
Verzonden: donderdag 11 februari 2010 16:53
Aan: [hidden email]
Onderwerp: Re: Speeding up aggregate

 

Does it still take that long if you use a file that only has the four break

variables and perhaps a case ID? 

 

-Mike Palij

New York University

 

----- Original Message -----

Sent: Thursday, February 11, 2010 10:31 AM

Subject: Speeding up aggregate

 

Dear all,

 

I have a dataset which contains 16 million records. I need to count how many records there are on 4 break variables so I use an simple aggregate with ADDVARIABLES mode.

 

AGGREGATE

  /OUTFILE=* MODE=ADDVARIABLES

  /BREAK=VAR1 VAR2 VAR3 VAR4

  /N_BREAK=N.

 

The first couple of million goes fast, but after 11 million records the aggregation is getting really slow. It takes ages to finish the last 5 million records. Normally it takes about 2,5 hours to finish the operation.

 

Is there a way speed this process up?

 

(I have 2GB internal memory)

 

Thanks in advance!



Derek Willemsen

Reply | Threaded
Open this post in threaded view
|

Re: Speeding up aggregate

Garry Gelade

Derek

 

Your aggregate command isn't using the presorted option. I'd be inclined to try presorting and then aggregating to a new dataset or to a temporary file.  You could then merge the counts back to your original file if you need the variable on the original data. 

 

Garry Gelade

Business Analytic Ltd.

 

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Derek Willemsen
Sent: 11 February 2010 16:18
To: [hidden email]
Subject: Re: Speeding up aggregate

 

Hi Mike,

 

Thanks for your reply. I’ve tested it and it’s still getting slow after 9-10 million records.. The first 9 million were processed in a few seconds so I was hopeful, but after a while it slowed down and it was processing a couple of hundred records a second (instead of thousands/millions).

 

Greetings,
Derek

 


Van: SPSSX(r) Discussion [mailto:[hidden email]] Namens Mike Palij
Verzonden: donderdag 11 februari 2010 16:53
Aan: [hidden email]
Onderwerp: Re: Speeding up aggregate

 

Does it still take that long if you use a file that only has the four break

variables and perhaps a case ID? 

 

-Mike Palij

New York University

 

----- Original Message -----

Sent: Thursday, February 11, 2010 10:31 AM

Subject: Speeding up aggregate

 

Dear all,

 

I have a dataset which contains 16 million records. I need to count how many records there are on 4 break variables so I use an simple aggregate with ADDVARIABLES mode.

 

AGGREGATE

  /OUTFILE=* MODE=ADDVARIABLES

  /BREAK=VAR1 VAR2 VAR3 VAR4

  /N_BREAK=N.

 

The first couple of million goes fast, but after 11 million records the aggregation is getting really slow. It takes ages to finish the last 5 million records. Normally it takes about 2,5 hours to finish the operation.

 

Is there a way speed this process up?

 

(I have 2GB internal memory)

 

Thanks in advance!



Derek Willemsen



__________ Information from ESET NOD32 Antivirus, version of virus signature database 4858 (20100211) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com



__________ Information from ESET NOD32 Antivirus, version of virus signature database 4858 (20100211) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
Reply | Threaded
Open this post in threaded view
|

Re: Speeding up aggregate

Derek Willemsen

Thanks Garry for this good advice. The break variables are indeed sorted so I’ve tried the presorted option. That did the trick! Still takes a long time but much shorter than before!

 

@ ViAnn:          I need the variable with the count for later use so crosstabs is no option.

 

Thanks all for your advice!

 

- Derek

 


Van: SPSSX(r) Discussion [mailto:[hidden email]] Namens Garry Gelade
Verzonden: donderdag 11 februari 2010 18:45
Aan: [hidden email]
Onderwerp: Re: Speeding up aggregate

 

Derek

 

Your aggregate command isn't using the presorted option. I'd be inclined to try presorting and then aggregating to a new dataset or to a temporary file.  You could then merge the counts back to your original file if you need the variable on the original data. 

 

Garry Gelade

Business Analytic Ltd.

 

 

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Derek Willemsen
Sent: 11 February 2010 16:18
To: [hidden email]
Subject: Re: Speeding up aggregate

 

Hi Mike,

 

Thanks for your reply. I’ve tested it and it’s still getting slow after 9-10 million records.. The first 9 million were processed in a few seconds so I was hopeful, but after a while it slowed down and it was processing a couple of hundred records a second (instead of thousands/millions).

 

Greetings,
Derek

 


Van: SPSSX(r) Discussion [mailto:[hidden email]] Namens Mike Palij
Verzonden: donderdag 11 februari 2010 16:53
Aan: [hidden email]
Onderwerp: Re: Speeding up aggregate

 

Does it still take that long if you use a file that only has the four break

variables and perhaps a case ID? 

 

-Mike Palij

New York University

 

----- Original Message -----

Sent: Thursday, February 11, 2010 10:31 AM

Subject: Speeding up aggregate

 

Dear all,

 

I have a dataset which contains 16 million records. I need to count how many records there are on 4 break variables so I use an simple aggregate with ADDVARIABLES mode.

 

AGGREGATE

  /OUTFILE=* MODE=ADDVARIABLES

  /BREAK=VAR1 VAR2 VAR3 VAR4

  /N_BREAK=N.

 

The first couple of million goes fast, but after 11 million records the aggregation is getting really slow. It takes ages to finish the last 5 million records. Normally it takes about 2,5 hours to finish the operation.

 

Is there a way speed this process up?

 

(I have 2GB internal memory)

 

Thanks in advance!



Derek Willemsen



__________ Information from ESET NOD32 Antivirus, version of virus signature database 4858 (20100211) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com



__________ Information from ESET NOD32 Antivirus, version of virus signature database 4858 (20100211) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com