SPSSX Discussion

Speeding up aggregate

Classic

List

Threaded

6 messages Options

Derek Willemsen

Speeding up aggregate

Dear all,

I have a dataset which contains 16 million records. I need to count how many records there are on 4 break variables so I use an simple aggregate with ADDVARIABLES mode.

AGGREGATE

/OUTFILE=* MODE=ADDVARIABLES

/BREAK=VAR1 VAR2 VAR3 VAR4

/N_BREAK=N.

The first couple of million goes fast, but after 11 million records the aggregation is getting really slow. It takes ages to finish the last 5 million records. Normally it takes about 2,5 hours to finish the operation.

Is there a way speed this process up?

(I have 2GB internal memory)

Thanks in advance!

Derek Willemsen

ViAnn Beadle

Re: Speeding up aggregate

Crosstabs will most likely give you those numbers faster unless you really want to add the counts back to your records. I think that would be two passes of the data. Also if the files are already sorted on the break variables, specify presorted.

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Derek Willemsen
Sent: Thursday, February 11, 2010 8:32 AM
To: [hidden email]
Subject: Speeding up aggregate

Dear all,

I have a dataset which contains 16 million records. I need to count how many records there are on 4 break variables so I use an simple aggregate with ADDVARIABLES mode.

AGGREGATE

/OUTFILE=* MODE=ADDVARIABLES

/BREAK=VAR1 VAR2 VAR3 VAR4

/N_BREAK=N.

Is there a way speed this process up?

(I have 2GB internal memory)

Thanks in advance!

Derek Willemsen

Mike

Re: Speeding up aggregate

In reply to this post by Derek Willemsen

Does it still take that long if you use a file that only has the four break

variables and perhaps a case ID?

-Mike Palij

New York University

[hidden email]

----- Original Message -----

From: [hidden email]

To: [hidden email]

Sent: Thursday, February 11, 2010 10:31 AM

Subject: Speeding up aggregate

Dear all,

I have a dataset which contains 16 million records. I need to count how many records there are on 4 break variables so I use an simple aggregate with ADDVARIABLES mode.

AGGREGATE

/OUTFILE=* MODE=ADDVARIABLES

/BREAK=VAR1 VAR2 VAR3 VAR4

/N_BREAK=N.

The first couple of million goes fast, but after 11 million records the aggregation is getting really slow. It takes ages to finish the last 5 million records. Normally it takes about 2,5 hours to finish the operation.

Is there a way speed this process up?

(I have 2GB internal memory)

Thanks in advance!

Derek Willemsen

Derek Willemsen

Re: Speeding up aggregate

Hi Mike,

Thanks for your reply. I’ve tested it and it’s still getting slow after 9-10 million records.. The first 9 million were processed in a few seconds so I was hopeful, but after a while it slowed down and it was processing a couple of hundred records a second (instead of thousands/millions).

Greetings,
Derek

Van: SPSSX(r) Discussion [mailto:[hidden email]] Namens Mike Palij
Verzonden: donderdag 11 februari 2010 16:53
Aan: [hidden email]
Onderwerp: Re: Speeding up aggregate

Does it still take that long if you use a file that only has the four break

variables and perhaps a case ID?

-Mike Palij

New York University

[hidden email]

----- Original Message -----

From: [hidden email]

To: [hidden email]

Sent: Thursday, February 11, 2010 10:31 AM

Subject: Speeding up aggregate

Dear all,

I have a dataset which contains 16 million records. I need to count how many records there are on 4 break variables so I use an simple aggregate with ADDVARIABLES mode.

AGGREGATE

/OUTFILE=* MODE=ADDVARIABLES

/BREAK=VAR1 VAR2 VAR3 VAR4

/N_BREAK=N.

The first couple of million goes fast, but after 11 million records the aggregation is getting really slow. It takes ages to finish the last 5 million records. Normally it takes about 2,5 hours to finish the operation.

Is there a way speed this process up?

(I have 2GB internal memory)

Thanks in advance!

Derek Willemsen

Garry Gelade

Re: Speeding up aggregate

Derek

Your aggregate command isn't using the presorted option. I'd be inclined to try presorting and then aggregating to a new dataset or to a temporary file. You could then merge the counts back to your original file if you need the variable on the original data.

Garry Gelade

Business Analytic Ltd.

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Derek Willemsen
Sent: 11 February 2010 16:18
To: [hidden email]
Subject: Re: Speeding up aggregate

Hi Mike,

Greetings,
Derek

Van: SPSSX(r) Discussion [mailto:[hidden email]] Namens Mike Palij
Verzonden: donderdag 11 februari 2010 16:53
Aan: [hidden email]
Onderwerp: Re: Speeding up aggregate

Does it still take that long if you use a file that only has the four break

variables and perhaps a case ID?

-Mike Palij

New York University

[hidden email]

----- Original Message -----

From: [hidden email]

To: [hidden email]

Sent: Thursday, February 11, 2010 10:31 AM

Subject: Speeding up aggregate

Dear all,

I have a dataset which contains 16 million records. I need to count how many records there are on 4 break variables so I use an simple aggregate with ADDVARIABLES mode.

AGGREGATE

/OUTFILE=* MODE=ADDVARIABLES

/BREAK=VAR1 VAR2 VAR3 VAR4

/N_BREAK=N.

The first couple of million goes fast, but after 11 million records the aggregation is getting really slow. It takes ages to finish the last 5 million records. Normally it takes about 2,5 hours to finish the operation.

Is there a way speed this process up?

(I have 2GB internal memory)

Thanks in advance!

Derek Willemsen

__________ Information from ESET NOD32 Antivirus, version of virus signature database 4858 (20100211) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com

Derek Willemsen

Re: Speeding up aggregate

Thanks Garry for this good advice. The break variables are indeed sorted so I’ve tried the presorted option. That did the trick! Still takes a long time but much shorter than before!

@ ViAnn: I need the variable with the count for later use so crosstabs is no option.

Thanks all for your advice!

- Derek

Van: SPSSX(r) Discussion [mailto:[hidden email]] Namens Garry Gelade
Verzonden: donderdag 11 februari 2010 18:45
Aan: [hidden email]
Onderwerp: Re: Speeding up aggregate

Derek

Garry Gelade

Business Analytic Ltd.

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Derek Willemsen
Sent: 11 February 2010 16:18
To: [hidden email]
Subject: Re: Speeding up aggregate

Hi Mike,

Greetings,
Derek

Van: SPSSX(r) Discussion [mailto:[hidden email]] Namens Mike Palij
Verzonden: donderdag 11 februari 2010 16:53
Aan: [hidden email]
Onderwerp: Re: Speeding up aggregate

Does it still take that long if you use a file that only has the four break

variables and perhaps a case ID?

-Mike Palij

New York University

[hidden email]

----- Original Message -----

From: [hidden email]

To: [hidden email]

Sent: Thursday, February 11, 2010 10:31 AM

Subject: Speeding up aggregate

Dear all,

I have a dataset which contains 16 million records. I need to count how many records there are on 4 break variables so I use an simple aggregate with ADDVARIABLES mode.

AGGREGATE

/OUTFILE=* MODE=ADDVARIABLES

/BREAK=VAR1 VAR2 VAR3 VAR4

/N_BREAK=N.

The first couple of million goes fast, but after 11 million records the aggregation is getting really slow. It takes ages to finish the last 5 million records. Normally it takes about 2,5 hours to finish the operation.

Is there a way speed this process up?

(I have 2GB internal memory)

Thanks in advance!

Derek Willemsen

__________ Information from ESET NOD32 Antivirus, version of virus signature database 4858 (20100211) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com