SPSSX Discussion

aggregate value without aggregation

Classic

List

Threaded

11 messages Options

wsu_wright

aggregate value without aggregation

We have a large file (10 million plus cases) which for some cases has a
date value (mm/dd/yyyy). We need to take the last (or max) date entry
and append it to all cases (even those which do not have a date value).
Normally we would use the aggregation procedure to do this, however, we
already have several sorts in this file and given its size, an
additional sort to aggregate for one value would be costly in time, so
we were wondering if there is another method to extract the max date &
make it a value for all cases (e.g., compute).

Thanks in advance...

David

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Maguin, Eugene

Re: aggregate value without aggregation

David,

I'm confused by some different possibilities. Is it
1) that you want find the max date across the 10M cases and append that one
value to all 10M cases, or
2) the 10M cases subdivide into 100K groups, say, and you want to find the
max date in each group and append that value to the cases in the group.
Thus, spreading the 100K values across the 10M cases.

I curious. How much time does sort pass for one variable take in 10M cases?

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Fry, Jonathan B.

Re: aggregate value without aggregation

In reply to this post by wsu_wright

Since release 13, you can use AGGREGATE for this with no sorting. The command might look like

aggregate /break=groupid /lastdate = max(date).

If there is no group ID variable, you'll need to create a constant variable to use as a group ID unless you are running release 17 or later.

Jonathan Fry
SPSS Inc.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Wright
Sent: Wednesday, September 02, 2009 4:10 PM
To: [hidden email]
Subject: aggregate value without aggregation

We have a large file (10 million plus cases) which for some cases has a
date value (mm/dd/yyyy). We need to take the last (or max) date entry
and append it to all cases (even those which do not have a date value).
Normally we would use the aggregation procedure to do this, however, we
already have several sorts in this file and given its size, an
additional sort to aggregate for one value would be costly in time, so
we were wondering if there is another method to extract the max date &
make it a value for all cases (e.g., compute).

Thanks in advance...

David

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Bruce Weaver

Re: aggregate value without aggregation

Administrator

In reply to this post by Maguin, Eugene

Gene Maguin wrote

David,

I'm confused by some different possibilities. Is it
1) that you want find the max date across the 10M cases and append that one
value to all 10M cases, or
2) the 10M cases subdivide into 100K groups, say, and you want to find the
max date in each group and append that value to the cases in the group.
Thus, spreading the 100K values across the 10M cases.

I curious. How much time does sort pass for one variable take in 10M cases?

Gene Maguin

I too was uncertain as to what exactly David was asking. But it can't be number 1, can it, because that would not require any sorting. You'd just need to compute some constant to use as the BREAK variable in the AGGREGATE. So I think it must be number 2. If it is, I can't think of any straightforward alternative to AGGREGATE that does not also require sorting the data.

Here's a somewhat complicated (and half-baked) approach that might work. ;-)

* Turn on OMS - use it to send output from MEANS to a new data set.
* I'm too lazy to work out the syntax for you right now! .

means datevar by Group / cells = max . /* no sorting required! .
OMSEND .

* Activate the dataset containing the max values of datevar .
* Use WRITE to write a series of COMPUTE commands to a syntax file .
* Activate the original dataset .
* Include the syntax file.

The syntax file created via WRITE would look something like this:

numeric maxdate (date11).
do if (group EQ 1).
- compute maxdate = {max for that group}
else if (group EQ 2).
- compute maxdate = {max for that group}
etc
end if.
exe.

Note again that this is a half-baked idea with no guarantees of success. ;-)

--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING:
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).

Peck, Jon

Re: aggregate value without aggregation

Note that the AGGREGATE procedure does NOT require sorting in general.

Regards,
Jon Peck

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Bruce Weaver
Sent: Wednesday, September 02, 2009 4:06 PM
To: [hidden email]
Subject: Re: [SPSSX-L] aggregate value without aggregation

Gene Maguin wrote:

>
> David,
>
> I'm confused by some different possibilities. Is it
> 1) that you want find the max date across the 10M cases and append that
> one
> value to all 10M cases, or
> 2) the 10M cases subdivide into 100K groups, say, and you want to find the
> max date in each group and append that value to the cases in the group.
> Thus, spreading the 100K values across the 10M cases.
>
> I curious. How much time does sort pass for one variable take in 10M
> cases?
>
> Gene Maguin
>
>

wsu_wright

Re: aggregate value without aggregation

In reply to this post by wsu_wright

Actually, it is #1, we need to append the max date to all 10 million
cases.

The processing time is different whether one is computing a variable or
aggregating it.

The following command takes a little over 4 minutes to run through the
10 million cases:

AGG /source_date=max(source_date).

Whereas the following compute takes less than 2 minutes:

COMPUTE source_date=DATE.DMY(9,2,2009).

The advantage of the compute is that I can place it with several other
computes so that they all run as a single data pass saving time (in a
file that already runs over 20 minutes), whereas with the aggregate, it
becomes a single data pass in ADDITION to the other computes that run
later in the job.

David

On Wed, Sep 2, 2009 at 5:05 PM , Bruce Weaver wrote:

> Gene Maguin wrote:
>>
>> David,
>>
>> I'm confused by some different possibilities. Is it
>> 1) that you want find the max date across the 10M cases and append
>> that
>> one
>> value to all 10M cases, or
>> 2) the 10M cases subdivide into 100K groups, say, and you want to
>> find the
>> max date in each group and append that value to the cases in the
>> group.
>> Thus, spreading the 100K values across the 10M cases.
>>
>> I curious. How much time does sort pass for one variable take in 10M
>> cases?
>>
>> Gene Maguin
>>
>>
>
> I too was uncertain as to what exactly David was asking. But it can't
> be
> number 1, can it, because that would not require any sorting. You'd
> just
> need to compute some constant to use as the BREAK variable in the
> AGGREGATE.
> So I think it must be number 2. If it is, I can't think of any
> straightforward alternative to AGGREGATE that does not also require
> sorting
> the data.
>
> Here's a somewhat complicated (and half-baked) approach that might
> work.
> ;-)
>
> * Turn on OMS - use it to send output from MEANS to a new data set.
> * I'm too lazy to work out the syntax for you right now! .
>
> means datevar by Group / cells = max . /* no sorting required! .
> OMSEND .
>
> * Activate the dataset containing the max values of datevar .
> * Use WRITE to write a series of COMPUTE commands to a syntax file .
> * Activate the original dataset .
> * Include the syntax file.
>
> The syntax file created via WRITE would look something like this:
>
> numeric maxdate (date11).
> do if (group EQ 1).
> - compute maxdate = {max for that group}
> else if (group EQ 2).
> - compute maxdate = {max for that group}
> etc
> end if.
> exe.
>
> Note again that this is a half-baked idea with no guarantees of
> success.
> ;-)
>
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> http://sites.google.com/a/lakeheadu.ca/bweaver/
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is for posting only, and is not monitored
> regularly.
> If you wish to send me an e-mail, please use the address shown in my
> sig
> file.
> --
> View this message in context:
> http://www.nabble.com/aggregate-value-without-aggregation-tp25265785p25266522.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except
> the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

Maguin, Eugene

Re: aggregate value without aggregation

David,

I don't think there is any way to do what you want without a sort or by
using aggregate. The problem is how to get the max date value spread back
across all cases. The following bit of code, although untested, should
determime the maximum date value.

Let date1 be your date variable.

Do if (#casenum eq 1).
+ compute maxdate=date1.
Else if (sysmis(maxdate)).
+ compute maxdate=date1.
Else if (date1 gt lag(maxdate)).
+ compute maxdate=date1.
End if.

All cases following the maximum value of date1 will have the correct value
for maxdate. The problem is how to get that value to the cases prior to the
case with max value for date1. One way is a sort followed by more code.
Specifically,

Sort cases by maxdate(d).

if ($casenum gt 1) maxdate=lag(maxdate).

Others have recommended using aggregate with a new variable having a
constant value and specifying presorted. I understand that aggregate is
slower than a compute but is it slower than a sort. I don't know.

Gene Maguin

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Fry, Jonathan B.

Re: aggregate value without aggregation

In reply to this post by wsu_wright

It appears that a data pass for this data set requires about two minutes unless significant computation makes it take longer. The simplest AGGREGATE command requires two data passes (one to compute the maximum date, the other to apply it to all the cases), so it takes four minutes.

You can cut that in half by adding "/presorted" to the AGGREGATE command, making the command:

AGG /presorted/max_source_date=max(source_date).

That option allows AGGREGATE to apply the maximum to the cases in a manner similar to the way COMPUTE works. There is no separate data pass for it. I don't think there is a faster way to do this.

Jonathan Fry
SPSS Inc.

-----Original Message-----
From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of David Wright
Sent: Wednesday, September 02, 2009 7:19 PM
To: [hidden email]
Subject: Re: aggregate value without aggregation

Actually, it is #1, we need to append the max date to all 10 million
cases.

The processing time is different whether one is computing a variable or
aggregating it.

The following command takes a little over 4 minutes to run through the
10 million cases:

AGG /source_date=max(source_date).

Whereas the following compute takes less than 2 minutes:

COMPUTE source_date=DATE.DMY(9,2,2009).

The advantage of the compute is that I can place it with several other
computes so that they all run as a single data pass saving time (in a
file that already runs over 20 minutes), whereas with the aggregate, it
becomes a single data pass in ADDITION to the other computes that run
later in the job.

David

On Wed, Sep 2, 2009 at 5:05 PM , Bruce Weaver wrote:

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Bruce Weaver

Re: aggregate value without aggregation

Administrator

Fry, Jonathan B. wrote

It appears that a data pass for this data set requires about two minutes unless significant computation makes it take longer. The simplest AGGREGATE command requires two data passes (one to compute the maximum date, the other to apply it to all the cases), so it takes four minutes.

You can cut that in half by adding "/presorted" to the AGGREGATE command, making the command:

AGG /presorted/max_source_date=max(source_date).

That option allows AGGREGATE to apply the maximum to the cases in a manner similar to the way COMPUTE works. There is no separate data pass for it. I don't think there is a faster way to do this.

Jonathan Fry
SPSS Inc.

Perhaps I'm being dense, but how does adding /presorted remove the need for two passes? Surely the same two steps are still needed (i.e., find the max, and write it to all cases). Here's the help file entry on "Presorted".

---Start of Help File Entry---

If the data are already sorted in order by the break variables, you can reduce run time and memory requirements by using the PRESORTED subcommand.

• If specified, PRESORTED must precede BREAK. The only specification is the keyword PRESORTED. PRESORTED has no additional specifications.

• When PRESORTED is specified, the program forms an aggregate case out of each group of adjacent cases with the same values for the break variables.

• When PRESORTED is specified, if AGGREGATE is appending new variables to the active dataset rather than writing a new file or replacing the active dataset, the cases must be sorted in ascending order by the BREAK variables.

---End of Help File Entry---

It does indeed say that adding /presorted will reduce run-time, but I still don't understand how.

Also, that final point seems to contradict what you and Jon P said elsewhere in the thread about sorting not being necessary before using AGGREGATE. Or have I misunderstood?

Thanks Jon (& Jon).

wsu_wright

Re: aggregate value without aggregation

In reply to this post by wsu_wright

Bruce, Jonathan, Jon, & Gene

Thanks for all your replies, I certainly learned some things in this
exchange although like Bruce, I'm not quite clear on the role of sorting
in the aggregation command.

My original syntax was running over 4 minutes to make the single data
pass on the 10 million + cases:

SORT cases by source_date.
AGG /presorted /max_source_date=max(source_date).

As both Jonathan & Jon pointed out, the sort was not necessary sense it
is not required in 17 (at least not with MAX) & not necessary in the
case of the MAX option regardless of version. I also saw during the
execution (from the case counter) that pasw was making 3 passes at the
data, one from the sort command, 2nd from the agg & 3rd from the max
option, so I was sorting the file twice. As per Jonathan's post (see
below) & Jon's earlier post about not needing the Sort command I changed
the syntax & now the agg runs in about 2 minutes.

However, as Bruce observe's, regardless of whether I state the
\presorted option, it appears agg continues to sort.

The following two commands perform in the same time (2 minutes) and the
case counter in both syntaxes shows pasw making 2 passes of the data:

AGG /presorted /max_source_date=max(source_date).

AGG /max_source_date=max(source_date).

Show I'm confused too, whether I sort ot not, it appears, at least from
the case counter, that pasw is making 2 passes at the data, one for
sorting & the other for creating the new factor.

David,

On Thu, Sep 3, 2009 at 1:44 PM , Bruce Weaver wrote:

> Fry, Jonathan B. wrote:
>>
>> It appears that a data pass for this data set requires about two
>> minutes
>> unless significant computation makes it take longer. The simplest
>> AGGREGATE command requires two data passes (one to compute the
>> maximum
>> date, the other to apply it to all the cases), so it takes four
>> minutes.
>>
>> You can cut that in half by adding "/presorted" to the AGGREGATE
>> command,
>> making the command:
>>
>> AGG /presorted/max_source_date=max(source_date).
>>
>> That option allows AGGREGATE to apply the maximum to the cases in a
>> manner
>> similar to the way COMPUTE works. There is no separate data pass for
>> it.
>> I don't think there is a faster way to do this.
>>
>> Jonathan Fry
>> SPSS Inc.
>>
>>
>
>
> Perhaps I'm being dense, but how does adding /presorted remove the
> need for
> two passes? Surely the same two steps are still needed (i.e., find
> the max,
> and write it to all cases). Here's the help file entry on
> "Presorted".
>
> ---Start of Help File Entry---
>
> If the data are already sorted in order by the break variables, you
> can
> reduce run time and memory requirements by using the PRESORTED
> subcommand.
>
> • If specified, PRESORTED must precede BREAK. The only specification
> is the
> keyword PRESORTED. PRESORTED has no additional specifications.
>
> • When PRESORTED is specified, the program forms an aggregate case out
> of
> each group of adjacent cases with the same values for the break
> variables.
>
> • When PRESORTED is specified, if AGGREGATE is appending new variables
> to
> the active dataset rather than writing a new file or replacing the
> active
> dataset, the cases must be sorted in ascending order by the BREAK
> variables.
>
> ---End of Help File Entry---
>
> It does indeed say that adding /presorted will reduce run-time, but I
> still
> don't understand how.
>
> Also, that final point seems to contradict what you and Jon P said
> elsewhere
> in the thread about sorting not being necessary before using
> AGGREGATE. Or
> have I misunderstood?
>
> Thanks Jon (& Jon).
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> http://sites.google.com/a/lakeheadu.ca/bweaver/
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is for posting only, and is not monitored
> regularly.
> If you wish to send me an e-mail, please use the address shown in my
> sig
> file.
> --
> View this message in context:
> http://www.nabble.com/aggregate-value-without-aggregation-tp25265785p25281901.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except
> the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

Fry, Jonathan B.

Re: aggregate value without aggregation

It seems my memory played a trick on me with regard to the number of data passes AGGREGATE performs to match statistics back to the active dataset. The original design called for doing that in a single pass, but I apparently could not get that to work, so it does two passes.

AGGREGATE never sorts the data. With the PRESORTED option, it writes a work file much like an OUTFILE and then does the MATCH FILES merge internally. That approach requires a sorted file. Without the PRESORTED option, it builds a rapidly-searchable data structure in memory containing that same information, then, on the second data pass, looks up the correct entry for each case to match it back. That approach requires memory proportional to the number of groups.

There is a less convenient one-pass solution: use AGGREGATE to create a dataset containing the maximum date, and MATCH FILES to apply it. MATCH FILES is what I call an "initial transformation": you can follow it with additional transformations that will be done to the result of the match in the same data pass. Because MATCH FILES needs a key variable, you'll need to create a constant variable for this to work. The command sequence might look like:

Compute constant = 1.
Dataset declare maxdate.
Aggregate /outfile=maxdate/ break=constant
/max_source_date=max(source_date).
Match files /file=*/table=maxdate/by constant/drop constant.

Jonathan Fry
SPSS Inc.

-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Friday, September 04, 2009 7:03 AM
To: Bruce Weaver; Fry, Jonathan B.; Peck, Jon; Gene Maguin
Cc: [hidden email]
Subject: Re: aggregate value without aggregation

Bruce, Jonathan, Jon, & Gene

Thanks for all your replies, I certainly learned some things in this
exchange although like Bruce, I'm not quite clear on the role of sorting
in the aggregation command.

My original syntax was running over 4 minutes to make the single data
pass on the 10 million + cases:

SORT cases by source_date.
AGG /presorted /max_source_date=max(source_date).

As both Jonathan & Jon pointed out, the sort was not necessary sense it
is not required in 17 (at least not with MAX) & not necessary in the
case of the MAX option regardless of version. I also saw during the
execution (from the case counter) that pasw was making 3 passes at the
data, one from the sort command, 2nd from the agg & 3rd from the max
option, so I was sorting the file twice. As per Jonathan's post (see
below) & Jon's earlier post about not needing the Sort command I changed
the syntax & now the agg runs in about 2 minutes.

However, as Bruce observe's, regardless of whether I state the
\presorted option, it appears agg continues to sort.

The following two commands perform in the same time (2 minutes) and the
case counter in both syntaxes shows pasw making 2 passes of the data:

AGG /presorted /max_source_date=max(source_date).

AGG /max_source_date=max(source_date).

Show I'm confused too, whether I sort ot not, it appears, at least from
the case counter, that pasw is making 2 passes at the data, one for
sorting & the other for creating the new factor.

David,

On Thu, Sep 3, 2009 at 1:44 PM , Bruce Weaver wrote:

> Fry, Jonathan B. wrote:
>>
>> It appears that a data pass for this data set requires about two
>> minutes
>> unless significant computation makes it take longer. The simplest
>> AGGREGATE command requires two data passes (one to compute the
>> maximum
>> date, the other to apply it to all the cases), so it takes four
>> minutes.
>>
>> You can cut that in half by adding "/presorted" to the AGGREGATE
>> command,
>> making the command:
>>
>> AGG /presorted/max_source_date=max(source_date).
>>
>> That option allows AGGREGATE to apply the maximum to the cases in a
>> manner
>> similar to the way COMPUTE works. There is no separate data pass for
>> it.
>> I don't think there is a faster way to do this.
>>
>> Jonathan Fry
>> SPSS Inc.
>>
>>
>
>
> Perhaps I'm being dense, but how does adding /presorted remove the
> need for
> two passes? Surely the same two steps are still needed (i.e., find
> the max,
> and write it to all cases). Here's the help file entry on
> "Presorted".
>
> ---Start of Help File Entry---
>
> If the data are already sorted in order by the break variables, you
> can
> reduce run time and memory requirements by using the PRESORTED
> subcommand.
>
> * If specified, PRESORTED must precede BREAK. The only specification
> is the
> keyword PRESORTED. PRESORTED has no additional specifications.
>
> * When PRESORTED is specified, the program forms an aggregate case out
> of
> each group of adjacent cases with the same values for the break
> variables.
>
> * When PRESORTED is specified, if AGGREGATE is appending new variables
> to
> the active dataset rather than writing a new file or replacing the
> active
> dataset, the cases must be sorted in ascending order by the BREAK
> variables.
>
> ---End of Help File Entry---
>
> It does indeed say that adding /presorted will reduce run-time, but I
> still
> don't understand how.
>
> Also, that final point seems to contradict what you and Jon P said
> elsewhere
> in the thread about sorting not being necessary before using
> AGGREGATE. Or
> have I misunderstood?
>
> Thanks Jon (& Jon).
>
>
> -----
> --
> Bruce Weaver
> [hidden email]
> http://sites.google.com/a/lakeheadu.ca/bweaver/
> "When all else fails, RTFM."
>
> NOTE: My Hotmail account is for posting only, and is not monitored
> regularly.
> If you wish to send me an e-mail, please use the address shown in my
> sig
> file.
> --
> View this message in context:
> http://www.nabble.com/aggregate-value-without-aggregation-tp25265785p25281901.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except
> the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD