Aggregation query

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Aggregation query

Mark Webb-5
I have trucker data that I need to aggregate and have hit a snag.
Can somebody please point me in the right direction.
A trucker can visit any site on their route. I want to count how many different sites they visit without restructuring my data.
The 1st 2 columns are my data - the 3rd column [N_diff_sites] is what I want to compute.

trucker       site       n_diff_sites
A                2              1
A                2              1
A                2              1
B                2              2
B                1              2
C                1              4
C                2              4
C                3              4
C                4              4
etc.

Regards
-- 
Mark Webb

Line +27 (21) 786 4379
Cell +27 (72) 199 1000 [Poor reception]
Fax  +27 (86) 260 1946 

Skype       tomarkwebb 
Email       [hidden email] 
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Aggregation query

Andy W
Here is one example. This basically creates a sequential counter within trucker and site pairs and then takes the max of the counter per trucker using AGGREGATE.

************************************************.
DATA LIST FREE / trucker (A1) site res (2F1.0).
BEGIN DATA
A                2              1
A                2              1
A                2              1
B                2              2
B                1              2
C                1              4
C                2              4
C                3              4
C                4              4
END DATA.

SORT CASES BY trucker site.
COMPUTE n_diff_sites = 1.
IF site <> LAG(site) AND (trucker = LAG(trucker)) n_diff_sites = LAG(n_diff_sites) + 1.
AGGREGATE OUTFILE = * MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK trucker
  /n_diff_sites = MAX(n_diff_sites).
************************************************.
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Aggregation query

PRogman
In reply to this post by Mark Webb-5
Do it in steps: first aggregate to get one row for each trucker & site; second aggregate and count number of rows per trucker. Last step, matching it back may not be necessary, depending on your work.

DATA LIST list /
trucker       site       n_diff_sites (A1 F1 F1).
BEGIN DATA.
A                2              1
A                2              1
A                2              1
B                2              2
B                1              2
C                1              4
C                2              4
C                3              4
C                4              4
END DATA.
DATASET NAME ds_trucker.
DATASET DECLARE ds_ts. 
AGGREGATE 
  /OUTFILE=ds_ts 
  /BREAK=trucker site 
  /N_BREAK=N. 

DATASET ACTIVATE ds_ts. 
AGGREGATE 
  /OUTFILE=* MODE=ADDVARIABLES 
  /BREAK=trucker 
  /n_diff=N.

DATASET ACTIVATE ds_trucker.
SORT CASES BY trucker site.

MATCH FILES
  FILE = *
 /TABLE ds_ts
 /BY trucker site
 /DROP N_BREAK.

EXECUTE.
DATASET CLOSE ds_ts.

HTH, PR
Reply | Threaded
Open this post in threaded view
|

Re: Aggregation query

Albert-Jan Roskam-2
In reply to this post by Mark Webb-5
Hi,
 
aggregate outfile = * mode=addvariables /break = trucker /n_diff_sites "number of different sites" = n.
 
Regards,

Albert-Jan




~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a

fresh water system, and public health, what have the Romans ever done for us?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




>________________________________
> From: Mark Webb <[hidden email]>
>To: [hidden email]
>Sent: Monday, November 3, 2014 2:25 PM
>Subject: [SPSSX-L] Aggregation query
>
>
>
>I have trucker data that I need to aggregate and have hit a snag.
>Can somebody please point me in the right direction.
>A trucker can visit any site on their route. I want to count how many different sites they visit without restructuring my data.
>The 1st 2 columns are my data - the 3rd column [N_diff_sites] is
      what I want to compute.

>
>trucker       site       n_diff_sites
>A                2              1
>A                2              1
>A                2              1
>B                2              2
>B                1              2
>C                1              4
>C                2              4
>C                3              4
>C                4              4
>etc.
>
>Regards
>
>--
Mark Webb Line +27 (21) 786 4379
Cell +27 (72) 199 1000 [Poor reception]
Fax  +27 (86) 260 1946  Skype       tomarkwebb
Email [hidden email]
=====================
To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
>
>

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Aggregation query

Andy W
Albert, that won't work with this example because there are duplicate trucker-site pairs in the original example (see Trucker A in the original example). If you eliminated duplicates first and then did that it would work, but I presume that goes against the original request "without restructuring the data".
Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/
Reply | Threaded
Open this post in threaded view
|

Re: Aggregation query

fjmenendez
In reply to this post by Andy W
What follows is my solution:

*****.
sORT CASES BY trucker Site(A).

MATCH FILES
  /FILE=*
  /BY trucker site
  /LAST=PrimarioÚltimo.

FILTER  BY PrimarioÚltimo.


AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES
  /BREAK=Trucker
  /Num_dif_sites2=N.

filter off.

AGGREGATE
  /OUTFILE=* MODE=ADDVARIABLES overwrite = yes
  /BREAK=Trucker
  /Num_dif_sites2=Max (num_dif_sites2).

******.

Hope it helps!

Florentino Menéndez.



On Mon, Nov 3, 2014 at 12:20 PM, Andy W <[hidden email]> wrote:
Here is one example. This basically creates a sequential counter within
trucker and site pairs and then takes the max of the counter per trucker
using AGGREGATE.

************************************************.
DATA LIST FREE / trucker (A1) site res (2F1.0).
BEGIN DATA
A                2              1
A                2              1
A                2              1
B                2              2
B                1              2
C                1              4
C                2              4
C                3              4
C                4              4
END DATA.

SORT CASES BY trucker site.
COMPUTE n_diff_sites = 1.
IF site <> LAG(site) AND (trucker = LAG(trucker)) n_diff_sites =
LAG(n_diff_sites) + 1.
AGGREGATE OUTFILE = * MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK trucker
  /n_diff_sites = MAX(n_diff_sites).
************************************************.



-----
Andy W
[hidden email]
http://andrewpwheeler.wordpress.com/
--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Aggregation-query-tp5727769p5727770.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Aggregation query

Jignesh Sutar
In reply to this post by Andy W
An alternative way to achieve the same.

sort cases by trucker site.
match files file=* by =trucker site /first=UniqueSite.
aggregate outfile=* mode=addvariables /break= trucker /n_diff_sites2=sum(UniqueSite).
Reply | Threaded
Open this post in threaded view
|

Re: Aggregation query

Bruce Weaver
Administrator
In reply to this post by Andy W
If "without restructuring the data" means that the original order of the cases must be maintained, then one could tweak Andy's solution slightly to accommodate.  E.g.,

* Save the original order of the cases.
****************************** .
COMPUTE OriginalOrder = $casenum.
****************************** .

SORT CASES BY trucker site.
COMPUTE n_diff_sites = 1.
IF site NE LAG(site) AND (trucker EQ LAG(trucker))
 n_diff_sites = LAG(n_diff_sites) + 1.
AGGREGATE OUTFILE = * MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK trucker
  /n_diff_sites = MAX(n_diff_sites).

* Restore the original order of the cases.
****************************** .
SORT CASES by OriginalOrder.
****************************** .
DELETE VARIABLES OriginalOrder.
FORMATS n_diff_sites (F3.0).
LIST.





Andy W wrote
Here is one example. This basically creates a sequential counter within trucker and site pairs and then takes the max of the counter per trucker using AGGREGATE.

************************************************.
DATA LIST FREE / trucker (A1) site res (2F1.0).
BEGIN DATA
A                2              1
A                2              1
A                2              1
B                2              2
B                1              2
C                1              4
C                2              4
C                3              4
C                4              4
END DATA.

SORT CASES BY trucker site.
COMPUTE n_diff_sites = 1.
IF site <> LAG(site) AND (trucker = LAG(trucker)) n_diff_sites = LAG(n_diff_sites) + 1.
AGGREGATE OUTFILE = * MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK trucker
  /n_diff_sites = MAX(n_diff_sites).
************************************************.
--
Bruce Weaver
bweaver@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/

"When all else fails, RTFM."

PLEASE NOTE THE FOLLOWING: 
1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above.
2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/).
Reply | Threaded
Open this post in threaded view
|

Re: Aggregation query

Albert-Jan Roskam-2
In reply to this post by Andy W
Oops, thanks for spotting that. Maybe a filter would do the trick, too.

data list free / trucker (a1) site (f2)  n_diff_sites_desired (f2).
begin data
 A                2              1
 A                2              1
 A                2              1
 B                2              2
 B                1              2
 C                1              4
 C                2              4
 C                3              4
 C                4              4
end data.
compute f =  not ( trucker eq lag(trucker) and site eq lag(site) ).
execute.
temporary.
filter by f.
aggregate outfile = * mode=addvariables /break = trucker /n_diff_sites "number of different sites" = n.
if missing( n_diff_sites ) n_diff_sites = lag(n_diff_sites).
execute.




Regards,

Albert-Jan




~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a

fresh water system, and public health, what have the Romans ever done for us?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~





----- Original Message -----

> From: Andy W <[hidden email]>
> To: [hidden email]
> Cc:
> Sent: Monday, November 3, 2014 3:46 PM
> Subject: Re: [SPSSX-L] Aggregation query
>
> Albert, that won't work with this example because there are duplicate
> trucker-site pairs in the original example (see Trucker A in the original
> example). If you eliminated duplicates first and then did that it would
> work, but I presume that goes against the original request "without
> restructuring the data".
>
>
>
>
> -----
> Andy W
> [hidden email]
> http://andrewpwheeler.wordpress.com/
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/Aggregation-query-tp5727769p5727773.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>  

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Aggregation query

Rich Ulrich
In reply to this post by Andy W
Let us be explicit here.  The original request's phrase, "without restructuring
the data,"  should be taken as a descriptor of how the final file should look;
do whatever is needed to get a count, and if it is in an aggregated file, then
tack the count on to the end of each record.

There are several ways to do the task.

If the original file were in random order, you would need to label lines with a
line number, so that the result could be sorted back to that order.

Now, if someone said, "without creating any temporary files", that would be
requesting a silly way to do this pragmatic job - unless one is hoping for a
new programming break-through.

--
Rich Ulrich


> Date: Mon, 3 Nov 2014 06:46:01 -0800

> From: [hidden email]
> Subject: Re: Aggregation query
> To: [hidden email]
>
> Albert, that won't work with this example because there are duplicate
> trucker-site pairs in the original example (see Trucker A in the original
> example). If you eliminated duplicates first and then did that it would
> work, but I presume that goes against the original request "without
> restructuring the data".
>
>
>
> -----
> Andy W
> [hidden email]
> http://andrewpwheeler.wordpress.com/
> --
> View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Aggregation-query-tp5727769p5727773.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Aggregation query

Jon K Peck
"Now, if someone said, "without creating any temporary files", that would be
requesting a silly way to do this pragmatic job - unless one is hoping for a
new programming break-through.
"

No new programming breakthrough required.  It could be done with a few lines of Python using sets, and not even sorting the case data would be necessary.


Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621




From:        Rich Ulrich <[hidden email]>
To:        [hidden email]
Date:        11/03/2014 11:28 AM
Subject:        Re: [SPSSX-L] Aggregation query
Sent by:        "SPSSX(r) Discussion" <[hidden email]>




Let us be explicit here.  The original request's phrase, "without restructuring
the data,"  should be taken as a descriptor of how the final file should look;
do whatever is needed to get a count, and if it is in an aggregated file, then
tack the count on to the end of each record.

There are several ways to do the task.

If the original file were in random order, you would need to label lines with a
line number, so that the result could be sorted back to that order.

Now, if someone said, "without creating any temporary files", that would be
requesting a silly way to do this pragmatic job - unless one is hoping for a
new programming break-through.

--
Rich Ulrich


> Date: Mon, 3 Nov 2014 06:46:01 -0800
> From: [hidden email]
> Subject: Re: Aggregation query
> To: [hidden email]
>
> Albert, that won't work with this example because there are duplicate
> trucker-site pairs in the original example (see Trucker A in the original
> example). If you eliminated duplicates first and then did that it would
> work, but I presume that goes against the original request "without
> restructuring the data".
>
>
>
> -----
> Andy W
> [hidden email]
>
http://andrewpwheeler.wordpress.com/
> --
> View this message in context:
http://spssx-discussion.1045642.n5.nabble.com/Aggregation-query-tp5727769p5727773.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
Reply | Threaded
Open this post in threaded view
|

Re: Aggregation query

David Marso
Administrator
In reply to this post by Albert-Jan Roskam-2

Assuming data are sorted by trucker site:
Bouncing off Albert-Jan:
data list free / trucker (a1) site (f2)  n_diff_sites_desired (f2).
begin data
 A                2              1
 A                2              1
 A                2              1
 B                2              2
 B                1              2
 C                1              4
 C                2              4
 C                3              4
 C                4              4
end data.
COMPUTE @flag@ =  not ( trucker eq lag(trucker) and site eq lag(site) ).
AGGREGATE OUTFILE * MODE=ADDVARIABLES/BREAK trucker/total=SUM(@flag@).
DELETE VARIABLES @flag@ .

Albert-Jan Roskam-2 wrote
Oops, thanks for spotting that. Maybe a filter would do the trick, too.

data list free / trucker (a1) site (f2)  n_diff_sites_desired (f2).
begin data
 A                2              1
 A                2              1
 A                2              1
 B                2              2
 B                1              2
 C                1              4
 C                2              4
 C                3              4
 C                4              4
end data.
compute f =  not ( trucker eq lag(trucker) and site eq lag(site) ).
execute.
temporary.
filter by f.
aggregate outfile = * mode=addvariables /break = trucker /n_diff_sites "number of different sites" = n.
if missing( n_diff_sites ) n_diff_sites = lag(n_diff_sites).
execute.




Regards,

Albert-Jan




~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a

fresh water system, and public health, what have the Romans ever done for us?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~





----- Original Message -----
> From: Andy W <[hidden email]>
> To: [hidden email]
> Cc:
> Sent: Monday, November 3, 2014 3:46 PM
> Subject: Re: [SPSSX-L] Aggregation query
>
> Albert, that won't work with this example because there are duplicate
> trucker-site pairs in the original example (see Trucker A in the original
> example). If you eliminated duplicates first and then did that it would
> work, but I presume that goes against the original request "without
> restructuring the data".
>
>
>
>
> -----
> Andy W
> [hidden email]
> http://andrewpwheeler.wordpress.com/
> --
> View this message in context:
> http://spssx-discussion.1045642.n5.nabble.com/Aggregation-query-tp5727769p5727773.html
> Sent from the SPSSX Discussion mailing list archive at Nabble.com.
>
> =====================
> To manage your subscription to SPSSX-L, send a message to
> [hidden email] (not to SPSSX-L), with no body text except the
> command. To leave the list, send the command
> SIGNOFF SPSSX-L
> For a list of commands to manage subscriptions, send the command
> INFO REFCARD
>  

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"