I have trucker data that I need to aggregate and
have hit a snag.
Can somebody please point me in the right direction. A trucker can visit any site on their route. I want to count how many different sites they visit without restructuring my data. The 1st 2 columns are my data - the 3rd column [N_diff_sites] is what I want to compute. trucker site n_diff_sites A 2 1 A 2 1 A 2 1 B 2 2 B 1 2 C 1 4 C 2 4 C 3 4 C 4 4 etc. Regards -- Mark Webb Line +27 (21) 786 4379 Cell +27 (72) 199 1000 [Poor reception] Fax +27 (86) 260 1946 Skype tomarkwebb Email [hidden email]===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Here is one example. This basically creates a sequential counter within trucker and site pairs and then takes the max of the counter per trucker using AGGREGATE.
************************************************. DATA LIST FREE / trucker (A1) site res (2F1.0). BEGIN DATA A 2 1 A 2 1 A 2 1 B 2 2 B 1 2 C 1 4 C 2 4 C 3 4 C 4 4 END DATA. SORT CASES BY trucker site. COMPUTE n_diff_sites = 1. IF site <> LAG(site) AND (trucker = LAG(trucker)) n_diff_sites = LAG(n_diff_sites) + 1. AGGREGATE OUTFILE = * MODE=ADDVARIABLES OVERWRITE=YES /BREAK trucker /n_diff_sites = MAX(n_diff_sites). ************************************************. |
In reply to this post by Mark Webb-5
Do it in steps: first aggregate to get one row for each trucker & site; second aggregate and count number of rows per trucker. Last step, matching it back may not be necessary, depending on your work.
DATA LIST list / trucker site n_diff_sites (A1 F1 F1). BEGIN DATA. A 2 1 A 2 1 A 2 1 B 2 2 B 1 2 C 1 4 C 2 4 C 3 4 C 4 4 END DATA. DATASET NAME ds_trucker. DATASET DECLARE ds_ts. AGGREGATE /OUTFILE=ds_ts /BREAK=trucker site /N_BREAK=N. DATASET ACTIVATE ds_ts. AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=trucker /n_diff=N. DATASET ACTIVATE ds_trucker. SORT CASES BY trucker site. MATCH FILES FILE = * /TABLE ds_ts /BY trucker site /DROP N_BREAK. EXECUTE. DATASET CLOSE ds_ts. HTH, PR |
In reply to this post by Mark Webb-5
Hi,
aggregate outfile = * mode=addvariables /break = trucker /n_diff_sites "number of different sites" = n. Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >________________________________ > From: Mark Webb <[hidden email]> >To: [hidden email] >Sent: Monday, November 3, 2014 2:25 PM >Subject: [SPSSX-L] Aggregation query > > > >I have trucker data that I need to aggregate and have hit a snag. >Can somebody please point me in the right direction. >A trucker can visit any site on their route. I want to count how many different sites they visit without restructuring my data. >The 1st 2 columns are my data - the 3rd column [N_diff_sites] is > >trucker site n_diff_sites >A 2 1 >A 2 1 >A 2 1 >B 2 2 >B 1 2 >C 1 4 >C 2 4 >C 3 4 >C 4 4 >etc. > >Regards > >-- Cell +27 (72) 199 1000 [Poor reception] Fax +27 (86) 260 1946 Skype tomarkwebb Email [hidden email] ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD > > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Albert, that won't work with this example because there are duplicate trucker-site pairs in the original example (see Trucker A in the original example). If you eliminated duplicates first and then did that it would work, but I presume that goes against the original request "without restructuring the data".
|
In reply to this post by Andy W
What follows is my solution: *****. sORT CASES BY trucker Site(A). MATCH FILES  /FILE=*  /BY trucker site  /LAST=PrimarioÚltimo. FILTER BY PrimarioÚltimo. AGGREGATE  /OUTFILE=* MODE=ADDVARIABLES  /BREAK=Trucker  /Num_dif_sites2=N. filter off. AGGREGATE  /OUTFILE=* MODE=ADDVARIABLES overwrite = yes  /BREAK=Trucker  /Num_dif_sites2=Max (num_dif_sites2). ******. Hope it helps! Florentino Menéndez. On Mon, Nov 3, 2014 at 12:20 PM, Andy W <[hidden email]> wrote: Here is one example. This basically creates a sequential counter within |
In reply to this post by Andy W
An alternative way to achieve the same.
sort cases by trucker site. match files file=* by =trucker site /first=UniqueSite. aggregate outfile=* mode=addvariables /break= trucker /n_diff_sites2=sum(UniqueSite). |
Administrator
|
In reply to this post by Andy W
If "without restructuring the data" means that the original order of the cases must be maintained, then one could tweak Andy's solution slightly to accommodate. E.g.,
* Save the original order of the cases. ****************************** . COMPUTE OriginalOrder = $casenum. ****************************** . SORT CASES BY trucker site. COMPUTE n_diff_sites = 1. IF site NE LAG(site) AND (trucker EQ LAG(trucker)) n_diff_sites = LAG(n_diff_sites) + 1. AGGREGATE OUTFILE = * MODE=ADDVARIABLES OVERWRITE=YES /BREAK trucker /n_diff_sites = MAX(n_diff_sites). * Restore the original order of the cases. ****************************** . SORT CASES by OriginalOrder. ****************************** . DELETE VARIABLES OriginalOrder. FORMATS n_diff_sites (F3.0). LIST.
--
Bruce Weaver bweaver@lakeheadu.ca http://sites.google.com/a/lakeheadu.ca/bweaver/ "When all else fails, RTFM." PLEASE NOTE THE FOLLOWING: 1. My Hotmail account is not monitored regularly. To send me an e-mail, please use the address shown above. 2. The SPSSX Discussion forum on Nabble is no longer linked to the SPSSX-L listserv administered by UGA (https://listserv.uga.edu/). |
In reply to this post by Andy W
Oops, thanks for spotting that. Maybe a filter would do the trick, too.
data list free / trucker (a1) site (f2) n_diff_sites_desired (f2). begin data A 2 1 A 2 1 A 2 1 B 2 2 B 1 2 C 1 4 C 2 4 C 3 4 C 4 4 end data. compute f = not ( trucker eq lag(trucker) and site eq lag(site) ). execute. temporary. filter by f. aggregate outfile = * mode=addvariables /break = trucker /n_diff_sites "number of different sites" = n. if missing( n_diff_sites ) n_diff_sites = lag(n_diff_sites). execute. Regards, Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ----- Original Message ----- > From: Andy W <[hidden email]> > To: [hidden email] > Cc: > Sent: Monday, November 3, 2014 3:46 PM > Subject: Re: [SPSSX-L] Aggregation query > > Albert, that won't work with this example because there are duplicate > trucker-site pairs in the original example (see Trucker A in the original > example). If you eliminated duplicates first and then did that it would > work, but I presume that goes against the original request "without > restructuring the data". > > > > > ----- > Andy W > [hidden email] > http://andrewpwheeler.wordpress.com/ > -- > View this message in context: > http://spssx-discussion.1045642.n5.nabble.com/Aggregation-query-tp5727769p5727773.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD > ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Andy W
Let us be explicit here. The original request's phrase, "without restructuring
=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
the data," should be taken as a descriptor of how the final file should look; do whatever is needed to get a count, and if it is in an aggregated file, then tack the count on to the end of each record. There are several ways to do the task. If the original file were in random order, you would need to label lines with a line number, so that the result could be sorted back to that order. Now, if someone said, "without creating any temporary files", that would be requesting a silly way to do this pragmatic job - unless one is hoping for a new programming break-through. -- Rich Ulrich > Date: Mon, 3 Nov 2014 06:46:01 -0800 > From: [hidden email] > Subject: Re: Aggregation query > To: [hidden email] > > Albert, that won't work with this example because there are duplicate > trucker-site pairs in the original example (see Trucker A in the original > example). If you eliminated duplicates first and then did that it would > work, but I presume that goes against the original request "without > restructuring the data". > > > > ----- > Andy W > [hidden email] > http://andrewpwheeler.wordpress.com/ > -- > View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Aggregation-query-tp5727769p5727773.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD |
"Now,
if someone said, "without creating any temporary files", that
would be
requesting a silly way to do this pragmatic job - unless one is hoping for a new programming break-through. " No new programming breakthrough required. It could be done with a few lines of Python using sets, and not even sorting the case data would be necessary. Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Rich Ulrich <[hidden email]> To: [hidden email] Date: 11/03/2014 11:28 AM Subject: Re: [SPSSX-L] Aggregation query Sent by: "SPSSX(r) Discussion" <[hidden email]> Let us be explicit here. The original request's phrase, "without restructuring the data," should be taken as a descriptor of how the final file should look; do whatever is needed to get a count, and if it is in an aggregated file, then tack the count on to the end of each record. There are several ways to do the task. If the original file were in random order, you would need to label lines with a line number, so that the result could be sorted back to that order. Now, if someone said, "without creating any temporary files", that would be requesting a silly way to do this pragmatic job - unless one is hoping for a new programming break-through. -- Rich Ulrich > Date: Mon, 3 Nov 2014 06:46:01 -0800 > From: [hidden email] > Subject: Re: Aggregation query > To: [hidden email] > > Albert, that won't work with this example because there are duplicate > trucker-site pairs in the original example (see Trucker A in the original > example). If you eliminated duplicates first and then did that it would > work, but I presume that goes against the original request "without > restructuring the data". > > > > ----- > Andy W > [hidden email] > http://andrewpwheeler.wordpress.com/ > -- > View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Aggregation-query-tp5727769p5727773.html > Sent from the SPSSX Discussion mailing list archive at Nabble.com. > > ===================== > To manage your subscription to SPSSX-L, send a message to > [hidden email] (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the command > INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@... (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Administrator
|
In reply to this post by Albert-Jan Roskam-2
Assuming data are sorted by trucker site: Bouncing off Albert-Jan: data list free / trucker (a1) site (f2) n_diff_sites_desired (f2). begin data A 2 1 A 2 1 A 2 1 B 2 2 B 1 2 C 1 4 C 2 4 C 3 4 C 4 4 end data. COMPUTE @flag@ = not ( trucker eq lag(trucker) and site eq lag(site) ). AGGREGATE OUTFILE * MODE=ADDVARIABLES/BREAK trucker/total=SUM(@flag@). DELETE VARIABLES @flag@ .
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Free forum by Nabble | Edit this page |