SPSSX Discussion

Merge from 1 point on a road to a stretch of road

Classic

List

Threaded

31 messages Options

David Marso

Re: Merge from 1 point on a road to a stretch of road

Administrator

Here is a somewhat general solution to attaching the closest fuzzy matching value of file1 to file2 (using merge variable as 'KEY' and other1 as data to be attached to file2. Usually a simple LAG of other1 will be sufficient.
Note this specifically targets records in file 2 which are sandwiched between two records in file1.
Calculate the smallest difference then grab the value of other1 associated with that logic.
What I mean by granularity is you arbitrarily build your 'explosion' by increment of 1. What if you need .1 increments ( in other words, building a large 'explosion' in order to do an exact match to arbitrary values is brute force, inelegant and sort of like using a sledge hammer to do framing)?
In retrospect this does not look ahead eno,ugh in file1 to see if there are clumps of file2 closer to the lead of file1 than the lag. That could be done but I'm not going to bother at the moment .
Hint : Sort in reverse repeat and compare results.
Hopefully this communicates the basic ideas and 'exploding' data is forever vanquished from the toolbox.
--
input program.
LOOP caseid=1 to 1000.
COMPUTE merge=UNIFORM(100).
COMPUTE Other1= TRUNC(UNIFORM(500)).
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
SORT CASES BY merge.
SAVE OUTFILE 'C:\Users\David\temp\temp1.sav'.

input program.
LOOP caseid=1 to 1000.
COMPUTE merge=UNIFORM(100).
COMPUTE Other2= TRUNC(UNIFORM(500)).
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
SORT CASES BY merge.
SAVE OUTFILE 'C:\Users\David\temp\temp2.sav'.

ADD FILES FILE 'C:\Users\David\temp\temp1.sav' / IN=@1
/ FILE 'C:\Users\David\temp\temp2.sav' / IN=@2
/ BY merge.
EXE.
IF MISSING(Other1) Other1=LAG(Other1).
SHIFT VALUES VARIABLE=merge RESULT=lead_merge LEAD=1
/VARIABLE=merge RESULT=lag_merge LAG=1
/VARIABLE=other1 RESULT=Lead_other1 LEAD=1
/VARIABLE =@1 RESULT=lead_@1 LEAD=1.

DO IF (@2 AND LAG(@1) AND lead_@1 ).
+ DO IF ABS(SUM(merge,-1*lead_merge)) LT ABS(SUM(merge,-1*lag_merge)) .
+ COMPUTE otherxx=lead_other1.
+ ELSE.
+ COMPUTE otherxx=LAG(other1).
+ END IF.
ELSE.
+ COMPUTE otherxx=other1.
END IF.
IF $CASENUM EQ 1 AND MISSING(other1) otherxx=lead_other1.
EXE.

Andy W wrote

I have no idea what you are talking about, care to share an example or elaborate?

To be specific just take the example datasets I created (there still hasn't been quite a confirmation by the OP that that is what the data looks like, but that is my best guess).

I agree it could be computationally cumbersome with large files (it could "explode" into really enormous number of records), but situations in which I have used it for it was not a problem. Also note (besides the lines to make fake data), the solution I provide is hardly complicated (it takes less than 20 lines of code). You could probably also take out the getting the made table and sorting it (I just didn't figure out first try how to make the table sorted correctly on the orignal XSAVE, but I imagine it is possible).

I really have no idea what you are talking about (neither about "interleaving with lag" nor with "granularity"), so please elaborate.

Andy

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

Andy W

Re: Merge from 1 point on a road to a stretch of road

Ok, to be clear, one's "best" solution will ultimately depend on the nature of the data (e.g. how big ones files are) and the nature of the task (e.g. does it need to be done once or on a regular basis) as well as the complexity of the actual solution itself. I agree that "exploding" could be bad in some obvious circumstances, mainly when the exploded table is too big to handle, but lets be clear about those circumstances instead of denouncing it while providing an incomplete alternative in its place.

First, in regards to granularity, *this is not a problem at all* in terms of if you need to increment by a value other than an integer. Although the loop is constrained to increment by integers, if you need to increment by "0.1" or "0.01", just multiply your values to be integers and then divide the result back to the original values. Although more difficult, this approach could even be accomodated to non-linear interpolations. Again, it is a problem if the exploded file is too big to handle (my experience depending on system and what your variables are somewhere between 20-40 million records), and certainly more cumbersome if the file is very big. But note *you only need to make the exploded file and sort it once*, so even then it will likely not be that big of deal. So, as far as I'm concerned "granularity" is a moot point.

Lets also remember what information we do have about the OP's problem, that these are *stretches of highway with given distance markers*. For an example, lets say we needed to do this for the *entire* interstate US highway system, and we had a marker every 0.01 of a mile (this would amount to a marker every 52.8 feet, well below a hectometer that the OP originally specified). According to wikipedia, <http://en.wikipedia.org/wiki/Interstate_Highway_System>, there is only 47,182 miles of interstate highways in the continental US. Exploding such a table would only result in 4.7 million records, well within the capabilities of SPSS to handle such a file without too much trouble. Also note again that with the exploding of the table, you only need to create the set and sort once (or however often the reference street set is updated, at best once a year in my experience with geo-data). Now, if the OP feels like chiming in she can state how long her road set is, and we can evaluate whether exploding is feasible or not.

Now again, we don't really know the nature of the task nor the exact nature of the data, but I will give an example of cases I have used this for. Say I have a dataset of 300,000 calls to 911 that have street addresses in one medium sized US city (for cities I have worked with this could be between 3~10 years of CAD data). Lets also say that I have a street centerline file that has the beginning and end address ranges as well as the street, that when exploded to form all potential addresses within said city only amounts to 100,000 potential addresses. Is the add files solution, which will take multiple sorting of the 400,000 calls + street segments (which probably has a variety of data already attached to it) the better solution than the 1 time sort of the calls and the exploded dataset? I don't think so, and IMO in that example the lesser evil is exploding the street segments file. In the end I don't do such tasks on a regular basis so any solution would work, but the exploding and merging is IMO so simple it should be considered even if you got your add and interweaving records to work.

I will take a look at your code and see if I can update with how I would do it when I get a chance. I feel as if you are intentionally ignoring the complexity of the situation (we have both a begin and an end of the segment, and you don't want to match records that are past the end). But I will see if I can figure out when I need a break from some of my other work. As you can tell I have a need for such solutions, so I'm all for better ways to do such a task. I'm currently unconvinced though *for my particular situation* that this is obviously better. Can't say much about the OP's (I haven't worked with marked highways like that), but I would guess exploding is not unmanageable.

Andy

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

David Marso

Re: Merge from 1 point on a road to a stretch of road

Administrator

"I will take a look at your code and see if I can update with how I would do it when I get a chance. I feel as if you are intentionally ignoring the complexity of the situation"...
Hmmm, so you haven't bothered to look at my code before jumping back on the soapbox? 'Exploding' the data does not address the 'complexity' of the situation. It is NOT complex it is utterly trivial.

Andy W

Re: Merge from 1 point on a road to a stretch of road

David,

Feel free to reply with well reasoned positions and evidence for your assertions. It is not personal attack, and I have provided reasons why I believe my advice is appropriate.

Andy

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Andy W

Re: Merge from 1 point on a road to a stretch of road

In reply to this post by David Marso

Because David ruffled my feathers a bit here is how *I would* do the add files and interweave. I'll get the soapbox out tomorrow when I get a chance to discuss the two options some more.

******************************************.
*this replicates what the street range segments look like.
set seed = 10.
input program.
loop #i = 1 to 100.
compute begin_add = 1.
compute end_add = 100.
compute street = #i.
compute segment_info = RV.NORMAL(0,10).
compute segment_uniqueid = 7 * #i.
end case.
end loop.
end file.
end input program.
dataset name segments.
execute.
*I need this info for the sort after add files.
dataset activate segments.
compute add = begin_add.

*ok lets replicate what the traffic volume table looks like.
input program.
loop #i = 1 to 100.
loop #j = 1 to 20.
compute add = TRUNC(RV.UNIFORM(1,100)).
compute traffic = RV.POISSON(5).
compute street = #i.
end case.
end loop.
end loop.
end file.
end input program.
dataset name traffic.
execute.

*now to add the two files.
dataset activate traffic.
add files file = *
/in flag_traffic
/file = 'segments'.

*now sort so segments record is above appropriate segments record.
sort cases by street add flag_traffic.

*you need a set of temp begin & end address range variables for next if statement to work.
IF flag_traffic = 0 #down_begin = begin_add.
IF flag_traffic = 1 #down_begin = lag(#down_begin).
IF flag_traffic = 0 #down_end = end_add.
IF flag_traffic = 1 #down_end = lag(#down_end).
IF flag_traffic = 0 #down_segmentid = segment_uniqueid.
IF flag_traffic = 1 #down_segmentid = lag(#down_segmentid).
IF flag_traffic = 0 #down_street = street.
IF flag_traffic = 1 #down_street = lag(#down_street).

*now assigning unique id if current street is the same and address within begin & end range.
IF flag_traffic = 1 and street = #down_street and add >= #down_begin and add <= #down_end segment_uniqueid2 = #down_segmentid.
exe.

*now get rid of extraneous rows.
select if flag_traffic = 1.
exe.

*now it is needed to clean up extraneous variables and re-merge data via table match in the street segments dataset.
******************************************.

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Sarraf, Shimon Aaron

Automatic reply: Merge from 1 point on a road to a stretch of road

I am currently out of the office and will return Wednesday, Sept. 19. I will respond to your email upon my return. If you need immediate assistance, please call 812-856-5824.

Thank you,

Shimon Sarraf

Center for Postsecondary Research

Indiana University Bloomington

David Marso

Re: Merge from 1 point on a road to a stretch of road

Administrator

In reply to this post by Andy W

Let's see ;-)
Basic idea at the most basic level:
I believe the simplicity of the code speaks for itself.
Do you still believe the 'exploding' data idea is still a useful item in your toolbox?
--
--
data list free / seg_beg seg_end.
begin data
1 5
6 8
9 12
13 20
end data.
SORT CASES BY seg_beg seg_end.
COMPUTE seg_num=$CASENUM.
save outfile 'seg.sav'.

data list free /point info.
begin data
2 10.5
7 9.3
10 10.5
15 15.2
end data.
SORT CASES BY point.
COMPUTE seg_beg=point.

ADD FILES / FILE 'seg.sav'
/ FILE * / IN=@1
/ BY seg_beg.

DO IF MISSING(seg_num).
+ COMPUTE seg_end=LAG(seg_end).
+ COMPUTE seg_num=LAG(seg_num).
END IF.
EXE.
SELECT IF @1.
EXE.

Andy W

Re: Merge from 1 point on a road to a stretch of road

Your still ignoring the fact that segments have an end number. Your code would match "101 Main St." to a street segment for "1 to 100 Main St." if no "101 - ??? Main St." address ranges for "Main St." existed. I couldn't figure out how to do this besides making the explicit "down_???" variables in my above code. Also for my reference is the exe before the select if necessary?

It is much simpler than I had originally envisioned, nice work. That is the first time I had seen the "BY" statement used on ADD FILES. That is a nice feature I will have to keep in the memory banks.

I use the explode for other things though to be honest (I've exploded street centerline files to submit to geocoding engines so I can build my own dictionary with x,y coordinates), but assuming the not matching past the end can be resolved in a simpler manner than I did I would say this is a better solution than exploding. Still exploding is not IMO as horrendous as your capslock key would seem to make it.

Andy

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Jon K Peck

Re: Merge from 1 point on a road to a stretch of road

In reply to this post by Andy W

What I would do is to build a set for each street containing all the interval bounds and then use an interval lookup routine for each accident address after numbering all the interval records. Once the record number is identified, regular MATCH FILES would do the rest, or the matching could be built into the program if there is enough memory to hold all that data. This is reasonably straightforward to do with Python programmability using existing features.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
new phone: 720-342-5621

From: Andy W <[hidden email]>
To: [hidden email]
Date: 09/16/2012 06:48 PM
Subject: Re: [SPSSX-L] Merge from 1 point on a road to a stretch of road
Sent by: "SPSSX(r) Discussion" <[hidden email]>

Because David ruffled my feathers a bit here is how *I would* do the add files and interweave. I'll get the soapbox out tomorrow when I get a chance to discuss the two options some more. ******************************************. *this replicates what the street range segments look like. set seed = 10. input program. loop #i = 1 to 100. compute begin_add = 1. compute end_add = 100. compute street = #i. compute segment_info = RV.NORMAL(0,10). compute segment_uniqueid = 7 * #i. end case. end loop. end file. end input program. dataset name segments. execute. *I need this info for the sort after add files. dataset activate segments. compute add = begin_add. *ok lets replicate what the traffic volume table looks like. input program. loop #i = 1 to 100. loop #j = 1 to 20. compute add = TRUNC(RV.UNIFORM(1,100)). compute traffic = RV.POISSON(5). compute street = #i. end case. end loop. end loop. end file. end input program. dataset name traffic. execute. *now to add the two files. dataset activate traffic. add files file = * /in flag_traffic /file = 'segments'. *now sort so segments record is above appropriate segments record. sort cases by street add flag_traffic. *you need a set of temp begin & end address range variables for next if statement to work. IF flag_traffic = 0 #down_begin = begin_add. IF flag_traffic = 1 #down_begin = lag(#down_begin). IF flag_traffic = 0 #down_end = end_add. IF flag_traffic = 1 #down_end = lag(#down_end). IF flag_traffic = 0 #down_segmentid = segment_uniqueid. IF flag_traffic = 1 #down_segmentid = lag(#down_segmentid). IF flag_traffic = 0 #down_street = street. IF flag_traffic = 1 #down_street = lag(#down_street). *now assigning unique id if current street is the same and address within begin & end range. IF flag_traffic = 1 and street = #down_street and add >= #down_begin and add <= #down_end segment_uniqueid2 = #down_segmentid. exe. *now get rid of extraneous rows. select if flag_traffic = 1. exe. *now it is needed to clean up extraneous variables and re-merge data via table match in the street segments dataset. ******************************************. ----- Andy W [hidden email]http://andrewpwheeler.wordpress.com/-- View this message in context:http://spssx-discussion.1045642.n5.nabble.com/Merge-from-1-point-on-a-road-to-a-stretch-of-road-tp5715019p5715127.htmlSent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

David Marso

Re: Merge from 1 point on a road to a stretch of road

Administrator

In reply to this post by Andy W

"Also for my reference is the exe before the select if necessary?"
It is required for the LAG functions to work correctly.
--

Andy W wrote

Your still ignoring the fact that segments have an end number. Your code would match "101 Main St." to a street segment for "1 to 100 Main St." if no "101 - ??? Main St." address ranges for "Main St." existed. I couldn't figure out how to do this besides making the explicit "down_???" variables in my above code. Also for my reference is the exe before the select if necessary?

It is much simpler than I had originally envisioned, nice work. That is the first time I had seen the "BY" statement used on ADD FILES. That is a nice feature I will have to keep in the memory banks.

I use the explode for other things though to be honest (I've exploded street centerline files to submit to geocoding engines so I can build my own dictionary with x,y coordinates), but assuming the not matching past the end can be resolved in a simpler manner than I did I would say this is a better solution than exploding. Still exploding is not IMO as horrendous as your capslock key would seem to make it.

Andy

Andy W

Re: Merge from 1 point on a road to a stretch of road

In reply to this post by Jon K Peck

Kind of the same with David's original comment, I really have no idea what you are saying. IMO I think I've chewed over this enough though! Better for the OP to chime in if any other clarifications or alternatives are needed.

Andy

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/