|
Administrator
|
Here is a somewhat general solution to attaching the closest fuzzy matching value of file1 to file2 (using merge variable as 'KEY' and other1 as data to be attached to file2. Usually a simple LAG of other1 will be sufficient.
Note this specifically targets records in file 2 which are sandwiched between two records in file1. Calculate the smallest difference then grab the value of other1 associated with that logic. What I mean by granularity is you arbitrarily build your 'explosion' by increment of 1. What if you need .1 increments ( in other words, building a large 'explosion' in order to do an exact match to arbitrary values is brute force, inelegant and sort of like using a sledge hammer to do framing)? In retrospect this does not look ahead eno,ugh in file1 to see if there are clumps of file2 closer to the lead of file1 than the lag. That could be done but I'm not going to bother at the moment . Hint : Sort in reverse repeat and compare results. Hopefully this communicates the basic ideas and 'exploding' data is forever vanquished from the toolbox. -- input program. LOOP caseid=1 to 1000. COMPUTE merge=UNIFORM(100). COMPUTE Other1= TRUNC(UNIFORM(500)). END CASE. END LOOP. END FILE. END INPUT PROGRAM. SORT CASES BY merge. SAVE OUTFILE 'C:\Users\David\temp\temp1.sav'. input program. LOOP caseid=1 to 1000. COMPUTE merge=UNIFORM(100). COMPUTE Other2= TRUNC(UNIFORM(500)). END CASE. END LOOP. END FILE. END INPUT PROGRAM. SORT CASES BY merge. SAVE OUTFILE 'C:\Users\David\temp\temp2.sav'. ADD FILES FILE 'C:\Users\David\temp\temp1.sav' / IN=@1 / FILE 'C:\Users\David\temp\temp2.sav' / IN=@2 / BY merge. EXE. IF MISSING(Other1) Other1=LAG(Other1). SHIFT VALUES VARIABLE=merge RESULT=lead_merge LEAD=1 /VARIABLE=merge RESULT=lag_merge LAG=1 /VARIABLE=other1 RESULT=Lead_other1 LEAD=1 /VARIABLE =@1 RESULT=lead_@1 LEAD=1. DO IF (@2 AND LAG(@1) AND lead_@1 ). + DO IF ABS(SUM(merge,-1*lead_merge)) LT ABS(SUM(merge,-1*lag_merge)) . + COMPUTE otherxx=lead_other1. + ELSE. + COMPUTE otherxx=LAG(other1). + END IF. ELSE. + COMPUTE otherxx=other1. END IF. IF $CASENUM EQ 1 AND MISSING(other1) otherxx=lead_other1. EXE.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
|
Ok, to be clear, one's "best" solution will ultimately depend on the nature of the data (e.g. how big ones files are) and the nature of the task (e.g. does it need to be done once or on a regular basis) as well as the complexity of the actual solution itself. I agree that "exploding" could be bad in some obvious circumstances, mainly when the exploded table is too big to handle, but lets be clear about those circumstances instead of denouncing it while providing an incomplete alternative in its place.
First, in regards to granularity, *this is not a problem at all* in terms of if you need to increment by a value other than an integer. Although the loop is constrained to increment by integers, if you need to increment by "0.1" or "0.01", just multiply your values to be integers and then divide the result back to the original values. Although more difficult, this approach could even be accomodated to non-linear interpolations. Again, it is a problem if the exploded file is too big to handle (my experience depending on system and what your variables are somewhere between 20-40 million records), and certainly more cumbersome if the file is very big. But note *you only need to make the exploded file and sort it once*, so even then it will likely not be that big of deal. So, as far as I'm concerned "granularity" is a moot point. Lets also remember what information we do have about the OP's problem, that these are *stretches of highway with given distance markers*. For an example, lets say we needed to do this for the *entire* interstate US highway system, and we had a marker every 0.01 of a mile (this would amount to a marker every 52.8 feet, well below a hectometer that the OP originally specified). According to wikipedia, <http://en.wikipedia.org/wiki/Interstate_Highway_System>, there is only 47,182 miles of interstate highways in the continental US. Exploding such a table would only result in 4.7 million records, well within the capabilities of SPSS to handle such a file without too much trouble. Also note again that with the exploding of the table, you only need to create the set and sort once (or however often the reference street set is updated, at best once a year in my experience with geo-data). Now, if the OP feels like chiming in she can state how long her road set is, and we can evaluate whether exploding is feasible or not. Now again, we don't really know the nature of the task nor the exact nature of the data, but I will give an example of cases I have used this for. Say I have a dataset of 300,000 calls to 911 that have street addresses in one medium sized US city (for cities I have worked with this could be between 3~10 years of CAD data). Lets also say that I have a street centerline file that has the beginning and end address ranges as well as the street, that when exploded to form all potential addresses within said city only amounts to 100,000 potential addresses. Is the add files solution, which will take multiple sorting of the 400,000 calls + street segments (which probably has a variety of data already attached to it) the better solution than the 1 time sort of the calls and the exploded dataset? I don't think so, and IMO in that example the lesser evil is exploding the street segments file. In the end I don't do such tasks on a regular basis so any solution would work, but the exploding and merging is IMO so simple it should be considered even if you got your add and interweaving records to work. I will take a look at your code and see if I can update with how I would do it when I get a chance. I feel as if you are intentionally ignoring the complexity of the situation (we have both a begin and an end of the segment, and you don't want to match records that are past the end). But I will see if I can figure out when I need a break from some of my other work. As you can tell I have a need for such solutions, so I'm all for better ways to do such a task. I'm currently unconvinced though *for my particular situation* that this is obviously better. Can't say much about the OP's (I haven't worked with marked highways like that), but I would guess exploding is not unmanageable. Andy |
|
Administrator
|
"I will take a look at your code and see if I can update with how I would do it when I get a chance. I feel as if you are intentionally ignoring the complexity of the situation"...
Hmmm, so you haven't bothered to look at my code before jumping back on the soapbox? 'Exploding' the data does not address the 'complexity' of the situation. It is NOT complex it is utterly trivial.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
|
David,
Feel free to reply with well reasoned positions and evidence for your assertions. It is not personal attack, and I have provided reasons why I believe my advice is appropriate. Andy |
|
In reply to this post by David Marso
Because David ruffled my feathers a bit here is how *I would* do the add files and interweave. I'll get the soapbox out tomorrow when I get a chance to discuss the two options some more.
******************************************. *this replicates what the street range segments look like. set seed = 10. input program. loop #i = 1 to 100. compute begin_add = 1. compute end_add = 100. compute street = #i. compute segment_info = RV.NORMAL(0,10). compute segment_uniqueid = 7 * #i. end case. end loop. end file. end input program. dataset name segments. execute. *I need this info for the sort after add files. dataset activate segments. compute add = begin_add. *ok lets replicate what the traffic volume table looks like. input program. loop #i = 1 to 100. loop #j = 1 to 20. compute add = TRUNC(RV.UNIFORM(1,100)). compute traffic = RV.POISSON(5). compute street = #i. end case. end loop. end loop. end file. end input program. dataset name traffic. execute. *now to add the two files. dataset activate traffic. add files file = * /in flag_traffic /file = 'segments'. *now sort so segments record is above appropriate segments record. sort cases by street add flag_traffic. *you need a set of temp begin & end address range variables for next if statement to work. IF flag_traffic = 0 #down_begin = begin_add. IF flag_traffic = 1 #down_begin = lag(#down_begin). IF flag_traffic = 0 #down_end = end_add. IF flag_traffic = 1 #down_end = lag(#down_end). IF flag_traffic = 0 #down_segmentid = segment_uniqueid. IF flag_traffic = 1 #down_segmentid = lag(#down_segmentid). IF flag_traffic = 0 #down_street = street. IF flag_traffic = 1 #down_street = lag(#down_street). *now assigning unique id if current street is the same and address within begin & end range. IF flag_traffic = 1 and street = #down_street and add >= #down_begin and add <= #down_end segment_uniqueid2 = #down_segmentid. exe. *now get rid of extraneous rows. select if flag_traffic = 1. exe. *now it is needed to clean up extraneous variables and re-merge data via table match in the street segments dataset. ******************************************. |
|
I am currently out of the office and will return Wednesday, Sept. 19. I will respond to your email upon my return. If you
need immediate assistance, please call 812-856-5824. Thank you, Shimon Sarraf Center for Postsecondary Research Indiana University Bloomington |
|
Administrator
|
In reply to this post by Andy W
Let's see ;-)
Basic idea at the most basic level: I believe the simplicity of the code speaks for itself. Do you still believe the 'exploding' data idea is still a useful item in your toolbox? -- -- data list free / seg_beg seg_end. begin data 1 5 6 8 9 12 13 20 end data. SORT CASES BY seg_beg seg_end. COMPUTE seg_num=$CASENUM. save outfile 'seg.sav'. data list free /point info. begin data 2 10.5 7 9.3 10 10.5 15 15.2 end data. SORT CASES BY point. COMPUTE seg_beg=point. ADD FILES / FILE 'seg.sav' / FILE * / IN=@1 / BY seg_beg. DO IF MISSING(seg_num). + COMPUTE seg_end=LAG(seg_end). + COMPUTE seg_num=LAG(seg_num). END IF. EXE. SELECT IF @1. EXE.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
|
Your still ignoring the fact that segments have an end number. Your code would match "101 Main St." to a street segment for "1 to 100 Main St." if no "101 - ??? Main St." address ranges for "Main St." existed. I couldn't figure out how to do this besides making the explicit "down_???" variables in my above code. Also for my reference is the exe before the select if necessary?
It is much simpler than I had originally envisioned, nice work. That is the first time I had seen the "BY" statement used on ADD FILES. That is a nice feature I will have to keep in the memory banks. I use the explode for other things though to be honest (I've exploded street centerline files to submit to geocoding engines so I can build my own dictionary with x,y coordinates), but assuming the not matching past the end can be resolved in a simpler manner than I did I would say this is a better solution than exploding. Still exploding is not IMO as horrendous as your capslock key would seem to make it. Andy |
|
In reply to this post by Andy W
What I would do is to build a set for each
street containing all the interval bounds and then use an interval lookup
routine for each accident address after numbering all the interval records.
Once the record number is identified, regular MATCH FILES would do
the rest, or the matching could be built into the program if there is enough
memory to hold all that data. This is reasonably straightforward
to do with Python programmability using existing features.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] new phone: 720-342-5621 From: Andy W <[hidden email]> To: [hidden email] Date: 09/16/2012 06:48 PM Subject: Re: [SPSSX-L] Merge from 1 point on a road to a stretch of road Sent by: "SPSSX(r) Discussion" <[hidden email]> Because David ruffled my feathers a bit here is how *I would* do the add files and interweave. I'll get the soapbox out tomorrow when I get a chance to discuss the two options some more. ******************************************. *this replicates what the street range segments look like. set seed = 10. input program. loop #i = 1 to 100. compute begin_add = 1. compute end_add = 100. compute street = #i. compute segment_info = RV.NORMAL(0,10). compute segment_uniqueid = 7 * #i. end case. end loop. end file. end input program. dataset name segments. execute. *I need this info for the sort after add files. dataset activate segments. compute add = begin_add. *ok lets replicate what the traffic volume table looks like. input program. loop #i = 1 to 100. loop #j = 1 to 20. compute add = TRUNC(RV.UNIFORM(1,100)). compute traffic = RV.POISSON(5). compute street = #i. end case. end loop. end loop. end file. end input program. dataset name traffic. execute. *now to add the two files. dataset activate traffic. add files file = * /in flag_traffic /file = 'segments'. *now sort so segments record is above appropriate segments record. sort cases by street add flag_traffic. *you need a set of temp begin & end address range variables for next if statement to work. IF flag_traffic = 0 #down_begin = begin_add. IF flag_traffic = 1 #down_begin = lag(#down_begin). IF flag_traffic = 0 #down_end = end_add. IF flag_traffic = 1 #down_end = lag(#down_end). IF flag_traffic = 0 #down_segmentid = segment_uniqueid. IF flag_traffic = 1 #down_segmentid = lag(#down_segmentid). IF flag_traffic = 0 #down_street = street. IF flag_traffic = 1 #down_street = lag(#down_street). *now assigning unique id if current street is the same and address within begin & end range. IF flag_traffic = 1 and street = #down_street and add >= #down_begin and add <= #down_end segment_uniqueid2 = #down_segmentid. exe. *now get rid of extraneous rows. select if flag_traffic = 1. exe. *now it is needed to clean up extraneous variables and re-merge data via table match in the street segments dataset. ******************************************. ----- Andy W [hidden email] http://andrewpwheeler.wordpress.com/ -- View this message in context: http://spssx-discussion.1045642.n5.nabble.com/Merge-from-1-point-on-a-road-to-a-stretch-of-road-tp5715019p5715127.html Sent from the SPSSX Discussion mailing list archive at Nabble.com. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
|
Administrator
|
In reply to this post by Andy W
"Also for my reference is the exe before the select if necessary?"
It is required for the LAG functions to work correctly. --
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
|
In reply to this post by Jon K Peck
Kind of the same with David's original comment, I really have no idea what you are saying. IMO I think I've chewed over this enough though! Better for the OP to chime in if any other clarifications or alternatives are needed.
Andy |
| Free forum by Nabble | Edit this page |
