At 02:58 AM 1/30/2015, David Marso wrote:
>In fact, *I challenge Chuck to come up with any task doable in SAS >that can't be done (by someone like me) more elegantly using SPSS*!! I'm still not Chuck, but here's a SAS project that I'd find much more difficult in SPSS: HEURISTIC NURSING-HOME MATCHING The client had a file of records of inspections for a great many nursing homes. Inspections were nominally once per year per home, and the file covered several years. Records included inspection date and results, and a good deal of identifying information, including name and street address. The good news was, there was a 5- or 6-character identifier (I don't recall which) identifying the homes. The bad news was, it was common for a home's identifier to be changed. (An ID used for one nursing home was never reassigned to a different one, thank goodness.) So the file had records of one or more inspections for a large number of IDs, and a smaller (but it was unclear how much smaller) number of nursing homes. My job (omitting some pre-processing to get where I've described) was to identify when two (or more) IDs actually referred to the same home, a determination that could only be made heuristically. The steps I took were (roughly; I'm working from memory), A. Build a single summary record for each ID in the file. That included earliest and latest inspection date, and identifying information -- I believe I handled changes of name and address, etc., over time, by keeping the earliest and latest values found. B. Build a file of candidate pairs -- pairs of IDs that there was reason to suspect might actually belong to the same facility. I did this by inner joining (or many-to-many merging) the file of summary records with itself several times, using equality keys like address and ZIP code that seemed likely indicators of being possibly the same facility, using PROC SQL. C. Evaluate the candidate pairs: Combine the lists from the various PROC SQL runs into a single list, noting for each candidate pair, which of the criteria it had matched on. Heuristically, evaluate the 'quality' of the match of each pair, including the degree to which address, etc., match, and the plausibility of the time gap between the latest inspection on record for one ID and the earliest for the other. (Time between inspections should be near one year. However, an important special case was inspections recorded on the same date for both IDs.) On the basis of this evaluation, reject some pairs; accept some as highly likely; and refer a few for human analysis. D. Transitive closure: If A is the same facility as B, and B the same as C, then A is the same as C. I did this in a macro loop, in each pass using PROC SQL to join all pairs yet found to the accepted candidate pairs, keeping the whole list of intermediate IDs. Terminate when no pairs were added that hadn't been previously found. Evaluate the final lists -- in a few cases, the A-C match was so implausible that it required rejecting a previously accepted A-B or B-C match. The result was a list of IDs, each the lowest-sorting one on record for a recognized facility, together with all IDs that were accepted as being for that facility. ============================================================================= Now, why SAS instead of SPSS? There was no actual decision; the shop where I was doing this had SAS, and didn't have SPSS. But suppose there had been a choice? . One simple thing: SAS is (or at least, was) much more printer-oriented than SPSS has been for some years. In a project like this, with a path of maybe eight or ten fairly lengthy programs, it was very convenient that SAS runs produced a printer-formatted listing of the code, on which one could control page breaks for readability, with run statistics and any error messages printed on the same listing. . PROC SQL to join the file of IDs with itself, to form candidate pairs. Now that techniques for many-to-many merge in SPSS have been worked out, that isn't such a compelling advantage (though the techniques that have been posted would need modification for the many-to-many merge of a file with itself). It'd still be easier in PROC SQL. . The macro loop with PROC SQL for the transitive closure. I recall (though don't recall the exact reasons) that the DATA step wouldn't have worked as well for the merges; and SPSS's merge facilities are pretty analogous to what the DATA step offers. Before Python, SPSS had no way to loop sequentially, adding members to a list at each pass and terminating when no new member is found; that was pretty direct with a SAS macro. ============================================================================= So, all in all, this project was easier in SAS -- and since it was a complicated project at best, 'easier' was very important. ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Richard, this is still basically what Bruce said earlier - I know how to do xyz in SAS but not in SPSS ergo SAS is better. I don't buy it - you were forced to use SAS and figured out some hacky solutions calling SQL in loops - is that really your argument? I think I would need to bring a drink with me to work if I needed to rewrite code that called SQL in a loop. I'm not judging - I've used hacky solutions myself for projects all the time, but lets recognize a hack when we see one as opposed to saying it is a benefit of the software.
Examples of your (D) step I show here, https://andrewpwheeler.wordpress.com/2013/07/19/querying-graph-neighbors-in-spss/ (and Jon mentions someone did find the transitive closure in pure SPSS in a comment at the end), and it has the expand all pairs in the (C) step as well. All of the steps are tedious, but likely can be performed in some capacity in SPSS for a long time now. (For a python solution to find the transitive closure using SPSS see here, https://andrewpwheeler.wordpress.com/2014/04/22/finding-subgroups-in-a-graph-using-networkx-and-spss/, it works mind-boggling fast on some of the fairly big graphs I have fed it. SAS now has some network routines, the operations research blog has some neat examples, http://blogs.sas.com/content/operations/, so it is possibly a simple task now in SAS. Another solution is to take powers of a binary adjacency matrix, pretty easy in MATRIX but needs a smaller number of nodes in the graph to be feasible.) Now, I agree everything you did here can also be done likely in any major statistical software package, so this isn't an argument for SPSS over SAS (or Stata or R) either, but lets not pretend were unbiased arbiters. I prefer SPSS because I have used it on a near daily basis since 2008, and I can do jobs in SPSS much faster than I can in other languages, but I at least recognize that the same things can be done in most other software (probably 90% of what I do could be written as a series of SQL queries without too much hassle!) Now, you can argue about things 15~20 years ago (I was stuck with SPSS V15 for quite some time at a job, which I believe came out in 2006, and its printing was fine, so I'm not quite sure how far you have to go back to find a version that doesn't print the output), maybe if you build a time machine you can make them pertinent to any contemporary discussion. Pro-tip: This is the best software I've used for fuzzy matching is http://fril.sourceforge.net/screens.html (I know Jon has written some scripts for SPSS but I have not used them so cannot comment.) If the nursing homes had addresses you can simply geocode them and match presuming they are at the same or near-same location. Jobs like this though I've sometimes just read the names and made a master list in a spreadsheet to match the IDs (if it is one-time or only needs an irregular update that can be quicker than spending a month writing hacky code that needs fixing on a regular basis.) |
At 08:30 AM 2/9/2015, Andy W wrote:
>Richard, this is still basically what Bruce said earlier - I know >how to do xyz in SAS but not in SPSS ergo SAS is better. Andy, you do me an injustice. You quote me as saying something I did not say, quoting Bruce who attributed to yet another poster something that HE did not say. I said, and reaffirm, that I have written some projects in SAS that I would rather do that way than in SPSS; and I have written enough in both SAS and SPSS to claim an educated -- though far from infallible -- opinion. As for "SAS is better than SPSS", I most assuredly have not said that, and would not. Given free access to both, I would write SPSS for projects up to a (subjective) level of complexity, and SAS beyond that. And I would far, far rather start a new user in SPSS than in SAS -- I was once moved to describe SAS's philosophy as, "Of the programmers, by the programmers, and for the programmers." >I don't buy it - you were forced to use SAS and figured out some >hacky solutions calling SQL in loops - is that really your argument? No, it's my description of what I did. 'Hacky', being an aesthetic judgement, is neither provable nor disprovable, but I'm surprised that you so characterize the use of PROC SQL in a macro loop. Because it used PROC SQL instead of a DATA step? On the whole I prefer DATA step solutions myself, and since I don't have the code in front of me I'm not sure of the exact reason I didn't; I believe it was simply easier to write code to be executed an indefinite number of times, adding new variables to the output at each pass. (I needed not just the degree of the connection, but the intermediate members at all degrees and the evaluations that had been assigned to all the intermediate connections.) Or, because it used a macro loop? But graph-closure solutions do require multiple data passes -- your "querying-graph-neighbors-in-spss" takes three, I believe. If it isn't satisfactory to terminate after a fixed number of passes, one needs a meta-programming loop; and a macro loop is the natural way to do that in SAS, as Python (now) is in SPSS. Now, at the time there were no meta-programming loops in SPSS. As for the data passes, I'm going to have to admit I don't recall the complete argument for SQL over a DATA step; and if it could have been done in a DATA step in SAS, it could have been in a transformation program in SPSS. As for its having been "hacky code that needs fixing on a regular basis", no; it was careful code, and went through at least a number of years of production use as new inspection records were received and added. >Now, I agree everything you did here can also be done likely in any >major statistical software package, so this isn't an argument for >SPSS over SAS (or Stata or R) either, but let's not pretend we're >unbiased arbiters. I have not remotely pretended to be so. I do state that I am someone with significant experience in both SPSS and SAS, and that I offer an informed and considered opinion. (You write, correctly, that there are important tools that would now be applied to the project I describe, that I did not have available at the time.) ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
I recall seven years ago working with a
large SPSS Statistics customer in Europe who had implemented transitive
closure in Statistics (then just called SPSS) on what at the time were
considered large datasets. I was impressed as it was not easy to
do back then, and they had a robust and efficient solution. Now,
of course, this task and many similar would be much easier to do with the
improvements in Statistics since then.
Jon Peck (no "h") aka Kim Senior Software Engineer, IBM [hidden email] phone: 720-342-5621 From: Richard Ristow <[hidden email]> To: [hidden email] Date: 02/09/2015 01:13 PM Subject: Re: [SPSSX-L] FW: SPSS Statistics Survey Sent by: "SPSSX(r) Discussion" <[hidden email]> At 08:30 AM 2/9/2015, Andy W wrote: >Richard, this is still basically what Bruce said earlier - I know >how to do xyz in SAS but not in SPSS ergo SAS is better. Andy, you do me an injustice. You quote me as saying something I did not say, quoting Bruce who attributed to yet another poster something that HE did not say. I said, and reaffirm, that I have written some projects in SAS that I would rather do that way than in SPSS; and I have written enough in both SAS and SPSS to claim an educated -- though far from infallible -- opinion. As for "SAS is better than SPSS", I most assuredly have not said that, and would not. Given free access to both, I would write SPSS for projects up to a (subjective) level of complexity, and SAS beyond that. And I would far, far rather start a new user in SPSS than in SAS -- I was once moved to describe SAS's philosophy as, "Of the programmers, by the programmers, and for the programmers." >I don't buy it - you were forced to use SAS and figured out some >hacky solutions calling SQL in loops - is that really your argument? No, it's my description of what I did. 'Hacky', being an aesthetic judgement, is neither provable nor disprovable, but I'm surprised that you so characterize the use of PROC SQL in a macro loop. Because it used PROC SQL instead of a DATA step? On the whole I prefer DATA step solutions myself, and since I don't have the code in front of me I'm not sure of the exact reason I didn't; I believe it was simply easier to write code to be executed an indefinite number of times, adding new variables to the output at each pass. (I needed not just the degree of the connection, but the intermediate members at all degrees and the evaluations that had been assigned to all the intermediate connections.) Or, because it used a macro loop? But graph-closure solutions do require multiple data passes -- your "querying-graph-neighbors-in-spss" takes three, I believe. If it isn't satisfactory to terminate after a fixed number of passes, one needs a meta-programming loop; and a macro loop is the natural way to do that in SAS, as Python (now) is in SPSS. Now, at the time there were no meta-programming loops in SPSS. As for the data passes, I'm going to have to admit I don't recall the complete argument for SQL over a DATA step; and if it could have been done in a DATA step in SAS, it could have been in a transformation program in SPSS. As for its having been "hacky code that needs fixing on a regular basis", no; it was careful code, and went through at least a number of years of production use as new inspection records were received and added. >Now, I agree everything you did here can also be done likely in any >major statistical software package, so this isn't an argument for >SPSS over SAS (or Stata or R) either, but let's not pretend we're >unbiased arbiters. I have not remotely pretended to be so. I do state that I am someone with significant experience in both SPSS and SAS, and that I offer an informed and considered opinion. (You write, correctly, that there are important tools that would now be applied to the project I describe, that I did not have available at the time.) ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Free forum by Nabble | Edit this page |