SPSSX Discussion

SPSS Statistics Survey

Classic

List

Threaded

24 messages Options

Richard Ristow

Re: FW: SPSS Statistics Survey

At 02:58 AM 1/30/2015, David Marso wrote:

>In fact, *I challenge Chuck to come up with any task doable in SAS
>that can't be done (by someone like me) more elegantly using SPSS*!!

I'm still not Chuck, but here's a SAS project that I'd find much more
difficult in SPSS:

HEURISTIC NURSING-HOME MATCHING

The client had a file of records of inspections for a great many
nursing homes. Inspections were nominally once per year per home, and
the file covered several years. Records included inspection date and
results, and a good deal of identifying information, including name
and street address.

The good news was, there was a 5- or 6-character identifier (I don't
recall which) identifying the homes. The bad news was, it was common
for a home's identifier to be changed. (An ID used for one nursing
home was never reassigned to a different one, thank goodness.) So
the file had records of one or more inspections for a large number of
IDs, and a smaller (but it was unclear how much smaller) number of
nursing homes. My job (omitting some pre-processing to get where
I've described) was to identify when two (or more) IDs actually
referred to the same home, a determination that could only be made
heuristically.

The steps I took were (roughly; I'm working from memory),

A. Build a single summary record for each ID in the file. That
included earliest and latest inspection date, and identifying
information -- I believe I handled changes of name and address, etc.,
over time, by keeping the earliest and latest values found.

B. Build a file of candidate pairs -- pairs of IDs that there was
reason to suspect might actually belong to the same facility. I did
this by inner joining (or many-to-many merging) the file of summary
records with itself several times, using equality keys like address
and ZIP code that seemed likely indicators of being possibly the same
facility, using PROC SQL.

C. Evaluate the candidate pairs: Combine the lists from the various
PROC SQL runs into a single list, noting for each candidate pair,
which of the criteria it had matched on. Heuristically, evaluate the
'quality' of the match of each pair, including the degree to which
address, etc., match, and the plausibility of the time gap between
the latest inspection on record for one ID and the earliest for the
other. (Time between inspections should be near one year. However, an
important special case was inspections recorded on the same date for
both IDs.) On the basis of this evaluation, reject some pairs; accept
some as highly likely; and refer a few for human analysis.

D. Transitive closure: If A is the same facility as B, and B the same
as C, then A is the same as C. I did this in a macro loop, in each
pass using PROC SQL to join all pairs yet found to the accepted
candidate pairs, keeping the whole list of intermediate IDs.
Terminate when no pairs were added that hadn't been previously found.
Evaluate the final lists -- in a few cases, the A-C match was so
implausible that it required rejecting a previously accepted A-B or B-C match.

The result was a list of IDs, each the lowest-sorting one on record
for a recognized facility, together with all IDs that were accepted
as being for that facility.
=============================================================================
Now, why SAS instead of SPSS? There was no actual decision; the shop
where I was doing this had SAS, and didn't have SPSS. But suppose
there had been a choice?

. One simple thing: SAS is (or at least, was) much more
printer-oriented than SPSS has been for some years. In a project
like this, with a path of maybe eight or ten fairly lengthy programs,
it was very convenient that SAS runs produced a printer-formatted
listing of the code, on which one could control page breaks for
readability, with run statistics and any error messages printed on
the same listing.

. PROC SQL to join the file of IDs with itself, to form candidate
pairs. Now that techniques for many-to-many merge in SPSS have been
worked out, that isn't such a compelling advantage (though the
techniques that have been posted would need modification for the
many-to-many merge of a file with itself). It'd still be easier in PROC SQL.

. The macro loop with PROC SQL for the transitive closure. I recall
(though don't recall the exact reasons) that the DATA step wouldn't
have worked as well for the merges; and SPSS's merge facilities are
pretty analogous to what the DATA step offers. Before Python, SPSS
had no way to loop sequentially, adding members to a list at each
pass and terminating when no new member is found; that was pretty
direct with a SAS macro.
=============================================================================
So, all in all, this project was easier in SAS -- and since it was a
complicated project at best, 'easier' was very important.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Andy W

Re: FW: SPSS Statistics Survey

Richard, this is still basically what Bruce said earlier - I know how to do xyz in SAS but not in SPSS ergo SAS is better. I don't buy it - you were forced to use SAS and figured out some hacky solutions calling SQL in loops - is that really your argument? I think I would need to bring a drink with me to work if I needed to rewrite code that called SQL in a loop. I'm not judging - I've used hacky solutions myself for projects all the time, but lets recognize a hack when we see one as opposed to saying it is a benefit of the software.

Examples of your (D) step I show here, https://andrewpwheeler.wordpress.com/2013/07/19/querying-graph-neighbors-in-spss/ (and Jon mentions someone did find the transitive closure in pure SPSS in a comment at the end), and it has the expand all pairs in the (C) step as well. All of the steps are tedious, but likely can be performed in some capacity in SPSS for a long time now. (For a python solution to find the transitive closure using SPSS see here, https://andrewpwheeler.wordpress.com/2014/04/22/finding-subgroups-in-a-graph-using-networkx-and-spss/, it works mind-boggling fast on some of the fairly big graphs I have fed it. SAS now has some network routines, the operations research blog has some neat examples, http://blogs.sas.com/content/operations/, so it is possibly a simple task now in SAS. Another solution is to take powers of a binary adjacency matrix, pretty easy in MATRIX but needs a smaller number of nodes in the graph to be feasible.)

Now, I agree everything you did here can also be done likely in any major statistical software package, so this isn't an argument for SPSS over SAS (or Stata or R) either, but lets not pretend were unbiased arbiters. I prefer SPSS because I have used it on a near daily basis since 2008, and I can do jobs in SPSS much faster than I can in other languages, but I at least recognize that the same things can be done in most other software (probably 90% of what I do could be written as a series of SQL queries without too much hassle!)

Now, you can argue about things 15~20 years ago (I was stuck with SPSS V15 for quite some time at a job, which I believe came out in 2006, and its printing was fine, so I'm not quite sure how far you have to go back to find a version that doesn't print the output), maybe if you build a time machine you can make them pertinent to any contemporary discussion.

Pro-tip: This is the best software I've used for fuzzy matching is http://fril.sourceforge.net/screens.html (I know Jon has written some scripts for SPSS but I have not used them so cannot comment.) If the nursing homes had addresses you can simply geocode them and match presuming they are at the same or near-same location. Jobs like this though I've sometimes just read the names and made a master list in a spreadsheet to match the IDs (if it is one-time or only needs an irregular update that can be quicker than spending a month writing hacky code that needs fixing on a regular basis.)

Andy W
apwheele@gmail.com
http://andrewpwheeler.wordpress.com/

Richard Ristow

Re: FW: SPSS Statistics Survey

At 08:30 AM 2/9/2015, Andy W wrote:

>Richard, this is still basically what Bruce said earlier - I know
>how to do xyz in SAS but not in SPSS ergo SAS is better.

Andy, you do me an injustice. You quote me as saying something I did
not say, quoting Bruce who attributed to yet another poster something
that HE did not say.

I said, and reaffirm, that I have written some projects in SAS that I
would rather do that way than in SPSS; and I have written enough in
both SAS and SPSS to claim an educated -- though far from infallible
-- opinion.

As for "SAS is better than SPSS", I most assuredly have not said
that, and would not. Given free access to both, I would write SPSS
for projects up to a (subjective) level of complexity, and SAS beyond
that. And I would far, far rather start a new user in SPSS than in
SAS -- I was once moved to describe SAS's philosophy as, "Of the
programmers, by the programmers, and for the programmers."

>I don't buy it - you were forced to use SAS and figured out some
>hacky solutions calling SQL in loops - is that really your argument?

No, it's my description of what I did. 'Hacky', being an aesthetic
judgement, is neither provable nor disprovable, but I'm surprised
that you so characterize the use of PROC SQL in a macro
loop. Because it used PROC SQL instead of a DATA step? On the whole
I prefer DATA step solutions myself, and since I don't have the code
in front of me I'm not sure of the exact reason I didn't; I believe
it was simply easier to write code to be executed an indefinite
number of times, adding new variables to the output at each pass. (I
needed not just the degree of the connection, but the intermediate
members at all degrees and the evaluations that had been assigned to
all the intermediate connections.)

Or, because it used a macro loop? But graph-closure solutions do
require multiple data passes -- your
"querying-graph-neighbors-in-spss" takes three, I believe. If it
isn't satisfactory to terminate after a fixed number of passes, one
needs a meta-programming loop; and a macro loop is the natural way
to do that in SAS, as Python (now) is in SPSS.

Now, at the time there were no meta-programming loops in SPSS. As
for the data passes, I'm going to have to admit I don't recall the
complete argument for SQL over a DATA step; and if it could have been
done in a DATA step in SAS, it could have been in a transformation
program in SPSS.

As for its having been "hacky code that needs fixing on a regular
basis", no; it was careful code, and went through at least a number
of years of production use as new inspection records were received and added.

>Now, I agree everything you did here can also be done likely in any
>major statistical software package, so this isn't an argument for
>SPSS over SAS (or Stata or R) either, but let's not pretend we're
>unbiased arbiters.

I have not remotely pretended to be so. I do state that I am someone
with significant experience in both SPSS and SAS, and that I offer an
informed and considered opinion.

(You write, correctly, that there are important tools that would now
be applied to the project I describe, that I did not have available
at the time.)

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck

Re: FW: SPSS Statistics Survey

I recall seven years ago working with a large SPSS Statistics customer in Europe who had implemented transitive closure in Statistics (then just called SPSS) on what at the time were considered large datasets. I was impressed as it was not easy to do back then, and they had a robust and efficient solution. Now, of course, this task and many similar would be much easier to do with the improvements in Statistics since then.

Jon Peck (no "h") aka Kim
Senior Software Engineer, IBM
[hidden email]
phone: 720-342-5621

From: Richard Ristow <[hidden email]>
To: [hidden email]
Date: 02/09/2015 01:13 PM
Subject: Re: [SPSSX-L] FW: SPSS Statistics Survey
Sent by: "SPSSX(r) Discussion" <[hidden email]>

At 08:30 AM 2/9/2015, Andy W wrote: >Richard, this is still basically what Bruce said earlier - I know >how to do xyz in SAS but not in SPSS ergo SAS is better. Andy, you do me an injustice. You quote me as saying something I did not say, quoting Bruce who attributed to yet another poster something that HE did not say. I said, and reaffirm, that I have written some projects in SAS that I would rather do that way than in SPSS; and I have written enough in both SAS and SPSS to claim an educated -- though far from infallible -- opinion. As for "SAS is better than SPSS", I most assuredly have not said that, and would not. Given free access to both, I would write SPSS for projects up to a (subjective) level of complexity, and SAS beyond that. And I would far, far rather start a new user in SPSS than in SAS -- I was once moved to describe SAS's philosophy as, "Of the programmers, by the programmers, and for the programmers." >I don't buy it - you were forced to use SAS and figured out some >hacky solutions calling SQL in loops - is that really your argument? No, it's my description of what I did. 'Hacky', being an aesthetic judgement, is neither provable nor disprovable, but I'm surprised that you so characterize the use of PROC SQL in a macro loop. Because it used PROC SQL instead of a DATA step? On the whole I prefer DATA step solutions myself, and since I don't have the code in front of me I'm not sure of the exact reason I didn't; I believe it was simply easier to write code to be executed an indefinite number of times, adding new variables to the output at each pass. (I needed not just the degree of the connection, but the intermediate members at all degrees and the evaluations that had been assigned to all the intermediate connections.) Or, because it used a macro loop? But graph-closure solutions do require multiple data passes -- your "querying-graph-neighbors-in-spss" takes three, I believe. If it isn't satisfactory to terminate after a fixed number of passes, one needs a meta-programming loop; and a macro loop is the natural way to do that in SAS, as Python (now) is in SPSS. Now, at the time there were no meta-programming loops in SPSS. As for the data passes, I'm going to have to admit I don't recall the complete argument for SQL over a DATA step; and if it could have been done in a DATA step in SAS, it could have been in a transformation program in SPSS. As for its having been "hacky code that needs fixing on a regular basis", no; it was careful code, and went through at least a number of years of production use as new inspection records were received and added. >Now, I agree everything you did here can also be done likely in any >major statistical software package, so this isn't an argument for >SPSS over SAS (or Stata or R) either, but let's not pretend we're >unbiased arbiters. I have not remotely pretended to be so. I do state that I am someone with significant experience in both SPSS and SAS, and that I offer an informed and considered opinion. (You write, correctly, that there are important tools that would now be applied to the project I describe, that I did not have available at the time.) ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD