I want to carry out a few data validation checks on my data. I have 7 validation checks so far. It’s check number E2 that I find weird. Any advice would be welcome.
string E1 E2 E3 E4 E5 E6 E7 (A1). compute TFN = number(e416, F12.0). if missing(n384) or (n384 eq 0) E1 = '1'. if ((n384 - n381) NE n558) E2 = '2'. if (n558 gt n384) E3 ='3'. if any(e490, '401', '410') and ((missing(n558)) or (n558 EQ 0)) E4 = '4'. if any(e490, '401', '410') and (e488 EQ 'ZZZZZZZZZZ') E5 = '5'. if any(e490, '401', '410') and missing(TFN) E6 ='6'. if any(e490, '501', '510') and (n558 GT 0) E7 ='7'. select if (E1 EQ '1') OR (E2 EQ '2') OR (E3 EQ '3') OR (E4 EQ '4') OR (E5 EQ '5') OR (E6 EQ '6') OR (E7 EQ '7'). list vars = year sem E313 E307 e490 n384 n381 n529 n558 tfn e488 E1 E2 E3 E4 E5 E6 E7. e490 n384 n381 n529 n558 E1 E2 E3 E4 E5 E6 E7 230 2252.01 2000.00 63.00 252.01 2 410 38.00 .00 .00 38.00 6 410 76.00 .00 .00 76.00 6 410 190.00 .00 .00 190.00 6 410 76.00 .00 .00 76.00 6 410 95.00 38.00 .00 57.00 6 The alignment is not too good here, but what the output is saying in regard to E2 is that case 1 is in error and case 6 is fine. (If you copy it to a text editor it may look better.) Sure, record 6 is good: 95 - 38 = 57 and the value of n558 is also 57 therefore no error and no error is flagged. But look at case 1: 2252.01 - 2000 = 252.01 and the value of n558 is also 252.01 (ie. identical values), but E2 shows it as an error by displaying '2' against the record. The following were used because the data had been held as text, even though the values were numbers: compute n384 = number(e384, F12.2). compute n381 = number(e381, F12.2). compute n529 = number(e529, F12.2). compute n558 = number(e558, F12.2). I don't know if that's relevant to the issue. |
Calculate the actual difference. Due to the nature of floating point arithmetic, numbers with a decimal part often do not have an exact representation in floating point. You would need a very small fuzz factor in the test.
On Wednesday, October 26, 2016, Ron0z <[hidden email]> wrote: I want to carry out a few data validation checks on my data. I have 7 -- ===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Jon, I've tried to find how the floating point
numbers are represented
internally in SPSS but couldn't
anything. Wouldn't it be a good idea
to declare a print format for n384, n381, and
n558 of, say, f20.15
and print out the case with e490=230. If
what you say below is true,
we'd expect some number(s) beyond the 2nd
decimal value and
print out more decimal values should reveal
it. Though this seems
reasonable to me I think that the error may be
elsewhere but it is
hard to determine where because (a) there is
distracting junk in
the code (for example in list: year sem ftn;
there are no values for
these), and/or (b) there may be syntax
elsewhere that is the source
of the problem.
By the way, why are there seven E vars instead
of one E with
seven values? The variables E1-E7 are
going to be sparse unless
a case can take have more than one E
variable. It certainly
confuses the output below (even after putting
it in Notepad;
there is only one column).
-Mike Palij
New York University
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
Got it licked!
Trying to trick the system is tricky. My calculated value of zero actually equals 0.0000000000291 which is not the same as 0. Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003. However, when I added TRUNC everything worked out just fine. The following works: if ((TRUNC(n384*100) - TRUNC(n381*100) - TRUNC(n558*100) ) ne 0) E2 = '2'. |
Administrator
|
In the long run you will be better off with something like :
IF (ABS(n384 - n381 - n558) LE .00000000001) E2='2'.
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
Administrator
|
It might just as easily have been represented as 225200.999999999999999 then you would be screwed. "2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003"
Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me. --- "Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis." Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?" |
In reply to this post by Ron0z
A few points on this. - The variable format has no effect on the actual internal value of a number. It just says how to display the number or, on input, how to interpret the digits. - SPSS, like other statistical software, stores numbers internally as double precision floating point values. That means 53 bits of precision plus an exponent which gives approximately 16 significant figures. The numbers are expressed in terms of powers of 2. Running this code shows exactly what is happening here. Using scientific notation exposes the actual values although technically there is about half an additional digit not shown. When 2252.01 is converted to binary, the stored value is 2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code). data list list/n384 n381 n558(3f8.2). begin data 2252.01 2000.00 252.01 end data compute diff = n384 - n381. compute test = diff - n558. format n384 n381 n558 diff test(e23.16). list. n384 n381 n558 diff test 2.252010000000000E+003 2.000000000000000E+003 2.520100000000000E+002 2.520100000000002E+002 2.273736754432321E-013 This link goes into all the gory details of floating point and the IEEE 754 standard. Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system. On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote: Got it licked! |
Jon,
I notice that in the options table, there is an entry for ‘fuzzbits’ and it is adjustable. It says it is for the rnd and trunc functions. I’m curious how rnd
and trunc is functionally different from (a </>/= b). Gene Maguin From: SPSSX(r) Discussion [mailto:[hidden email]]
On Behalf Of Jon Peck A few points on this. - The variable format has no effect on the actual internal value of a number. It just says how to display the number or, on input, how to interpret the digits. - SPSS, like other statistical software, stores numbers internally as double precision floating point values. That means 53 bits of precision plus an exponent which gives approximately 16 significant figures.
The numbers are expressed in terms of powers of 2. Running this code shows exactly what is happening here. Using scientific notation exposes the actual values although technically there is about half an additional digit not shown. When 2252.01 is converted
to binary, the stored value is 2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code). data list list/n384 n381 n558(3f8.2). begin data 2252.01 2000.00 252.01 end data compute diff = n384 - n381. compute test = diff - n558. format n384 n381 n558 diff test(e23.16). list. n384 n381 n558 diff test 2.252010000000000E+003 2.000000000000000E+003 2.520100000000000E+002 2.520100000000002E+002 2.273736754432321E-013 This link goes into all the gory details of floating point and the IEEE 754 standard. Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating
point number system. On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote:
-- Jon K Peck ===================== To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD
|
In reply to this post by Jon Peck
Boy, does this bring back memories. When
I was an undergrad in the
early 1970s I was a psych major and had minor
in bio and computer
science -- running punched cards on an IBM
mainframe (which took
hours to get through the system, meaning it
took hours for me to find
the errors I committed and rerun the program
which was madness).
Later in the 1970s I was in grad school in
exp/cog psych and was
running SPSS and BMDP on a Sperry UNIVAC 1100
which used
7-bit bytes for ASCII instead of the 8-bit
bytes of EBCDIC of the
IBM. Wilkinson & Dallal published
their American Statistician
paper in 1977 on "Accuracy of Sample Moments
Calculations etc".
which pointed out that most statistical
software packages gave
incorrect results because (1) of the algorithm
used, and (b) the
use of single precision 32-bit
representation. This ties into the
presentation that Jon Peck makes for double
precision down
below.
Quoting from W&D.
|Ling (1974) listed five
algorithms for calculating the sum of
|squares about the sample mean. The first of these is
the
|desk calculator:
NOTE: Greek symbols don't paste properly into the
mail
client i use but the desk calculator involves the sum of
x^2.
|Of the five algorithms, it is the least reliable for
use on a
|computer with a small word size due to
its susceptibility
|to truncation error. Anscombe (1967) has
shown that for a
|computer like the IBM 370 with a
four-byte, 32-bit w word,
|24 of which are used to store a mantissa
(approximately
|7.2 decimal digits), S sub
11 [standard deviation] cannot be
|computed with any accuracy for the
numbers 9001, 9002, 9003.
|Following the example through,
| X sub1i_____ X^2 sub1_____ X^2 sub1 (first2
4 binaryd igits)
| 9001____
81018001____ 81018000
| 9002____
81036004____ 81036000
| 9003____
81054009____ 81054008.
|
|Therefore, to the first 24 binary digits,
|sum of X^2 sub1 = 243108000.
|Again, to the first 24 binary digits,
| 3X-bar^2 = 243108000, so
| sum of X^2 sub1 - 3X-bar^2 = 0.
[NOTE: see original article for actual
formulas)
|The same result can be obtained in decimal arithmetic by
ignoring
|any digit beyond the seventh. The three
consecutive integers need
|not be as large as 9000 for complete loss
of accuracy; numbers
|smaller than 3000 will do.
Because different software packages used different
algorithms
as well as precision (double precision helps minimize the
problem)
whether the results one got were correct or wrong depended
upon
the algorithm the package used and the precision used (and
whether
it could be changed from single to double).
BMDP used single precision
(32-bit) representation but used
the mean deviation formula
to calculate the sum of squares,
so it avoided
overflow/underflow errors that would produce
truncated numbers and,
hence, incorrect results. BMDP
was the only package to provide correct
results with single
precision. However, BMDP required reading the
data twice:
first to calculate the appropriated statistics
like the mean,
and second to subtract the mean from each
value which
meant the processing took longer.
SPSS 1977 (IBM mainframe version) could
calculate the
mean and standard deviation correctly but not
correlations.
As W&D point out this, was due to SPSS
using single
precision for some calculations and double
precisions
for others. At this time SPSS didn't provide
an option
to do double precision calculations
throughout.
Other packages (DATATEXT, OSIRIS) which
provided
an option to use double precision were able to
get all of
the statistics correct when this option was
used. Subsequent
to this article, SPSS would used double
precision in its
calculations on IBM machines but I'm not sure
what year
or version of SPSS this was. Not sure if
they changed
the algorithm.
W&D note the Control Data Corp (CDC) computers
used 60 bit words and the desk calculator algorithm would
provide correct results here (I assume that all packages running
on CDC machines made use of the 60 bit representations).
NYU used to have a CDC machine in addition to its IBM
mainframes in the early 1980s and it is telling that Elazar Pedhazur
used both
IBM and CDC machines for analyses -- Pleasure
did his Ph.D. under Fred Kerlinger and would
become
part of the Ed psych faculty in NYU's school
of Ed.
In Pedhazur's early books on multiple
regression he would
show results from both CDC and IBM versions of
software.
Sidenote: in the 1940 the
clinical psychology program
was located in
the NYU School of Ed, not in the NYU
School of Arts &
Science, and that is where Jack Cohen
got his Ph.D. in
clinical psychology as well as his quantitative
training. After
getting his Ph.D., Jack would do clinical
work and research in one
of the Veterans hospitals until
the late '50s/'early 60s
when he became part of the
NYU Arts
& Science psychology dept as well as doing
a consulting gig
at the New York State Psychiatric Institute
(NYSPI or just PI) which
was run by Columbia's med
school's dept of
psychiatry. I believe he was affiliated
with the Biometrics Dept
there, which is where his second
wife Patricia would also
work and does so still today.
Anyway, Jon's presentation below helps to make
sense of
W&D presentation but in the context of the
32-bit word.
I believe that the IEEE 754 standard would
help standardize
operations across statistical
packages.
-Mike Palij
New York University
===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD |
In reply to this post by Maguin, Eugene
Fuzzbits is pretty esoteric, and generally one should leave it alone. I have copied the help below, but what might not be clear from that is that it does not affect ordinary equality calculations. They are based on the exact bits. It also does not affect the front end in places, say, in pivot tables, where a format is applied. The setting is designed to deal with noise in the least significant bits in a calculation as in the present case. As you can see from the help, it has had a chequered history. The setting was exposed in V17 because of a few customers who had code that was very sensitive to the rounding/truncation threshold, and they wanted to ensure compatibility. Rounding and Truncation of Numeric Values. For the RND and TRUNC functions, this setting controls the default threshold for rounding up values that are very close to a rounding boundary. The setting is specified as a number of bits and is set to 6 at install time, which should be sufficient for most applications. Setting the number of bits to 0 produces the same results as in release 10. Setting the number of bits to 10 produces the same results as in releases 11 and 12.
On Thu, Oct 27, 2016 at 8:00 AM, Maguin, Eugene <[hidden email]> wrote:
|
In reply to this post by Mike
Interesting history, Mike. I don't know when SPSS became fully double precision, but it probably would have been no later than the introduction of SPSSX, which was around 1981-2. That's before I joined SPSS. At least as big an issue as single/double precision is how the arithmetic calculations are done. Even something as apparently simple as a sum can run amok if you just use the obvious method if the numbers vary widely in magnitude. Here is a link to the Kahan algorithm used by Statistics. On Thu, Oct 27, 2016 at 11:05 AM, Mike Palij <[hidden email]> wrote:
|
In reply to this post by Mike
Mike
Punched cards? At least you could see the text along the top! When writing the Salford Survey Suite in the mid-1960s the computer was an English Electric KDF9 (64K RAM and as big as a small house: input and output on 8-hole paper tape). I first encountered cards at LSE in 1970, when data, programs and JCL were needed on actual cards to feed into SDTAB (a program for Survey Data Tabulation) and MUTOS (a program for spreading out data supplied by fieldwork agencies on multi-punched cards). Other software we tried included BMDP and OSIRIS, but local programs abounded elsewhere. For survey processing before SPSS (and personal computers) see slide-show http://surveyresearch.weebly.com/uploads/2/9/9/8/2998485/slides_1_-_history_and_background.ppt SPSS was only implemented in the UK in 1970 after some political infighting at Edinburgh in which an in-house proposal was defeated. (see: http://surveyresearch.weebly.com/uploads/2/9/9/8/2998485/appendix_1_how_spss_came_to_the_uk.pdf) Edinburgh was one of a handful of universities then allowed to buy American computers. My first encounter with SPSS was a completely disastrous blind date (http://surveyresearch.weebly.com/my-first-time-in-at-the-spss-deep-end-1972.html) but perseverance paid off. It was a revelation and spread like wildfire among university social science departments, not least because the manual was available in academic book-stores and the command language was not Fortran, but English. At last we were free of constraint by the mathematicians, statisticians and programmers guarding their precious machines, many of whom criticised SPSS for their own reasons, including inefficient programming and statistical inaccuracy. We just ignored them. I have used it ever since, starting at LSE in 1972 (data and setups still on actual 80-column cards, data later on magnetic tape) using the CDC 2000 and, when SPSS was available for other manufacturers, the ICL 2900. From 1976 (still on cards: card-punch on another campus a mile away) on ICL 1900, then from 1982 (when my unit acquired a single card-punch) and from the mid-1980s (Oh joy! A new lab with 16 VDUs, 4 fast servers and 2 lineprinters: no more cards!) on Dec-10, Dec-20 and finally a Vax cluster (but still with JCL in the form of EDT and VMS). Even then the social scientists weren't consulted about their needs, but were dependent on what Librarianship and Information Sciences had ordered. SPSS for Windows offers ultimate freedom from all constraints except personal incompetence. NB: the links above are taken from my 2006 presentation to ASSESS (SPSS users in Europe) http://surveyresearch.weebly.com/old-dog-old-tricks-using-spss-syntax-to-beat-the-mouse-trap.html |
That's a term from the past. I spent several years using a VAX and grew to love it with an affection no other system has come close to equaling. (Tough hardware too.) A mainframe where SPSS batch jobs could be scheduled (many tasks running simultaneously); VAXMate which was their version of a PC that see the mainframe drive as a virtual drive (huge disk space); a programmable editor (TPU ?); and my greatest love - DCL that permitted so many office automation processes to be created with an elegance that makes DOS look like a toy. |
Free forum by Nabble | Edit this page |