SPSSX Discussion

IF statement problem or maybe a compute problem

Classic

List

Threaded

13 messages Options

Ron0z

IF statement problem or maybe a compute problem

I want to carry out a few data validation checks on my data. I have 7 validation checks so far. It’s check number E2 that I find weird. Any advice would be welcome.

string E1 E2 E3 E4 E5 E6 E7 (A1).
compute TFN = number(e416, F12.0).

if missing(n384) or (n384 eq 0) E1 = '1'.
if ((n384 - n381) NE n558) E2 = '2'.
if (n558 gt n384) E3 ='3'.
if any(e490, '401', '410') and ((missing(n558)) or (n558 EQ 0)) E4 = '4'.
if any(e490, '401', '410') and (e488 EQ 'ZZZZZZZZZZ') E5 = '5'.
if any(e490, '401', '410') and missing(TFN) E6 ='6'.
if any(e490, '501', '510') and (n558 GT 0) E7 ='7'.

select if (E1 EQ '1') OR (E2 EQ '2') OR (E3 EQ '3') OR (E4 EQ '4') OR (E5 EQ '5') OR (E6 EQ '6') OR (E7 EQ '7').
list vars = year sem E313 E307 e490 n384 n381 n529 n558 tfn e488 E1 E2 E3 E4 E5 E6 E7.

e490 n384 n381 n529 n558 E1 E2 E3 E4 E5 E6 E7

230 2252.01 2000.00 63.00 252.01 2
410 38.00 .00 .00 38.00 6
410 76.00 .00 .00 76.00 6
410 190.00 .00 .00 190.00 6
410 76.00 .00 .00 76.00 6
410 95.00 38.00 .00 57.00 6

The alignment is not too good here, but what the output is saying in regard to E2 is that case 1 is in error and case 6 is fine. (If you copy it to a text editor it may look better.)

Sure, record 6 is good: 95 - 38 = 57 and the value of n558 is also 57 therefore no error and no error is flagged. But look at case 1: 2252.01 - 2000 = 252.01 and the value of n558 is also 252.01 (ie. identical values), but E2 shows it as an error by displaying '2' against the record.

The following were used because the data had been held as text, even though the values were numbers:
compute n384 = number(e384, F12.2).
compute n381 = number(e381, F12.2).
compute n529 = number(e529, F12.2).
compute n558 = number(e558, F12.2).
I don't know if that's relevant to the issue.

Jon Peck

Re: IF statement problem or maybe a compute problem

Calculate the actual difference. Due to the nature of floating point arithmetic, numbers with a decimal part often do not have an exact representation in floating point. You would need a very small fuzz factor in the test.

On Wednesday, October 26, 2016, Ron0z <[hidden email]> wrote:

I want to carry out a few data validation checks on my data. I have 7
validation checks so far. It’s check number E2 that I find weird. Any
advice would be welcome.

string E1 E2 E3 E4 E5 E6 E7 (A1).
compute TFN = number(e416, F12.0).

if missing(n384) or (n384 eq 0) E1 = '1'.
if ((n384 - n381) NE n558) E2 = '2'.
if (n558 gt n384) E3 ='3'.
if any(e490, '401', '410') and ((missing(n558)) or (n558 EQ 0)) E4 = '4'.
if any(e490, '401', '410') and (e488 EQ 'ZZZZZZZZZZ') E5 = '5'.
if any(e490, '401', '410') and missing(TFN) E6 ='6'.
if any(e490, '501', '510') and (n558 GT 0) E7 ='7'.

select if (E1 EQ '1') OR (E2 EQ '2') OR (E3 EQ '3') OR (E4 EQ '4') OR (E5
EQ '5') OR (E6 EQ '6') OR (E7 EQ '7').
list vars = year sem E313 E307 e490 n384 n381 n529 n558 tfn e488 E1 E2 E3 E4
E5 E6 E7.

e490 n384 n381 n529 n558 E1 E2 E3 E4 E5 E6 E7

230 2252.01 2000.00 63.00 252.01 2
410 38.00 .00 .00 38.00 6
410 76.00 .00 .00 76.00 6
410 190.00 .00 .00 190.00 6
410 76.00 .00 .00 76.00 6
410 95.00 38.00 .00 57.00 6

The alignment is not too good here, but what the output is saying in regard
to E2 is that case 1 is in error and case 6 is fine. (If you copy it to a
text editor it may look better.)

Sure, record 6 is good: 95 - 38 = 57 and the value of n558 is also 57
therefore no error and no error is flagged. But look at case 1: 2252.01 -
2000 = 252.01 and the value of n558 is also 252.01 (ie. identical values),
but E2 shows it as an error by displaying '2' against the record.

The following were used because the data had been held as text, even though
the values were numbers:
compute n384 = number(e384, F12.2).
compute n381 = number(e381, F12.2).
compute n529 = number(e529, F12.2).
compute n558 = number(e558, F12.2).
I don't know if that's relevant to the issue.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
<a href="javascript:;" onclick="_e(event, 'cvml', 'LISTSERV@LISTSERV.UGA.EDU')">LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Mike

Re: IF statement problem or maybe a compute problem

Jon, I've tried to find how the floating point numbers are represented

internally in SPSS but couldn't anything. Wouldn't it be a good idea

to declare a print format for n384, n381, and n558 of, say, f20.15

and print out the case with e490=230. If what you say below is true,

we'd expect some number(s) beyond the 2nd decimal value and

print out more decimal values should reveal it. Though this seems

reasonable to me I think that the error may be elsewhere but it is

hard to determine where because (a) there is distracting junk in

the code (for example in list: year sem ftn; there are no values for

these), and/or (b) there may be syntax elsewhere that is the source

of the problem.

By the way, why are there seven E vars instead of one E with

seven values? The variables E1-E7 are going to be sparse unless

a case can take have more than one E variable. It certainly

confuses the output below (even after putting it in Notepad;

there is only one column).

-Mike Palij

New York University

[hidden email]

----- Original Message -----

From: [hidden email]

To: [hidden email]

Sent: Wednesday, October 26, 2016 8:50 PM

Subject: Re: IF statement problem or maybe a compute problem

Calculate the actual difference. Due to the nature of floating point arithmetic, numbers with a decimal part often do not have an exact representation in floating point. You would need a very small fuzz factor in the test.

On Wednesday, October 26, 2016, Ron0z <[hidden email]> wrote:

I want to carry out a few data validation checks on my data. I have 7
validation checks so far. It’s check number E2 that I find weird. Any
advice would be welcome.

string E1 E2 E3 E4 E5 E6 E7 (A1).
compute TFN = number(e416, F12.0).

if missing(n384) or (n384 eq 0) E1 = '1'.
if ((n384 - n381) NE n558) E2 = '2'.
if (n558 gt n384) E3 ='3'.
if any(e490, '401', '410') and ((missing(n558)) or (n558 EQ 0)) E4 = '4'.
if any(e490, '401', '410') and (e488 EQ 'ZZZZZZZZZZ') E5 = '5'.
if any(e490, '401', '410') and missing(TFN) E6 ='6'.
if any(e490, '501', '510') and (n558 GT 0) E7 ='7'.

select if (E1 EQ '1') OR (E2 EQ '2') OR (E3 EQ '3') OR (E4 EQ '4') OR (E5
EQ '5') OR (E6 EQ '6') OR (E7 EQ '7').
list vars = year sem E313 E307 e490 n384 n381 n529 n558 tfn e488 E1 E2 E3 E4
E5 E6 E7.

e490 n384 n381 n529 n558 E1 E2 E3 E4 E5 E6 E7

230 2252.01 2000.00 63.00 252.01 2
410 38.00 .00 .00 38.00 6
410 76.00 .00 .00 76.00 6
410 190.00 .00 .00 190.00 6
410 76.00 .00 .00 76.00 6
410 95.00 38.00 .00 57.00 6

The alignment is not too good here, but what the output is saying in regard
to E2 is that case 1 is in error and case 6 is fine. (If you copy it to a
text editor it may look better.)

Sure, record 6 is good: 95 - 38 = 57 and the value of n558 is also 57
therefore no error and no error is flagged. But look at case 1: 2252.01 -
2000 = 252.01 and the value of n558 is also 252.01 (ie. identical values),
but E2 shows it as an error by displaying '2' against the record.

The following were used because the data had been held as text, even though
the values were numbers:
compute n384 = number(e384, F12.2).
compute n381 = number(e381, F12.2).
compute n529 = number(e529, F12.2).
compute n558 = number(e558, F12.2).
I don't know if that's relevant to the issue.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397.html
Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
<A onclick="_e(event, 'cvml', 'LISTSERV@LISTSERV.UGA.EDU')" href="javascript:;">LISTSERV@... (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

--

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Ron0z

Re: IF statement problem or maybe a compute problem

Got it licked!

Trying to trick the system is tricky. My calculated value of zero actually equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003. However, when I added TRUNC everything worked out just fine.

The following works:
if ((TRUNC(n384*100) - TRUNC(n381*100) - TRUNC(n558*100) ) ne 0) E2 = '2'.

David Marso

Re: IF statement problem or maybe a compute problem

Administrator

In the long run you will be better off with something like :

IF (ABS(n384 - n381 - n558) LE .00000000001) E2='2'.

Ron0z wrote

Got it licked!

Trying to trick the system is tricky. My calculated value of zero actually equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003. However, when I added TRUNC everything worked out just fine.

The following works:
if ((TRUNC(n384*100) - TRUNC(n381*100) - TRUNC(n558*100) ) ne 0) E2 = '2'.

Please reply to the list and not to my personal email.
Those desiring my consulting or training services please feel free to email me.
---
"Nolite dare sanctum canibus neque mittatis margaritas vestras ante porcos ne forte conculcent eas pedibus suis."
Cum es damnatorum possederunt porcos iens ut salire off sanguinum cliff in abyssum?"

David Marso

Re: IF statement problem or maybe a compute problem

Administrator

It might just as easily have been represented as 225200.999999999999999 then you would be screwed.

"2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003"

David Marso wrote

In the long run you will be better off with something like :

IF (ABS(n384 - n381 - n558) LE .00000000001) E2='2'.

Ron0z wrote

Got it licked!

Trying to trick the system is tricky. My calculated value of zero actually equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is represented as 225201.00000000003. However, when I added TRUNC everything worked out just fine.

The following works:
if ((TRUNC(n384*100) - TRUNC(n381*100) - TRUNC(n558*100) ) ne 0) E2 = '2'.

Jon Peck

Re: IF statement problem or maybe a compute problem

In reply to this post by Ron0z

A few points on this.

- The variable format has no effect on the actual internal value of a number. It just says how to display the number or, on input, how to interpret the digits.

- SPSS, like other statistical software, stores numbers internally as double precision floating point values. That means 53 bits of precision plus an exponent which gives approximately 16 significant figures. The numbers are expressed in terms of powers of 2.

Running this code shows exactly what is happening here. Using scientific notation exposes the actual values although technically there is about half an additional digit not shown. When 2252.01 is converted to binary, the stored value is

2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).

data list list/n384 n381 n558(3f8.2).

begin data

2252.01 2000.00 252.01

end data

compute diff = n384 - n381.

compute test = diff - n558.

format n384 n381 n558 diff test(e23.16).

list.

n384 n381 n558 diff test

2.252010000000000E+003 2.000000000000000E+003 2.520100000000000E+002 2.520100000000002E+002 2.273736754432321E-013

This link goes into all the gory details of floating point and the IEEE 754 standard.

https://en.wikipedia.org/wiki/Floating_point

Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system.

On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote:

Got it licked!

Trying to trick the system is tricky. My calculated value of zero actually
equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really
help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is
represented as 225201.00000000003. However, when I added TRUNC everything
worked out just fine.

The following works:
if ((TRUNC(n384*100) - TRUNC(n381*100) - TRUNC(n558*100) ) ne 0) E2 = '2'.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397p5733400.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck
[hidden email]

Maguin, Eugene

Re: IF statement problem or maybe a compute problem

Jon,

I notice that in the options table, there is an entry for ‘fuzzbits’ and it is adjustable. It says it is for the rnd and trunc functions. I’m curious how rnd and trunc is functionally different from (a </>/= b). Gene Maguin

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon Peck
Sent: Thursday, October 27, 2016 9:49 AM
To: [hidden email]
Subject: Re: IF statement problem or maybe a compute problem

A few points on this.

- The variable format has no effect on the actual internal value of a number. It just says how to display the number or, on input, how to interpret the digits.

2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).

data list list/n384 n381 n558(3f8.2).

begin data

2252.01 2000.00 252.01

end data

compute diff = n384 - n381.

compute test = diff - n558.

format n384 n381 n558 diff test(e23.16).

list.

n384 n381 n558 diff test

2.252010000000000E+003 2.000000000000000E+003 2.520100000000000E+002 2.520100000000002E+002 2.273736754432321E-013

This link goes into all the gory details of floating point and the IEEE 754 standard.

https://en.wikipedia.org/wiki/Floating_point

On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote:

Got it licked!

Trying to trick the system is tricky. My calculated value of zero actually
equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really
help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is
represented as 225201.00000000003. However, when I added TRUNC everything
worked out just fine.

The following works:
if ((TRUNC(n384*100) - TRUNC(n381*100) - TRUNC(n558*100) ) ne 0) E2 = '2'.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397p5733400.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

Jon K Peck
[hidden email]

Mike

Re: IF statement problem or maybe a compute problem

In reply to this post by Jon Peck

Boy, does this bring back memories. When I was an undergrad in the

early 1970s I was a psych major and had minor in bio and computer

science -- running punched cards on an IBM mainframe (which took

hours to get through the system, meaning it took hours for me to find

the errors I committed and rerun the program which was madness).

Later in the 1970s I was in grad school in exp/cog psych and was

running SPSS and BMDP on a Sperry UNIVAC 1100 which used

7-bit bytes for ASCII instead of the 8-bit bytes of EBCDIC of the

IBM. Wilkinson & Dallal published their American Statistician

paper in 1977 on "Accuracy of Sample Moments Calculations etc".

which pointed out that most statistical software packages gave

incorrect results because (1) of the algorithm used, and (b) the

use of single precision 32-bit representation. This ties into the

presentation that Jon Peck makes for double precision down

below.

Quoting from W&D.

|Ling (1974) listed five algorithms for calculating the sum of

|squares about the sample mean. The first of these is the

|desk calculator:

NOTE: Greek symbols don't paste properly into the mail

client i use but the desk calculator involves the sum of x^2.

|Of the five algorithms, it is the least reliable for use on a

|computer with a small word size due to its susceptibility

|to truncation error. Anscombe (1967) has shown that for a

|computer like the IBM 370 with a four-byte, 32-bit w word,

|24 of which are used to store a mantissa (approximately

|7.2 decimal digits), S sub 11 [standard deviation] cannot be

|computed with any accuracy for the numbers 9001, 9002, 9003.

|Following the example through,

| X sub1i_____ X^2 sub1_____ X^2 sub1 (first2 4 binaryd igits)

| 9001____ 81018001____ 81018000

| 9002____ 81036004____ 81036000

| 9003____ 81054009____ 81054008.

|Therefore, to the first 24 binary digits,

|sum of X^2 sub1 = 243108000.

|Again, to the first 24 binary digits,

| 3X-bar^2 = 243108000, so

| sum of X^2 sub1 - 3X-bar^2 = 0.

[NOTE: see original article for actual formulas)

|The same result can be obtained in decimal arithmetic by ignoring

|any digit beyond the seventh. The three consecutive integers need

|not be as large as 9000 for complete loss of accuracy; numbers

|smaller than 3000 will do.

Because different software packages used different algorithms

as well as precision (double precision helps minimize the problem)

whether the results one got were correct or wrong depended upon

the algorithm the package used and the precision used (and whether

it could be changed from single to double).

BMDP used single precision (32-bit) representation but used

the mean deviation formula to calculate the sum of squares,

so it avoided overflow/underflow errors that would produce

truncated numbers and, hence, incorrect results. BMDP

was the only package to provide correct results with single

precision. However, BMDP required reading the data twice:

first to calculate the appropriated statistics like the mean,

and second to subtract the mean from each value which

meant the processing took longer.

SPSS 1977 (IBM mainframe version) could calculate the

mean and standard deviation correctly but not correlations.

As W&D point out this, was due to SPSS using single

precision for some calculations and double precisions

for others. At this time SPSS didn't provide an option

to do double precision calculations throughout.

Other packages (DATATEXT, OSIRIS) which provided

an option to use double precision were able to get all of

the statistics correct when this option was used. Subsequent

to this article, SPSS would used double precision in its

calculations on IBM machines but I'm not sure what year

or version of SPSS this was. Not sure if they changed

the algorithm.

W&D note the Control Data Corp (CDC) computers

used 60 bit words and the desk calculator algorithm would

provide correct results here (I assume that all packages running

on CDC machines made use of the 60 bit representations).

NYU used to have a CDC machine in addition to its IBM

mainframes in the early 1980s and it is telling that Elazar Pedhazur

used both IBM and CDC machines for analyses -- Pleasure

did his Ph.D. under Fred Kerlinger and would become

part of the Ed psych faculty in NYU's school of Ed.

In Pedhazur's early books on multiple regression he would

show results from both CDC and IBM versions of software.

Sidenote: in the 1940 the clinical psychology program

was located in the NYU School of Ed, not in the NYU

School of Arts & Science, and that is where Jack Cohen

got his Ph.D. in clinical psychology as well as his quantitative

training. After getting his Ph.D., Jack would do clinical

work and research in one of the Veterans hospitals until

the late '50s/'early 60s when he became part of the

NYU Arts & Science psychology dept as well as doing

a consulting gig at the New York State Psychiatric Institute

(NYSPI or just PI) which was run by Columbia's med

school's dept of psychiatry. I believe he was affiliated

with the Biometrics Dept there, which is where his second

wife Patricia would also work and does so still today.

Anyway, Jon's presentation below helps to make sense of

W&D presentation but in the context of the 32-bit word.

I believe that the IEEE 754 standard would help standardize

operations across statistical packages.

-Mike Palij

New York University

[hidden email]

----- Original Message -----

From: [hidden email]

To: [hidden email]

Sent: Thursday, October 27, 2016 9:48 AM

Subject: Re: IF statement problem or maybe a compute problem

A few points on this.

- The variable format has no effect on the actual internal value of a number. It just says how to display the number or, on input, how to interpret the digits.

- SPSS, like other statistical software, stores numbers internally as double precision floating point values. That means 53 bits of precision plus an exponent which gives approximately 16 significant figures. The numbers are expressed in terms of powers of 2.

Running this code shows exactly what is happening here. Using scientific notation exposes the actual values although technically there is about half an additional digit not shown. When 2252.01 is converted to binary, the stored value is

2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).

data list list/n384 n381 n558(3f8.2).

begin data

2252.01 2000.00 252.01

end data

compute diff = n384 - n381.

compute test = diff - n558.

format n384 n381 n558 diff test(e23.16).

list.

n384 n381 n558 diff test

2.252010000000000E+003 2.000000000000000E+003 2.520100000000000E+002 2.520100000000002E+002 2.273736754432321E-013

This link goes into all the gory details of floating point and the IEEE 754 standard.

https://en.wikipedia.org/wiki/Floating_point

Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system.

On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote:

Got it licked!

Trying to trick the system is tricky. My calculated value of zero actually
equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really
help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is
represented as 225201.00000000003. However, when I added TRUNC everything
worked out just fine.

The following works:
if ((TRUNC(n384*100) - TRUNC(n381*100) - TRUNC(n558*100) ) ne 0) E2 = '2'.

Jon Peck

Re: IF statement problem or maybe a compute problem

In reply to this post by Maguin, Eugene

Fuzzbits is pretty esoteric, and generally one should leave it alone. I have copied the help below, but what might not be clear from that is that it does not affect ordinary equality calculations. They are based on the exact bits. It also does not affect the front end in places, say, in pivot tables, where a format is applied.

The setting is designed to deal with noise in the least significant bits in a calculation as in the present case. As you can see from the help, it has had a chequered history. The setting was exposed in V17 because of a few customers who had code that was very sensitive to the rounding/truncation threshold, and they wanted to ensure compatibility.

Rounding and Truncation of Numeric Values. For the RND and TRUNC functions, this setting controls the default threshold for rounding up values that are very close to a rounding boundary. The setting is specified as a number of bits and is set to 6 at install time, which should be sufficient for most applications. Setting the number of bits to 0 produces the same results as in release 10. Setting the number of bits to 10 produces the same results as in releases 11 and 12.

For the RND function, this setting specifies the number of least-significant bits by which the value to be rounded may fall short of the threshold for rounding up but still be rounded up. For example, when rounding a value between 1.0 and 2.0 to the nearest integer this setting specifies how much the value can fall short of 1.5 (the threshold for rounding up to 2.0) and still be rounded up to 2.0.
For the TRUNC function, this setting specifies the number of least-significant bits by which the value to be truncated may fall short of the nearest rounding boundary and be rounded up before truncating. For example, when truncating a value between 1.0 and 2.0 to the nearest integer this setting specifies how much the value can fall short of 2.0 and be rounded up to 2.0.

On Thu, Oct 27, 2016 at 8:00 AM, Maguin, Eugene <[hidden email]> wrote:

Jon,

I notice that in the options table, there is an entry for ‘fuzzbits’ and it is adjustable. It says it is for the rnd and trunc functions. I’m curious how rnd and trunc is functionally different from (a </>/= b). Gene Maguin

From: SPSSX(r) Discussion [mailto:[hidden email]] On Behalf Of Jon Peck
Sent: Thursday, October 27, 2016 9:49 AM
To: [hidden email]
Subject: Re: IF statement problem or maybe a compute problem

A few points on this.

- The variable format has no effect on the actual internal value of a number. It just says how to display the number or, on input, how to interpret the digits.

- SPSS, like other statistical software, stores numbers internally as double precision floating point values. That means 53 bits of precision plus an exponent which gives approximately 16 significant figures. The numbers are expressed in terms of powers of 2.

Running this code shows exactly what is happening here. Using scientific notation exposes the actual values although technically there is about half an additional digit not shown. When 2252.01 is converted to binary, the stored value is

2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).

data list list/n384 n381 n558(3f8.2).

begin data

2252.01 2000.00 252.01

end data

compute diff = n384 - n381.

compute test = diff - n558.

format n384 n381 n558 diff test(e23.16).

list.

n384 n381 n558 diff test

2.252010000000000E+003 2.000000000000000E+003 2.520100000000000E+002 2.520100000000002E+002 2.273736754432321E-013

This link goes into all the gory details of floating point and the IEEE 754 standard.

https://en.wikipedia.org/wiki/Floating_point

Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system.

On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote:

Got it licked!

Trying to trick the system is tricky. My calculated value of zero actually
equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really
help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is
represented as 225201.00000000003. However, when I added TRUNC everything
worked out just fine.

The following works:
if ((TRUNC(n384*100) - TRUNC(n381*100) - TRUNC(n558*100) ) ne 0) E2 = '2'.

--
View this message in context: http://spssx-discussion.1045642.n5.nabble.com/IF-statement-problem-or-maybe-a-compute-problem-tp5733397p5733400.html

Sent from the SPSSX Discussion mailing list archive at Nabble.com.

=====================
To manage your subscription to SPSSX-L, send a message to
[hidden email] (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD

--

Jon K Peck
[hidden email]

===================== To manage your subscription to SPSSX-L, send a message to [hidden email] (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

Jon K Peck
[hidden email]

Jon Peck

Re: IF statement problem or maybe a compute problem

In reply to this post by Mike

Interesting history, Mike. I don't know when SPSS became fully double precision, but it probably would have been no later than the introduction of SPSSX, which was around 1981-2. That's before I joined SPSS.

At least as big an issue as single/double precision is how the arithmetic calculations are done. Even something as apparently simple as a sum can run amok if you just use the obvious method if the numbers vary widely in magnitude. Here is a link to the Kahan algorithm used by Statistics.

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

On Thu, Oct 27, 2016 at 11:05 AM, Mike Palij <[hidden email]> wrote:

Boy, does this bring back memories. When I was an undergrad in the

early 1970s I was a psych major and had minor in bio and computer

science -- running punched cards on an IBM mainframe (which took

hours to get through the system, meaning it took hours for me to find

the errors I committed and rerun the program which was madness).

Later in the 1970s I was in grad school in exp/cog psych and was

running SPSS and BMDP on a Sperry UNIVAC 1100 which used

7-bit bytes for ASCII instead of the 8-bit bytes of EBCDIC of the

IBM. Wilkinson & Dallal published their American Statistician

paper in 1977 on "Accuracy of Sample Moments Calculations etc".

which pointed out that most statistical software packages gave

incorrect results because (1) of the algorithm used, and (b) the

use of single precision 32-bit representation. This ties into the

presentation that Jon Peck makes for double precision down

below.

Quoting from W&D.

|Ling (1974) listed five algorithms for calculating the sum of

|squares about the sample mean. The first of these is the

|desk calculator:

NOTE: Greek symbols don't paste properly into the mail

client i use but the desk calculator involves the sum of x^2.

|Of the five algorithms, it is the least reliable for use on a

|computer with a small word size due to its susceptibility

|to truncation error. Anscombe (1967) has shown that for a

|computer like the IBM 370 with a four-byte, 32-bit w word,

|24 of which are used to store a mantissa (approximately

|7.2 decimal digits), S sub 11 [standard deviation] cannot be

|computed with any accuracy for the numbers 9001, 9002, 9003.

|Following the example through,

| X sub1i_____ X^2 sub1_____ X^2 sub1 (first2 4 binaryd igits)

|      9001____ 81018001____ 81018000

|      9002____ 81036004____ 81036000

|      9003____ 81054009____ 81054008.

|

|Therefore, to the first 24 binary digits,

|sum of X^2 sub1 = 243108000.

|Again, to the first 24 binary digits,

| 3X-bar^2 = 243108000, so

| sum of X^2 sub1  - 3X-bar^2 = 0.

[NOTE: see original article for actual formulas)

|The same result can be obtained in decimal arithmetic by ignoring

|any digit beyond the seventh. The three consecutive integers need

|not be as large as 9000 for complete loss of accuracy; numbers

|smaller than 3000 will do.

Because different software packages used different algorithms

as well as precision (double precision helps minimize the problem)

whether the results one got were correct or wrong depended upon

the algorithm the package used and the precision used (and whether

it could be changed from single to double).

BMDP used single precision (32-bit) representation but used

the mean deviation formula to calculate the sum of squares,

so it avoided overflow/underflow errors that would produce

truncated numbers and, hence, incorrect results. BMDP

was the only package to provide correct results with single

precision. However, BMDP required reading the data twice:

first to calculate the appropriated statistics like the mean,

and second to subtract the mean from each value which

meant the processing took longer.

SPSS 1977 (IBM mainframe version) could calculate the

mean and standard deviation correctly but not correlations.

As W&D point out this, was due to SPSS using single

precision for some calculations and double precisions

for others. At this time SPSS didn't provide an option

to do double precision calculations throughout.

Other packages (DATATEXT, OSIRIS) which provided

an option to use double precision were able to get all of

the statistics correct when this option was used. Subsequent

to this article, SPSS would used double precision in its

calculations on IBM machines but I'm not sure what year

or version of SPSS this was. Not sure if they changed

the algorithm.

W&D note the Control Data Corp (CDC) computers

used 60 bit words and the desk calculator algorithm would

provide correct results here (I assume that all packages running

on CDC machines made use of the 60 bit representations).

NYU used to have a CDC machine in addition to its IBM

mainframes in the early 1980s and it is telling that Elazar Pedhazur

used both IBM and CDC machines for analyses -- Pleasure

did his Ph.D. under Fred Kerlinger and would become

part of the Ed psych faculty in NYU's school of Ed.

In Pedhazur's early books on multiple regression he would

show results from both CDC and IBM versions of software.

Sidenote: in the 1940 the clinical psychology program

was located in the NYU School of Ed, not in the NYU

School of Arts & Science, and that is where Jack Cohen

got his Ph.D. in clinical psychology as well as his quantitative

training.  After getting his Ph.D., Jack would do clinical

work and research in one of the Veterans hospitals until

the late '50s/'early 60s when he became part of the

NYU Arts & Science psychology dept as well as doing

a consulting gig at the New York State Psychiatric Institute

(NYSPI or just PI) which was run by Columbia's med

school's dept of psychiatry. I believe he was affiliated

with the Biometrics Dept there, which is where his second

wife Patricia would also work and does so still today.

Anyway, Jon's presentation below helps to make sense of

W&D presentation but in the context of the 32-bit word.

I believe that the IEEE 754 standard would help standardize

operations across statistical packages.

-Mike Palij

New York University

[hidden email]

----- Original Message -----

From: [hidden email]

To: [hidden email]

Sent: Thursday, October 27, 2016 9:48 AM

Subject: Re: IF statement problem or maybe a compute problem

A few points on this.

- The variable format has no effect on the actual internal value of a number. It just says how to display the number or, on input, how to interpret the digits.

- SPSS, like other statistical software, stores numbers internally as double precision floating point values. That means 53 bits of precision plus an exponent which gives approximately 16 significant figures. The numbers are expressed in terms of powers of 2.

Running this code shows exactly what is happening here. Using scientific notation exposes the actual values although technically there is about half an additional digit not shown. When 2252.01 is converted to binary, the stored value is

2.25201000000000022e+03 (shown with a little extra precision as calculated in Python code).

data list list/n384 n381 n558(3f8.2).

begin data

2252.01 2000.00 252.01

end data

compute diff = n384 - n381.

compute test = diff - n558.

format n384 n381 n558 diff test(e23.16).

list.

n384 n381 n558 diff test

2.252010000000000E+003 2.000000000000000E+003 2.520100000000000E+002 2.520100000000002E+002 2.273736754432321E-013

This link goes into all the gory details of floating point and the IEEE 754 standard.

https://en.wikipedia.org/wiki/Floating_point

Many years ago, we considered adding some fuzz to the equality test in COMPUTE et al to reduce confusion, but anything we could do just moved the seemingly paradoxical behavior to elsewhere in the floating point number system.

On Wed, Oct 26, 2016 at 9:28 PM, Ron0z <[hidden email]> wrote:

Got it licked!

Trying to trick the system is tricky. My calculated value of zero actually
equals 0.0000000000291 which is not the same as 0.

Multiplying the values by 100 to remove the decimal component didn’t really
help, as you might surmise even though 2252.01 x 100 = 225201 in spss it is
represented as 225201.00000000003. However, when I added TRUNC everything
worked out just fine.

The following works:
if ((TRUNC(n384*100) - TRUNC(n381*100) - TRUNC(n558*100) ) ne 0) E2 = '2'.

Jon K Peck
[hidden email]

John F Hall

Re: IF statement problem or maybe a compute problem

In reply to this post by Mike

Mike

Punched cards? At least you could see the text along the top! When writing the Salford Survey Suite in the mid-1960s the computer was an English Electric KDF9 (64K RAM and as big as a small house: input and output on 8-hole paper tape). I first encountered cards at LSE in 1970, when data, programs and JCL were needed on actual cards to feed into SDTAB (a program for Survey Data Tabulation) and MUTOS (a program for spreading out data supplied by fieldwork agencies on multi-punched cards). Other software we tried included BMDP and OSIRIS, but local programs abounded elsewhere. For survey processing before SPSS (and personal computers) see slide-show
http://surveyresearch.weebly.com/uploads/2/9/9/8/2998485/slides_1_-_history_and_background.ppt

SPSS was only implemented in the UK in 1970 after some political infighting at Edinburgh in which an in-house proposal was defeated.
(see: http://surveyresearch.weebly.com/uploads/2/9/9/8/2998485/appendix_1_how_spss_came_to_the_uk.pdf) Edinburgh was one of a handful of universities then allowed to buy American computers.

My first encounter with SPSS was a completely disastrous blind date (http://surveyresearch.weebly.com/my-first-time-in-at-the-spss-deep-end-1972.html) but perseverance paid off.

It was a revelation and spread like wildfire among university social science departments, not least because the manual was available in academic book-stores and the command language was not Fortran, but English. At last we were free of constraint by the mathematicians, statisticians and programmers guarding their precious machines, many of whom criticised SPSS for their own reasons, including inefficient programming and statistical inaccuracy. We just ignored them.

I have used it ever since, starting at LSE in 1972 (data and setups still on actual 80-column cards, data later on magnetic tape) using the CDC 2000 and, when SPSS was available for other manufacturers, the ICL 2900. From 1976 (still on cards: card-punch on another campus a mile away) on ICL 1900, then from 1982 (when my unit acquired a single card-punch) and from the mid-1980s (Oh joy! A new lab with 16 VDUs, 4 fast servers and 2 lineprinters: no more cards!) on Dec-10, Dec-20 and finally a Vax cluster (but still with JCL in the form of EDT and VMS). Even then the social scientists weren't consulted about their needs, but were dependent on what Librarianship and Information Sciences had ordered.

SPSS for Windows offers ultimate freedom from all constraints except personal incompetence.

NB: the links above are taken from my 2006 presentation to ASSESS (SPSS users in Europe)
http://surveyresearch.weebly.com/old-dog-old-tricks-using-spss-syntax-to-beat-the-mouse-trap.html

Ron0z

Re: IF statement problem or maybe a compute problem

John F Hall wrote

and finally a Vax cluster

That's a term from the past. I spent several years using a VAX and grew to love it with an affection no other system has come close to equaling. (Tough hardware too.) A mainframe where SPSS batch jobs could be scheduled (many tasks running simultaneously); VAXMate which was their version of a PC that see the mainframe drive as a virtual drive (huge disk space); a programmable editor (TPU ?); and my greatest love - DCL that permitted so many office automation processes to be created with an elegance that makes DOS look like a toy.