lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

On 18-Nov-05, at 12:18 PM, Mike Pall wrote:

On the other hand, I've certainly seen C compilers (though I
admit not for a long time) which would cheerfully optimize away the
(a)!=(a) check (which certainly should be optimized away if a is an
integer type.)

It's clearly a violation of the standard to optimize this
comparison away for floating point numbers. Abandon all
hope that such a compiler gets the other subtle issues
of FP arithmetic right.

Which standard would that be? :) All I see in the C standard is that the value of a comparison or equality operator is "1 if the specified relation is true and 0 if it is false". Even C99 does not mandate the use of IEEE-754 floating point, and it is not intrinsic to floating point that either (1) there is such a thing as "not a number" or (2) if there is, that it tests unequal to itself.

OK, to be fair, C99 does say that if the implementation purports to implement IEEE-754 floating point (by defining __STDC_IEC_559__), then it has to make == and != work that way. On the other hand, in that case, it also has to define isunordered(x, y). And, curiously, gcc (at least on x86) *does* inline isunordered(x, x) even though it does not inline isnan(x) (which is semantically identical). Go figure.

(gcc does not seem to define __STDC_IEC_559__, though. Perhaps the implementation isn't considered complete yet. So it's under no obligation to honour the definition of ==. See below.)

So I timed the following three little snippets in a hard loop:

double uno(double x) {
  if (isunordered(x, x)) return 0;
  else return x + 1;

double isn(double x) {
  if (isnan(x)) return 0;
  else return x + 1;

double cmp(double x) {
  if (x != x) return 0;
  else return x + 1;

using 0.0 and some NaN as an argument. Results: (nanoseconds per iteration, timed with 100,000,000 iterations). (Remember when you could benchmark without counting zeros? :)

NaN |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 107.8
  0.0      |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx     101
x != x
  NaN      |xxxxxxxxxx     19.6
  0.0      |xxxxxxxxxxxx   24.5
isunordered(x, x)
  NaN      |xxxxxxxxxxxx   24.2
  0.0      |xxxxxxxxxx     19.0

Conclusion: on gcc/x86 (and possibly other platforms), the best test is 'isunordered(x,x)'; it's apparently faster in the common case, and it is semantically correct.

Now, gcc offers the interesting optimization flag -ffast-math. You should never use this flag. We all know that, right? But I suppose people do. Anyway, I tried it. Two interesting things arise:

First, with -ffast-math, gcc optimizes away the 'x != x' test. Of course, it never claimed to be IEEE-754 compliant, so it's allowed to do that.

Second, with -ffast-math, gcc also optimised away the call to isunordered() (!).

Finally, -ffast-math is anything but fast when presented with NaNs. I ran the same tests as above, but the results won't fit on the width of the email:

  NaN      |-----------------------------------------------------> 4247
  0.0      |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx    104
x != x
  NaN      |---->        3933* (wrong answer)
  0.0      |xxxxxxx        13.2
isunordered(x, x)
  NaN      |---->        3933* (wrong answer)
  0.0      |xxxxxxx        13.1

So optimising away the tests does produce the right answer faster. But why is it so slow to produce the wrong answer? And why does the slowness affect a non-inlined call to isnan() as well?

The answer to the first question is, of course, the pathetic handling of NaNs by Pentiums. Since the check for NaN has been removed from the code, the addition NaN+1.0 takes place; this is disastrously slow on a Pentium 4.

Examination of the assembly shows that this is very similar to the isnan(x) case. In a vain attempt to squeeze every microcycle out of the Pentium 4, gcc moves the addition to *prior* to the test of the result of isnan(x). That is, it does the addition regardless of whether it needs the value, because "it can't hurt". One presumes that it wouldn't have done that had it been a division instead of an addition, and it certainly does not do it without -ffast-math, presumably because in that case it knows that the addition might change an fp exception flag. But the joke's on gcc in the end; the "unnecessary but harmless" addition ends up costing a 40x slowdown.