[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Setting Float Precision in Lua.c
- From: KHMan <keinhong@...>
- Date: Thu, 7 Jun 2018 10:04:59 +0800
On 6/6/2018 8:53 PM, Albert Chan wrote:
Why don't you compile your binaries for SSE2 only? Even easier, just compile to 64-bit binaries? Surprising you mentioned Windows uses extended precision by default when there is x64 on every 64-bit capable Intel/AMD/other chip... and has been so for many, many years already.
I already picked 53-bits roundings.
I use my own laptop behavior just as a example, same with fsum.lua
David Gay's dtoa.c strtod maybe a better example.
With 53-bits roundings, it optimized away common cases 
= 123456789 / 1e20 -- both numbers exactly represented in double
= 1.23456789e-012 -- division guaranteed correct rounding
Here is a different approach (the long story approach):
(It was bubbling in my brain so I had to type it out. If you don't
understand this, then I really cannot help any further.)
Say, all values are on a line.
A float double actually represents a number that lies anywhere on
a segment on that line. It may be exactly the value of the
representation, but it can also be a little more, or a little
less. All those values in a segment need to be shoehorned into one
binary representation. It's a single binary representation, yet
the values can all be different. It's an approximation.
The examples you keep offering imply exact numbers, that is, they
are points on the line. Then in the examples, the arithmetic
operation is performed, and the FPU should round and hit another
point on the line. There is an expectation of mathematical
perfection or mathematical elegance.
When we work with actual numbers instead of ideal examples, we
always understand that when operations are performed, the result
values hardly ever hit the exact points on the line that equal a
binary representation. Instead, the result value is close, within
the segment which has that binary representation. So there is
error, and error usually accumulates.
Since a binary representation really means a segment of possible
values on the value line, when we do arithmetic with two segments,
we end up with a bigger segment. We can have many combinations of
operands and result within those segments and they are all valid
for the binary representation. But how correct are those values?
Normally we know the quality of our inputs and they are much less
than 16 digits of precision, so we often successfully manage
errors in calculations.
But some people are of the notion that when arithmetic is done on
two points on the value line, the result should hit an exact point
when such a situation arises. It appears that some people have the
first mental model (segments), others have the second mental model
(points). But if we keep thinking about all those exact points on
the line, then the problem is that values next to those points
cannot be shoehorned into beautifully exact and artificial
If we want exact calculations all the time, just use floats as
integers. We can assume the integers are exact, as points on the
value line. We also need to do things that don't mess up this
model. But once the result has a fraction, for example when a
division is done, that value is most likely no longer exactly
representable. It's an approximation.
For non-mathematicians, we work with regular numbers or data all
the time and they get processed and the end value is approximated
by the resulting binary representation. Those values do not hit
the points on the value line that are exactly the value of the
binary representations. But we have 16 decimal digits to work
with, so we format the result properly for user consumption by
rounding to much less than 16 digits of precision. This is why I
mentioned the concepts of engineering compromises versus
So it's no problem for most of us. But if mathematicians keep
thinking about ideal situations and keep trying to hit exact
points on the value line, then they should keep on doing so and
not bother the rest of us about it.
[snip snip snip]
Kein-Hong Man (esq.)