# A Glimpse into Floating-Point Accuracy 31

Posted by **Loren Shure**,

There are frequent posts on the MATLAB newsgroup as well as lots of questions posed to Technical Support about the floating point accuracy of MATLAB. Many people think they have found a bug when some seemingly simple arithmetic doesn't give the intuitive answer. The issue that arises has to do with the finite number of bits that a computational device uses to store and operate on values.

### Contents

### Calculator Demo

Have you ever impatiently waited for an elevator and continued pushing the `UP` button in the hopes that the elevator will arrive faster? If so, you may also have impatiently pushed buttons repetitively
on a calculator (assuming you know what a calculator is!). It's not just MATLAB that computes values in a way that seems to
defy our expectations sometimes.

```
clear
format long
```

Here's what I tried. I put the number 7 into my calculator. I think it calculates in single precision, but I'm not sure. It shows 7 digits after the decimal point when I take the square root, and doing it several times, I get

sqrts7inSingle = single([7 2.6457513 1.6265765 1.275373])

sqrts7inSingle = 7.0000000 2.6457512 1.6265765 1.2753730

Next, I take the final number and square it 3 times, yielding

squaringSqrt7inSingle = single([1.275373 1.6265762 2.6457501 6.99999935])

squaringSqrt7inSingle = 1.2753730 1.6265762 2.6457500 6.9999995

Compare the difference between the original and final values.

sqrts7inSingle(1)-squaringSqrt7inSingle(end)

ans = 4.7683716e-007

This number turns out to be very interesting in this context. Compare it to the distance between a single precision value of 7 and its next closest value, also know as the floating-point relative accuracy.

eps(single(7))

ans = 4.7683716e-007

We get the same value! So, we can demonstrate the effects of finite storage size for values on a handheld calculator. Let's try the equivalent in MATLAB now.

### Accuracy in MATLAB

Let's do a similar experiment in MATLAB using single precision so we have a similar situation to the one when I used the calculator. In this case, I am going to keep taking the square root until the value reaches 1.

num = single(7); count = 0; tol = eps('single'); dnum = sqrt(num); while abs(dnum-1)>=tol count = count+1; dnum = sqrt(dnum); end count

count = 23

So, it took 23 iterations before we reached the number 1. But now we're in a funny situation. There is no way to square exactly 1 and ever reach our original value of 7.

### Using Double Precision

Let's repeat the same experiment from the calculator in MATLAB using double precision.

num = 7; sqrtd = sqrt(num); sqrtd(2) = sqrt(sqrtd(1)); sqrtd(3) = sqrt(sqrtd(2)) prods(1) = sqrtd(end); prods(2) = prods(1)^2; prods(3) = prods(2)^2 finalNum = prods(3)^2

sqrtd = 2.64575131106459 1.62657656169779 1.27537310685845 prods = 1.27537310685845 1.62657656169779 2.64575131106459 finalNum = 7.00000000000001

Find the difference between `finalNum` and `num` and compare to the proper relative floating-point accuracy:

diffNum = finalNum-num acc = eps(num)*6 diffNum-acc

diffNum = 5.329070518200751e-015 acc = 5.329070518200751e-015 ans = 0

Why did I use 6 when I calculated `acc`? Because I performed 6 floating point operations to get the my final answer, 3 `sqrt` applications, followed by squaring the answers 3 times.

### Typical MATLAB Pitfall

The most common MATLAB pitfall I run across is when users check equality of values generated from the `:` operator. The code starts innocently enough. I've actually turned the logic around here a bit, so we will print out values in the
loop where the loop counter may not be exactly what the user expects.

format short for ind = 0:.1:1; if ind ~= fix(10*ind)/10 disp(ind - fix(10*ind)/10) end end

5.5511e-017 1.1102e-016 1.1102e-016

And then the question is, why do *any* of the values print out? It's because computers can't represent all numbers exactly given a fixed storage size such as double
precision.

### Conclusions

This was a very simplified explanation about floating point, but one, I hope, that is easily understood. Let me know.

### References

Here are some pointers to more resources.

Published with MATLAB® 7.2

**Category:**- Common Errors

### Note

Comments are closed.

## 31 CommentsOldest to Newest

**1**of 31

Hi,

I have just found this floating point precision “bug” in an extremely simple operation:

e.g. if i type:

>> 0.3 == 0.1*3

ans =

0

It would be extremely useful if Matlab would give out a warning for such cases, because this can really throw you off track.

**2**of 31

Ilya,

This is not a bug, but due to limited accuracy. What kind of warning would you expect, one in the editor perhaps?

–Loren

**3**of 31

I have similar floating-point precision problem in my program. For instance;

a=0.1+0.2-0.3

a =

5.551115123125783e-017

I want “a” to be “0” not that precise. How am I going to fix this problem in my code? If anyone could help me, I really appreciate.

**4**of 31

Bashar-

You are asking for the impossible from floating point. Please read the links in the post to the FAQ, etc.

–Loren

**5**of 31

Bashar, you could define a tolerance then round your computations accordingly. You might check out File Exchange submission 4261.

http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=4261&objectType=file

(test HTML link: <a href=”http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=4261&objectType=file”round2)

**6**of 31

matlab must be doing some pretty strange stuff internally for the following: 0.1+0.2-0.3 = 5.551115123125783e-017

My question is why have the matlab dev team not sorted this out in 2008! Mathematica, Java, C# and C++ do not suffer such FP errors for such simple operations (with more complex operations then a certain degree of error is perhaps acceptable?) – Matlab seems to be alone in still have these really basic errors.

Matlab dev team care to comment?

**7**of 31

Hugh-

This is not an error. If MATLAB were built to get that answer to be 0, we’d have to do a bunch of operations to sort the data in the “right” order. How would MATLAB actually know that IS the way the values should be accumulated from the real-world? You can have MATLAB perform calculations in whatever order you wish by using parentheses. Or introduce a function that sorts the data appropriately and then does the sums and differences.

As long as there is no way to perfectly represent some values, floating point calculations are subject to round-off error. As for the other languages, it depends on how the code is written so they get exactly 0. If the constants are built into the program rather than user-supplied at run-time, it’s possible that a C++ compiler will reorder the operations on constants. And Mathematica might be doing the calculations using extended precision (but I don’t know that for certain).

–Loren

**8**of 31

Hugh—The C program below prints “Not equal” when compiled with gcc and run on Linux. An equivalent Java program also prints “Not equal”.

#include <stdio.h> int main(int nargin, char *argv[]) { double a = 0.1; double b = 0.3; if ( (a+a+a) != b ) { fprintf(stdout, "Not equal\n"); } else { fprintf(stdout, "Equal\n"); } return(0); }

**9**of 31

And the same for C++. This program prints “Not equal”:

#include <iostream> int main() { double a = 0.1; double b = 0.3; if ( (a+a+a) != b ) { std::cout << "Not equal\n"; } else { std::cout << "Equal\n"; } return(0); }

**10**of 31

I’m really surprised with the following result. What I’m doing wrong?

>> format hex

>> x = hex2dec(‘8D6F7897′) .* hex2dec(’41C64E6D’) + hex2dec(‘6073’)

x =

43c22b6e94e57f5d

>> uint64(x)

ans =

2456dd29cafeba00

>>

The expected result is: 2456dd29cafebaBE

I can do that operation on the Windows calculator, but MATLAB just can’t?

**11**of 31

John-

(I updated this response as I had cruft in my code.)

MATLAB is doing the calculation correctly in floating point which uses 52 bits for the mantissa, 1 bit for the sign, and the remaining bits for the exponent. Converting the result to uint64 retains the floating point accuracy.

>> x1 = hex2dec('8D6F7897') x1 = 2.3729e+009 >> x2 = hex2dec('41C64E6D') x2 = 1.1035e+009 >> x3 = hex2dec('6073') x3 = 24691 >> x4 = x1.*x2+x3; >> x5 = uint64(x4) x5 = 2456dd29cafeba00

–Loren

**12**of 31

Thank you for your reply Loren.

Please note: you are still getting the wrong result

2456dd29cafeba00 <- Wrong

The correct one ends with BE, not 00

2456dd29cafebabe <- Correct

I’ve posted about this issue on the newsgroup:

http://www.mathworks.com/matlabcentral/newsreader/view_thread/173330

While I’ve been suggested a workaround which solves the problem for my specific requirements, the MATLAB integer handling limitation remains.

Best regards.

**13**of 31

John-

See the answers in the newgroup thread you mention for much more discussion and possible ways to get beyond the highest 53 bits. See this info from the documentation: http://www.mathworks.com/access/helpdesk/help/techdoc/ref/format.html and particularly look at example 3 where it says:

“The hexadecimal display corresponds to the internal representation of the value. It is not the same as the hexadecimal notation in the C programming language.”

–Loren

**14**of 31

John, the main issue as I see it is that we don’t support arithmetic on ‘uint64’ yet. I expect that’s already on our to-do list somewhere. Obviously going through floating point isn’t cutting it for your application. As you say, you have some workarounds. I’ve cobbled up some m code to multiply and add uint32 values and keep the low bits, so you’d do something like low32add(low32mult(a,b),c) to get the lower 32 bits of a*b + c. Also, check out the fixedpoint toolbox.

—

Mike

**15**of 31

Thanks Loren and Mike. I’m amazed at the great and friendly support on my problem.

So far I’ve decided to take the numeric workaround by John D’Errico and Roger Stafford. My algorithm will greatly benefit from using straight numeric arithmetics, plus their workaround is compatible with required bit-wise operations.

I’ll check the fi toolbox, no doubt ;)

Thanks for your time, it’s much appreciated.

**16**of 31

Hi,

This could sound a little amateur but here is what i dont understand;

I perfectly figured the eps concept, but since we get this error most of the times why cant we work on say 14 significant digits to be sure. I mean we can round-off the 15. digit for all calculations and that ensures the correct answer. Is there such a way available in Matlab?

**17**of 31

Banu-

We could rearchitect MATLAB to do that, and it would slow everything down intolerably. In addition, who’s to say that 100*eps is the right cutoff?

–Loren

**18**of 31

Loren,

Thanks for the guide, but i am confused with the following result;

>> 1-(cos(1/8)^2+sin(1/8)^2)

ans =

1.1102e-016

>> ans/eps(1)

ans =

0.5000

Eps is supposed to be the min length to point 1. How come half of it appears?

**19**of 31

Addendum: having played with Cleve’s floatgui, I see that my answer is basically correct. From his book: “within each binary interval 2^e \leq x \leq 2^{e+1}, the numbers are equally spaced with an increment of 2^{e-t}” (t = number of bits to store the mantissa). So numbers in [1,2] are separated by eps, but those in [1/2,1] are separated by eps/2.

**20**of 31

arda,

EPS is not the min length to point 1. It is the distance from 1 to the _next largest_ floating point number. You’re correct that the next _largest_ number after 1 is 1+eps(1). However, the next _smallest_ number before 1 is not 1-eps(1) but is actually 1-(eps(1)/2).

You can see this by looking at the hexadecimal representation of these two numbers and one. To do this, use the HEX option for the FORMAT command. The first three hex digits are the sign and the exponent of the number, the last thirteen are the mantissa.

```
% Tell MATLAB to display the numbers in hex format
format hex
% Look at the number 1
one = 1
% This is the next largest number
% The last bit of the mantissa has been incremented
nextLargest = one+eps(one)
% This is the next smallest number
% We decremented the last bit of the mantissa
% which required borrowing from the exponent
nextSmallest = one - (eps(one)/2)
% Is nextSmallest actually less than 1?
isSmaller = nextSmallest < one
% Reset the format back to the default
format
```

Cleve's newsletter article, linked at the end of Loren's blog posting, describes the hexadecimal format displayed by FORMAT HEX.

**21**of 31

@arda: I’ve encountered this before and I *think* the answer is related to how fp numbers are distributed: the distance between each exactly-representable fp number gets bigger as the numbers get bigger. OK, so eps(1) is defined as the distance between 1 and the next (ie higher) fp number. Since you’re doing subtraction, I think you’re getting the distance between 1 and the previous (ie lower) fp number, which, apparently, is eps/2.

Have a look at floatgui.m from Cleve Moler’s NCM book: http://www.mathworks.com/moler/chapters.html

**22**of 31

Thanks for the reply,

I’m interested in cos() errors in Matlab and wanted to see the major error. I know these errors are mostly because floating points, and they are all look like floating point errors;

>> x=0:1/2^15:0.3;

>> plot(x,cos(x).^2+sin(x).^2-1,’.’)

But i do want to see the exact reason for every number and then maybe come up with a solution. See i also found some interesting things;

Take a look at the cos() function which is said to be used in Matlab. I have cross checked and ensure that this is the code Matlab uses (http://www.netlib.org/fdlibm/k_cos.c).

Inside the code the constants are given as forexample C3=2.48015872894767294178e-05 and this is 1/8! (http://en.wikipedia.org/wiki/Cosine). However the difference is;

>> 1/factorial(8)-2.48015872894767294178e-05

ans =

1.2111e-014

These are clearly different numbers! And what is more interesting is that errors appear more when i try to use exact factorial values. Even with higher series..

**23**of 31

arda,

The coefficient C3 used by FDLIBM’s implementation of COS is NOT exactly 1/factorial(8). See this Cleve’s Corner article for a discussion of how the FDLIBM implementation of SIN was designed; the implementation of COS was designed in much the same way.

Quoting from that article:

*The six coefficients are close to, but not exactly equal to, the power series coefficients 1/3!, 1/5!, …, 1/13! . They minimize the maximum relative error, | (sin() – p())/sin() |, over the interval. Six terms are enough to make this approximation error less than 2^(-52), which is the roundoff error involved when all the terms, and the sum, are less than one.*

That article is titled “The Tetragamma Function and Numerical Craftsmanship” — and there’s definitely some craftmanship involved in the implementation of the trig functions.

**24**of 31

Steve,

Thanks for the help. For a footnote, I investigated the trigonometric errors for both correct and incorrect values. It turns out to be coincidence of correct values (not incorrect ones) :)

**25**of 31

As mentioned by Loren Matlab uses 56 bits for mantisa. Can you please answer, How many bits are used for exponent, and what is the base is it just 2, or 16, or something else.

I tested the following code, hoping mantisa would become zero, but it doesn’t.

Can you pl comment?

n = 1;

for i = 1:1000

n = n *(1/2)

end

**26**of 31

Naveen-

See this comment: http://blogs.mathworks.com/loren/2006/08/23/a-glimpse-into-floating-point-accuracy/#comment-29617 for mantissa, etc. details.

–loren

**27**of 31

Just a critical remark.

When I used a Borland compiler round 1995, it knew the data type “extended” which let us store a number from 1E-500 to 1E2000.

In Maple, which is also less expensive than Matlab, it is possible to increase the precision to a virtually arbitrary accuracy.

Why is that, that Matlab ( which costs at least as much as Maple and advertises itself to be a “numerical laboratorium” in contrast to Maple, which is a symbolic engine ) is the issue of under- and overflow not solved satisfactionary in 2011 ?! I am quite pissed off.

**28**of 31

nablaQuadrat-

Could you please explain what you mean about the issue of under- and over-flow not being solved? MATLAB conforms to the IEEE standard and does what the standard says, to the best of my knowledge.

–Loren

**29**of 31

Hello there, I have this question.

While calculating sums of squares during a minimization routine, I end up with numbers of the form 0.002+0.203i.

Is this some special format? Or have I managed to produce

a …complex valued minimum for my problem?

(Which will either make me famous or embarassed).

Thank you for your time.

**30**of 31

Chris-

You must indeed have a complex valued minimum. You need to check out your objective function carefully, make sure it can’t be called out of range, is normalized correctly, there can be no accumulation of small imaginary parts that could build up, etc.

–Loren

**31**of 31

Thank you!

## Recent Comments