# Bug in Half-Precision Floating Point Object

Posted by **Cleve Moler**,

My post on May 8 was about "half-precision" and "quarter-precision" arithmetic. I also added code for objects `fp16` and `fp8` to Cleve's Laboratory. A few days ago I heard from Pierre Blanchard and my good friend Nick Higham at the University of Manchester about a serious bug in the constructors for those objects.

### Contents

#### The Bug

Let

```
format long
e = eps(fp16(1))
```

e = 9.765625000000000e-04

The value of `e` is

1/1024

ans = 9.765625000000000e-04

This is the relative precision of half-precision floating point numbers, which is the spacing of half-precision numbers in the interval between 1 and 2. So in binary the next number after 1 is

disp(binary(1+e))

0 01111 0000000001

And the last number before 2 is

disp(binary(2-e))

0 01111 1111111111

The three fields displayed are the sign, which is one bit, the exponent, which has five bits and the fraction, which has ten bits.

So far, so good. The bug shows up when I try to convert any number between `2-e` and `2` to half-precision. There aren't any more half-precision numbers between those limits. The values in the lower half of the interval should round down to `2-e` and the values in the upper half should round up to `2`. The round-to-even convention says that the midpoint, `2-e/2`, should round to `2`.

But I wasn't careful about how I did the rounding. I just used the MATLAB `round` function, which doesn't follow the round-to-even convention. Worse yet, I didn't check to see if the fraction rounded up into the exponent field. I tried to do everything in one statement.

dbtype oldfp16 48:49

48 u = bitxor(uint16(round(1024*f)), ... 49 bitshift(uint16(e+15),10));

For values between `2-e/2` and `2`, the `round(1024*f)` is `1024`, which requires eleven bits. The `bitxor` then clobbers the exponent field. I won't show the result here. If you have the May half-precision object on your machine, give it a try.

This doesn't just happen for values a little bit less than 2, it happens close to any power of 2.

#### The Fix

We need a proper round-to-nearest-even function.

dbtype fp16 31

31 rndevn = @(s) round(s-(rem(s,2)==0.5));

Then don't try to do it all at once.

dbtype fp16 50:56

50 % normal 51 t = uint16(rndevn(1024*f)); 52 if t==1024 53 t = uint16(0); 54 e = e+1; 55 end 56 u = bitxor(t, bitshift(uint16(e+15),10));

It turns out that the branch for denormals is OK, once `round` is replaced by `rndeven`. The exponent for denormals is all zeros, so when the fraction encroaches it produces the correct result.

A similar fix is required for the quarter-precision constructor, `fp8`.

#### Cleve's Laboratory

I am updating the code on the MATLAB Central File Exchange. Only `@fp16/fp16` and `@fp8/fp8` are affected. (Give me a few days to complete the update process.)

#### Thanks

Thanks to Pierre and Nick.

Get the MATLAB code

Published with MATLAB® R2017a

**Category:**- Numerical Analysis,
- Precision

### Note

Comments are closed.

## Recent Comments