My post on May 8 was about "half-precision" and "quarter-precision" arithmetic. I also added code for objects fp16 and fp8 to Cleve's Laboratory. A few days ago I heard from Pierre Blanchard and my good friend Nick Higham at the University of Manchester about a serious bug in the constructors for those objects.
format long e = eps(fp16(1))
e = 9.765625000000000e-04
The value of e is
ans = 9.765625000000000e-04
This is the relative precision of half-precision floating point numbers, which is the spacing of half-precision numbers in the interval between 1 and 2. So in binary the next number after 1 is
0 01111 0000000001
And the last number before 2 is
0 01111 1111111111
The three fields displayed are the sign, which is one bit, the exponent, which has five bits and the fraction, which has ten bits.
So far, so good. The bug shows up when I try to convert any number between 2-e and 2 to half-precision. There aren't any more half-precision numbers between those limits. The values in the lower half of the interval should round down to 2-e and the values in the upper half should round up to 2. The round-to-even convention says that the midpoint, 2-e/2, should round to 2.
But I wasn't careful about how I did the rounding. I just used the MATLAB round function, which doesn't follow the round-to-even convention. Worse yet, I didn't check to see if the fraction rounded up into the exponent field. I tried to do everything in one statement.
dbtype oldfp16 48:49
48 u = bitxor(uint16(round(1024*f)), ... 49 bitshift(uint16(e+15),10));
For values between 2-e/2 and 2, the round(1024*f) is 1024, which requires eleven bits. The bitxor then clobbers the exponent field. I won't show the result here. If you have the May half-precision object on your machine, give it a try.
This doesn't just happen for values a little bit less than 2, it happens close to any power of 2.
We need a proper round-to-nearest-even function.
dbtype fp16 31
31 rndevn = @(s) round(s-(rem(s,2)==0.5));
Then don't try to do it all at once.
dbtype fp16 50:56
50 % normal 51 t = uint16(rndevn(1024*f)); 52 if t==1024 53 t = uint16(0); 54 e = e+1; 55 end 56 u = bitxor(t, bitshift(uint16(e+15),10));
It turns out that the branch for denormals is OK, once round is replaced by rndeven. The exponent for denormals is all zeros, so when the fraction encroaches it produces the correct result.
A similar fix is required for the quarter-precision constructor, fp8.
I am updating the code on the MATLAB Central File Exchange. Only @fp16/fp16 and @fp8/fp8 are affected. (Give me a few days to complete the update process.)
Thanks to Pierre and Nick.
要发表评论，请点击 此处 登录到您的 MathWorks 帐户或创建一个新帐户。