A year and a half ago I wrote a post about "half precision" 16-bit floating point arithmetic, Moler on fp16. I followed this with a bug fix, bug in fp16. Both posts were about fp16, defined in IEEE standard 754. This is only one of 15 possible 16-bit formats. In this post I am going to consider all 15.
There is also interest in a new format, bfloat16. A recent post by Nick Higham, compares the two, Higham on fp16 and bfloat16. Nick mentions the interest in the two formats by Intel, AMD, NVIDIA, Arm and Google. These formats are two out of the 15.
A floating point format is characterized by two parameters, p, the number of bits in the fraction, and q, the number of bits in the exponent. For half precision, we always have p+q = 15. This leaves one bit for the sign.
The two formats of most interest are the IEEE standard fp16 with p = 10 and the new bfloat16 with p = 7. The new format has three more bits in the exponent and three fewer bits in the fraction than the standard. This increased range at the expense of precision is proving useful in machine learning and image processing.
My new MATLAB® object is an elaboration of fp16, so I named it vfp16. Here is its help entry.
vfp16. Constructor for variable format 16-bit floating point object. y = vfp16(x) is an array, the same size as x, of uint16s. Each element is packed with p fraction bits, 15-p exponent bits and one sign bit. A single value of the precision, p, is associated with the entire array. Any integer value of p in the range 0 <= p <= 15 is allowed, although the extreme values are of questionable utility. The default precision is p = 10 for IEEE standard fp16. y = vfp16(x,p) has precision p without changing the working precision. Three key-value pairs may be set: vfp16('precision',p) sets the working precision to p. vfp16('subnormals','on'/'off) controls gradual underflow. vfp16('fma','off'/'on') controls fused multiply adds. Up to three key-value pairs are allowed in a single call to vfp16. Two formats exist in hardware: vfp16('fp16') sets p = 10, subnormals = on, fma = off (the default). vfp16('bfloat16') sets p = 7, subnormals = off, fma = on. vfp16('precision') is the current working precision. vfp16('subnormals') is the current status of gradual underflow. vfp16('fma') is the current status of fused multiply adds. u = packed(y) is the uint16s in y. p = precision(y) is the value for the entire array y. See also: vfp16/single, http://blogs.mathworks.com/cleve/2019/01/16. Reference page in Doc Center doc vfp16
The key attributes of variable format half precision are displayed in the following chart, vfp16_anatomy. The extreme exponent range makes it necessary to use the vpa arithmetic of the Symbolic Math Toolbox™ to compute vfp16_anatomy.
- p is the precision, the number of bits in the fraction.
- bias is the range of the exponent.
- eps is the distance from 1 to the next larger vfp16 number.
- realmax is the largest vfp16 number.
- realmin is the smallest normalized vfp16 number.
- tiny is the smallest subnormal vfp16 number.
p bias eps realmax realmin tiny [ 1, 8191, 0.5, 8.181e2465, 3.667e-2466, 1.834e-2466] [ 2, 4095, 0.25, 9.138e1232, 3.83e-1233, 9.575e-1234] [ 3, 2047, 0.125, 3.03e616, 1.238e-616, 1.547e-617] [ 4, 1023, 0.0625, 1.742e308, 2.225e-308, 1.391e-309] [ 5, 511, 0.03125, 1.32e154, 2.983e-154, 9.323e-156] [ 6, 255, 0.01562, 1.149e77, 3.454e-77, 5.398e-79] [ 7, 127, 0.007812, 3.39e38, 1.175e-38, 9.184e-41] [ 8, 63, 0.003906, 1.841e19, 2.168e-19, 8.47e-22] [ 9, 31, 0.001953, 4.291e9, 9.313e-10, 1.819e-12] [ 10, 15, 0.0009766, 65500.0, 6.104e-5, 5.96e-8] [ 11, 7, 0.0004883, 255.9, 0.01562, 7.629e-6] [ 12, 3, 0.0002441, 16.0, 0.25, 6.104e-5] [ 13, 1, 0.0001221, 4.0, 1.0, 0.0001221] [ 14, 0, 6.104e-5, 2.0, 2.0, 0.0001221] [ 15, NaN, 3.052e-5, NaN, NaN, NaN]
Here is the binary display of vfp16(x,p) as p varies for an x between 2 and 4. This is the same output as the animated calculator below.
format compact format long x = 10/3
x = 3.333333333333333
for p = 0:15 y = binary(vfp16(x,p)); fprintf('%5d %18s\n',p,y) end
0 0 100000000000001 1 0 10000000000000 1 2 0 1000000000000 11 3 0 100000000000 101 4 0 10000000000 1011 5 0 1000000000 10101 6 0 100000000 101011 7 0 10000000 1010101 8 0 1000000 10101011 9 0 100000 101010101 10 0 10000 1010101011 11 0 1000 10101010101 12 0 100 101010101011 13 0 10 1010101010101 14 0 1 10000000000000 15 0 101010101010101
The upper triangle in the output is the biased exponent field and the lower triangle is the fraction. At p = 0 there is no room for a fraction and at p = 15 there is no room for an exponent. Consequently, these formats are not very useful.
Here are the results when these values are converted back to doubles. As the precision increases the error is cut in half at each step. The last two values of p show failure. The important values of p are 7 and 10.
disp(' p y x-y') for p = 0:15 y = double(vfp16(x,p)); fprintf('%5d %18.13f %18.13f\n',p,y,x-y) end
p y x-y 0 4.0000000000000 -0.6666666666667 1 3.0000000000000 0.3333333333333 2 3.5000000000000 -0.1666666666667 3 3.2500000000000 0.0833333333333 4 3.3750000000000 -0.0416666666667 5 3.3125000000000 0.0208333333333 6 3.3437500000000 -0.0104166666667 7 3.3281250000000 0.0052083333333 8 3.3359375000000 -0.0026041666667 9 3.3320312500000 0.0013020833333 10 3.3339843750000 -0.0006510416667 11 3.3330078125000 0.0003255208333 12 3.3334960937500 -0.0001627604167 13 3.3332519531250 0.0000813802083 14 NaN NaN 15 2.6666259765625 0.6667073567708
With eight bits in the exponent, the new bfloat16 has a significant advantage over the standard fp16 when it comes to conversion between single and half precision. The sign and exponent fields of the half precision word are the same as single precision, so the half precision word is just the front half of the single precision word. For example,
format hex disp(single(pi)); disp(packed(vfp16(pi,7)))
The cores of many algorithms for matrix computation often involve one of two fundamental vector operations, "dot" and "daxpy". Let x and y be column vectors of length n and let a be a scalar. The extended dot product is
x'*y + a
The so-called elementary vector operation, or "daxpy" for "double precision a times x plus y" is
a*x + y
Both involve loops of length n around a multiplication followed by an addition. Many modern computer architectures have fused multiply add instructions, FMA, where this operation is a single instruction. Moreover, the multiplication produces a result in twice the working precision and the addition is done with that higher precision. The bfloat16 specification includes FMA. With our vfp16 method FMA can be turned off and on. It is off by default.
I wrote a blog post about floating point denormals several years ago. In the revision of IEEE 754 denormals were renamed subnormals. Whatever they are called, they are relatively rare with the large range of bfloat16, so they can be turned off.
All the properties of bfloat16 can be obtained with the statement
which sets precision = 7, subnormals = off and fma = on.
The defaults can be restored with
which sets precision = 10, subnormals = on and fma = off .
This animation shows how I've added @vdp16 to the calculator that I mentioned in blog posts about half precision and roman numerals. When the radio button for the word size 16 is selected, a slider appears that allows you to select the precision. The button is labeled "16", followed by the precision. Moving the slider up increases the number of bits in the fraction and consequently increases the precision, while moving the slider down decreases p and the precision.
I could make a variable format quarter precision object, but none of the other formats are useful. And variable format single and double precision objects have lots of bits, but little else to offer.
I'm not done with this. I am still in the process of extending the linear algebra functions to variable format. I hope to report on that in a couple of weeks. In the meantime, I'll post Version 4.20 of Cleve's Laboratory with what I have done so far.
To leave a comment, please click here to sign in to your MathWorks Account or create a new one.