Variable Format Half Precision Floating Point Arithmetic

A year and a half ago I wrote a post about "half precision" 16-bit floating point arithmetic, Moler on fp16. I followed this with a bug fix, bug in fp16. Both posts were about fp16, defined in IEEE standard 754. This is only one of 15 possible 16-bit formats. In this post I am going to consider all 15.

There is also interest in a new format, bfloat16. A recent post by Nick Higham, compares the two, Higham on fp16 and bfloat16. Nick mentions the interest in the two formats by Intel, AMD, NVIDIA, Arm and Google. These formats are two out of the 15.



A floating point format is characterized by two parameters, p, the number of bits in the fraction, and q, the number of bits in the exponent. For half precision, we always have p+q = 15. This leaves one bit for the sign.

The two formats of most interest are the IEEE standard fp16 with p = 10 and the new bfloat16 with p = 7. The new format has three more bits in the exponent and three fewer bits in the fraction than the standard. This increased range at the expense of precision is proving useful in machine learning and image processing.


My new MATLAB® object is an elaboration of fp16, so I named it vfp16. Here is its help entry.

   help @vfp16/vfp16
  vfp16.  Constructor for variable format 16-bit floating point object.
  y = vfp16(x) is an array, the same size as x, of uint16s.   Each
      element is packed with p fraction bits, 15-p exponent bits and
      one sign bit.  A single value of the precision, p, is associated
      with the entire array.
      Any integer value of p in the range 0 <= p <= 15 is allowed,
      although the extreme values are of questionable utility.
      The default precision is p = 10 for IEEE standard fp16.
  y = vfp16(x,p) has precision p without changing the working precision.
  Three key-value pairs may be set:
     vfp16('precision',p) sets the working precision to p.
     vfp16('subnormals','on'/'off) controls gradual underflow.
     vfp16('fma','off'/'on') controls fused multiply adds.
     Up to three key-value pairs are allowed in a single call to vfp16.
  Two formats exist in hardware:
  vfp16('fp16') sets p = 10, subnormals = on, fma = off (the default).
  vfp16('bfloat16') sets p = 7, subnormals = off, fma = on.
  vfp16('precision') is the current working precision.
  vfp16('subnormals') is the current status of gradual underflow.
  vfp16('fma') is the current status of fused multiply adds.
  u = packed(y) is the uint16s in y.
  p = precision(y) is the value for the entire array y. 
  See also: vfp16/single,

    Reference page in Doc Center
       doc vfp16


The key attributes of variable format half precision are displayed in the following chart, vfp16_anatomy. The extreme exponent range makes it necessary to use the vpa arithmetic of the Symbolic Math Toolbox™ to compute vfp16_anatomy.

  • p is the precision, the number of bits in the fraction.
  • bias is the range of the exponent.
  • eps is the distance from 1 to the next larger vfp16 number.
  • realmax is the largest vfp16 number.
  • realmin is the smallest normalized vfp16 number.
  • tiny is the smallest subnormal vfp16 number.
   p  bias        eps     realmax      realmin         tiny
[  1, 8191,       0.5, 8.181e2465, 3.667e-2466, 1.834e-2466]
[  2, 4095,      0.25, 9.138e1232,  3.83e-1233, 9.575e-1234]
[  3, 2047,     0.125,   3.03e616,  1.238e-616,  1.547e-617]
[  4, 1023,    0.0625,  1.742e308,  2.225e-308,  1.391e-309]
[  5,  511,   0.03125,   1.32e154,  2.983e-154,  9.323e-156]
[  6,  255,   0.01562,   1.149e77,   3.454e-77,   5.398e-79]
[  7,  127,  0.007812,    3.39e38,   1.175e-38,   9.184e-41]
[  8,   63,  0.003906,   1.841e19,   2.168e-19,    8.47e-22]
[  9,   31,  0.001953,    4.291e9,   9.313e-10,   1.819e-12]
[ 10,   15, 0.0009766,    65500.0,    6.104e-5,     5.96e-8]
[ 11,    7, 0.0004883,      255.9,     0.01562,    7.629e-6]
[ 12,    3, 0.0002441,       16.0,        0.25,    6.104e-5]
[ 13,    1, 0.0001221,        4.0,         1.0,   0.0001221]
[ 14,    0,  6.104e-5,        2.0,         2.0,   0.0001221]
[ 15,  NaN,  3.052e-5,        NaN,         NaN,         NaN]


Here is the binary display of vfp16(x,p) as p varies for an x between 2 and 4. This is the same output as the animated calculator below.

   format compact
   format long

   x = 10/3
x =
   for p = 0:15
       y = binary(vfp16(x,p));
       fprintf('%5d   %18s\n',p,y)
    0   0 100000000000001 
    1   0 10000000000000 1
    2   0 1000000000000 11
    3   0 100000000000 101
    4   0 10000000000 1011
    5   0 1000000000 10101
    6   0 100000000 101011
    7   0 10000000 1010101
    8   0 1000000 10101011
    9   0 100000 101010101
   10   0 10000 1010101011
   11   0 1000 10101010101
   12   0 100 101010101011
   13   0 10 1010101010101
   14   0 1 10000000000000
   15   0  101010101010101

The upper triangle in the output is the biased exponent field and the lower triangle is the fraction. At p = 0 there is no room for a fraction and at p = 15 there is no room for an exponent. Consequently, these formats are not very useful.

Here are the results when these values are converted back to doubles. As the precision increases the error is cut in half at each step. The last two values of p show failure. The important values of p are 7 and 10.

   disp('    p           y                  x-y')
   for p = 0:15
       y = double(vfp16(x,p));
       fprintf('%5d %18.13f   %18.13f\n',p,y,x-y)
    p           y                  x-y
    0    4.0000000000000     -0.6666666666667
    1    3.0000000000000      0.3333333333333
    2    3.5000000000000     -0.1666666666667
    3    3.2500000000000      0.0833333333333
    4    3.3750000000000     -0.0416666666667
    5    3.3125000000000      0.0208333333333
    6    3.3437500000000     -0.0104166666667
    7    3.3281250000000      0.0052083333333
    8    3.3359375000000     -0.0026041666667
    9    3.3320312500000      0.0013020833333
   10    3.3339843750000     -0.0006510416667
   11    3.3330078125000      0.0003255208333
   12    3.3334960937500     -0.0001627604167
   13    3.3332519531250      0.0000813802083
   14                NaN                  NaN
   15    2.6666259765625      0.6667073567708

Conversion to Single

With eight bits in the exponent, the new bfloat16 has a significant advantage over the standard fp16 when it comes to conversion between single and half precision. The sign and exponent fields of the half precision word are the same as single precision, so the half precision word is just the front half of the single precision word. For example,

   format hex

Fused Multiply Add

The cores of many algorithms for matrix computation often involve one of two fundamental vector operations, "dot" and "daxpy". Let x and y be column vectors of length n and let a be a scalar. The extended dot product is

x'*y + a

The so-called elementary vector operation, or "daxpy" for "double precision a times x plus y" is

a*x + y

Both involve loops of length n around a multiplication followed by an addition. Many modern computer architectures have fused multiply add instructions, FMA, where this operation is a single instruction. Moreover, the multiplication produces a result in twice the working precision and the addition is done with that higher precision. The bfloat16 specification includes FMA. With our vfp16 method FMA can be turned off and on. It is off by default.


I wrote a blog post about floating point denormals several years ago. In the revision of IEEE 754 denormals were renamed subnormals. Whatever they are called, they are relatively rare with the large range of bfloat16, so they can be turned off.

All the properties of bfloat16 can be obtained with the statement


which sets precision = 7, subnormals = off and fma = on.

The defaults can be restored with


which sets precision = 10, subnormals = on and fma = off .


This animation shows how I've added @vdp16 to the calculator that I mentioned in blog posts about half precision and roman numerals. When the radio button for the word size 16 is selected, a slider appears that allows you to select the precision. The button is labeled "16", followed by the precision. Moving the slider up increases the number of bits in the fraction and consequently increases the precision, while moving the slider down decreases p and the precision.

I could make a variable format quarter precision object, but none of the other formats are useful. And variable format single and double precision objects have lots of bits, but little else to offer.

To be continued

I'm not done with this. I am still in the process of extending the linear algebra functions to variable format. I hope to report on that in a couple of weeks. In the meantime, I'll post Version 4.20 of Cleve's Laboratory with what I have done so far.

Published with MATLAB® R2018b



To leave a comment, please click here to sign in to your MathWorks Account or create a new one.