# Variable Format Half Precision Floating Point Arithmetic

A year and a half ago I wrote a post about "half precision" 16-bit floating point arithmetic, Moler on fp16. I followed this with a bug fix, bug in fp16. Both posts were about *fp16*, defined in IEEE standard 754. This is only one of 15 possible 16-bit formats. In this post I am going to consider all 15.

There is also interest in a new format, *bfloat16*. A recent post by Nick Higham, compares the two, Higham on fp16 and bfloat16. Nick mentions the interest in the two formats by Intel, AMD, NVIDIA, Arm and Google. These formats are two out of the 15.

### Contents

#### Formats

A floating point format is characterized by two parameters, p, the number of bits in the fraction, and q, the number of bits in the exponent. For half precision, we always have p+q = 15. This leaves one bit for the sign.

The two formats of most interest are the IEEE standard *fp16* with p = 10 and the new *bfloat16* with p = 7. The new format has three more bits in the exponent and three fewer bits in the fraction than the standard. This increased range at the expense of precision is proving useful in machine learning and image processing.

#### vfp16

My new MATLAB® object is an elaboration of `fp16`, so I named it `vfp16`. Here is its `help` entry.

```
help @vfp16/vfp16
```

vfp16. Constructor for variable format 16-bit floating point object. y = vfp16(x) is an array, the same size as x, of uint16s. Each element is packed with p fraction bits, 15-p exponent bits and one sign bit. A single value of the precision, p, is associated with the entire array. Any integer value of p in the range 0 <= p <= 15 is allowed, although the extreme values are of questionable utility. The default precision is p = 10 for IEEE standard fp16. y = vfp16(x,p) has precision p without changing the working precision. Three key-value pairs may be set: vfp16('precision',p) sets the working precision to p. vfp16('subnormals','on'/'off) controls gradual underflow. vfp16('fma','off'/'on') controls fused multiply adds. Up to three key-value pairs are allowed in a single call to vfp16. Two formats exist in hardware: vfp16('fp16') sets p = 10, subnormals = on, fma = off (the default). vfp16('bfloat16') sets p = 7, subnormals = off, fma = on. vfp16('precision') is the current working precision. vfp16('subnormals') is the current status of gradual underflow. vfp16('fma') is the current status of fused multiply adds. u = packed(y) is the uint16s in y. p = precision(y) is the value for the entire array y. See also: vfp16/single, http://blogs.mathworks.com/cleve/2019/01/16. Reference page in Doc Center doc vfp16

#### Anatomy

The key attributes of variable format half precision are displayed in the following chart, `vfp16_anatomy`. The extreme exponent range makes it necessary to use the `vpa` arithmetic of the Symbolic Math Toolbox™ to compute `vfp16_anatomy`.

`p`is the precision, the number of bits in the fraction.`bias`is the range of the exponent.`eps`is the distance from 1 to the next larger`vfp16`number.`realmax`is the largest`vfp16`number.`realmin`is the smallest normalized`vfp16`number.`tiny`is the smallest subnormal`vfp16`number.

vfp16_anatomy

p bias eps realmax realmin tiny [ 1, 8191, 0.5, 8.181e2465, 3.667e-2466, 1.834e-2466] [ 2, 4095, 0.25, 9.138e1232, 3.83e-1233, 9.575e-1234] [ 3, 2047, 0.125, 3.03e616, 1.238e-616, 1.547e-617] [ 4, 1023, 0.0625, 1.742e308, 2.225e-308, 1.391e-309] [ 5, 511, 0.03125, 1.32e154, 2.983e-154, 9.323e-156] [ 6, 255, 0.01562, 1.149e77, 3.454e-77, 5.398e-79] [ 7, 127, 0.007812, 3.39e38, 1.175e-38, 9.184e-41] [ 8, 63, 0.003906, 1.841e19, 2.168e-19, 8.47e-22] [ 9, 31, 0.001953, 4.291e9, 9.313e-10, 1.819e-12] [ 10, 15, 0.0009766, 65500.0, 6.104e-5, 5.96e-8] [ 11, 7, 0.0004883, 255.9, 0.01562, 7.629e-6] [ 12, 3, 0.0002441, 16.0, 0.25, 6.104e-5] [ 13, 1, 0.0001221, 4.0, 1.0, 0.0001221] [ 14, 0, 6.104e-5, 2.0, 2.0, 0.0001221] [ 15, NaN, 3.052e-5, NaN, NaN, NaN]

#### Example

Here is the binary display of `vfp16(x,p)` as `p` varies for an `x` between 2 and 4. This is the same output as the animated calculator below.

format compact format long x = 10/3

x = 3.333333333333333

for p = 0:15 y = binary(vfp16(x,p)); fprintf('%5d %18s\n',p,y) end

0 0 100000000000001 1 0 10000000000000 1 2 0 1000000000000 11 3 0 100000000000 101 4 0 10000000000 1011 5 0 1000000000 10101 6 0 100000000 101011 7 0 10000000 1010101 8 0 1000000 10101011 9 0 100000 101010101 10 0 10000 1010101011 11 0 1000 10101010101 12 0 100 101010101011 13 0 10 1010101010101 14 0 1 10000000000000 15 0 101010101010101

The upper triangle in the output is the biased exponent field and the lower triangle is the fraction. At `p = 0` there is no room for a fraction and at `p = 15` there is no room for an exponent. Consequently, these formats are not very useful.

Here are the results when these values are converted back to doubles. As the precision increases the error is cut in half at each step. The last two values of `p` show failure. The important values of `p` are 7 and 10.

disp(' p y x-y') for p = 0:15 y = double(vfp16(x,p)); fprintf('%5d %18.13f %18.13f\n',p,y,x-y) end

p y x-y 0 4.0000000000000 -0.6666666666667 1 3.0000000000000 0.3333333333333 2 3.5000000000000 -0.1666666666667 3 3.2500000000000 0.0833333333333 4 3.3750000000000 -0.0416666666667 5 3.3125000000000 0.0208333333333 6 3.3437500000000 -0.0104166666667 7 3.3281250000000 0.0052083333333 8 3.3359375000000 -0.0026041666667 9 3.3320312500000 0.0013020833333 10 3.3339843750000 -0.0006510416667 11 3.3330078125000 0.0003255208333 12 3.3334960937500 -0.0001627604167 13 3.3332519531250 0.0000813802083 14 NaN NaN 15 2.6666259765625 0.6667073567708

#### Conversion to Single

With eight bits in the exponent, the new *bfloat16* has a significant advantage over the standard *fp16* when it comes to conversion between single and half precision. The sign and exponent fields of the half precision word are the same as single precision, so the half precision word is just the front half of the single precision word. For example,

```
format hex
disp(single(pi));
disp(packed(vfp16(pi,7)))
```

40490fdb 4049

#### Fused Multiply Add

The cores of many algorithms for matrix computation often involve one of two fundamental vector operations, "dot" and "daxpy". Let `x` and `y` be column vectors of length `n` and let `a` be a scalar. The extended dot product is

x'*y + a

The so-called elementary vector operation, or "daxpy" for "double precision a times x plus y" is

a*x + y

Both involve loops of length n around a multiplication followed by an addition. Many modern computer architectures have fused multiply add instructions, FMA, where this operation is a single instruction. Moreover, the multiplication produces a result in twice the working precision and the addition is done with that higher precision. The *bfloat16* specification includes FMA. With our `vfp16` method FMA can be turned off and on. It is off by default.

#### Subnormals

I wrote a blog post about floating point denormals several years ago. In the revision of IEEE 754 denormals were renamed subnormals. Whatever they are called, they are relatively rare with the large range of *bfloat16*, so they can be turned off.

All the properties of *bfloat16* can be obtained with the statement

```
vfp16('bfloat16')
```

which sets `precision = 7`, `subnormals = off` and `fma = on`.

The defaults can be restored with

```
vfp16('fp16')
```

which sets `precision = 10`, `subnormals = on` and `fma = off` .

#### Calculator

This animation shows how I've added `@vdp16` to the calculator that I mentioned in blog posts about half precision and roman numerals. When the radio button for the word size 16 is selected, a slider appears that allows you to select the precision. The button is labeled "16", followed by the precision. Moving the slider up increases the number of bits in the fraction and consequently increases the precision, while moving the slider down decreases `p` and the precision.

I could make a variable format quarter precision object, but none of the other formats are useful. And variable format single and double precision objects have lots of bits, but little else to offer.

#### To be continued

I'm not done with this. I am still in the process of extending the linear algebra functions to variable format. I hope to report on that in a couple of weeks. In the meantime, I'll post Version 4.20 of Cleve's Laboratory with what I have done so far.

**Category:**- Numerical Analysis,
- Precision

## Comments

To leave a comment, please click here to sign in to your MathWorks Account or create a new one.