“Half Precision” 16-bit Floating Point Arithmetic 3

Posted by Cleve Moler,

The floating point arithmetic format that requires only 16 bits of storage is becoming increasingly popular. Also known as half precision or binary16, the format is useful when memory is a scarce resource.

Contents

Background

The IEEE 754 standard, published in 1985, defines formats for floating point numbers that occupy 32 or 64 bits of storage. These formats are known as binary32 and binary64, or more frequently as single and double precision. For many years MATLAB used only double precision and it remains our default format. Single precision has been added gradually over the last several years and is now also fully supported.

A revision of IEEE 754, published in 2008, defines a floating point format that occupies only 16 bits. Known as binary16, it is primarily intended to reduce storage and memory bandwidth requirements. Since it provides only "half" precision, its use for actual computation is problematic. An interesting discussion of its utility as an image processing format with increased dynamic range is provided by Industrial Light and Magic. Hardware support for half precision is now available on many processors, including the GPU in the Apple iPhone 7. Here is a link to an extensive article about half precision on the NVIDIA GeForce GPU.

Floating point anatomy

The format of a floating point number is characterized by two parameters, p, the number of bits in the fraction and q, the number of bits in the exponent. I will consider four precisions, quarter, half, single, and double. The quarter-precision format is something that I just invented for this blog post; it is not standard and actually not very useful.

The four pairs of characterizing parameters are

   p = [4, 10, 23, 52];
   q = [3, 5, 8, 11];

With these values of p and q, and with one more bit for the sign, the total number of bits in the word, w, is a power of two.

   w = p + q + 1
w =
     8    16    32    64

Normalized numbers

Most floating point numbers are normalized, and are expressed as

$$ x = \pm (1+f)2^e $$

The fraction $f$ is in the half open interval

$$ 0 \leq f < 1 $$

The binary representation of $f$ requires at most p bits. In other words $2^p f$ is an integer in the range

$$ 0 \leq 2^p f < 2^p $$

The exponent $e$ is an integer in the range

$$ -b+1 \leq e \leq b $$

The quantity $b$ is both the largest exponent and the bias.

$$ b = 2^{q-1} - 1 $$

   b = 2.^(q-1)-1
b =
           3          15         127        1023

The fractional part of a normalized number is $1+f$, but only $f$ needs to be stored. That leading $1$ is known as the hidden bit.

Subnormal

There are two values of the exponent $e$ for which the biased exponent, $e+b$, reaches the smallest and largest values possible to represent in q bits. The smallest is

$$ e + b = 0 $$

The corresponding floating point numbers do not have a hidden leading bit. These are the subnormal or denormal numbers.

$$ x = \pm f 2^{-b} $$

Infinity and Not-A-Number

The largest possible biased exponent is

$$ e + b = 2^q-1 $$.

Quantities with this exponent field represent infinities and NaN, or Not-A-Number.

The percentage of floating point numbers that are exceptional because they are subnormal, infinity or NaN increases as the precision decreases. Exceptional exponents are only $2$ values out of $2^q$. For double precision this is $2/2^{11}$, which is less than a tenth of a percent, but for half precision it is $2/2^5$, which is more than 6 percent. And fully one-fourth of all my toy quarter precision floating point numbers are exceptional.

Encode the sign bit with s = 0 for nonnegative and s = 1 for negative. And encode the exponent with an offsetting bias, b. Then a floating point number can be packed in w bits with

  x = [s e+b 2^p*f]

Precision and range

epsilon

If a real number cannot be expressed with a binary expansion requiring at most p bits, it must be approximated by a floating point number that does have such a binary representation. This is roundoff error. The important quantity characterizing precision is machine epsilon, or eps. In MATLAB, eps(x) is the distance from x to the next larger (in absolute value) floating point number. With no argument, eps is simply the difference between 1 and the next larger floating point number.

    format shortg
    eps = 2.^(-p)
eps =
       0.0625   0.00097656   1.1921e-07   2.2204e-16

This tells us that double precision is good for about 16 decimal digits of accuracy, single for about 7 decimal digits, half for about 3, and quarter for barely more than one.

realmax

If a real number, or the result of an arithmetic operation, is too large to be represented, it overflows and is replaced infinity. The largest floating point number that does not overflow is

    realmax = 2.^b.*(2-eps)
realmax =
         15.5        65504   3.4028e+38  1.7977e+308

realmin

Underflow and representation of very small numbers is more complicated. The smallest normalized floating point number is

    realmin = 2.^(-b+1)
realmin =
         0.25   6.1035e-05   1.1755e-38  2.2251e-308

tiny

But there are numbers smaller than realmin. IEEE 754 introduced the notion of gradual underflow and denormal numbers. In the 2008 revised standard their name was changed to subnormal.

Think of roundoff in numbers near underflow. Before 754 floating point numbers had the disconcerting property that x and y could be unequal, but their difference could underflow so x-y becomes 0. With 754 the gap between 0 and realmin is filled with numbers whose spacing is the same as the spacing between realmin and 2*realmin. I like to call this spacing, and the smallest subnormal number, tiny.

    tiny = realmin.*eps
tiny =
     0.015625   5.9605e-08   1.4013e-45  4.9407e-324

Floating point integers

flintmax

It is possible to do integer arithmetic with floating point numbers. I like to call such numbers flints. When we write the numbers $3$ and $3.0$, they are different descriptions of the same integer, but we think of one as fixed point and the other as floating point. The largest flint is flintmax.

    flintmax = 2./eps
flintmax =
           32         2048   1.6777e+07   9.0072e+15

Technically all the floating point numbers larger than flintmax are integers, but the spacing between them is larger than one, so it is not safe to use them for integer arithmetic. Only integer-valued floating point numbers between 0 and flintmax are allowed to be called flints.

Table

Let's collect all these anatomical characteristics together in a new MATLAB table.

   T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax];

   T = table(T(:,1), T(:,2), T(:,3), T(:,4), ...
        'variablenames',{'quarter','half','single','double'}, ...
        'rownames',{'w','p','q','b','eps','realmax','realmin', ...
                    'tiny','flintmax'});
   disp(T)
                quarter        half         single        double   
                ________    __________    __________    ___________
    w                  8            16            32             64
    p                  4            10            23             52
    q                  3             5             8             11
    b                  3            15           127           1023
    eps           0.0625    0.00097656    1.1921e-07     2.2204e-16
    realmax         15.5         65504    3.4028e+38    1.7977e+308
    realmin         0.25    6.1035e-05    1.1755e-38    2.2251e-308
    tiny        0.015625    5.9605e-08    1.4013e-45    4.9407e-324
    flintmax          32          2048    1.6777e+07     9.0072e+15

fp8 and fp16

Version 3.1 of Cleve's Laboratory includes code for objects @fp8 and @fp16 that begin to provide full implementations of quarter-precision and half-precision arithmetic.

The methods currently provided are

   methods(fp16)
Methods for class fp16:

abs         eps         isfinite    mrdivide    rem         subsref     
binary      eq          le          mtimes      round       svd         
ctranspose  fix         lt          ne          sign        tril        
diag        fp16        lu          norm        single      triu        
disp        ge          max         plus        size        uminus      
display     gt          minus       realmax     subsasgn    
double      hex         mldivide    realmin     subsindex   

These provide only partial implementations because the arithmetic is not done on the compact forms. We cheat. For each individual scalar operation, the operands are unpacked from their short storage into old fashioned doubles. The operation is then carried out by existing double precision code and the results returned to the shorter formats. This simulates the reduced precision and restricted range, but requires relatively little new code.

All of the work is done in the constructors @fp8/fp8.m and @fp16/fp16.m and what we might call the "deconstructors" @fp8/double.m and @fp16/double.m. The constructors convert ordinary floating point numbers to reduced precision representations by packing as many of the 32 or 64 bits as will fit into 8 or 16 bit words. The deconstructors do the reverse by unpacking things.

Once these methods are available, almost everything else is trivial. The code for most operations is like this one for the overloaded addition.

    type @fp16/plus.m
function z = plus(x,y)
   z = fp16(double(x) + double(y));
end

Wikipedia test suite

The Wikipedia page about half-precision includes several 16-bit examples with the sign, exponent, and fraction fields separated. I've added a couple more.

  0 01111 0000000000 = 1
  0 00101 0000000000 = 2^-10 = eps
  0 01111 0000000001 = 1+eps = 1.0009765625 (next smallest float after 1)
  1 10000 0000000000 = -2
  0 11110 1111111111 = 65504 (max half precision) = 2^15*(2-eps)
  0 00001 0000000000 = 2^-14 = r ~ 6.10352e-5 (minimum positive normal)
  0 00000 1111111111 = r*(1-eps) ~ 6.09756e-5 (maximum subnormal)
  0 00000 0000000001 = r*eps ~ 5.96046e-8 (minimum positive subnormal)
  0 00000 0000000000 ~ r*eps/2 (underflow to zero)
  0 00000 0000000000 = 0
  1 00000 0000000000 = -0
  0 11111 0000000000 = infinity
  1 11111 0000000000 = -infinity
  0 11111 1111111111 = NaN
  0 01101 0101010101 = 0.333251953125 ~ 1/3

This provides my test suite for checking fp16 operations on scalars.

   clear
   zero = fp16(0);
   one = fp16(1);
   eps = eps(one);
   r = realmin(one);

   tests = {'1','eps','1+eps','-2','2/r*(2-eps)', ...
            'r','r*(1-eps)','r*eps','r*eps/2', ...
            'zero','-zero','1/zero','-1/zero','zero/zero','1/3'};

Let's run the tests.

   for t = tests(:)'
       x = eval(t{:});
       y = fp16(x);
       z = binary(y);
       w = double(y);
       fprintf('  %18s  %04s %19.10g %19.10g    %s\n', ...
               z,hex(y),double(x),w,t{:})
   end
  0 01111 0000000000  3C00                   1                   1    1
  0 00101 0000000000  1400        0.0009765625        0.0009765625    eps
  0 01111 0000000001  3C01         1.000976563         1.000976563    1+eps
  1 10000 0000000000  C000                  -2                  -2    -2
  0 11110 1111111111  7BFF               65504               65504    2/r*(2-eps)
  0 00001 0000000000  0400     6.103515625e-05     6.103515625e-05    r
  0 00000 1111111111  03FF     6.097555161e-05     6.097555161e-05    r*(1-eps)
  0 00000 0000000001  0001     5.960464478e-08     5.960464478e-08    r*eps
  0 00000 0000000001  0001     5.960464478e-08     5.960464478e-08    r*eps/2
  0 00000 0000000000  0000                   0                   0    zero
  0 00000 0000000000  0000                   0                   0    -zero
  0 11111 0000000000  7C00                 Inf                 Inf    1/zero
  1 11111 0000000000  FC00                -Inf                -Inf    -1/zero
  1 11111 1111111111  FFFF                 NaN                 NaN    zero/zero
  0 01101 0101010101  3555        0.3333333333        0.3332519531    1/3

Matrix operations

Most of the methods in @fp8 and @fp16 handle matrices. The 4-by-4 magic square from Durer's Melancholia II provides my first example.

   clear
   format short
   M = fp16(magic(4))
 
M = 
 
    16     2     3    13
     5    11    10     8
     9     7     6    12
     4    14    15     1
 

Let's see how the packed 16-bit elements look in binary.

   B = binary(M)
B = 
  4×4 string array
  Columns 1 through 3
    "0 10011 0000000000"    "0 10000 0000000000"    "0 10000 1000000000"
    "0 10001 0100000000"    "0 10010 0110000000"    "0 10010 0100000000"
    "0 10010 0010000000"    "0 10001 1100000000"    "0 10001 1000000000"
    "0 10001 0000000000"    "0 10010 1100000000"    "0 10010 1110000000"
  Column 4
    "0 10010 1010000000"
    "0 10010 0000000000"
    "0 10010 1000000000"
    "0 01111 0000000000"

Check that the row sums are all equal. This matrix-vector multiply can be done exactly with the flints in the magic square.

   e = fp16(ones(4,1))
   Me = M*e
 
e = 
 
     1
     1
     1
     1
 
 
Me = 
 
    34
    34
    34
    34
 

fp16 backslash

I've overloaded mldivide, so I can solve linear systems and compute inverses. The actual computation is done by lutx, a "textbook" function that I wrote years ago, long before this half-precision project. But now the MATLAB object system insures that every individual arithmetic operation is done on unpacked fp16 numbers.

Let's generate a 5-by-5 matrix with random two-digit integer entries.

   A = fp16(randi(100,5,5))
 
A = 
 
    76    71    83    44    49
    75     4    70    39    45
    40    28    32    77    65
    66     5    96    80    71
    18    10     4    19    76
 

I am going to use fp16 backslash to invert A. So I need the identity matrix in half precision.

   I = fp16(eye(5))
 
I = 
 
     1     0     0     0     0
     0     1     0     0     0
     0     0     1     0     0
     0     0     0     1     0
     0     0     0     0     1
 

Now the overloaded backslash calls lutx to compute the inverse.

   X = A\I
 
X = 
 
   -0.0044    0.0346    0.0125   -0.0254   -0.0046
    0.0140   -0.0116    0.0026   -0.0046   -0.0002
    0.0071   -0.0176   -0.0200    0.0238    0.0008
   -0.0060   -0.0020    0.0200    0.0006   -0.0125
    0.0003   -0.0052   -0.0072    0.0052    0.0174
 

Compute the residual.

   AX = A*X
   R = I - AX
 
AX = 
 
    1.0000   -0.0011   -0.0002   -0.0007   -0.0001
   -0.0001    0.9990   -0.0001   -0.0007   -0.0001
   -0.0001   -0.0005    1.0000   -0.0003   -0.0002
   -0.0002   -0.0011   -0.0001    0.9995   -0.0002
   -0.0000   -0.0001    0.0001   -0.0001    1.0000
 
 
R = 
 
         0    0.0011    0.0002    0.0007    0.0001
    0.0001    0.0010    0.0001    0.0007    0.0001
    0.0001    0.0005         0    0.0003    0.0002
    0.0002    0.0011    0.0001    0.0005    0.0002
    0.0000    0.0001   -0.0001    0.0001         0
 

Both AX and R are what I expect from arithmetic that is accurate to only about three decimal digits.

Although I get a different random A every time I publish this blog post, I expect that it has a modest condition number.

   kappa = cond(A)
kappa =
   15.7828

Since A is not badly conditioned, I can invert the computed inverse and expect to get close to the original integer matrix.

   Z = X\I
 
Z = 
 
   76.1250   71.0000   83.1250   44.1250   49.1250
   75.1250    4.0234   70.1875   39.1250   45.1250
   40.0625   28.0000   32.0625   77.0000   65.0625
   66.1250    5.0234   96.1875   80.1250   71.1250
   18.0156   10.0000    4.0156   19.0156   76.0000
 

fp16 SVD

I have just nonchalantly computed cond(A). But cond isn't on the list of overload methods for fp16. I was pleasantly surprised to find that matlab\matfun\cond.m quietly worked on this new datatype. Here is the core of that code.

   dbtype cond 34:43, dbtype cond 47
34    if p == 2
35        s = svd(A);
36        if any(s == 0)   % Handle singular matrix
37            c = Inf(class(A));
38        else
39            c = max(s)./min(s);
40            if isempty(c)
41                c = zeros(class(A));
42            end
43        end

47    end

So it is correctly using the singular value decomposition, and I have svd overloaded. The SVD computation is handled by a 433 line M-file, svdtx, that, like lutx, was written before fp16 existed.

Let's compute the SVD again.

   [U,S,V] = svd(A)
 
U = 
 
   -0.5210   -0.4841    0.6802   -0.0315    0.1729
   -0.4260   -0.2449   -0.3572   -0.4561   -0.6504
   -0.4058    0.4683    0.1633    0.6284   -0.4409
   -0.5786    0.0268   -0.5620    0.1532    0.5703
   -0.2174    0.6968    0.2593   -0.6104    0.1658
 
 
S = 
 
  267.5000         0         0         0         0
         0   71.1875         0         0         0
         0         0   55.5000         0         0
         0         0         0   37.3750         0
         0         0         0         0   16.9531
 
 
V = 
 
   -0.4858   -0.3108   -0.0175   -0.3306   -0.7471
   -0.2063   -0.2128    0.9238    0.2195    0.1039
   -0.5332   -0.5205   -0.2920   -0.0591    0.5967
   -0.4534    0.2891   -0.2050    0.7993   -0.1742
   -0.4812    0.7095    0.1384   -0.4478    0.2126
 

Reconstruct A from its half precision SVD. It's not too shabby.

   USVT = U*S*V'
 
USVT = 
 
   75.9375   71.0000   83.0625   44.0313   49.0000
   75.0000    4.0117   70.0625   38.9688   45.0000
   40.0313   28.0469   32.0313   77.0625   65.0625
   66.0000    4.9688   96.0625   80.0000   71.0000
   18.0313   10.0234    4.0156   19.0313   76.0000
 

Finally, verify that we've been working all this time with fp16 objects.

   whos
  Name       Size            Bytes  Class     Attributes

  A          5x5               226  fp16                
  AX         5x5               226  fp16                
  B          4x4              1576  string              
  I          5x5               226  fp16                
  M          4x4               208  fp16                
  Me         4x1               184  fp16                
  R          5x5               226  fp16                
  S          5x5               226  fp16                
  U          5x5               226  fp16                
  USVT       5x5               226  fp16                
  V          5x5               226  fp16                
  X          5x5               226  fp16                
  Z          5x5               226  fp16                
  e          4x1               184  fp16                
  kappa      1x1                 8  double              

Calculator

I introduced a calculator in my blog post about Roman numerals. Version 3.1 of Cleve's Laboratory also includes a fancier version of the calculator that computes in four different precisions -- quarter, half, single, and double -- and displays the results in four different formats -- decimal, hexadecimal, binary, and Roman.

I like to demonstrate the calculator by clicking on the keys

    1 0 0 0 / 8 1 =

because the decimal expansion is a repeating .123456790.

Thanks

Thanks to MathWorkers Ben Tordoff, Steve Eddins, and Kiran Kintali who provided background and pointers to work on half precision.


Get the MATLAB code

Published with MATLAB® R2017a

3 CommentsOldest to Newest

Mark L. Stone replied on : 1 of 3

My SPAM filter question is “What is 6 + 8?”. Fortunately, my quarter precision calculator was able to handle that, albeit with precious little to spare.

I’m glad you mentioned that, Mark. It caused me to notice an inconsistency in my table summarizing the anatomical parameters of floating point number systems. The table for quarter precision says that flintmax is 32, because that’s 2/eps. But overflow happens above realmax, which is just 15.5. So flintmax should really be
min(2/eps,floor(flintmax))
That revision is only applicable for quarter precision.

I don’t know what totals the spam protection generates.

Add A Comment

What is 6 + 10?

Preview: hide