{"id":2459,"date":"2017-05-08T16:00:05","date_gmt":"2017-05-08T21:00:05","guid":{"rendered":"https:\/\/blogs.mathworks.com\/cleve\/?p=2459"},"modified":"2018-06-30T19:43:17","modified_gmt":"2018-07-01T00:43:17","slug":"half-precision-16-bit-floating-point-arithmetic","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/08\/half-precision-16-bit-floating-point-arithmetic\/","title":{"rendered":"&#8220;Half Precision&#8221; 16-bit Floating Point Arithmetic"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>The floating point arithmetic format that requires only 16 bits of storage is becoming increasingly popular.  Also known as <i>half precision<\/i> or <i>binary16<\/i>, the format is useful when memory is a scarce resource.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#090a8a25-1ad3-4d45-8c77-fca1e7bc68c1\">Background<\/a><\/li><li><a href=\"#8c5eead6-1f30-44ac-9cc7-e12a96c16417\">Floating point anatomy<\/a><\/li><li><a href=\"#ed12f876-2348-4f2e-92a8-1f086ad8a55e\">Precision and range<\/a><\/li><li><a href=\"#1e420b34-dab8-48c0-a1fd-015b27ea6b90\">Floating point integers<\/a><\/li><li><a href=\"#0ea70488-c769-4b2a-bf5c-df782780a66c\">Table<\/a><\/li><li><a href=\"#a45f8322-e167-4947-b44f-1ecbdb56e8f1\">fp8 and fp16<\/a><\/li><li><a href=\"#ad57b760-cfa1-456c-a0bd-df3d2aa6476b\">Wikipedia test suite<\/a><\/li><li><a href=\"#73869dd0-cb7a-48a1-9ee2-4292e8fae072\">Matrix operations<\/a><\/li><li><a href=\"#7893ea1a-0d29-4ba4-89f1-da8c5f4c70b9\">fp16 backslash<\/a><\/li><li><a href=\"#56f81913-9deb-47d9-8794-8fe544f55b45\">fp16 SVD<\/a><\/li><li><a href=\"#25922c0d-7a18-4c9c-bf40-201f9ae34473\">Calculator<\/a><\/li><li><a href=\"#a3037622-0c8e-4528-b060-e03bed068e09\">Thanks<\/a><\/li><\/ul><\/div><h4>Background<a name=\"090a8a25-1ad3-4d45-8c77-fca1e7bc68c1\"><\/a><\/h4><p>The IEEE 754 standard, published in 1985, defines formats for floating point numbers that occupy 32 or 64 bits of storage. These formats are known as <i>binary32<\/i> and <i>binary64<\/i>, or more frequently as <i>single<\/i> and <i>double precision<\/i>.  For many years MATLAB used only double precision and it remains our default format.  Single precision has been added gradually over the last several years and is now also fully supported.<\/p><p>A revision of IEEE 754, published in 2008, defines a floating point format that occupies only 16 bits.  Known as <i>binary16<\/i>, it is primarily intended to reduce storage and memory bandwidth requirements. Since it provides only \"half\" precision, its use for actual computation is problematic.  An interesting discussion of its utility as an image processing format with increased dynamic range is provided by <a href=\"http:\/\/www.openexr.com\/index.html\">Industrial Light and Magic<\/a>. Hardware support for half precision is now available on many processors, including the GPU in the <a href=\"http:\/\/www.realworldtech.com\/apple-custom-gpu\/\">Apple iPhone 7<\/a>. Here is a link to an extensive article about half precision on the <a href=\"http:\/\/www.anandtech.com\/show\/10325\/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review\/5\">NVIDIA GeForce GPU<\/a>.<\/p><h4>Floating point anatomy<a name=\"8c5eead6-1f30-44ac-9cc7-e12a96c16417\"><\/a><\/h4><p>The format of a floating point number is characterized by two parameters, <tt>p<\/tt>, the number of bits in the fraction and <tt>q<\/tt>, the number of bits in the exponent.  I will consider four precisions, <i>quarter<\/i>, <i>half<\/i>, <i>single<\/i>, and <i>double<\/i>.  The quarter-precision format is something that I just invented for this blog post; it is not standard and actually not very useful.<\/p><p>The four pairs of characterizing parameters are<\/p><pre class=\"codeinput\">   p = [4, 10, 23, 52];\r\n<\/pre><pre class=\"codeinput\">   q = [3, 5, 8, 11];\r\n<\/pre><p>With these values of <tt>p<\/tt> and <tt>q<\/tt>, and with one more bit for the sign, the total number of bits in the word, <tt>w<\/tt>, is a power of two.<\/p><pre class=\"codeinput\">   w = p + q + 1\r\n<\/pre><pre class=\"codeoutput\">w =\r\n     8    16    32    64\r\n<\/pre><p><b>Normalized numbers<\/b><\/p><p>Most floating point numbers are <i>normalized<\/i>, and are expressed as<\/p><p>$$ x = \\pm (1+f)2^e $$<\/p><p>The <i>fraction<\/i> $f$ is in the half open interval<\/p><p>$$ 0 \\leq f &lt; 1 $$<\/p><p>The binary representation of $f$ requires at most <tt>p<\/tt> bits. In other words $2^p f$ is an integer in the range<\/p><p>$$ 0 \\leq 2^p f &lt; 2^p $$<\/p><p>The <i>exponent<\/i> $e$ is an integer in the range<\/p><p>$$ -b+1 \\leq e \\leq b $$<\/p><p>The quantity $b$ is both the largest exponent and the <tt>bias<\/tt>.<\/p><p>$$ b = 2^{q-1} - 1 $$<\/p><pre class=\"codeinput\">   b = 2.^(q-1)-1\r\n<\/pre><pre class=\"codeoutput\">b =\r\n           3          15         127        1023\r\n<\/pre><p>The fractional part of a normalized number is $1+f$, but only $f$ needs to be stored.  That leading $1$ is known as the <i>hidden bit<\/i>.<\/p><p><b>Subnormal<\/b><\/p><p>There are two values of the exponent $e$ for which the biased exponent, $e+b$, reaches the smallest and largest values possible to represent in <tt>q<\/tt> bits.  The smallest is<\/p><p>$$ e + b = 0 $$<\/p><p>The corresponding floating point numbers do not have a hidden leading bit.  These are the <i>subnormal<\/i> or <i>denormal<\/i> numbers.<\/p><p>$$ x = \\pm f 2^{-b} $$<\/p><p><b>Infinity and Not-A-Number<\/b><\/p><p>The largest possible biased exponent is<\/p><p>$$ e + b = 2^q-1 $$.<\/p><p>Quantities with this exponent field represent <i>infinities<\/i> and <i>NaN<\/i>, or <i>Not-A-Number<\/i>.<\/p><p>The percentage of floating point numbers that are exceptional because they are subnormal, infinity or NaN increases as the precision decreases.  Exceptional exponents are only $2$ values out of $2^q$.  For double precision this is $2\/2^{11}$, which is less than a tenth of a percent, but for half precision it is $2\/2^5$, which is more than 6 percent.  And fully one-fourth of all my toy quarter precision floating point numbers are exceptional.<\/p><p>Encode the sign bit with <tt>s = 0<\/tt> for nonnegative and <tt>s = 1<\/tt> for negative.  And encode the exponent with an offsetting bias, <tt>b<\/tt>. Then a floating point number can be packed in <tt>w<\/tt> bits with<\/p><pre>  x = [s e+b 2^p*f]<\/pre><h4>Precision and range<a name=\"ed12f876-2348-4f2e-92a8-1f086ad8a55e\"><\/a><\/h4><p><b>epsilon<\/b><\/p><p>If a real number cannot be expressed with a binary expansion requiring at most <tt>p<\/tt> bits, it must be approximated by a floating point number that does have such a binary representation.  This is <i>roundoff error<\/i>. The important quantity characterizing precision is <i>machine epsilon<\/i>, or <tt>eps<\/tt>.  In MATLAB, <tt>eps(x)<\/tt> is the distance from <tt>x<\/tt> to the next larger (in absolute value) floating point number.  With no argument, <tt>eps<\/tt> is simply the difference between <tt>1<\/tt> and the next larger floating point number.<\/p><pre class=\"codeinput\">    format <span class=\"string\">shortg<\/span>\r\n    eps = 2.^(-p)\r\n<\/pre><pre class=\"codeoutput\">eps =\r\n       0.0625   0.00097656   1.1921e-07   2.2204e-16\r\n<\/pre><p>This tells us that double precision is good for about 16 decimal digits of accuracy, single for about 7 decimal digits, half for about 3, and quarter for barely more than one.<\/p><p><b>realmax<\/b><\/p><p>If a real number, or the result of an arithmetic operation, is too large to be represented, it <i>overflows<\/i> and is replaced <i>infinity<\/i>. The largest floating point number that does not overflow is<\/p><pre class=\"codeinput\">    realmax = 2.^b.*(2-eps)\r\n<\/pre><pre class=\"codeoutput\">realmax =\r\n         15.5        65504   3.4028e+38  1.7977e+308\r\n<\/pre><p><b>realmin<\/b><\/p><p><i>Underflow<\/i> and representation of very small numbers is more complicated. The smallest normalized floating point number is<\/p><pre class=\"codeinput\">    realmin = 2.^(-b+1)\r\n<\/pre><pre class=\"codeoutput\">realmin =\r\n         0.25   6.1035e-05   1.1755e-38  2.2251e-308\r\n<\/pre><p><b>tiny<\/b><\/p><p>But there are numbers smaller than <tt>realmin<\/tt>.  IEEE 754 introduced the notion of <i>gradual underflow<\/i> and <i>denormal<\/i> numbers.  In the 2008 revised standard their name was changed to <i>subnormal<\/i>.<\/p><p>Think of roundoff in numbers near underflow.  Before 754 floating point numbers had the disconcerting property that <tt>x<\/tt> and <tt>y<\/tt> could be unequal, but their difference could underflow so <tt>x-y<\/tt> becomes <tt>0<\/tt>. With 754 the gap between <tt>0<\/tt> and <tt>realmin<\/tt> is filled with numbers whose spacing is the same as the spacing between <tt>realmin<\/tt> and <tt>2*realmin<\/tt>. I like to call this spacing, and the smallest subnormal number, <tt>tiny<\/tt>.<\/p><pre class=\"codeinput\">    tiny = realmin.*eps\r\n<\/pre><pre class=\"codeoutput\">tiny =\r\n     0.015625   5.9605e-08   1.4013e-45  4.9407e-324\r\n<\/pre><h4>Floating point integers<a name=\"1e420b34-dab8-48c0-a1fd-015b27ea6b90\"><\/a><\/h4><p><b>flintmax<\/b><\/p><p>It is possible to do integer arithmetic with floating point numbers. I like to call such numbers <i>flints<\/i>.  When we write the numbers $3$ and $3.0$, they are different descriptions of the same integer, but we think of one as fixed point and the other as floating point. The largest flint is <tt>flintmax<\/tt>.<\/p><pre class=\"codeinput\">    flintmax = 2.\/eps\r\n<\/pre><pre class=\"codeoutput\">flintmax =\r\n           32         2048   1.6777e+07   9.0072e+15\r\n<\/pre><p>Technically all the floating point numbers larger than <tt>flintmax<\/tt> are integers, but the spacing between them is larger than one, so it is not safe to use them for integer arithmetic. Only integer-valued floating point numbers between <tt>0<\/tt> and <tt>flintmax<\/tt> are allowed to be called flints.<\/p><h4>Table<a name=\"0ea70488-c769-4b2a-bf5c-df782780a66c\"><\/a><\/h4><p>Let's collect all these anatomical characteristics together in a new MATLAB <tt>table<\/tt>.<\/p><pre class=\"codeinput\">   T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax];\r\n\r\n   T = table(T(:,1), T(:,2), T(:,3), T(:,4), <span class=\"keyword\">...<\/span>\r\n        <span class=\"string\">'variablenames'<\/span>,{<span class=\"string\">'quarter'<\/span>,<span class=\"string\">'half'<\/span>,<span class=\"string\">'single'<\/span>,<span class=\"string\">'double'<\/span>}, <span class=\"keyword\">...<\/span>\r\n        <span class=\"string\">'rownames'<\/span>,{<span class=\"string\">'w'<\/span>,<span class=\"string\">'p'<\/span>,<span class=\"string\">'q'<\/span>,<span class=\"string\">'b'<\/span>,<span class=\"string\">'eps'<\/span>,<span class=\"string\">'realmax'<\/span>,<span class=\"string\">'realmin'<\/span>, <span class=\"keyword\">...<\/span>\r\n                    <span class=\"string\">'tiny'<\/span>,<span class=\"string\">'flintmax'<\/span>});\r\n   disp(T)\r\n<\/pre><pre class=\"codeoutput\">                quarter        half         single        double   \r\n                ________    __________    __________    ___________\r\n    w                  8            16            32             64\r\n    p                  4            10            23             52\r\n    q                  3             5             8             11\r\n    b                  3            15           127           1023\r\n    eps           0.0625    0.00097656    1.1921e-07     2.2204e-16\r\n    realmax         15.5         65504    3.4028e+38    1.7977e+308\r\n    realmin         0.25    6.1035e-05    1.1755e-38    2.2251e-308\r\n    tiny        0.015625    5.9605e-08    1.4013e-45    4.9407e-324\r\n    flintmax          32          2048    1.6777e+07     9.0072e+15\r\n<\/pre><h4>fp8 and fp16<a name=\"a45f8322-e167-4947-b44f-1ecbdb56e8f1\"><\/a><\/h4><p>Version 3.1 of <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/59085-cleve-laboratory\">Cleve's Laboratory<\/a> includes code for objects <tt>@fp8<\/tt> and <tt>@fp16<\/tt> that begin to provide full implementations of quarter-precision and half-precision arithmetic.<\/p><p>The methods currently provided are<\/p><pre class=\"codeinput\">   methods(fp16)\r\n<\/pre><pre class=\"codeoutput\">\r\nMethods for class fp16:\r\n\r\nabs         eps         isfinite    mrdivide    rem         subsref     \r\nbinary      eq          le          mtimes      round       svd         \r\nctranspose  fix         lt          ne          sign        tril        \r\ndiag        fp16        lu          norm        single      triu        \r\ndisp        ge          max         plus        size        uminus      \r\ndisplay     gt          minus       realmax     subsasgn    \r\ndouble      hex         mldivide    realmin     subsindex   \r\n\r\n<\/pre><p>These provide only partial implementations because the arithmetic is not done on the compact forms.  We cheat.  For each individual scalar operation, the operands are unpacked from their short storage into old fashioned doubles.  The operation is then carried out by existing double precision code and the results returned to the shorter formats. This simulates the reduced precision and restricted range, but requires relatively little new code.<\/p><p>All of the work is done in the constructors <tt>@fp8\/fp8.m<\/tt> and <tt>@fp16\/fp16.m<\/tt> and what we might call the \"deconstructors\" <tt>@fp8\/double.m<\/tt> and <tt>@fp16\/double.m<\/tt>. The constructors convert ordinary floating point numbers to reduced precision representations by packing as many of the 32 or 64 bits as will fit into 8 or 16 bit words.  The deconstructors do the reverse by unpacking things.<\/p><p>Once these methods are available, almost everything else is trivial. The code for most operations is like this one for the overloaded addition.<\/p><pre class=\"codeinput\">    type <span class=\"string\">@fp16\/plus.m<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction z = plus(x,y)\r\n   z = fp16(double(x) + double(y));\r\nend\r\n<\/pre><h4>Wikipedia test suite<a name=\"ad57b760-cfa1-456c-a0bd-df3d2aa6476b\"><\/a><\/h4><p>The <a href=\"http:\/\/en.wikipedia.org\/wiki\/Half-precision_floating-point_format\">Wikipedia page<\/a> about half-precision includes several 16-bit examples with the sign, exponent, and fraction fields separated. I've added a couple more.<\/p><pre>  0 01111 0000000000 = 1\r\n  0 00101 0000000000 = 2^-10 = eps\r\n  0 01111 0000000001 = 1+eps = 1.0009765625 (next smallest float after 1)\r\n  1 10000 0000000000 = -2\r\n  0 11110 1111111111 = 65504 (max half precision) = 2^15*(2-eps)\r\n  0 00001 0000000000 = 2^-14 = r ~ 6.10352e-5 (minimum positive normal)\r\n  0 00000 1111111111 = r*(1-eps) ~ 6.09756e-5 (maximum subnormal)\r\n  0 00000 0000000001 = r*eps ~ 5.96046e-8 (minimum positive subnormal)\r\n  0 00000 0000000000 ~ r*eps\/2 (underflow to zero)\r\n  0 00000 0000000000 = 0\r\n  1 00000 0000000000 = -0\r\n  0 11111 0000000000 = infinity\r\n  1 11111 0000000000 = -infinity\r\n  0 11111 1111111111 = NaN\r\n  0 01101 0101010101 = 0.333251953125 ~ 1\/3<\/pre><p>This provides my test suite for checking <tt>fp16<\/tt> operations on scalars.<\/p><pre class=\"codeinput\">   clear\r\n   zero = fp16(0);\r\n   one = fp16(1);\r\n   eps = eps(one);\r\n   r = realmin(one);\r\n\r\n   tests = {<span class=\"string\">'1'<\/span>,<span class=\"string\">'eps'<\/span>,<span class=\"string\">'1+eps'<\/span>,<span class=\"string\">'-2'<\/span>,<span class=\"string\">'2\/r*(2-eps)'<\/span>, <span class=\"keyword\">...<\/span>\r\n            <span class=\"string\">'r'<\/span>,<span class=\"string\">'r*(1-eps)'<\/span>,<span class=\"string\">'r*eps'<\/span>,<span class=\"string\">'r*eps\/2'<\/span>, <span class=\"keyword\">...<\/span>\r\n            <span class=\"string\">'zero'<\/span>,<span class=\"string\">'-zero'<\/span>,<span class=\"string\">'1\/zero'<\/span>,<span class=\"string\">'-1\/zero'<\/span>,<span class=\"string\">'zero\/zero'<\/span>,<span class=\"string\">'1\/3'<\/span>};\r\n<\/pre><p>Let's run the tests.<\/p><pre class=\"codeinput\">   <span class=\"keyword\">for<\/span> t = tests(:)'\r\n       x = eval(t{:});\r\n       y = fp16(x);\r\n       z = binary(y);\r\n       w = double(y);\r\n       fprintf(<span class=\"string\">'  %18s  %04s %19.10g %19.10g    %s\\n'<\/span>, <span class=\"keyword\">...<\/span>\r\n               z,hex(y),double(x),w,t{:})\r\n   <span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">  0 01111 0000000000  3C00                   1                   1    1\r\n  0 00101 0000000000  1400        0.0009765625        0.0009765625    eps\r\n  0 01111 0000000001  3C01         1.000976563         1.000976563    1+eps\r\n  1 10000 0000000000  C000                  -2                  -2    -2\r\n  0 11110 1111111111  7BFF               65504               65504    2\/r*(2-eps)\r\n  0 00001 0000000000  0400     6.103515625e-05     6.103515625e-05    r\r\n  0 00000 1111111111  03FF     6.097555161e-05     6.097555161e-05    r*(1-eps)\r\n  0 00000 0000000001  0001     5.960464478e-08     5.960464478e-08    r*eps\r\n  0 00000 0000000001  0001     5.960464478e-08     5.960464478e-08    r*eps\/2\r\n  0 00000 0000000000  0000                   0                   0    zero\r\n  0 00000 0000000000  0000                   0                   0    -zero\r\n  0 11111 0000000000  7C00                 Inf                 Inf    1\/zero\r\n  1 11111 0000000000  FC00                -Inf                -Inf    -1\/zero\r\n  1 11111 1111111111  FFFF                 NaN                 NaN    zero\/zero\r\n  0 01101 0101010101  3555        0.3333333333        0.3332519531    1\/3\r\n<\/pre><h4>Matrix operations<a name=\"73869dd0-cb7a-48a1-9ee2-4292e8fae072\"><\/a><\/h4><p>Most of the methods in <tt>@fp8<\/tt> and <tt>@fp16<\/tt> handle matrices. The 4-by-4 magic square from Durer's Melancholia II provides my first example.<\/p><pre class=\"codeinput\">   clear\r\n   format <span class=\"string\">short<\/span>\r\n   M = fp16(magic(4))\r\n<\/pre><pre class=\"codeoutput\"> \r\nM = \r\n \r\n    16     2     3    13\r\n     5    11    10     8\r\n     9     7     6    12\r\n     4    14    15     1\r\n \r\n<\/pre><p>Let's see how the packed 16-bit elements look in binary.<\/p><pre class=\"codeinput\">   B = binary(M)\r\n<\/pre><pre class=\"codeoutput\">B = \r\n  4&times;4 string array\r\n  Columns 1 through 3\r\n    \"0 10011 0000000000\"    \"0 10000 0000000000\"    \"0 10000 1000000000\"\r\n    \"0 10001 0100000000\"    \"0 10010 0110000000\"    \"0 10010 0100000000\"\r\n    \"0 10010 0010000000\"    \"0 10001 1100000000\"    \"0 10001 1000000000\"\r\n    \"0 10001 0000000000\"    \"0 10010 1100000000\"    \"0 10010 1110000000\"\r\n  Column 4\r\n    \"0 10010 1010000000\"\r\n    \"0 10010 0000000000\"\r\n    \"0 10010 1000000000\"\r\n    \"0 01111 0000000000\"\r\n<\/pre><p>Check that the row sums are all equal.  This matrix-vector multiply can be done exactly with the flints in the magic square.<\/p><pre class=\"codeinput\">   e = fp16(ones(4,1))\r\n   Me = M*e\r\n<\/pre><pre class=\"codeoutput\"> \r\ne = \r\n \r\n     1\r\n     1\r\n     1\r\n     1\r\n \r\n \r\nMe = \r\n \r\n    34\r\n    34\r\n    34\r\n    34\r\n \r\n<\/pre><h4>fp16 backslash<a name=\"7893ea1a-0d29-4ba4-89f1-da8c5f4c70b9\"><\/a><\/h4><p>I've overloaded <tt>mldivide<\/tt>, so I can solve linear systems and compute inverses.  The actual computation is done by <tt>lutx<\/tt>, a \"textbook\" function that I wrote years ago, long before this half-precision project.  But now the MATLAB object system insures that every individual arithmetic operation is done on unpacked <tt>fp16<\/tt> numbers.<\/p><p>Let's generate a 5-by-5 matrix with random two-digit integer entries.<\/p><pre class=\"codeinput\">   A = fp16(randi(100,5,5))\r\n<\/pre><pre class=\"codeoutput\"> \r\nA = \r\n \r\n    76    71    83    44    49\r\n    75     4    70    39    45\r\n    40    28    32    77    65\r\n    66     5    96    80    71\r\n    18    10     4    19    76\r\n \r\n<\/pre><p>I am going to use <tt>fp16<\/tt> backslash to invert <tt>A<\/tt>. So I need the identity matrix in half precision.<\/p><pre class=\"codeinput\">   I = fp16(eye(5))\r\n<\/pre><pre class=\"codeoutput\"> \r\nI = \r\n \r\n     1     0     0     0     0\r\n     0     1     0     0     0\r\n     0     0     1     0     0\r\n     0     0     0     1     0\r\n     0     0     0     0     1\r\n \r\n<\/pre><p>Now the overloaded backslash calls <tt>lutx<\/tt> to compute the inverse.<\/p><pre class=\"codeinput\">   X = A\\I\r\n<\/pre><pre class=\"codeoutput\"> \r\nX = \r\n \r\n   -0.0044    0.0346    0.0125   -0.0254   -0.0046\r\n    0.0140   -0.0116    0.0026   -0.0046   -0.0002\r\n    0.0071   -0.0176   -0.0200    0.0238    0.0008\r\n   -0.0060   -0.0020    0.0200    0.0006   -0.0125\r\n    0.0003   -0.0052   -0.0072    0.0052    0.0174\r\n \r\n<\/pre><p>Compute the residual.<\/p><pre class=\"codeinput\">   AX = A*X\r\n   R = I - AX\r\n<\/pre><pre class=\"codeoutput\"> \r\nAX = \r\n \r\n    1.0000   -0.0011   -0.0002   -0.0007   -0.0001\r\n   -0.0001    0.9990   -0.0001   -0.0007   -0.0001\r\n   -0.0001   -0.0005    1.0000   -0.0003   -0.0002\r\n   -0.0002   -0.0011   -0.0001    0.9995   -0.0002\r\n   -0.0000   -0.0001    0.0001   -0.0001    1.0000\r\n \r\n \r\nR = \r\n \r\n         0    0.0011    0.0002    0.0007    0.0001\r\n    0.0001    0.0010    0.0001    0.0007    0.0001\r\n    0.0001    0.0005         0    0.0003    0.0002\r\n    0.0002    0.0011    0.0001    0.0005    0.0002\r\n    0.0000    0.0001   -0.0001    0.0001         0\r\n \r\n<\/pre><p>Both <tt>AX<\/tt> and <tt>R<\/tt> are what I expect from arithmetic that is accurate to only about three decimal digits.<\/p><p>Although I get a different random <tt>A<\/tt> every time I publish this blog post, I expect that it has a modest condition number.<\/p><pre class=\"codeinput\">   kappa = cond(A)\r\n<\/pre><pre class=\"codeoutput\">kappa =\r\n   15.7828\r\n<\/pre><p>Since <tt>A<\/tt> is not badly conditioned, I can invert the computed inverse and expect to get close to the original integer matrix.<\/p><pre class=\"codeinput\">   Z = X\\I\r\n<\/pre><pre class=\"codeoutput\"> \r\nZ = \r\n \r\n   76.1250   71.0000   83.1250   44.1250   49.1250\r\n   75.1250    4.0234   70.1875   39.1250   45.1250\r\n   40.0625   28.0000   32.0625   77.0000   65.0625\r\n   66.1250    5.0234   96.1875   80.1250   71.1250\r\n   18.0156   10.0000    4.0156   19.0156   76.0000\r\n \r\n<\/pre><h4>fp16 SVD<a name=\"56f81913-9deb-47d9-8794-8fe544f55b45\"><\/a><\/h4><p>I have just nonchalantly computed <tt>cond(A)<\/tt>.  But <tt>cond<\/tt> isn't on the list of overload methods for <tt>fp16<\/tt>.  I was pleasantly surprised to find that <tt>matlab\\matfun\\cond.m<\/tt> quietly worked on this new datatype. Here is the core of that code.<\/p><pre class=\"codeinput\">   dbtype <span class=\"string\">cond<\/span> <span class=\"string\">34:43<\/span>, dbtype <span class=\"string\">cond<\/span> <span class=\"string\">47<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\n34    if p == 2\r\n35        s = svd(A);\r\n36        if any(s == 0)   % Handle singular matrix\r\n37            c = Inf(class(A));\r\n38        else\r\n39            c = max(s).\/min(s);\r\n40            if isempty(c)\r\n41                c = zeros(class(A));\r\n42            end\r\n43        end\r\n\r\n47    end\r\n<\/pre><p>So it is correctly using the singular value decomposition, and I have <tt>svd<\/tt> overloaded.  The SVD computation is handled by a 433 line M-file, <tt>svdtx<\/tt>, that, like <tt>lutx<\/tt>, was written before <tt>fp16<\/tt> existed.<\/p><p>Let's compute the SVD again.<\/p><pre class=\"codeinput\">   [U,S,V] = svd(A)\r\n<\/pre><pre class=\"codeoutput\"> \r\nU = \r\n \r\n   -0.5210   -0.4841    0.6802   -0.0315    0.1729\r\n   -0.4260   -0.2449   -0.3572   -0.4561   -0.6504\r\n   -0.4058    0.4683    0.1633    0.6284   -0.4409\r\n   -0.5786    0.0268   -0.5620    0.1532    0.5703\r\n   -0.2174    0.6968    0.2593   -0.6104    0.1658\r\n \r\n \r\nS = \r\n \r\n  267.5000         0         0         0         0\r\n         0   71.1875         0         0         0\r\n         0         0   55.5000         0         0\r\n         0         0         0   37.3750         0\r\n         0         0         0         0   16.9531\r\n \r\n \r\nV = \r\n \r\n   -0.4858   -0.3108   -0.0175   -0.3306   -0.7471\r\n   -0.2063   -0.2128    0.9238    0.2195    0.1039\r\n   -0.5332   -0.5205   -0.2920   -0.0591    0.5967\r\n   -0.4534    0.2891   -0.2050    0.7993   -0.1742\r\n   -0.4812    0.7095    0.1384   -0.4478    0.2126\r\n \r\n<\/pre><p>Reconstruct <tt>A<\/tt> from its half precision SVD.  It's not too shabby.<\/p><pre class=\"codeinput\">   USVT = U*S*V'\r\n<\/pre><pre class=\"codeoutput\"> \r\nUSVT = \r\n \r\n   75.9375   71.0000   83.0625   44.0313   49.0000\r\n   75.0000    4.0117   70.0625   38.9688   45.0000\r\n   40.0313   28.0469   32.0313   77.0625   65.0625\r\n   66.0000    4.9688   96.0625   80.0000   71.0000\r\n   18.0313   10.0234    4.0156   19.0313   76.0000\r\n \r\n<\/pre><p>Finally, verify that we've been working all this time with <tt>fp16<\/tt> objects.<\/p><pre class=\"codeinput\">   whos\r\n<\/pre><pre class=\"codeoutput\">  Name       Size            Bytes  Class     Attributes\r\n\r\n  A          5x5               226  fp16                \r\n  AX         5x5               226  fp16                \r\n  B          4x4              1576  string              \r\n  I          5x5               226  fp16                \r\n  M          4x4               208  fp16                \r\n  Me         4x1               184  fp16                \r\n  R          5x5               226  fp16                \r\n  S          5x5               226  fp16                \r\n  U          5x5               226  fp16                \r\n  USVT       5x5               226  fp16                \r\n  V          5x5               226  fp16                \r\n  X          5x5               226  fp16                \r\n  Z          5x5               226  fp16                \r\n  e          4x1               184  fp16                \r\n  kappa      1x1                 8  double              \r\n\r\n<\/pre><h4>Calculator<a name=\"25922c0d-7a18-4c9c-bf40-201f9ae34473\"><\/a><\/h4><p>I introduced a <tt>calculator<\/tt> in my blog post about <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/04\/24\/\">Roman numerals<\/a>. Version 3.1 of <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/59085-cleve-laboratory\">Cleve's Laboratory<\/a> also includes a fancier version of the calculator that computes in four different precisions -- quarter, half, single, and double -- and displays the results in four different formats -- decimal, hexadecimal, binary, and Roman.<\/p><p>I like to demonstrate the calculator by clicking on the keys<\/p><pre>    1 0 0 0 \/ 8 1 =<\/pre><p>because the decimal expansion is a repeating <tt>.123456790<\/tt>.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/calculator_movie.gif\" alt=\"\"> <\/p><h4>Thanks<a name=\"a3037622-0c8e-4528-b060-e03bed068e09\"><\/a><\/h4><p>Thanks to MathWorkers Ben Tordoff, Steve Eddins, and Kiran Kintali who provided background and pointers to work on half precision.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_bea37f3aee0e469aaa02e22e2581ad50() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='bea37f3aee0e469aaa02e22e2581ad50 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' bea37f3aee0e469aaa02e22e2581ad50';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2017 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_bea37f3aee0e469aaa02e22e2581ad50()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2017a<br><\/p><\/div><!--\r\nbea37f3aee0e469aaa02e22e2581ad50 ##### SOURCE BEGIN #####\r\n%% \"Half Precision\" 16-bit Floating Point Arithmetic\r\n% The floating point arithmetic format that requires only 16 bits of\r\n% storage is becoming increasingly popular.  Also known as\r\n% _half precision_ or _binary16_, the format is useful when memory\r\n% is a scarce resource.\r\n\r\n%% Background\r\n% The IEEE 754 standard, published in 1985, defines formats for \r\n% floating point numbers that occupy 32 or 64 bits of storage.\r\n% These formats are known as _binary32_ and _binary64_, or more\r\n% frequently as _single_ and _double precision_.  For many years\r\n% MATLAB used only double precision and it remains our default\r\n% format.  Single precision has been added gradually\r\n% over the last several years and is now also fully supported.\r\n\r\n%%\r\n% A revision of IEEE 754, published in 2008, defines a floating point\r\n% format that occupies only 16 bits.  Known as _binary16_, it is\r\n% primarily intended to reduce storage and memory bandwidth requirements.\r\n% Since it provides only \"half\" precision, its use for actual computation\r\n% is problematic.  An interesting discussion of its utility as an image\r\n% processing format with increased dynamic range is provided by\r\n% <http:\/\/www.openexr.com\/index.html Industrial Light and Magic>.\r\n% Hardware support for half precision is now available on many processors,\r\n% including the GPU in the\r\n% <http:\/\/www.realworldtech.com\/apple-custom-gpu\/ Apple iPhone 7>.\r\n% Here is a link to an extensive article about half precision on the\r\n% <http:\/\/www.anandtech.com\/show\/10325\/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review\/5\r\n% NVIDIA GeForce GPU>.\r\n\r\n%% Floating point anatomy\r\n% The format of a floating point number is characterized by two\r\n% parameters, |p|, the number of bits in the fraction and |q|, the number\r\n% of bits in the exponent.  I will consider four precisions,\r\n% _quarter_, _half_, _single_, and _double_.  The quarter-precision\r\n% format is something that I just invented for this blog post;\r\n% it is not standard and actually not very useful.\r\n\r\n%%\r\n% The four pairs of characterizing parameters are\r\n\r\n   p = [4, 10, 23, 52];\r\n   \r\n%%\r\n\r\n   q = [3, 5, 8, 11];\r\n\r\n\r\n   \r\n%%\r\n% With these values of |p| and |q|, and with one more bit for the sign, \r\n% the total number of bits in the word, |w|, is a power of two.\r\n\r\n   w = p + q + 1\r\n   \r\n%%\r\n% *Normalized numbers*\r\n%\r\n% Most floating point numbers are _normalized_, and are expressed as\r\n% \r\n% $$ x = \\pm (1+f)2^e $$\r\n\r\n%%\r\n% The _fraction_ $f$ is in the half open interval\r\n%\r\n% $$ 0 \\leq f < 1 $$\r\n%\r\n\r\n%%\r\n% The binary representation of $f$ requires at most |p| bits.\r\n% In other words $2^p f$ is an integer in the range\r\n%\r\n% $$ 0 \\leq 2^p f < 2^p $$\r\n\r\n%%\r\n%\r\n% The _exponent_ $e$ is an integer in the range\r\n%\r\n% $$ -b+1 \\leq e \\leq b $$\r\n%\r\n\r\n%%\r\n% The quantity $b$ is both the largest exponent and the |bias|.\r\n%\r\n% $$ b = 2^{q-1} - 1 $$\r\n\r\n   b = 2.^(q-1)-1\r\n   \r\n%%\r\n% The fractional part of a normalized number is $1+f$, but only $f$\r\n% needs to be stored.  That leading $1$ is known as the _hidden bit_.\r\n  \r\n%%\r\n% *Subnormal*\r\n%\r\n% There are two values of the exponent $e$ for which the biased exponent,\r\n% $e+b$, reaches the smallest and largest values possible to represent\r\n% in |q| bits.  The smallest is\r\n%\r\n% $$ e + b = 0 $$\r\n%\r\n% The corresponding floating point numbers do not have a hidden leading\r\n% bit.  These are the _subnormal_ or _denormal_ numbers.\r\n%\r\n% $$ x = \\pm f 2^{-b} $$\r\n%\r\n\r\n%%\r\n% *Infinity and Not-A-Number*\r\n%\r\n% The largest possible biased exponent is\r\n% \r\n% $$ e + b = 2^q-1 $$.\r\n%\r\n% Quantities with this exponent field represent _infinities_ and\r\n% _NaN_, or _Not-A-Number_.\r\n\r\n%%\r\n% The percentage of floating point numbers that are exceptional\r\n% because they are subnormal, infinity or NaN increases as the\r\n% precision decreases.  Exceptional exponents are only $2$ values\r\n% out of $2^q$.  For double precision this is $2\/2^{11}$, which is\r\n% less than a tenth of a percent, but for half precision it is $2\/2^5$,\r\n% which is more than 6 percent.  And fully one-fourth of all my toy\r\n% quarter precision floating point numbers are exceptional.\r\n\r\n%% \r\n% Encode the sign bit with |s = 0| for nonnegative and |s = 1| for\r\n% negative.  And encode the exponent with an offsetting bias, |b|.\r\n% Then a floating point number can be packed in |w| bits with\r\n%\r\n%    x = [s e+b 2^p*f]\r\n    \r\n%% Precision and range\r\n% *epsilon*\r\n%\r\n% If a real number cannot be expressed with a binary expansion requiring\r\n% at most |p| bits, it must be approximated by a floating point number\r\n% that does have such a binary representation.  This is _roundoff error_.\r\n% The important quantity characterizing precision is _machine epsilon_,\r\n% or |eps|.  In MATLAB, |eps(x)| is the distance from |x| to the\r\n% next larger (in absolute value) floating point number.  With no\r\n% argument, |eps| is simply the difference between |1| and the next larger\r\n% floating point number.\r\n\r\n    format shortg\r\n    eps = 2.^(-p)\r\n    \r\n%%\r\n% This tells us that double precision is good for about 16 decimal digits \r\n% of accuracy, single for about 7 decimal digits, half for about 3, and\r\n% quarter for barely more than one.\r\n            \r\n%%\r\n% *realmax*\r\n%\r\n% If a real number, or the result of an arithmetic operation, is too\r\n% large to be represented, it _overflows_ and is replaced _infinity_.\r\n% The largest floating point number that does not overflow is\r\n\r\n    realmax = 2.^b.*(2-eps)\r\n    \r\n%% \r\n% *realmin*\r\n%\r\n% _Underflow_ and representation of very small numbers is more complicated.\r\n% The smallest normalized floating point number is\r\n\r\n    realmin = 2.^(-b+1)\r\n    \r\n%% \r\n% *tiny*\r\n%\r\n% But there are numbers smaller than |realmin|.  IEEE 754 introduced the\r\n% notion of _gradual underflow_ and _denormal_ numbers.  In the 2008\r\n% revised standard their name was changed to _subnormal_.\r\n\r\n%%\r\n% Think of roundoff in numbers near underflow.  Before 754 floating\r\n% point numbers had the disconcerting property that |x| and |y| could\r\n% be unequal, but their difference could underflow so |x-y| becomes |0|.\r\n% With 754 the gap between |0| and |realmin| is filled with numbers whose\r\n% spacing is the same as the spacing between |realmin| and |2*realmin|.\r\n% I like to call this spacing, and the smallest subnormal number, |tiny|.\r\n\r\n    tiny = realmin.*eps\r\n    \r\n%% Floating point integers\r\n% *flintmax*\r\n% \r\n% It is possible to do integer arithmetic with floating point numbers.\r\n% I like to call such numbers _flints_.  When we write the numbers $3$\r\n% and $3.0$, they are different descriptions of the same integer, but\r\n% we think of one as fixed point and the other as floating point.\r\n% The largest flint is |flintmax|.\r\n\r\n    flintmax = 2.\/eps\r\n    \r\n%%\r\n% Technically all the floating point numbers larger than |flintmax|\r\n% are integers, but the spacing between them is larger than one, so it\r\n% is not safe to use them for integer arithmetic. Only integer-valued\r\n% floating point numbers between |0| and |flintmax| are allowed to be\r\n% called flints.\r\n\r\n    \r\n%% Table\r\n% Let's collect all these anatomical characteristics together in a\r\n% new MATLAB |table|.\r\n\r\n   T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax];\r\n\r\n   T = table(T(:,1), T(:,2), T(:,3), T(:,4), ...\r\n        'variablenames',{'quarter','half','single','double'}, ...\r\n        'rownames',{'w','p','q','b','eps','realmax','realmin', ...\r\n                    'tiny','flintmax'});\r\n   disp(T)\r\n   \r\n%% fp8 and fp16\r\n% Version 3.1 of\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/59085-cleve-laboratory\r\n% Cleve's Laboratory> includes code for objects |@fp8| and |@fp16|\r\n% that begin to provide full implementations of quarter-precision and\r\n% half-precision arithmetic.\r\n\r\n%%\r\n% The methods currently provided are\r\n\r\n   methods(fp16)\r\n\r\n%%\r\n% These provide only partial implementations because the arithmetic is not\r\n% done on the compact forms.  We cheat.  For each individual scalar\r\n% operation, the operands are unpacked from their short storage into\r\n% old fashioned doubles.  The operation is then carried out by existing\r\n% double precision code and the results returned to the shorter formats.\r\n% This simulates the reduced precision and restricted range, but requires\r\n% relatively little new code.\r\n\r\n%%\r\n% All of the work is done in the constructors |@fp8\/fp8.m| and\r\n% |@fp16\/fp16.m| and what we might call the \"deconstructors\" \r\n% |@fp8\/double.m| and |@fp16\/double.m|.\r\n% The constructors convert ordinary floating point\r\n% numbers to reduced precision representations by packing as many of\r\n% the 32 or 64 bits as will fit into 8 or 16 bit words.  The deconstructors\r\n% do the reverse by unpacking things.\r\n\r\n%%\r\n% Once these methods are available, almost everything else is trivial.\r\n% The code for most operations is like this one for the overloaded\r\n% addition.\r\n\r\n    type @fp16\/plus.m\r\n   \r\n%% Wikipedia test suite\r\n% The <http:\/\/en.wikipedia.org\/wiki\/Half-precision_floating-point_format\r\n% Wikipedia page> about half-precision includes several 16-bit examples\r\n% with the sign, exponent, and fraction fields separated.\r\n% I've added a couple more.\r\n%\r\n%    0 01111 0000000000 = 1\r\n%    0 00101 0000000000 = 2^-10 = eps\r\n%    0 01111 0000000001 = 1+eps = 1.0009765625 (next smallest float after 1)\r\n%    1 10000 0000000000 = -2 \r\n%    0 11110 1111111111 = 65504 (max half precision) = 2^15*(2-eps)\r\n%    0 00001 0000000000 = 2^-14 = r ~ 6.10352e-5 (minimum positive normal)\r\n%    0 00000 1111111111 = r*(1-eps) ~ 6.09756e-5 (maximum subnormal)\r\n%    0 00000 0000000001 = r*eps ~ 5.96046e-8 (minimum positive subnormal)\r\n%    0 00000 0000000000 ~ r*eps\/2 (underflow to zero)\r\n%    0 00000 0000000000 = 0 \r\n%    1 00000 0000000000 = -0\r\n%    0 11111 0000000000 = infinity\r\n%    1 11111 0000000000 = -infinity\r\n%    0 11111 1111111111 = NaN\r\n%    0 01101 0101010101 = 0.333251953125 ~ 1\/3\r\n   \r\n%%\r\n% This provides my test suite for checking |fp16| operations on scalars.\r\n\r\n   clear\r\n   zero = fp16(0);\r\n   one = fp16(1);\r\n   eps = eps(one);\r\n   r = realmin(one);\r\n\r\n   tests = {'1','eps','1+eps','-2','2\/r*(2-eps)', ...\r\n            'r','r*(1-eps)','r*eps','r*eps\/2', ...\r\n            'zero','-zero','1\/zero','-1\/zero','zero\/zero','1\/3'};\r\n\r\n%%\r\n% Let's run the tests.\r\n\r\n   for t = tests(:)'\r\n       x = eval(t{:});\r\n       y = fp16(x);\r\n       z = binary(y);\r\n       w = double(y);\r\n       fprintf('  %18s  %04s %19.10g %19.10g    %s\\n', ...\r\n               z,hex(y),double(x),w,t{:})\r\n   end\r\n\r\n%% Matrix operations\r\n% Most of the methods in |@fp8| and |@fp16| handle matrices.\r\n% The 4-by-4 magic square from Durer's Melancholia II provides my first\r\n% example.\r\n\r\n   clear\r\n   format short\r\n   M = fp16(magic(4))\r\n   \r\n%%\r\n% Let's see how the packed 16-bit elements look in binary.\r\n\r\n   B = binary(M)\r\n \r\n%%\r\n% Check that the row sums are all equal.  This matrix-vector multiply\r\n% can be done exactly with the flints in the magic square.\r\n\r\n   e = fp16(ones(4,1))\r\n   Me = M*e\r\n   \r\n%% fp16 backslash\r\n% I've overloaded |mldivide|, so I can solve linear systems and\r\n% compute inverses.  The actual computation is done by |lutx|,\r\n% a \"textbook\" function that I wrote years ago, long before this\r\n% half-precision project.  But now the MATLAB object system insures\r\n% that every individual arithmetic operation is done on unpacked\r\n% |fp16| numbers.\r\n\r\n%%\r\n% Let's generate a 5-by-5 matrix with random two-digit integer entries.\r\n\r\n   A = fp16(randi(100,5,5))\r\n   \r\n%%\r\n% I am going to use |fp16| backslash to invert |A|.\r\n% So I need the identity matrix in half precision.\r\n\r\n   I = fp16(eye(5))\r\n \r\n%%\r\n% Now the overloaded backslash calls |lutx| to compute the inverse.\r\n   \r\n   X = A\\I   \r\n   \r\n%%\r\n% Compute the residual.\r\n\r\n   AX = A*X\r\n   R = I - AX\r\n   \r\n%%\r\n% Both |AX| and |R| are what I expect from arithmetic that is\r\n% accurate to only about three decimal digits.\r\n\r\n%%\r\n% Although I get a different random |A| every time I publish this blog\r\n% post, I expect that it has a modest condition number.\r\n\r\n   kappa = cond(A)\r\n   \r\n%%\r\n% Since |A| is not badly conditioned, I can invert the computed inverse\r\n% and expect to get close to the original integer matrix.\r\n\r\n   Z = X\\I\r\n      \r\n%% fp16 SVD\r\n% I have just nonchalantly computed |cond(A)|.  But |cond| isn't on the\r\n% list of overload methods for |fp16|.  I was pleasantly surprised to \r\n% find that |matlab\\matfun\\cond.m| quietly worked on this new datatype.\r\n% Here is the core of that code.\r\n\r\n   dbtype cond 34:43, dbtype cond 47\r\n   \r\n%%\r\n% So it is correctly using the singular value decomposition, and I have\r\n% |svd| overloaded.  The SVD computation is handled by a 433 line M-file,\r\n% |svdtx|, that, like |lutx|, was written before |fp16| existed.\r\n\r\n%%\r\n% Let's compute the SVD again.\r\n\r\n   [U,S,V] = svd(A)\r\n   \r\n%%\r\n% Reconstruct |A| from its half precision SVD.  It's not too shabby.\r\n\r\n   USVT = U*S*V'\r\n     \r\n%%\r\n% Finally, verify that we've been working all this time\r\n% with |fp16| objects.\r\n\r\n   whos\r\n\r\n%% Calculator\r\n% I introduced a |calculator| in my blog post about\r\n% <https:\/\/blogs.mathworks.com\/cleve\/2017\/04\/24\/ Roman numerals>.\r\n% Version 3.1 of\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/59085-cleve-laboratory\r\n% Cleve's Laboratory> also includes a fancier version of the calculator\r\n% that computes in four different precisions REPLACE_WITH_DASH_DASH quarter, half, single, and\r\n% double REPLACE_WITH_DASH_DASH and displays the results in four different formats REPLACE_WITH_DASH_DASH decimal,\r\n% hexadecimal, binary, and Roman.\r\n\r\n%%\r\n% I like to demonstrate the calculator by clicking on the keys\r\n%\r\n%      1 0 0 0 \/ 8 1 =\r\n%\r\n% because the decimal expansion is a repeating |.123456790|.\r\n%\r\n% <<calculator_movie.gif>>\r\n\r\n%% Thanks\r\n% Thanks to MathWorkers Ben Tordoff, Steve Eddins, and Kiran Kintali\r\n% who provided background and pointers to work on half precision.\r\n    \r\n\r\n##### SOURCE END ##### bea37f3aee0e469aaa02e22e2581ad50\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/calculator_movie.gif\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>The floating point arithmetic format that requires only 16 bits of storage is becoming increasingly popular.  Also known as <i>half precision<\/i> or <i>binary16<\/i>, the format is useful when memory is a scarce resource.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/08\/half-precision-16-bit-floating-point-arithmetic\/\">read more >><\/a><\/p>","protected":false},"author":78,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[5,6,16,7,30],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2459"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/users\/78"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/comments?post=2459"}],"version-history":[{"count":4,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2459\/revisions"}],"predecessor-version":[{"id":2484,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2459\/revisions\/2484"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media?parent=2459"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/categories?post=2459"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/tags?post=2459"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}