{"id":4392,"date":"2019-01-16T04:32:19","date_gmt":"2019-01-16T09:32:19","guid":{"rendered":"https:\/\/blogs.mathworks.com\/cleve\/?p=4392"},"modified":"2019-01-16T17:04:59","modified_gmt":"2019-01-16T22:04:59","slug":"variable-format-half-precision-floating-point-arithmetic","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/cleve\/2019\/01\/16\/variable-format-half-precision-floating-point-arithmetic\/","title":{"rendered":"Variable Format Half Precision Floating Point Arithmetic"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>A year and a half ago I wrote a post about \"half precision\" 16-bit floating point arithmetic, <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/08\/half-precision-16-bit-floating-point-arithmetic\">Moler on fp16<\/a>. I followed this with a bug fix, <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/12\/20\/bug-in-half-precision-floating-point-object\/\">bug in fp16<\/a>. Both posts were about <i>fp16<\/i>, defined in IEEE standard 754. This is only one of 15 possible 16-bit formats. In this post I am going to consider all 15.<\/p><p>There is also interest in a new format, <i>bfloat16<\/i>. A recent post by Nick Higham, compares the two, <a href=\"https:\/\/nickhigham.wordpress.com\/2018\/12\/03\/half-precision-arithmetic-fp16-versus-bfloat16\/\">Higham on fp16 and bfloat16<\/a>.  Nick mentions the interest in the two formats by Intel, AMD, NVIDIA, Arm and Google.  These formats are two out of the 15.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#492fb44c-d5e2-4b4f-8ff2-6c61297641f5\">Formats<\/a><\/li><li><a href=\"#9f1b5f1d-2605-4c4f-a9a1-186de3ca4bc7\">vfp16<\/a><\/li><li><a href=\"#84fd2483-6ecf-4b50-b783-92b0938e3a57\">Anatomy<\/a><\/li><li><a href=\"#8cc9ed7f-71cb-4bdd-97df-d434797a2562\">Example<\/a><\/li><li><a href=\"#0f6852c1-df93-4300-8367-0e0f2f19a345\">Conversion to Single<\/a><\/li><li><a href=\"#018a233b-1b56-499b-ab66-9bf4f542c8b0\">Fused Multiply Add<\/a><\/li><li><a href=\"#34f9c391-3e9b-4d1f-9824-e168a3a9de94\">Subnormals<\/a><\/li><li><a href=\"#56f81515-c908-4865-add9-ad866ef1d639\">Calculator<\/a><\/li><li><a href=\"#3b9f416a-5752-45d1-a801-c9bf17b91dc2\">To be continued<\/a><\/li><\/ul><\/div><h4>Formats<a name=\"492fb44c-d5e2-4b4f-8ff2-6c61297641f5\"><\/a><\/h4><p>A floating point format is characterized by two parameters, p, the number of bits in the fraction, and q, the number of bits in the exponent. For half precision, we always have p+q = 15.  This leaves one bit for the sign.<\/p><p>The two formats of most interest are the IEEE standard <i>fp16<\/i> with p = 10 and the new <i>bfloat16<\/i> with p = 7.  The new format has three more bits in the exponent and three fewer bits in the fraction than the standard.  This increased range at the expense of precision is proving useful in machine learning and image processing.<\/p><h4>vfp16<a name=\"9f1b5f1d-2605-4c4f-a9a1-186de3ca4bc7\"><\/a><\/h4><p>My new MATLAB&reg; object is an elaboration of <tt>fp16<\/tt>, so I named it <tt>vfp16<\/tt>.  Here is its <tt>help<\/tt> entry.<\/p><pre class=\"codeinput\">   help <span class=\"string\">@vfp16\/vfp16<\/span>\r\n<\/pre><pre class=\"codeoutput\">  vfp16.  Constructor for variable format 16-bit floating point object.\r\n \r\n  y = vfp16(x) is an array, the same size as x, of uint16s.   Each\r\n      element is packed with p fraction bits, 15-p exponent bits and\r\n      one sign bit.  A single value of the precision, p, is associated\r\n      with the entire array.\r\n \r\n      Any integer value of p in the range 0 &lt;= p &lt;= 15 is allowed,\r\n      although the extreme values are of questionable utility.\r\n      The default precision is p = 10 for IEEE standard fp16.\r\n \r\n  y = vfp16(x,p) has precision p without changing the working precision.\r\n  \r\n  Three key-value pairs may be set:\r\n     vfp16('precision',p) sets the working precision to p.\r\n     vfp16('subnormals','on'\/'off) controls gradual underflow.\r\n     vfp16('fma','off'\/'on') controls fused multiply adds.\r\n     Up to three key-value pairs are allowed in a single call to vfp16.\r\n \r\n  Two formats exist in hardware:\r\n  vfp16('fp16') sets p = 10, subnormals = on, fma = off (the default).\r\n  vfp16('bfloat16') sets p = 7, subnormals = off, fma = on.\r\n \r\n  vfp16('precision') is the current working precision.\r\n  vfp16('subnormals') is the current status of gradual underflow.\r\n  vfp16('fma') is the current status of fused multiply adds.\r\n  u = packed(y) is the uint16s in y.\r\n  p = precision(y) is the value for the entire array y. \r\n \r\n  See also: vfp16\/single,\r\n            http:\/\/blogs.mathworks.com\/cleve\/2019\/01\/16.\r\n\r\n    Reference page in Doc Center\r\n       doc vfp16\r\n\r\n\r\n<\/pre><h4>Anatomy<a name=\"84fd2483-6ecf-4b50-b783-92b0938e3a57\"><\/a><\/h4><p>The key attributes of variable format half precision are displayed in the following chart, <tt>vfp16_anatomy<\/tt>.  The extreme  exponent range makes it necessary to use the <tt>vpa<\/tt> arithmetic of the Symbolic Math Toolbox&#8482; to compute <tt>vfp16_anatomy<\/tt>.<\/p><div><ul><li><tt>p<\/tt> is the precision, the number of bits in the fraction.<\/li><li><tt>bias<\/tt> is the range of the exponent.<\/li><li><tt>eps<\/tt> is the distance from 1 to the next larger <tt>vfp16<\/tt> number.<\/li><li><tt>realmax<\/tt> is the largest <tt>vfp16<\/tt> number.<\/li><li><tt>realmin<\/tt> is the smallest normalized <tt>vfp16<\/tt> number.<\/li><li><tt>tiny<\/tt> is the smallest subnormal <tt>vfp16<\/tt> number.<\/li><\/ul><\/div><pre class=\"codeinput\">   vfp16_anatomy\r\n<\/pre><pre class=\"codeoutput\">   p  bias        eps     realmax      realmin         tiny\r\n[  1, 8191,       0.5, 8.181e2465, 3.667e-2466, 1.834e-2466]\r\n[  2, 4095,      0.25, 9.138e1232,  3.83e-1233, 9.575e-1234]\r\n[  3, 2047,     0.125,   3.03e616,  1.238e-616,  1.547e-617]\r\n[  4, 1023,    0.0625,  1.742e308,  2.225e-308,  1.391e-309]\r\n[  5,  511,   0.03125,   1.32e154,  2.983e-154,  9.323e-156]\r\n[  6,  255,   0.01562,   1.149e77,   3.454e-77,   5.398e-79]\r\n[  7,  127,  0.007812,    3.39e38,   1.175e-38,   9.184e-41]\r\n[  8,   63,  0.003906,   1.841e19,   2.168e-19,    8.47e-22]\r\n[  9,   31,  0.001953,    4.291e9,   9.313e-10,   1.819e-12]\r\n[ 10,   15, 0.0009766,    65500.0,    6.104e-5,     5.96e-8]\r\n[ 11,    7, 0.0004883,      255.9,     0.01562,    7.629e-6]\r\n[ 12,    3, 0.0002441,       16.0,        0.25,    6.104e-5]\r\n[ 13,    1, 0.0001221,        4.0,         1.0,   0.0001221]\r\n[ 14,    0,  6.104e-5,        2.0,         2.0,   0.0001221]\r\n[ 15,  NaN,  3.052e-5,        NaN,         NaN,         NaN]\r\n<\/pre><h4>Example<a name=\"8cc9ed7f-71cb-4bdd-97df-d434797a2562\"><\/a><\/h4><p>Here is the binary display of <tt>vfp16(x,p)<\/tt> as <tt>p<\/tt> varies for an <tt>x<\/tt> between 2 and 4.  This is the same output as the animated calculator below.<\/p><pre class=\"codeinput\">   format <span class=\"string\">compact<\/span>\r\n   format <span class=\"string\">long<\/span>\r\n\r\n   x = 10\/3\r\n<\/pre><pre class=\"codeoutput\">x =\r\n   3.333333333333333\r\n<\/pre><pre class=\"codeinput\">   <span class=\"keyword\">for<\/span> p = 0:15\r\n       y = binary(vfp16(x,p));\r\n       fprintf(<span class=\"string\">'%5d   %18s\\n'<\/span>,p,y)\r\n   <span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">    0   0 100000000000001 \r\n    1   0 10000000000000 1\r\n    2   0 1000000000000 11\r\n    3   0 100000000000 101\r\n    4   0 10000000000 1011\r\n    5   0 1000000000 10101\r\n    6   0 100000000 101011\r\n    7   0 10000000 1010101\r\n    8   0 1000000 10101011\r\n    9   0 100000 101010101\r\n   10   0 10000 1010101011\r\n   11   0 1000 10101010101\r\n   12   0 100 101010101011\r\n   13   0 10 1010101010101\r\n   14   0 1 10000000000000\r\n   15   0  101010101010101\r\n<\/pre><p>The upper triangle in the output is the biased exponent field and the lower triangle is the fraction.  At <tt>p = 0<\/tt> there is no room for a fraction and at <tt>p = 15<\/tt> there is no room for an exponent. Consequently, these formats are not very useful.<\/p><p>Here are the results when these values are converted back to doubles. As the precision increases the error is cut in half at each step. The last two values of <tt>p<\/tt> show failure. The important values of <tt>p<\/tt> are 7 and 10.<\/p><pre class=\"codeinput\">   disp(<span class=\"string\">'    p           y                  x-y'<\/span>)\r\n   <span class=\"keyword\">for<\/span> p = 0:15\r\n       y = double(vfp16(x,p));\r\n       fprintf(<span class=\"string\">'%5d %18.13f   %18.13f\\n'<\/span>,p,y,x-y)\r\n   <span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">    p           y                  x-y\r\n    0    4.0000000000000     -0.6666666666667\r\n    1    3.0000000000000      0.3333333333333\r\n    2    3.5000000000000     -0.1666666666667\r\n    3    3.2500000000000      0.0833333333333\r\n    4    3.3750000000000     -0.0416666666667\r\n    5    3.3125000000000      0.0208333333333\r\n    6    3.3437500000000     -0.0104166666667\r\n    7    3.3281250000000      0.0052083333333\r\n    8    3.3359375000000     -0.0026041666667\r\n    9    3.3320312500000      0.0013020833333\r\n   10    3.3339843750000     -0.0006510416667\r\n   11    3.3330078125000      0.0003255208333\r\n   12    3.3334960937500     -0.0001627604167\r\n   13    3.3332519531250      0.0000813802083\r\n   14                NaN                  NaN\r\n   15    2.6666259765625      0.6667073567708\r\n<\/pre><h4>Conversion to Single<a name=\"0f6852c1-df93-4300-8367-0e0f2f19a345\"><\/a><\/h4><p>With eight bits in the exponent, the new <i>bfloat16<\/i> has a significant advantage over the standard <i>fp16<\/i> when it comes to conversion between single and half precision.  The sign and exponent fields of the half precision word are the same as single precision, so the half precision word is just the front half of the single precision word.  For example,<\/p><pre class=\"codeinput\">   format <span class=\"string\">hex<\/span>\r\n   disp(single(pi));\r\n   disp(packed(vfp16(pi,7)))\r\n<\/pre><pre class=\"codeoutput\">   40490fdb\r\n   4049\r\n<\/pre><h4>Fused Multiply Add<a name=\"018a233b-1b56-499b-ab66-9bf4f542c8b0\"><\/a><\/h4><p>The cores of many algorithms for matrix computation often involve one of two fundamental vector operations, \"dot\" and \"daxpy\".  Let <tt>x<\/tt> and <tt>y<\/tt> be column vectors of length <tt>n<\/tt> and let <tt>a<\/tt> be a scalar. The extended dot product is<\/p><pre class=\"language-matlab\">x'*y + a\r\n<\/pre><p>The so-called elementary vector operation, or \"daxpy\" for \"double precision a times x plus y\" is<\/p><pre class=\"language-matlab\">a*x + y\r\n<\/pre><p>Both involve loops of length n around a multiplication followed by an addition.  Many modern computer architectures have fused multiply add instructions, FMA, where this operation is a single instruction. Moreover, the multiplication produces a result in twice the working precision and the addition is done with that higher precision.  The <a href=\"https:\/\/software.intel.com\/sites\/default\/files\/managed\/40\/8b\/bf16-hardware-numerics-definition-white-paper.pdf\"><i>bfloat16<\/i> specification<\/a> includes FMA.  With our <tt>vfp16<\/tt> method FMA can be turned off and on.  It is off by default.<\/p><h4>Subnormals<a name=\"34f9c391-3e9b-4d1f-9824-e168a3a9de94\"><\/a><\/h4><p>I wrote a blog post about <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2014\/07\/21\/floating-point-denormals-insignificant-but-controversial-2\/\">floating point denormals<\/a> several years ago.  In the revision of IEEE 754 denormals were renamed subnormals.  Whatever they are called, they are relatively rare with the large range of <i>bfloat16<\/i>, so they can be turned off.<\/p><p>All the properties of <i>bfloat16<\/i> can be obtained with the statement<\/p><pre class=\"codeinput\">   vfp16(<span class=\"string\">'bfloat16'<\/span>)\r\n<\/pre><p>which sets <tt>precision = 7<\/tt>, <tt>subnormals = off<\/tt> and <tt>fma = on<\/tt>.<\/p><p>The defaults can be restored with<\/p><pre class=\"codeinput\">   vfp16(<span class=\"string\">'fp16'<\/span>)\r\n<\/pre><p>which sets <tt>precision = 10<\/tt>, <tt>subnormals = on<\/tt> and <tt>fma = off<\/tt> .<\/p><h4>Calculator<a name=\"56f81515-c908-4865-add9-ad866ef1d639\"><\/a><\/h4><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"http:\/\/blogs.mathworks.com\/cleve\/files\/calculator_movie2.gif\" alt=\"\"> <\/p><p>This animation shows how I've added <tt>@vdp16<\/tt> to the calculator that I mentioned in blog posts about <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/08\/half-precision-16-bit-floating-point-arithmetic\">half precision<\/a> and <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/04\/24\/a-roman-numeral-object-with-arithmetic-matrices-and-a-clock\">roman numerals<\/a>.  When the radio button for the word size 16 is selected, a slider appears that allows you to select the precision. The button is labeled \"16\", followed by the precision. Moving the slider up increases the number of bits in the fraction and consequently increases the precision, while moving the slider down decreases <tt>p<\/tt> and the precision.<\/p><p>I could make a variable format quarter precision object, but none of the other formats are useful.  And variable format single and double precision objects have lots of bits, but little else to offer.<\/p><h4>To be continued<a name=\"3b9f416a-5752-45d1-a801-c9bf17b91dc2\"><\/a><\/h4><p>I'm not done with this.  I am still in the process of extending the linear algebra functions to variable format.  I hope to report on that in a couple of weeks.  In the meantime, I'll post <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/59085-cleve_s-laboratory\">Version 4.20 of Cleve's Laboratory<\/a> with what I have done so far.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_3fb40fbfa99c44bb9ffba6254a31159a() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='3fb40fbfa99c44bb9ffba6254a31159a ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 3fb40fbfa99c44bb9ffba6254a31159a';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2019 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_3fb40fbfa99c44bb9ffba6254a31159a()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2018b<br><\/p><\/div><!--\r\n3fb40fbfa99c44bb9ffba6254a31159a ##### SOURCE BEGIN #####\r\n%% Variable Format Half Precision Floating Point Arithmetic\r\n% A year and a half ago I wrote a post about \"half precision\" 16-bit\r\n% floating point arithmetic,\r\n% <https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/08\/half-precision-16-bit-floating-point-arithmetic\r\n% Moler on fp16>.\r\n% I followed this with a bug fix,\r\n% <https:\/\/blogs.mathworks.com\/cleve\/2017\/12\/20\/bug-in-half-precision-floating-point-object\/\r\n% bug in fp16>.\r\n% Both posts were about _fp16_, defined in IEEE standard 754.\r\n% This is only one of 15 possible 16-bit formats.\r\n% In this post I am going to consider all 15.\r\n%\r\n% There is also interest in a new format, _bfloat16_.\r\n% A recent post by Nick Higham, compares the two,\r\n% <https:\/\/nickhigham.wordpress.com\/2018\/12\/03\/half-precision-arithmetic-fp16-versus-bfloat16\/\r\n% Higham on fp16 and bfloat16>.  Nick mentions the interest in the two\r\n% formats by Intel, AMD, NVIDIA, Arm and Google.  These formats are\r\n% two out of the 15.\r\n\r\n%% Formats\r\n% A floating point format is characterized by two parameters, p, the \r\n% number of bits in the fraction, and q, the number of bits in the exponent.\r\n% For half precision, we always have p+q = 15.  This leaves one bit for\r\n% the sign.\r\n\r\n%%\r\n% The two formats of most interest are the IEEE standard _fp16_ with\r\n% p = 10 and the new _bfloat16_ with p = 7.  The new format has three more\r\n% bits in the exponent and three fewer bits in the fraction than the\r\n% standard.  This increased range at the expense of precision is proving \r\n% useful in machine learning and image processing.\r\n\r\n%% vfp16\r\n% My new MATLAB(R) object is an elaboration of |fp16|, so I named it\r\n% |vfp16|.  Here is its |help| entry.\r\n\r\n   help @vfp16\/vfp16\r\n\r\n%% Anatomy\r\n% The key attributes of variable format half precision are displayed\r\n% in the following chart, |vfp16_anatomy|.  The extreme  exponent range\r\n% makes it necessary to use the |vpa| arithmetic of the Symbolic Math\r\n% Toolbox(TM) to compute |vfp16_anatomy|.\r\n%\r\n% * |p| is the precision, the number of bits in the fraction.\r\n% * |bias| is the range of the exponent.\r\n% * |eps| is the distance from 1 to the next larger |vfp16| number.\r\n% * |realmax| is the largest |vfp16| number.\r\n% * |realmin| is the smallest normalized |vfp16| number.\r\n% * |tiny| is the smallest subnormal |vfp16| number.\r\n\r\n   vfp16_anatomy\r\n   \r\n%% Example\r\n% Here is the binary display of |vfp16(x,p)| as |p| varies for an\r\n% |x| between 2 and 4.  This is the same output as the animated\r\n% calculator below.\r\n\r\n   format compact\r\n   format long\r\n   \r\n   x = 10\/3\r\n   \r\n%%\r\n\r\n   for p = 0:15\r\n       y = binary(vfp16(x,p));\r\n       fprintf('%5d   %18s\\n',p,y)\r\n   end\r\n   \r\n%%\r\n% The upper triangle in the output is the biased exponent field and the\r\n% lower triangle is the fraction.  At |p = 0| there is no room for a\r\n% fraction and at |p = 15| there is no room for an exponent.  \r\n% Consequently, these formats are not very useful.\r\n\r\n%%\r\n% Here are the results when these values are converted back to doubles.\r\n% As the precision increases the error is cut in half at each step.\r\n% The last two values of |p| show failure.\r\n% The important values of |p| are 7 and 10.\r\n\r\n   disp('    p           y                  x-y')\r\n   for p = 0:15   \r\n       y = double(vfp16(x,p));\r\n       fprintf('%5d %18.13f   %18.13f\\n',p,y,x-y)\r\n   end\r\n   \r\n%% Conversion to Single\r\n% With eight bits in the exponent, the new _bfloat16_ has a significant\r\n% advantage over the standard _fp16_ when it comes to conversion between\r\n% single and half precision.  The sign and exponent fields of the half\r\n% precision word are the same as single precision, so the half precision\r\n% word is just the front half of the single precision word.  For example,\r\n\r\n   format hex\r\n   disp(single(pi));\r\n   disp(packed(vfp16(pi,7)))\r\n   \r\n%% Fused Multiply Add\r\n% The cores of many algorithms for matrix computation often involve one\r\n% of two fundamental vector operations, \"dot\" and \"daxpy\".  Let |x| and\r\n% |y| be column vectors of length |n| and let |a| be a scalar.\r\n% The extended dot product is\r\n%\r\n%   x'*y + a\r\n%\r\n% The so-called elementary vector operation, or \"daxpy\" for \"double\r\n% precision a times x plus y\" is\r\n%\r\n%   a*x + y\r\n%\r\n% Both involve loops of length n around a multiplication followed by\r\n% an addition.  Many modern computer architectures have fused multiply\r\n% add instructions, FMA, where this operation is a single instruction.\r\n% Moreover, the multiplication produces a result in twice the working\r\n% precision and the addition is done with that higher precision.  The\r\n% <https:\/\/software.intel.com\/sites\/default\/files\/managed\/40\/8b\/bf16-hardware-numerics-definition-white-paper.pdf\r\n% _bfloat16_ specification> includes FMA.  With our |vfp16| method FMA\r\n% can be turned off and on.  It is off by default.\r\n\r\n%% Subnormals\r\n% I wrote a blog post about\r\n% <https:\/\/blogs.mathworks.com\/cleve\/2014\/07\/21\/floating-point-denormals-insignificant-but-controversial-2\/\r\n% floating point denormals> several years ago.  In the revision of IEEE 754\r\n% denormals were renamed subnormals.  Whatever they are called, they are\r\n% relatively rare with the large range of _bfloat16_, so they can be\r\n% turned off.\r\n%\r\n% All the properties of _bfloat16_ can be obtained with the statement\r\n\r\n   vfp16('bfloat16')\r\n\r\n%%\r\n% which sets |precision = 7|, |subnormals = off| and |fma = on|.\r\n     \r\n%%\r\n% The defaults can be restored with\r\n \r\n   vfp16('fp16')\r\n \r\n%%\r\n% which sets |precision = 10|, |subnormals = on| and |fma = off| .\r\n \r\n%% Calculator\r\n%\r\n% <<calculator_movie2.gif>>\r\n%\r\n% This animation shows how I've added |@vdp16| to the calculator that\r\n% I mentioned in blog posts about\r\n% <https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/08\/half-precision-16-bit-floating-point-arithmetic\r\n% half precision> and\r\n% <https:\/\/blogs.mathworks.com\/cleve\/2017\/04\/24\/a-roman-numeral-object-with-arithmetic-matrices-and-a-clock\r\n% roman numerals>.  When the radio button for the word size 16 is\r\n% selected, a slider appears that allows you to select the precision.\r\n% The button is labeled \"16\", followed by the precision.\r\n% Moving the slider up increases the number of bits in the fraction\r\n% and consequently increases the precision, while moving the slider down\r\n% decreases |p| and the precision.\r\n%\r\n% I could make a variable format quarter precision object, but none of \r\n% the other formats are useful.  And variable format single and double\r\n% precision objects have lots of bits, but little else to offer.\r\n\r\n%% To be continued\r\n% I'm not done with this.  I am still in the process of extending the\r\n% linear algebra functions to variable format.  I hope to report on that\r\n% in a couple of weeks.  In the meantime, I'll post\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/59085-cleve_s-laboratory\r\n% Version 4.20 of Cleve's Laboratory> with what I have done so far.\r\n##### SOURCE END ##### 3fb40fbfa99c44bb9ffba6254a31159a\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"http:\/\/blogs.mathworks.com\/cleve\/files\/calculator_movie2.gif\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>A year and a half ago I wrote a post about \"half precision\" 16-bit floating point arithmetic, <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/08\/half-precision-16-bit-floating-point-arithmetic\">Moler on fp16<\/a>. I followed this with a bug fix, <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/12\/20\/bug-in-half-precision-floating-point-object\/\">bug in fp16<\/a>. Both posts were about <i>fp16<\/i>, defined in IEEE standard 754. This is only one of 15 possible 16-bit formats. In this post I am going to consider all 15.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/cleve\/2019\/01\/16\/variable-format-half-precision-floating-point-arithmetic\/\">read more >><\/a><\/p>","protected":false},"author":78,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16,7],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/4392"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/users\/78"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/comments?post=4392"}],"version-history":[{"count":5,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/4392\/revisions"}],"predecessor-version":[{"id":4408,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/4392\/revisions\/4408"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media?parent=4392"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/categories?post=4392"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/tags?post=4392"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}