{"id":2490,"date":"2017-05-22T15:48:49","date_gmt":"2017-05-22T20:48:49","guid":{"rendered":"https:\/\/blogs.mathworks.com\/cleve\/?p=2490"},"modified":"2017-05-22T15:48:49","modified_gmt":"2017-05-22T20:48:49","slug":"quadruple-precision-128-bit-floating-point-arithmetic","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/22\/quadruple-precision-128-bit-floating-point-arithmetic\/","title":{"rendered":"Quadruple Precision, 128-bit Floating Point Arithmetic"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>The floating point arithmetic format that occupies 128 bits of storage is known as <i>binary128<\/i> or <i>quadruple precision<\/i>. This blog post describes an implementation of quadruple precision programmed entirely in the MATLAB language.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/screen_shot.png\" alt=\"\"> <\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#0703d33c-5814-4576-b26e-e2ce365e8dfc\">Background<\/a><\/li><li><a href=\"#0ead9d42-dc3b-4a65-a8b5-cb1bb79074ed\">Beyond double<\/a><\/li><li><a href=\"#af23e250-fbe8-476e-a6eb-70d8f6bd12da\">Floating point anatomy<\/a><\/li><li><a href=\"#cca96534-8a4c-4c8a-b10d-0677b7e65d8e\">Precision and range<\/a><\/li><li><a href=\"#cb2dceea-993a-4496-8c32-8484aabe6c5f\">Floating point integers<\/a><\/li><li><a href=\"#aa7bc39b-6d1e-482f-91be-18607d3b95cf\">Table<\/a><\/li><li><a href=\"#7b575c04-ef9c-466e-a586-ca699e81b772\">fp128<\/a><\/li><li><a href=\"#79cebb37-dd75-4952-a45a-c7524d9c6620\">Examples<\/a><\/li><li><a href=\"#d23480fa-8ddc-42d4-9f62-6fe3f640c79d\">Matrix operations<\/a><\/li><li><a href=\"#26310ffb-dd99-4b37-8763-dc1151c64cad\">Quadruple precision backslash<\/a><\/li><li><a href=\"#7400bdf6-20e8-403b-b6b9-6f1e89ca240a\">Quadruple precision SVD<\/a><\/li><li><a href=\"#88791357-8813-4ab2-b887-80e7f4965d42\">Rosser matrix<\/a><\/li><li><a href=\"#d748cfa7-86db-4ba5-999d-2232962a9f19\">Postscript<\/a><\/li><\/ul><\/div><h4>Background<a name=\"0703d33c-5814-4576-b26e-e2ce365e8dfc\"><\/a><\/h4><p>The IEEE 754 standard, published in 1985, defines formats for floating point numbers that occupy 32 or 64 bits of storage. These formats are known as <i>binary32<\/i> and <i>binary64<\/i>, or more frequently as <i>single<\/i> and <i>double precision<\/i>.  For many years MATLAB used only double precision and it remains our default format.  Single precision has been added gradually over the last several years and is now also fully supported.<\/p><p>A revision of IEEE 754, published in 2008, defines two more floating point formats.  One, <i>binary16<\/i> or <i>half precision<\/i>, occupies only 16 bits and was the subject of my <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/08\/half-precision-16-bit-floating-point-arithmetic\/\">previous blog post<\/a>.  It is  primarily intended to reduce storage and memory bandwidth requirements.  Since it provides only \"half\" precision, its use for actual computation is problematic.<\/p><p>The other new format introduced in IEEE 754-2008 is <i>binary128<\/i> or <i>quadruple precision<\/i>.  It is intended for situations where the accuracy or range of double precision is inadequate.<\/p><p>I see two descriptions of quadruple precision software implementations on the Web.<\/p><div><ul><li><a href=\"http:\/\/www.advanpix.com\/2013\/01\/20\/fast-quadruple-precision-computations\/\">ADVANPIX Muliprecision Toolbox for MATLAB<\/a><\/li><\/ul><\/div><div><ul><li><a href=\"http:\/\/gcc.gnu.org\/onlinedocs\/libquadmath.pdf\">GCC Quad-Precision Math Library<\/a><\/li><\/ul><\/div><p>I have not used either package, but judging by their Web pages, they both appear to be complete and well supported.<\/p><p>The MATLAB Symbolic Math Toolbox provides <tt>vpa<\/tt>, arbitrary precision decimal floating point arithmetic, and <tt>sym<\/tt>, exact rational arithmetic. Both provide accuracy and range well beyond quadruple precision, but do not specifically support the 128-bit IEEE format.<\/p><p>My goal here is to describe a prototype of a MATLAB object, <tt>fp128<\/tt>, that implements quadruple precision with code written entirely in the MATLAB language.  It is not very efficient, but is does allow experimentation with the 128-bit format.<\/p><h4>Beyond double<a name=\"0ead9d42-dc3b-4a65-a8b5-cb1bb79074ed\"><\/a><\/h4><p>There are other floating point formats beyond double precision. <i>Long double<\/i> usually refers to the 80-bit extended precision floating point registers available with the Intel x86 architecture and described as <i>double extended<\/i> in IEEE 754.  This provides the same exponent range as quadruple precision, but much less accuracy.<\/p><p><i>Double double<\/i> refers to the use of a pair of double precision values.  The exponent field and sign bit of the second double are ignored, so this is effectively a 116-bit format.  Both the exponent range and the precision are more than double but less than quadruple.<\/p><h4>Floating point anatomy<a name=\"af23e250-fbe8-476e-a6eb-70d8f6bd12da\"><\/a><\/h4><p>The format of a floating point number is characterized by two parameters, <tt>p<\/tt>, the number of bits in the fraction and <tt>q<\/tt>, the number of bits in the exponent.  I will compare four precisions, <i>half<\/i>, <i>single<\/i>, <i>double<\/i>, and <i>quadruple<\/i>. The four pairs of characterizing parameters are<\/p><pre class=\"codeinput\">   p = [10, 23, 52 112];\r\n<\/pre><pre class=\"codeinput\">   q = [5, 8, 11, 15];\r\n<\/pre><p>With these values of <tt>p<\/tt> and <tt>q<\/tt>, and with one more bit for the sign, the total number of bits in the word, <tt>w<\/tt>, is a power of two.<\/p><pre class=\"codeinput\">   format <span class=\"string\">shortg<\/span>\r\n   w = p + q + 1\r\n<\/pre><pre class=\"codeoutput\">w =\r\n    16    32    64   128\r\n<\/pre><p><b>Normalized numbers<\/b><\/p><p>Most floating point numbers are <i>normalized<\/i>, and are expressed as<\/p><p>$$ x = \\pm (1+f)2^e $$<\/p><p>The <i>fraction<\/i> $f$ is in the half open interval<\/p><p>$$ 0 \\leq f &lt; 1 $$<\/p><p>The binary representation of $f$ requires at most <tt>p<\/tt> bits. In other words $2^p f$ is an integer in the range<\/p><p>$$ 0 \\leq 2^p f &lt; 2^p $$<\/p><p>The <i>exponent<\/i> $e$ is an integer in the range<\/p><p>$$ -b+1 \\leq e \\leq b $$<\/p><p>The quantity $b$ is both the largest exponent and the <tt>bias<\/tt>.<\/p><p>$$ b = 2^{q-1} - 1 $$<\/p><pre class=\"codeinput\">   b = 2.^(q-1)-1\r\n<\/pre><pre class=\"codeoutput\">b =\r\n          15         127        1023       16383\r\n<\/pre><p>The fractional part of a normalized number is $1+f$, but only $f$ needs to be stored.  That leading $1$ is known as the <i>hidden bit<\/i>.<\/p><p><b>Subnormal<\/b><\/p><p>There are two values of the exponent $e$ for which the biased exponent, $e+b$, reaches the smallest and largest values possible to represent in <tt>q<\/tt> bits.  The smallest is<\/p><p>$$ e + b = 0 $$<\/p><p>The corresponding floating point numbers do not have a hidden leading bit.  These are the <i>subnormal<\/i> or <i>denormal<\/i> numbers.<\/p><p>$$ x = \\pm f 2^{-b} $$<\/p><p><b>Infinity and Not-A-Number<\/b><\/p><p>The largest possible biased exponent is<\/p><p>$$ e + b = 2^q-1 $$.<\/p><p>Quantities with this exponent field represent <i>infinities<\/i> and <i>NaN<\/i>, or <i>Not-A-Number<\/i>.<\/p><p>The percentage of floating point numbers that are exceptional because they are subnormal, infinity or NaN increases as the precision decreases.  Exceptional exponents are only $2$ values out of $2^q$.  For quadruple precision this is $2\/2^{15}$, which is less than a one one-thousandth of one percent.<\/p><p>Encode the sign bit with <tt>s = 0<\/tt> for nonnegative and <tt>s = 1<\/tt> for negative.  And encode the exponent with an offsetting bias, <tt>b<\/tt>. Then a floating point number can be packed in <tt>w<\/tt> bits with<\/p><pre>  x = [s e+b 2^p*f]<\/pre><h4>Precision and range<a name=\"cca96534-8a4c-4c8a-b10d-0677b7e65d8e\"><\/a><\/h4><p><b>epsilon<\/b><\/p><p>If a real number cannot be expressed with a binary expansion requiring at most <tt>p<\/tt> bits, it must be approximated by a floating point number that does have such a binary representation.  This is <i>roundoff error<\/i>. The important quantity characterizing precision is <i>machine epsilon<\/i>, or <tt>eps<\/tt>.  In MATLAB, <tt>eps(x)<\/tt> is the distance from <tt>x<\/tt> to the next larger (in absolute value) floating point number (of that class). With no argument, <tt>eps<\/tt> is simply the difference between <tt>1<\/tt> and the next larger floating point number.<\/p><pre class=\"codeinput\">    format <span class=\"string\">shortg<\/span>\r\n    eps = 2.^(-p)\r\n<\/pre><pre class=\"codeoutput\">eps =\r\n   0.00097656   1.1921e-07   2.2204e-16   1.9259e-34\r\n<\/pre><p>This tells us that quadruple precision is good for about 34 decimal digits of accuracy, double for about 16 decimal digits, single for about 7, and half for about 3.<\/p><p><b>realmax<\/b><\/p><p>If a real number, or the result of an arithmetic operation, is too large to be represented, it <i>overflows<\/i> and is replaced by <i>infinity<\/i>. The largest floating point number that does not overflow is <tt>realmax<\/tt>. When I try to compute quadruple <tt>realmax<\/tt> with double precision, it overflows.  I will fix this up in the table to follow.<\/p><pre class=\"codeinput\">    realmax = 2.^b.*(2-eps)\r\n<\/pre><pre class=\"codeoutput\">realmax =\r\n        65504   3.4028e+38  1.7977e+308          Inf\r\n<\/pre><p><b>realmin<\/b><\/p><p><i>Underflow<\/i> and representation of very small numbers is more complicated. The smallest normalized floating point number is <tt>realmin<\/tt>.  When I try to compute quadruple <tt>realmin<\/tt> it underflows to zero.  Again, I will fix this up in the table.<\/p><pre class=\"codeinput\">    realmin = 2.^(-b+1)\r\n<\/pre><pre class=\"codeoutput\">realmin =\r\n   6.1035e-05   1.1755e-38  2.2251e-308            0\r\n<\/pre><p><b>tiny<\/b><\/p><p>But there are numbers smaller than <tt>realmin<\/tt>.  IEEE 754 introduced the notion of <i>gradual underflow<\/i> and <i>denormal<\/i> numbers.  In the 2008 revised standard their name was changed to <i>subnormal<\/i>.<\/p><p>Think of roundoff in numbers near underflow.  Before 754, floating point numbers had the disconcerting property that <tt>x<\/tt> and <tt>y<\/tt> could be unequal, but their difference could underflow, so <tt>x-y<\/tt> becomes <tt>0<\/tt>. With 754 the gap between <tt>0<\/tt> and <tt>realmin<\/tt> is filled with numbers whose spacing is the same as the spacing between <tt>realmin<\/tt> and <tt>2*realmin<\/tt>. I like to call this spacing, and the smallest subnormal number, <tt>tiny<\/tt>.<\/p><pre class=\"codeinput\">    tiny = realmin.*eps\r\n<\/pre><pre class=\"codeoutput\">tiny =\r\n   5.9605e-08   1.4013e-45  4.9407e-324            0\r\n<\/pre><h4>Floating point integers<a name=\"cb2dceea-993a-4496-8c32-8484aabe6c5f\"><\/a><\/h4><p><b>flintmax<\/b><\/p><p>It is possible to do integer arithmetic with floating point numbers. I like to call such numbers <i>flints<\/i>.  When we write the numbers $3$ and $3.0$, they are different descriptions of the same integer, but we think of one as fixed point and the other as floating point. The largest flint is <tt>flintmax<\/tt>.<\/p><pre class=\"codeinput\">    flintmax = 2.\/eps\r\n<\/pre><pre class=\"codeoutput\">flintmax =\r\n         2048   1.6777e+07   9.0072e+15   1.0385e+34\r\n<\/pre><p>Technically all the floating point numbers larger than <tt>flintmax<\/tt> are integers, but the spacing between them is larger than one, so it is not safe to use them for integer arithmetic. Only integer-valued floating point numbers between <tt>0<\/tt> and <tt>flintmax<\/tt> are allowed to be called flints.<\/p><h4>Table<a name=\"aa7bc39b-6d1e-482f-91be-18607d3b95cf\"><\/a><\/h4><p>Let's collect all these anatomical characteristics together in a new MATLAB <tt>table<\/tt>.  I have now edited the output and inserted the correct quadruple precision values.<\/p><pre class=\"codeinput\">   T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax];\r\n\r\n   T = table(T(:,1), T(:,2), T(:,3), T(:,4), <span class=\"keyword\">...<\/span>\r\n      <span class=\"string\">'variablenames'<\/span>,{<span class=\"string\">'half'<\/span>,<span class=\"string\">'single'<\/span>,<span class=\"string\">'double'<\/span>,<span class=\"string\">'quadruple'<\/span>}, <span class=\"keyword\">...<\/span>\r\n      <span class=\"string\">'rownames'<\/span>,{<span class=\"string\">'w'<\/span>,<span class=\"string\">'p'<\/span>,<span class=\"string\">'q'<\/span>,<span class=\"string\">'b'<\/span>,<span class=\"string\">'eps'<\/span>,<span class=\"string\">'realmax'<\/span>,<span class=\"string\">'realmin'<\/span>, <span class=\"keyword\">...<\/span>\r\n                  <span class=\"string\">'tiny'<\/span>,<span class=\"string\">'flintmax'<\/span>});\r\n\r\n   type <span class=\"string\">Table.txt<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\n                 half         single        double       quadruple \r\n              __________    __________    ___________    __________\r\n\r\n    w                 16            32             64           128\r\n    p                 10            23             52           112\r\n    q                  5             8             11            15\r\n    b                 15           127           1023         16383\r\n    eps       0.00097656    1.1921e-07     2.2204e-16    1.9259e-34\r\n    realmax        65504    3.4028e+38    1.7977e+308   1.190e+4932\r\n    realmin   6.1035e-05    1.1755e-38    2.2251e-308   3.362e-4932\r\n    tiny      5.9605e-08    1.4013e-45    4.9407e-324   6.475e-4966\r\n    flintmax        2048    1.6777e+07     9.0072e+15    1.0385e+34\r\n<\/pre><h4>fp128<a name=\"7b575c04-ef9c-466e-a586-ca699e81b772\"><\/a><\/h4><p>I am currently working on code for an object, <tt>@fp128<\/tt>, that could provide a full implementation of quadruple-precision arithmetic.  The methods available so far are<\/p><pre class=\"codeinput\">   methods(fp128)\r\n<\/pre><pre class=\"codeoutput\">\r\nMethods for class fp128:\r\n\r\nabs         eq          le          mtimes      realmax     subsref     \r\ncond        fp128       lt          ne          realmin     svd         \r\ndiag        frac        lu          norm        shifter     sym         \r\ndisp        ge          max         normalize   sig         times       \r\ndisplay     gt          minus       normalize2  sign        tril        \r\ndouble      hex         mldivide    plus        sqrt        triu        \r\nebias       hypot       mpower      power       subsasgn    uminus      \r\neps         ldivide     mrdivide    rdivide     subsindex   \r\n\r\n<\/pre><p>The code that I have for quadrule precision is much more complex than the code that I have for <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/08\/half-precision-16-bit-floating-point-arithmetic\/\">half precision<\/a>.  There I am able to \"cheat\" by converting half precision numbers to doubles and relying on traditional MATLAB arithmetic.  I can't do that for quads.<\/p><p>The storage scheme for quads is described in the help entry for the constructor.<\/p><pre class=\"codeinput\">   help <span class=\"string\">@fp128\/fp128<\/span>\r\n<\/pre><pre class=\"codeoutput\">  fp128 Quad precision constructor.\r\n  z = fp128(x) has three fields.\r\n    x = s*(1+f)*2^e, where\r\n    z.s, one uint8, s = (-1)^sg = 1-2*sg, sg = (1-s)\/2.\r\n    z.e, 15 bits, biased exponent, one uint16.\r\n         b = 2^14-1 = 16383,\r\n         eb = e + b,\r\n         1 &lt;= eb &lt;= 2*b for normalized quads,\r\n         eb = 0 for subnormal quads,\r\n         eb = 2*b+1 = 32767 for infinity and NaN.\r\n    z.f, 112 bits, nonnegative fraction,\r\n         4-vector of uint64s, each with 1\/4-th of the bits,\r\n         0 &lt;= f(k) &lt; 2^28, 4*28 = 112. \r\n         z.f represents sum(f .* pow2), pow2 = 2.^(-28*(1:4))       \r\n\r\n    Reference page in Doc Center\r\n       doc fp128\r\n\r\n\r\n<\/pre><p>Breaking the 112-bit fraction into four 28-bit pieces makes it possible to do arithmetic operations on the pieces without worrying about integer overflow.  The core of the <tt>times<\/tt> code, which implements <tt>x.*y<\/tt>, is the convolution of the two fractional parts.<\/p><pre class=\"codeinput\">   dbtype <span class=\"string\">45:53<\/span> <span class=\"string\">@fp128\/times<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\n45            % The multiplication.\r\n46            % z.f = conv(x.f,y.f);\r\n47            % Restore hidden 1's.\r\n48            xf = [1 x.f];\r\n49            yf = [1 y.f];\r\n50            zf = zeros(1,9,'uint64');\r\n51            for k = 1:5\r\n52                zf(k:k+4) = zf(k:k+4) + yf(k)*xf;\r\n53            end\r\n<\/pre><p>The result of the convolution, <tt>zf<\/tt>, is a <tt>uint64<\/tt> vector of length nine with 52-bit elements.  It must be renormalized to the fit the <tt>fp128<\/tt> storage scheme.<\/p><p>Addition and subtraction involve addition and subtraction of the fractional parts after they have been shifted so that the corresponding exponents are equal.  Again, this produces temporary vectors that must be renormalized.<\/p><p>Scalar division, <tt>y\/x<\/tt>, is done by first computing the reciprocal of the denominator, <tt>1\/x<\/tt>, and then doing one final multiplication, <tt>1\/x * y<\/tt>.  The reciprocal is computed by a few steps of Newton iteration, starting with a scaled  reciprocal, <tt>1\/double(x)<\/tt>.<\/p><h4>Examples<a name=\"79cebb37-dd75-4952-a45a-c7524d9c6620\"><\/a><\/h4><p>The output for each example shows the three fields in hexadecimal -- one sign field, one biased exponent field, and one fraction field that is a vector with four entries displayed with seven hex digits. This is followed by a 36 significant digit decimal representation.<\/p><p><b>One<\/b><\/p><pre class=\"codeinput\">   clear\r\n   format <span class=\"string\">long<\/span>\r\n   one = fp128(1)\r\n<\/pre><pre class=\"codeoutput\"> \r\none = \r\n   0 3fff  0000000  0000000  0000000  0000000\r\n   1.0\r\n<\/pre><p><b>eps<\/b><\/p><pre class=\"codeinput\">   eps = eps(one)\r\n<\/pre><pre class=\"codeoutput\"> \r\neps = \r\n   0 3f8f  0000000  0000000  0000000  0000000\r\n   0.000000000000000000000000000000000192592994438723585305597794258492732\r\n<\/pre><p><b>1 + eps<\/b><\/p><pre class=\"codeinput\">   one_plus_eps = one + eps\r\n<\/pre><pre class=\"codeoutput\"> \r\none_plus_eps = \r\n   0 3fff  0000000  0000000  0000000  0000001\r\n   1.00000000000000000000000000000000019\r\n<\/pre><p><b>2 - eps<\/b><\/p><pre class=\"codeinput\">   two_minus_eps = 2 - eps\r\n<\/pre><pre class=\"codeoutput\"> \r\ntwo_minus_eps = \r\n   0 3fff  fffffff  fffffff  fffffff  fffffff\r\n   1.99999999999999999999999999999999981\r\n<\/pre><p><b>realmin<\/b><\/p><pre class=\"codeinput\">   rmin = realmin(one)\r\n<\/pre><pre class=\"codeoutput\"> \r\nrmin = \r\n   0 0001  0000000  0000000  0000000  0000000\r\n   3.3621031431120935062626778173217526e-4932\r\n<\/pre><p><b>realmax<\/b><\/p><pre class=\"codeinput\">   rmax = realmax(one)\r\n<\/pre><pre class=\"codeoutput\"> \r\nrmax = \r\n   0 7ffe  fffffff  fffffff  fffffff  fffffff\r\n   1.18973149535723176508575932662800702e4932\r\n<\/pre><p><b>Compute 1\/10 with double, then convert to quadruple.<\/b><\/p><pre class=\"codeinput\">   dble_tenth = fp128(1\/10)\r\n<\/pre><pre class=\"codeoutput\"> \r\ndble_tenth = \r\n   0 3ffb  9999999  99999a0  0000000  0000000\r\n   0.100000000000000005551115123125782702\r\n<\/pre><p><b>Compute 1\/10 with quadruple.<\/b><\/p><pre class=\"codeinput\">   quad_tenth = 1\/fp128(10)\r\n<\/pre><pre class=\"codeoutput\"> \r\nquad_tenth = \r\n   0 3ffb  9999999  9999999  9999999  9999999\r\n   0.0999999999999999999999999999999999928\r\n<\/pre><p><b>Double precision <tt>pi<\/tt> converted to quadruple.<\/b><\/p><pre class=\"codeinput\">   dble_pi = fp128(pi)\r\n<\/pre><pre class=\"codeoutput\"> \r\ndble_pi = \r\n   0 4000  921fb54  442d180  0000000  0000000\r\n   3.14159265358979311599796346854418516\r\n<\/pre><p><b><tt>pi<\/tt> accurate to quadruple precision.<\/b><\/p><pre class=\"codeinput\">   quad_pi = fp128(sym(<span class=\"string\">'pi'<\/span>))\r\n<\/pre><pre class=\"codeoutput\"> \r\nquad_pi = \r\n   0 4000  921fb54  442d184  69898cc  51701b8\r\n   3.1415926535897932384626433832795028\r\n<\/pre><h4>Matrix operations<a name=\"d23480fa-8ddc-42d4-9f62-6fe3f640c79d\"><\/a><\/h4><p>The 4-by-4 magic square from Durer's Melancholia II provides my first matrix example.<\/p><pre class=\"codeinput\">   clear\r\n   M = fp128(magic(4));\r\n<\/pre><p>Let's see how the 128-bit elements look in hex.<\/p><pre class=\"codeinput\">   format <span class=\"string\">hex<\/span>\r\n   M\r\n<\/pre><pre class=\"codeoutput\"> \r\nM = \r\n   0 4003  0000000  0000000  0000000  0000000\r\n   0 4001  4000000  0000000  0000000  0000000\r\n   0 4002  2000000  0000000  0000000  0000000\r\n   0 4001  0000000  0000000  0000000  0000000\r\n   0 4000  0000000  0000000  0000000  0000000\r\n   0 4002  6000000  0000000  0000000  0000000\r\n   0 4001  c000000  0000000  0000000  0000000\r\n   0 4002  c000000  0000000  0000000  0000000\r\n   0 4000  8000000  0000000  0000000  0000000\r\n   0 4002  4000000  0000000  0000000  0000000\r\n   0 4001  8000000  0000000  0000000  0000000\r\n   0 4002  e000000  0000000  0000000  0000000\r\n   0 4002  a000000  0000000  0000000  0000000\r\n   0 4002  0000000  0000000  0000000  0000000\r\n   0 4002  8000000  0000000  0000000  0000000\r\n   0 3fff  0000000  0000000  0000000  0000000\r\n<\/pre><p>Check that the row sums are all equal.  This matrix-vector multiply can be done exactly with the flints in the magic square.<\/p><pre class=\"codeinput\">   e = fp128(ones(4,1))\r\n   Me = M*e\r\n<\/pre><pre class=\"codeoutput\"> \r\ne = \r\n   0 3fff  0000000  0000000  0000000  0000000\r\n   0 3fff  0000000  0000000  0000000  0000000\r\n   0 3fff  0000000  0000000  0000000  0000000\r\n   0 3fff  0000000  0000000  0000000  0000000\r\n \r\nMe = \r\n   0 4004  1000000  0000000  0000000  0000000\r\n   0 4004  1000000  0000000  0000000  0000000\r\n   0 4004  1000000  0000000  0000000  0000000\r\n   0 4004  1000000  0000000  0000000  0000000\r\n<\/pre><h4>Quadruple precision backslash<a name=\"26310ffb-dd99-4b37-8763-dc1151c64cad\"><\/a><\/h4><p>I've overloaded <tt>mldivide<\/tt>, so I can solve linear systems and compute inverses.  The actual computation is done by <tt>lutx<\/tt>, a \"textbook\" function that I wrote years ago, long before this quadruple-precision project, followed by the requisite solution of triangular systems.  But now the MATLAB object system insures that every individual arithmetic operation is done with IEEE 754 quadruple precision.<\/p><p>Let's generate a 3-by-3 matrix with random two-digit integer entries.<\/p><pre class=\"codeinput\">   A = fp128(randi(100,3,3))\r\n<\/pre><pre class=\"codeoutput\"> \r\nA = \r\n   0 4002  0000000  0000000  0000000  0000000\r\n   0 4001  8000000  0000000  0000000  0000000\r\n   0 4004  b000000  0000000  0000000  0000000\r\n   0 4005  3800000  0000000  0000000  0000000\r\n   0 4005  7800000  0000000  0000000  0000000\r\n   0 4002  a000000  0000000  0000000  0000000\r\n   0 4004  c800000  0000000  0000000  0000000\r\n   0 4004  7800000  0000000  0000000  0000000\r\n   0 4000  0000000  0000000  0000000  0000000\r\n<\/pre><p>I am going to use <tt>fp128<\/tt> backslash to invert <tt>A<\/tt>. So I need the identity matrix in quadruple precision.<\/p><pre class=\"codeinput\">   I = fp128(eye(size(A)));\r\n<\/pre><p>Now the overloaded backslash calls <tt>lutx<\/tt>, and then solves two triangular systems to produce the inverse.<\/p><pre class=\"codeinput\">   X = A\\I\r\n<\/pre><pre class=\"codeoutput\"> \r\nX = \r\n   0 3ff7  2fd38ea  bcfb815  69cdccc  a36d8a5\r\n   1 3ff9  c595b53  8c842ee  f26189c  a0770d4\r\n   0 3ffa  c0bc8b7  4adcc40  4ea66ca  61f1380\r\n   1 3ff7  a42f790  e4ad874  c358882  7ff988e\r\n   0 3ffa  12ea8c2  3ef8c17  01c7616  5e03a5a\r\n   1 3ffa  70d4565  958740b  78452d8  f32d866\r\n   0 3ff9  2fd38ea  bcfb815  69cdccc  a36d8a7\r\n   0 3ff3  86bc8e5  42ed82a  103d526  a56452f\r\n   1 3ff6  97f9949  ba961b3  72d69d9  4ace666\r\n<\/pre><p>Compute the residual.<\/p><pre class=\"codeinput\">   AX = A*X\r\n   R = I - AX;\r\n   format <span class=\"string\">short<\/span>\r\n   RD = double(R)\r\n<\/pre><pre class=\"codeoutput\"> \r\nAX = \r\n   0 3fff  0000000  0000000  0000000  0000000\r\n   0 3f90  0000000  0000000  0000000  0000000\r\n   1 3f8d  0000000  0000000  0000000  0000000\r\n   0 0000  0000000  0000000  0000000  0000000\r\n   0 3fff  0000000  0000000  0000000  0000000\r\n   0 3f8d  8000000  0000000  0000000  0000000\r\n   1 3f8c  0000000  0000000  0000000  0000000\r\n   1 3f8d  0000000  0000000  0000000  0000000\r\n   0 3ffe  fffffff  fffffff  fffffff  ffffffb\r\nRD =\r\n   1.0e-33 *\r\n         0         0    0.0241\r\n   -0.3852         0    0.0481\r\n    0.0481   -0.0722    0.4815\r\n<\/pre><p>Both <tt>AX<\/tt> and <tt>R<\/tt> are what I expect from arithmetic that is accurate to about 34 decimal digits.<\/p><p>Although I get a different random <tt>A<\/tt> every time I publish this blog post, I expect that it has a modest condition number.<\/p><pre class=\"codeinput\">   kappa = cond(A)\r\n<\/pre><pre class=\"codeoutput\"> \r\nkappa = \r\n   0 4002  7e97c18  91278cd  8375371  7915346\r\n   11.9560249020758193065358323606886569\r\n<\/pre><p>Since <tt>A<\/tt> is not badly conditioned, I can invert the computed inverse and expect to get close to the original integer matrix. The elements of the resulting <tt>Z<\/tt> are integers, possibly bruised with quadruple precision fuzz.<\/p><pre class=\"codeinput\">   format <span class=\"string\">hex<\/span>\r\n   Z = X\\I\r\n<\/pre><pre class=\"codeoutput\"> \r\nZ = \r\n   0 4002  0000000  0000000  0000000  0000000\r\n   0 4001  8000000  0000000  0000000  0000000\r\n   0 4004  b000000  0000000  0000000  0000004\r\n   0 4005  37fffff  fffffff  fffffff  ffffffc\r\n   0 4005  77fffff  fffffff  fffffff  ffffffe\r\n   0 4002  a000000  0000000  0000000  0000001\r\n   0 4004  c7fffff  fffffff  fffffff  ffffffc\r\n   0 4004  77fffff  fffffff  fffffff  ffffffc\r\n   0 3fff  fffffff  fffffff  fffffff  ffffffc\r\n<\/pre><h4>Quadruple precision SVD<a name=\"7400bdf6-20e8-403b-b6b9-6f1e89ca240a\"><\/a><\/h4><p>I have just nonchalantly computed <tt>cond(A)<\/tt>.  Here is the code for the overloaded <tt>cond<\/tt>.<\/p><pre class=\"codeinput\">   type <span class=\"string\">@fp128\/cond.m<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction kappa = cond(A)\r\n    sigma = svd(A);\r\n    kappa = sigma(1)\/sigma(end);\r\nend\r\n    \r\n<\/pre><p>So it is correctly using the singular value decomposition.  I also have <tt>svd<\/tt> overloaded.  The SVD computation is handled by a 433 line M-file, <tt>svdtx<\/tt>, that, like <tt>lutx<\/tt>, was written before <tt>fp128<\/tt> existed. It was necessary to modify five lines in <tt>svdtx<\/tt>.  The line<\/p><pre class=\"language-matlab\">u = zeros(n,ncu);\r\n<\/pre><p>had to be changed to<\/p><pre class=\"language-matlab\">u = fp128(zeros(n,ncu));\r\n<\/pre><p>Similarly for <tt>v<\/tt>, <tt>s<\/tt>, <tt>e<\/tt> and <tt>work<\/tt>.  I should point out that the preallocation of the arrays is inherited from the LINPACK Fortran subroutine DSVDC.  Without it, <tt>svdtx<\/tt> would not have required any modification to work correctly in quadruple precision.<\/p><p>Let's compute the full SVD.<\/p><pre class=\"codeinput\">   [U,S,V] = svd(A)\r\n<\/pre><pre class=\"codeoutput\"> \r\nU = \r\n   1 3ffe  57d9492  76f3ea4  dc14bb3  15d42c1\r\n   1 3ffe  75a77c4  8c7b469  2cac695  59be7fe\r\n   1 3ffc  0621737  9b04c78  1c2109d  8736b46\r\n   1 3ffb  38214c0  d75c84c  4bcf5ff  f3cffd7\r\n   1 3ffb  a9281e3  e12dd3a  d632d61  c8f6e60\r\n   0 3ffe  fbbccdc  a571fa1  f5a588b  fb0d806\r\n   1 3ffe  79587db  4889548  f09ae4b  cd0150c\r\n   0 3ffe  59fae16  17bcabb  6408ba4  7b2a573\r\n   0 3ff8  cde38fc  e952ad5  8b526c2  780c2e5\r\n \r\nS = \r\n   0 4006  1f3ad79  d0b9b08  18b1444  030e4ef\r\n   0 0000  0000000  0000000  0000000  0000000\r\n   0 0000  0000000  0000000  0000000  0000000\r\n   0 0000  0000000  0000000  0000000  0000000\r\n   0 4004  a720ef6  28c6ec0  87f4c54  82dda2a\r\n   0 0000  0000000  0000000  0000000  0000000\r\n   0 0000  0000000  0000000  0000000  0000000\r\n   0 0000  0000000  0000000  0000000  0000000\r\n   0 4002  8061b9a  0e96d8c  c2ef745  9ea4c9a\r\n \r\nV = \r\n   1 3ffb  db3df03  9b5e1b3  5bf4478  0e42b0d\r\n   1 3ffe  b540007  4d4bc9e  dc9461a  0de0481\r\n   1 3ffe  03aaff4  d9cea2c  e8ee2bc  2eba908\r\n   0 3ffe  fa73e09  9ef8810  a03d2eb  46ade00\r\n   1 3ffa  b316e2f  fe9d3ae  dfa9988  fbca927\r\n   1 3ffc  184af51  f25fece  97bc0da  5ff13a2\r\n   1 3ffb  706955f  a877cbb  b63f6dd  4e2150e\r\n   0 3ffe  08fc1eb  7b86ef7  4af3c6c  732aae9\r\n   1 3ffe  b3aaead  ef356e2  7cd2937  94b22a7\r\n<\/pre><p>Reconstruct <tt>A<\/tt> from its quadruple precision SVD.  It's not too shabby.<\/p><pre class=\"codeinput\">   USVT = U*S*V'\r\n<\/pre><pre class=\"codeoutput\"> \r\nUSVT = \r\n   0 4001  fffffff  fffffff  fffffff  fffffce\r\n   0 4001  7ffffff  fffffff  fffffff  fffffc7\r\n   0 4004  b000000  0000000  0000000  000000a\r\n   0 4005  37fffff  fffffff  fffffff  ffffff1\r\n   0 4005  77fffff  fffffff  fffffff  ffffff6\r\n   0 4002  9ffffff  fffffff  fffffff  fffffd2\r\n   0 4004  c7fffff  fffffff  fffffff  ffffff1\r\n   0 4004  77fffff  fffffff  fffffff  ffffff4\r\n   0 3fff  fffffff  fffffff  fffffff  ffffe83\r\n<\/pre><h4>Rosser matrix<a name=\"88791357-8813-4ab2-b887-80e7f4965d42\"><\/a><\/h4><p>An interesting example is provided by a classic test matrix, the 8-by-8 <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2014\/01\/06\/the-rosser-matrix\/\">Rosser matrix<\/a>.  Let's compare quadruple precision computation with the exact rational computation provided by the Symbolic Math Toolbox.<\/p><p>First, generate quad and sym versions of <tt>rosser<\/tt>.<\/p><pre class=\"codeinput\">   R = fp128(rosser);\r\n   S = sym(rosser)\r\n<\/pre><pre class=\"codeoutput\">S =\r\n[  611,  196, -192,  407,   -8,  -52,  -49,   29]\r\n[  196,  899,  113, -192,  -71,  -43,   -8,  -44]\r\n[ -192,  113,  899,  196,   61,   49,    8,   52]\r\n[  407, -192,  196,  611,    8,   44,   59,  -23]\r\n[   -8,  -71,   61,    8,  411, -599,  208,  208]\r\n[  -52,  -43,   49,   44, -599,  411,  208,  208]\r\n[  -49,   -8,    8,   59,  208,  208,   99, -911]\r\n[   29,  -44,   52,  -23,  208,  208, -911,   99]\r\n<\/pre><p><tt>R<\/tt> is symmetric, but not positive definite, so its LU factorization requires pivoting.<\/p><pre class=\"codeinput\">   [L,U,p] = lutx(R);\r\n   format <span class=\"string\">short<\/span>\r\n   p\r\n<\/pre><pre class=\"codeoutput\">p =\r\n     1\r\n     2\r\n     3\r\n     7\r\n     6\r\n     8\r\n     4\r\n     5\r\n<\/pre><p><tt>R<\/tt> is singular, so with exact computation <tt>U(n,n)<\/tt> would be zero. With quadruple precision, the diagonal of <tt>U<\/tt> is<\/p><pre class=\"codeinput\">   format <span class=\"string\">long<\/span> <span class=\"string\">e<\/span>\r\n   diag(U)\r\n<\/pre><pre class=\"codeoutput\"> \r\nans = \r\n   611.0\r\n   836.126022913256955810147299509001582\r\n   802.209942588471107300640276546225738\r\n   99.0115741407236314604636000423592687\r\n   -710.481057851148425133280246646085002\r\n   579.272484693223512196223933017062413\r\n   -1.2455924519190846395771824210276321\r\n   0.000000000000000000000000000000215716190833522835766351129431653015\r\n<\/pre><p>The relative size of the last diagonal element is zero to almost 34 digits.<\/p><pre class=\"codeinput\">   double(U(8,8)\/U(1,1))\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n     3.530543221497919e-34\r\n<\/pre><p>Compare this with symbolic computation, which, in this case, can compute an LU decomposition with exact rational arithmetic and no pivoting.<\/p><pre class=\"codeinput\">   [L,U] = lu(S);\r\n   diag(U)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n                 611\r\n          510873\/611\r\n    409827400\/510873\r\n    50479800\/2049137\r\n       3120997\/10302\r\n -1702299620\/3120997\r\n        255000\/40901\r\n                   0\r\n<\/pre><p>As expected, with symbolic computation <tt>U(8,8)<\/tt> is exactly zero.<\/p><p>How about SVD?<\/p><pre class=\"codeinput\">   r = svd(R)\r\n<\/pre><pre class=\"codeoutput\"> \r\nr = \r\n   1020.04901842999682384631379130551006\r\n   1020.04901842999682384631379130550858\r\n   1019.99999999999999999999999999999941\r\n   1019.90195135927848300282241090227735\r\n   999.999999999999999999999999999999014\r\n   999.999999999999999999999999999998817\r\n   0.0980486407215169971775890977220345302\r\n   0.0000000000000000000000000000000832757192990287779822645036097560521\r\n<\/pre><p>The Rosser matrix is atypical because its characteristic polynomial factors over the rationals.  So, even though it is of degree 8, the singular values are the roots of quadratic factors.<\/p><pre class=\"codeinput\">   s = svd(S)\r\n<\/pre><pre class=\"codeoutput\">s =\r\n                  10*10405^(1\/2)\r\n                  10*10405^(1\/2)\r\n                            1020\r\n 10*(1020*26^(1\/2) + 5201)^(1\/2)\r\n                            1000\r\n                            1000\r\n 10*(5201 - 1020*26^(1\/2))^(1\/2)\r\n                               0\r\n<\/pre><p>The relative error of the quadruple precision calculation.<\/p><pre class=\"codeinput\">   double(norm(r - s)\/norm(s))\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n     9.293610246879066e-34\r\n<\/pre><p>About 33 digits.<\/p><h4>Postscript<a name=\"d748cfa7-86db-4ba5-999d-2232962a9f19\"><\/a><\/h4><p>Finally, verify that we've been working all this time with <tt>fp128<\/tt> and <tt>sym<\/tt> objects.<\/p><pre class=\"codeinput\">   whos\r\n<\/pre><pre class=\"codeoutput\">  Name       Size            Bytes  Class     Attributes\r\n\r\n  A          3x3              3531  fp128               \r\n  AX         3x3              3531  fp128               \r\n  I          3x3              3531  fp128               \r\n  L          8x8                 8  sym                 \r\n  M          4x4              6128  fp128               \r\n  Me         4x1              1676  fp128               \r\n  R          8x8             23936  fp128               \r\n  RD         3x3                72  double              \r\n  S          8x8                 8  sym                 \r\n  U          8x8                 8  sym                 \r\n  USVT       3x3              3531  fp128               \r\n  V          3x3              3531  fp128               \r\n  X          3x3              3531  fp128               \r\n  Z          3x3              3531  fp128               \r\n  ans        1x1                 8  double              \r\n  e          4x1              1676  fp128               \r\n  kappa      1x1               563  fp128               \r\n  p          8x1                64  double              \r\n  r          8x1              3160  fp128               \r\n  s          8x1                 8  sym                 \r\n\r\n<\/pre><script language=\"JavaScript\"> <!-- \r\n    function grabCode_d47b6ba7b4b7413ea76f76763559fea7() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='d47b6ba7b4b7413ea76f76763559fea7 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' d47b6ba7b4b7413ea76f76763559fea7';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2017 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_d47b6ba7b4b7413ea76f76763559fea7()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2017a<br><\/p><\/div><!--\r\nd47b6ba7b4b7413ea76f76763559fea7 ##### SOURCE BEGIN #####\r\n%% Quadruple Precision, 128-bit Floating Point Arithmetic\r\n% The floating point arithmetic format that occupies 128 bits of\r\n% storage is known as _binary128_ or _quadruple precision_.\r\n% This blog post describes an implementation of quadruple\r\n% precision programmed entirely in the MATLAB language.\r\n%\r\n% <<screen_shot.png>>\r\n\r\n%% Background\r\n% The IEEE 754 standard, published in 1985, defines formats for \r\n% floating point numbers that occupy 32 or 64 bits of storage.\r\n% These formats are known as _binary32_ and _binary64_, or more\r\n% frequently as _single_ and _double precision_.  For many years\r\n% MATLAB used only double precision and it remains our default\r\n% format.  Single precision has been added gradually\r\n% over the last several years and is now also fully supported.\r\n\r\n%%\r\n% A revision of IEEE 754, published in 2008, defines two more floating\r\n% point formats.  One, _binary16_ or _half precision_, occupies only\r\n% 16 bits and was the subject of my\r\n% <https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/08\/half-precision-16-bit-floating-point-arithmetic\/\r\n% previous blog post>.  It is  primarily intended to reduce storage and\r\n% memory bandwidth requirements.  Since it provides only \"half\" precision,\r\n% its use for actual computation is problematic.\r\n\r\n%%\r\n% The other new format introduced in IEEE 754-2008 is _binary128_ or\r\n% _quadruple precision_.  It is intended for situations where the accuracy\r\n% or range of double precision is inadequate.\r\n\r\n%%\r\n% I see two descriptions of quadruple precision\r\n% software implementations on the Web.\r\n% \r\n% * <http:\/\/www.advanpix.com\/2013\/01\/20\/fast-quadruple-precision-computations\/\r\n% ADVANPIX Muliprecision Toolbox for MATLAB>\r\n%\r\n% * <http:\/\/gcc.gnu.org\/onlinedocs\/libquadmath.pdf\r\n% GCC Quad-Precision Math Library>\r\n\r\n%%\r\n% I have not used either package, but judging by their Web pages, they\r\n% both appear to be complete and well supported.\r\n\r\n%%\r\n% The MATLAB Symbolic Math Toolbox provides |vpa|, arbitrary precision\r\n% decimal floating point arithmetic, and |sym|, exact rational arithmetic.\r\n% Both provide accuracy and range well beyond quadruple precision, but do\r\n% not specifically support the 128-bit IEEE format.\r\n\r\n%%\r\n% My goal here is to describe a prototype of a MATLAB object, |fp128|,\r\n% that implements quadruple precision with code written entirely in the\r\n% MATLAB language.  It is not very efficient, but is does allow \r\n% experimentation with the 128-bit format.\r\n\r\n%% Beyond double\r\n% There are other floating point formats beyond double precision.\r\n% _Long double_ usually refers to the 80-bit extended precision\r\n% floating point registers available with the Intel x86 architecture\r\n% and described as _double extended_ in IEEE 754.  This provides\r\n% the same exponent range as quadruple precision, but much less\r\n% accuracy.\r\n\r\n%%\r\n% _Double double_ refers to the use of a pair of double precision\r\n% values.  The exponent field and sign bit of the second double are\r\n% ignored, so this is effectively a 116-bit format.  Both the exponent\r\n% range and the precision are more than double but less than quadruple.\r\n\r\n%% Floating point anatomy\r\n% The format of a floating point number is characterized by two\r\n% parameters, |p|, the number of bits in the fraction and |q|, the number\r\n% of bits in the exponent.  I will compare four precisions,\r\n% _half_, _single_, _double_, and _quadruple_.\r\n% The four pairs of characterizing parameters are\r\n\r\n   p = [10, 23, 52 112];\r\n   \r\n%%\r\n\r\n   q = [5, 8, 11, 15];\r\n\r\n\r\n   \r\n%%\r\n% With these values of |p| and |q|, and with one more bit for the sign, \r\n% the total number of bits in the word, |w|, is a power of two.\r\n\r\n   format shortg\r\n   w = p + q + 1\r\n   \r\n%%\r\n% *Normalized numbers*\r\n%\r\n% Most floating point numbers are _normalized_, and are expressed as\r\n% \r\n% $$ x = \\pm (1+f)2^e $$\r\n\r\n%%\r\n% The _fraction_ $f$ is in the half open interval\r\n%\r\n% $$ 0 \\leq f < 1 $$\r\n%\r\n\r\n%%\r\n% The binary representation of $f$ requires at most |p| bits.\r\n% In other words $2^p f$ is an integer in the range\r\n%\r\n% $$ 0 \\leq 2^p f < 2^p $$\r\n\r\n%%\r\n%\r\n% The _exponent_ $e$ is an integer in the range\r\n%\r\n% $$ -b+1 \\leq e \\leq b $$\r\n%\r\n\r\n%%\r\n% The quantity $b$ is both the largest exponent and the |bias|.\r\n%\r\n% $$ b = 2^{q-1} - 1 $$\r\n\r\n   b = 2.^(q-1)-1\r\n   \r\n%%\r\n% The fractional part of a normalized number is $1+f$, but only $f$\r\n% needs to be stored.  That leading $1$ is known as the _hidden bit_.\r\n  \r\n%%\r\n% *Subnormal*\r\n%\r\n% There are two values of the exponent $e$ for which the biased exponent,\r\n% $e+b$, reaches the smallest and largest values possible to represent\r\n% in |q| bits.  The smallest is\r\n%\r\n% $$ e + b = 0 $$\r\n%\r\n% The corresponding floating point numbers do not have a hidden leading\r\n% bit.  These are the _subnormal_ or _denormal_ numbers.\r\n%\r\n% $$ x = \\pm f 2^{-b} $$\r\n%\r\n\r\n%%\r\n% *Infinity and Not-A-Number*\r\n%\r\n% The largest possible biased exponent is\r\n% \r\n% $$ e + b = 2^q-1 $$.\r\n%\r\n% Quantities with this exponent field represent _infinities_ and\r\n% _NaN_, or _Not-A-Number_.\r\n\r\n%%\r\n% The percentage of floating point numbers that are exceptional\r\n% because they are subnormal, infinity or NaN increases as the\r\n% precision decreases.  Exceptional exponents are only $2$ values\r\n% out of $2^q$.  For quadruple precision this is $2\/2^{15}$, which is\r\n% less than a one one-thousandth of one percent.\r\n%% \r\n% Encode the sign bit with |s = 0| for nonnegative and |s = 1| for\r\n% negative.  And encode the exponent with an offsetting bias, |b|.\r\n% Then a floating point number can be packed in |w| bits with\r\n%\r\n%    x = [s e+b 2^p*f]\r\n    \r\n%% Precision and range\r\n% *epsilon*\r\n%\r\n% If a real number cannot be expressed with a binary expansion requiring\r\n% at most |p| bits, it must be approximated by a floating point number\r\n% that does have such a binary representation.  This is _roundoff error_.\r\n% The important quantity characterizing precision is _machine epsilon_,\r\n% or |eps|.  In MATLAB, |eps(x)| is the distance from |x| to the\r\n% next larger (in absolute value) floating point number (of that class).\r\n% With no argument, |eps| is simply the difference between |1| and the\r\n% next larger floating point number.\r\n\r\n    format shortg\r\n    eps = 2.^(-p)\r\n    \r\n%%\r\n% This tells us that quadruple precision is good for about 34 decimal\r\n% digits of accuracy, double for about 16 decimal digits, single for\r\n% about 7, and half for about 3.\r\n            \r\n%%\r\n% *realmax*\r\n%\r\n% If a real number, or the result of an arithmetic operation, is too\r\n% large to be represented, it _overflows_ and is replaced by _infinity_.\r\n% The largest floating point number that does not overflow is |realmax|.\r\n% When I try to compute quadruple |realmax| with double precision,\r\n% it overflows.  I will fix this up in the table to follow.\r\n\r\n    realmax = 2.^b.*(2-eps)\r\n    \r\n%% \r\n% *realmin*\r\n%\r\n% _Underflow_ and representation of very small numbers is more complicated.\r\n% The smallest normalized floating point number is |realmin|.  When I\r\n% try to compute quadruple |realmin| it underflows to zero.  Again,\r\n% I will fix this up in the table.\r\n\r\n    realmin = 2.^(-b+1)\r\n    \r\n%% \r\n% *tiny*\r\n%\r\n% But there are numbers smaller than |realmin|.  IEEE 754 introduced the\r\n% notion of _gradual underflow_ and _denormal_ numbers.  In the 2008\r\n% revised standard their name was changed to _subnormal_.\r\n\r\n%%\r\n% Think of roundoff in numbers near underflow.  Before 754, floating\r\n% point numbers had the disconcerting property that |x| and |y| could\r\n% be unequal, but their difference could underflow, so |x-y| becomes |0|.\r\n% With 754 the gap between |0| and |realmin| is filled with numbers whose\r\n% spacing is the same as the spacing between |realmin| and |2*realmin|.\r\n% I like to call this spacing, and the smallest subnormal number, |tiny|.\r\n\r\n    tiny = realmin.*eps\r\n    \r\n%% Floating point integers\r\n% *flintmax*\r\n% \r\n% It is possible to do integer arithmetic with floating point numbers.\r\n% I like to call such numbers _flints_.  When we write the numbers $3$\r\n% and $3.0$, they are different descriptions of the same integer, but\r\n% we think of one as fixed point and the other as floating point.\r\n% The largest flint is |flintmax|.\r\n\r\n    flintmax = 2.\/eps\r\n    \r\n%%\r\n% Technically all the floating point numbers larger than |flintmax|\r\n% are integers, but the spacing between them is larger than one, so it\r\n% is not safe to use them for integer arithmetic. Only integer-valued\r\n% floating point numbers between |0| and |flintmax| are allowed to be\r\n% called flints.\r\n\r\n    \r\n%% Table\r\n% Let's collect all these anatomical characteristics together in a\r\n% new MATLAB |table|.  I have now edited the output and inserted\r\n% the correct quadruple precision values.\r\n\r\n   T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax];\r\n\r\n   T = table(T(:,1), T(:,2), T(:,3), T(:,4), ...\r\n      'variablenames',{'half','single','double','quadruple'}, ...\r\n      'rownames',{'w','p','q','b','eps','realmax','realmin', ...\r\n                  'tiny','flintmax'});\r\n              \r\n   type Table.txt\r\n\r\n   \r\n%% fp128\r\n% I am currently working on code for an object, |@fp128|,\r\n% that could provide a full implementation of quadruple-precision\r\n% arithmetic.  The methods available so far are\r\n\r\n   methods(fp128)\r\n\r\n%%\r\n% The code that I have for quadrule precision is much more complex than\r\n% the code that I have for \r\n% <https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/08\/half-precision-16-bit-floating-point-arithmetic\/\r\n% half precision>.  There I am able to \"cheat\" by converting half\r\n% precision numbers to doubles and relying on traditional MATLAB\r\n% arithmetic.  I can't do that for quads.\r\n\r\n%%\r\n% The storage scheme for quads is described in the help entry for\r\n% the constructor.\r\n\r\n   help @fp128\/fp128\r\n\r\n%%\r\n% Breaking the 112-bit fraction into four 28-bit pieces makes it possible\r\n% to do arithmetic operations on the pieces without worrying about integer\r\n% overflow.  The core of the |times| code, which implements |x.*y|,\r\n% is the convolution of the two fractional parts.\r\n\r\n   dbtype 45:53 @fp128\/times\r\n   \r\n%%\r\n% The result of the convolution, |zf|, is a |uint64| vector of length\r\n% nine with 52-bit elements.  It must be renormalized to the fit the\r\n% |fp128| storage scheme.\r\n\r\n%%\r\n% Addition and subtraction involve addition and subtraction of the\r\n% fractional parts after they have been shifted so that the\r\n% corresponding exponents are equal.  Again, this produces temporary\r\n% vectors that must be renormalized.\r\n\r\n%%\r\n% Scalar division, |y\/x|, is done by first computing the reciprocal\r\n% of the denominator, |1\/x|, and then doing one final multiplication,\r\n% |1\/x * y|.  The reciprocal is computed by a few steps of Newton\r\n% iteration, starting with a scaled  reciprocal, |1\/double(x)|.\r\n   \r\n%% Examples\r\n% The output for each example shows the three fields in hexadecimal REPLACE_WITH_DASH_DASH\r\n% one sign field, one biased exponent field, and one fraction field\r\n% that is a vector with four entries displayed with seven hex digits.\r\n% This is followed by a 36 significant digit decimal representation.\r\n\r\n%%\r\n% *One*\r\n\r\n   clear\r\n   format long\r\n   one = fp128(1)\r\n   \r\n%%\r\n% *eps*\r\n\r\n   eps = eps(one)\r\n   \r\n%%\r\n% *1 + eps*\r\n\r\n   one_plus_eps = one + eps\r\n   \r\n%%\r\n% *2 - eps*\r\n\r\n   two_minus_eps = 2 - eps\r\n\r\n%%\r\n% *realmin*\r\n\r\n   rmin = realmin(one)\r\n\r\n%%\r\n% *realmax*\r\n\r\n   rmax = realmax(one)\r\n\r\n%%\r\n% *Compute 1\/10 with double, then convert to quadruple.*\r\n\r\n   dble_tenth = fp128(1\/10)\r\n\r\n%%\r\n% *Compute 1\/10 with quadruple.*\r\n\r\n   quad_tenth = 1\/fp128(10)\r\n\r\n%%\r\n% *Double precision |pi| converted to quadruple.*\r\n\r\n   dble_pi = fp128(pi)\r\n   \r\n%%\r\n% *|pi| accurate to quadruple precision.*\r\n \r\n   quad_pi = fp128(sym('pi'))\r\n\r\n\r\n%% Matrix operations\r\n% The 4-by-4 magic square from Durer's Melancholia II provides my first\r\n% matrix example.\r\n\r\n   clear\r\n   M = fp128(magic(4));\r\n   \r\n%%\r\n% Let's see how the 128-bit elements look in hex.\r\n\r\n   format hex\r\n   M\r\n \r\n%%\r\n% Check that the row sums are all equal.  This matrix-vector multiply\r\n% can be done exactly with the flints in the magic square.\r\n\r\n   e = fp128(ones(4,1))\r\n   Me = M*e\r\n   \r\n%% Quadruple precision backslash\r\n% I've overloaded |mldivide|, so I can solve linear systems and\r\n% compute inverses.  The actual computation is done by |lutx|,\r\n% a \"textbook\" function that I wrote years ago, long before this\r\n% quadruple-precision project, followed by the requisite solution\r\n% of triangular systems.  But now the MATLAB object system insures\r\n% that every individual arithmetic operation is done with IEEE 754\r\n% quadruple precision.\r\n\r\n%%\r\n% Let's generate a 3-by-3 matrix with random two-digit integer entries.\r\n\r\n   A = fp128(randi(100,3,3))\r\n   \r\n%%\r\n% I am going to use |fp128| backslash to invert |A|.\r\n% So I need the identity matrix in quadruple precision.\r\n\r\n   I = fp128(eye(size(A)));\r\n \r\n%%\r\n% Now the overloaded backslash calls |lutx|, and then solves two\r\n% triangular systems to produce the inverse.\r\n   \r\n   X = A\\I   \r\n   \r\n%%\r\n% Compute the residual.\r\n\r\n   AX = A*X\r\n   R = I - AX;\r\n   format short\r\n   RD = double(R)\r\n   \r\n%%\r\n% Both |AX| and |R| are what I expect from arithmetic that is\r\n% accurate to about 34 decimal digits.\r\n\r\n%%\r\n% Although I get a different random |A| every time I publish this blog\r\n% post, I expect that it has a modest condition number.\r\n\r\n   kappa = cond(A)\r\n   \r\n%%\r\n% Since |A| is not badly conditioned, I can invert the computed inverse\r\n% and expect to get close to the original integer matrix.\r\n% The elements of the resulting |Z| are integers, possibly bruised with\r\n% quadruple precision fuzz.\r\n\r\n   format hex\r\n   Z = X\\I\r\n      \r\n%% Quadruple precision SVD\r\n% I have just nonchalantly computed |cond(A)|.  Here is the code for\r\n% the overloaded |cond|.\r\n\r\n   type @fp128\/cond.m\r\n   \r\n%%\r\n% So it is correctly using the singular value decomposition.  I also\r\n% have |svd| overloaded.  The SVD computation is handled by a 433 line\r\n% M-file, |svdtx|, that, like |lutx|, was written before |fp128| existed.\r\n% It was necessary to modify five lines in |svdtx|.  The line\r\n%\r\n%   u = zeros(n,ncu);\r\n%\r\n% had to be changed to\r\n%\r\n%   u = fp128(zeros(n,ncu));\r\n%\r\n% Similarly for |v|, |s|, |e| and |work|.  I should point out that the\r\n% preallocation of the arrays is inherited from the LINPACK Fortran\r\n% subroutine DSVDC.  Without it, |svdtx| would not have required any\r\n% modification to work correctly in quadruple precision.\r\n\r\n\r\n%%\r\n% Let's compute the full SVD.\r\n\r\n   [U,S,V] = svd(A)\r\n   \r\n%%\r\n% Reconstruct |A| from its quadruple precision SVD.  It's not too shabby.\r\n\r\n   USVT = U*S*V'\r\n     \r\n%% Rosser matrix\r\n% An interesting example is provided by a classic test matrix, the 8-by-8\r\n% <https:\/\/blogs.mathworks.com\/cleve\/2014\/01\/06\/the-rosser-matrix\/\r\n% Rosser matrix>.  Let's compare quadruple precision computation with\r\n% the exact rational computation provided by the Symbolic Math Toolbox.\r\n\r\n%%\r\n% First, generate quad and sym versions of |rosser|.\r\n\r\n   R = fp128(rosser);\r\n   S = sym(rosser)\r\n   \r\n%%\r\n% |R| is symmetric, but not positive definite, so its LU factorization\r\n% requires pivoting.\r\n\r\n   [L,U,p] = lutx(R);\r\n   format short\r\n   p\r\n   \r\n%%\r\n% |R| is singular, so with exact computation |U(n,n)| would be zero.\r\n% With quadruple precision, the diagonal of |U| is\r\n\r\n   format long e\r\n   diag(U)\r\n   \r\n%%\r\n% The relative size of the last diagonal element is zero to almost\r\n% 34 digits.\r\n\r\n   double(U(8,8)\/U(1,1))\r\n   \r\n%%\r\n% Compare this with symbolic computation, which, in this case, can\r\n% compute an LU decomposition with exact rational arithmetic and no\r\n% pivoting.\r\n\r\n   [L,U] = lu(S);\r\n   diag(U)\r\n   \r\n%%\r\n% As expected, with symbolic computation |U(8,8)| is exactly zero.\r\n\r\n%%\r\n% How about SVD?\r\n\r\n   r = svd(R)\r\n   \r\n%%\r\n% The Rosser matrix is atypical because its\r\n% characteristic polynomial factors over the rationals.  So, even though\r\n% it is of degree 8, the singular values are the roots of\r\n% quadratic factors.\r\n\r\n   s = svd(S)\r\n   \r\n%%\r\n% The relative error of the quadruple precision calculation.\r\n\r\n   double(norm(r - s)\/norm(s))\r\n   \r\n%%\r\n% About 33 digits.\r\n\r\n%% Postscript\r\n% Finally, verify that we've been working all this time\r\n% with |fp128| and |sym| objects.\r\n\r\n   whos\r\n\r\n##### SOURCE END ##### d47b6ba7b4b7413ea76f76763559fea7\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/screen_shot.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>The floating point arithmetic format that occupies 128 bits of storage is known as <i>binary128<\/i> or <i>quadruple precision<\/i>. This blog post describes an implementation of quadruple precision programmed entirely in the MATLAB language.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/05\/22\/quadruple-precision-128-bit-floating-point-arithmetic\/\">read more >><\/a><\/p>","protected":false},"author":78,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[4,9,16,7,30],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2490"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/users\/78"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/comments?post=2490"}],"version-history":[{"count":1,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2490\/revisions"}],"predecessor-version":[{"id":2491,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2490\/revisions\/2491"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media?parent=2490"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/categories?post=2490"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/tags?post=2490"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}