{"id":1008,"date":"2014-07-07T12:00:29","date_gmt":"2014-07-07T17:00:29","guid":{"rendered":"https:\/\/blogs.mathworks.com\/cleve\/?p=1008"},"modified":"2016-04-11T08:58:59","modified_gmt":"2016-04-11T13:58:59","slug":"floating-point-numbers","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/cleve\/2014\/07\/07\/floating-point-numbers\/","title":{"rendered":"Floating Point Numbers"},"content":{"rendered":"<div class=\"content\"><!--introduction-->This is the first part of a two-part series about the single- and double precision floating point numbers that MATLAB uses for almost all of its arithmetic operations. (This post is adapted from section 1.7 of my book <a href=\"https:\/\/www.mathworks.com\/moler\/chapters.html\">Numerical Computing with MATLAB<\/a>, published by MathWorks and SIAM.)\r\n\r\n<!--\/introduction-->\r\n<h3>Contents<\/h3>\r\n<div>\r\n<ul>\r\n\t<li><a href=\"#3578733f-0605-4967-ac91-17b688e29069\">IEEE 754-1985 Standard<\/a><\/li>\r\n\t<li><a href=\"#770ac8f3-1284-47fb-bd6b-dace193121e8\">Velvel Kahan<\/a><\/li>\r\n\t<li><a href=\"#b67a8ba0-a2e7-4749-9e95-4d2532988a04\">Single and Double Precision<\/a><\/li>\r\n\t<li><a href=\"#92e9183c-bf90-4d7f-a056-130ec0b3ff6f\">Precision versus Range<\/a><\/li>\r\n\t<li><a href=\"#c0f7dead-910c-454a-8373-636ba49fa63d\">Floating Point Format<\/a><\/li>\r\n\t<li><a href=\"#4a6961d5-8770-4ae5-b740-f38865248976\">floatgui<\/a><\/li>\r\n\t<li><a href=\"#2fef5a1d-01fb-4555-8f43-52f68a153d9d\">eps<\/a><\/li>\r\n\t<li><a href=\"#53100ae5-10b5-4011-ba6c-5d3d00c7e702\">One-tenth<\/a><\/li>\r\n\t<li><a href=\"#d20292e2-e6f1-4d46-8e81-f4a5c9209ee7\">Hexadecimal format<\/a><\/li>\r\n\t<li><a href=\"#bd7575a2-a92a-41dd-ba55-121d8e8ca093\">Golden Ratio<\/a><\/li>\r\n\t<li><a href=\"#24cb4f4d-b8a9-4c19-b22b-9d2a9f7f3812\">Computing eps<\/a><\/li>\r\n\t<li><a href=\"#1e7249e4-aba4-48e6-8db9-8806a3c61e94\">Underflow and Overflow<\/a><\/li>\r\n<\/ul>\r\n<\/div>\r\n<h4>IEEE 754-1985 Standard<a name=\"3578733f-0605-4967-ac91-17b688e29069\"><\/a><\/h4>\r\nIf you look carefully at the definitions of fundamental arithmetic operations like addition and multiplication, you soon encounter the mathematical abstraction known as real numbers. But actual computation with real numbers is not very practical because it involves notions like limits and infinities. Instead, MATLAB and most other technical computing environments use floating point arithmetic, which involves a finite set of numbers with finite precision. This leads to the phenomena of <i>roundoff<\/i>, <i>underflow<\/i>, and <i>overflow<\/i>. Most of the time, it is possible to use MATLAB effectively without worrying about these details, but, every once in a while, it pays to know something about the properties and limitations of floating point numbers.\r\n\r\nThirty years ago, the situation was far more complicated than it is today. Each computer had its own floating point number system. Some were binary; some were decimal. There was even a Russian computer that used trinary arithmetic. Among the binary computers, some used 2 as the base; others used 8 or 16. And everybody had a different precision. In 1985, the IEEE Standards Board and the American National Standards Institute adopted the ANSI\/IEEE Standard 754-1985 for Binary Floating-Point Arithmetic. This was the culmination of almost a decade of work by a 92-person working group of mathematicians, computer scientists, and engineers from universities, computer manufacturers, and microprocessor companies.\r\n\r\nA revision of IEEE_754-1985, known as <a href=\"http:\/\/en.wikipedia.org\/wiki\/IEEE_754-2008\">IEEE 754-2008<\/a>, was published in 2008. The revision combines the binary standards in 754 and the decimal standards in another document, 854. But as far as I can tell, the new features introduced in the revision haven't had much impact yet.\r\n<h4>Velvel Kahan<a name=\"770ac8f3-1284-47fb-bd6b-dace193121e8\"><\/a><\/h4>\r\nWilliam M. Kahan was the primary architect of IEEE floating point arithmetic. I've always known him by his nickname \"Velvel\". I met him in the early 1960's when I was a grad student at Stanford and he was a young faculty member at the University of Toronto. He moved from Toronto to UC Berkeley in the late 1960's and spent the rest of his career at Berkeley. He is now an emeritus professor of mathematics and of electrical engineering and computer science.\r\n\r\nVelvel is an eminent numerical analyst. Among many other accomplishments, he is the developer along with Gene Golub of the first practical algorithm for computing the singular value decomposition.\r\n\r\nIn the late 1970's microprocessor manufacturers including Intel, Motorola, and Texas Instruments were developing the chips that were to become the basis for the personal computer and eventually today's electronics. In a remarkable case of cooperation among competitors, they established a committee to develop a common standard for floating point arithmetic.\r\n\r\nVelvel was a consult to Intel, working with John Palmer, a Stanford grad who was the principal architect of the 8087 math coprocessor. The 8087 accompanied the 8086, which was to become the CPU of the IBM PC. Kahan and Palmer convinced Intel to allow them to release the specs for their math coprocessor to the IEEE committee. These plans became the basis for the standard.\r\n\r\nVelvel received the ACM Turing Award in 1989 for \"his fundamental contributions to numerical analysis\". He was named an ACM Fellow in 1994, and was inducted into the National Academy of Engineering in 2005.\r\n<h4>Single and Double Precision<a name=\"b67a8ba0-a2e7-4749-9e95-4d2532988a04\"><\/a><\/h4>\r\nAll computers designed since 1985 use IEEE floating point arithmetic. This doesn't mean that they all get exactly the same results, because there is some flexibility within the standard. But it does mean that we now have a machine-independent model of how floating point arithmetic behaves.\r\n\r\nMATLAB has traditionally used the IEEE double precision format. In recent years, the single precision format has also been available. This saves space and can sometimes be faster. There is also an extended precision format, which is not precisely specified by the standard and which may be used for intermediate results by the floating point hardware. This is one source of the lack of uniformity among different machines.\r\n\r\nMost nonzero floating point numbers are normalized. This means they can be expressed as\r\n\r\n$$ x = \\pm (1 + f) \\cdot 2^e $$\r\n\r\nThe quantity $f$ is the <i>fraction<\/i> or <i>mantissa<\/i> and $e$ is the <i>exponent<\/i>. The fraction satisfies\r\n\r\n$$ 0 \\leq f &lt; 1 $$\r\n\r\nThe fraction $f$ must be representable in binary using at most 52 bits for double precision and 23 bits for single recision. In other words, for double, $2^{52} f$ is an integer in the interval\r\n\r\n$$ 0 \\leq 2^{52} f &lt; 2^{52} $$\r\n\r\nFor single, $2^{23} f$ is an integer in the interval\r\n\r\n$$ 0 \\leq 2^{23} f &lt; 2^{23} $$\r\n\r\nFor double precision, the exponent $e$ is an integer in the interval\r\n\r\n$$ -1022 \\leq e \\leq 1023 $$\r\n\r\nFor single, the exponent $e$ is in the interval\r\n\r\n$$ -126 \\leq e \\leq 127 $$\r\n<h4>Precision versus Range<a name=\"92e9183c-bf90-4d7f-a056-130ec0b3ff6f\"><\/a><\/h4>\r\nThe finiteness of $f$ is a limitation on <i>precision<\/i>. The finiteness of $e$ is a limitation on <i>range<\/i>. Any numbers that don't meet these limitations must be approximated by ones that do.\r\n<h4>Floating Point Format<a name=\"c0f7dead-910c-454a-8373-636ba49fa63d\"><\/a><\/h4>\r\nDouble precision floating point numbers are stored in a 64-bit word, with 52 bits for $f$, 11 bits for $e$, and 1 bit for the sign of the number. The sign of $e$ is accommodated by storing $e+1023$, which is between $1$ and $2^{11}-2$. The 2 extreme values for the exponent field, $0$ and $2^{11}-1$, are reserved for exceptional floating point numbers that I will describe later.\r\n\r\nSingle precision floating point numbers are stored in a 32-bit word, with 23 bits for $f$, 8 bits for $e$, and 1 bit for the sign of the number. The sign of $e$ is accommodated by storing $e+127$, which is between $1$ and $2^{8}-2$. Again, the 2 extreme values for the exponent field are reserved for exceptional floating point numbers.\r\n\r\nThe entire fractional part of a floating point number is not $f$, but $1+f$. However, the leading $1$ doesn't need to be stored. In effect, the IEEE formats pack $33$ or $65$ bits of information into a $32$ or $64$-bit word.\r\n<h4>floatgui<a name=\"4a6961d5-8770-4ae5-b740-f38865248976\"><\/a><\/h4>\r\nMy program <tt>floatgui<\/tt>, <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/37976-numerical-computing-with-matlab\">available here<\/a>, shows the distribution of the positive numbers in a model floating point system with variable parameters. The parameter $t$ specifies the number of bits used to store $f$. In other words, $2^t f$ is an integer. The parameters $e_{min}$ and $e_{max}$ specify the range of the exponent, so $e_{min} \\leq e \\leq e_{max}$. Initially, <tt>floatgui<\/tt> sets $t = 3$, $e_{min} = -4$, and $e_{max} = 2$ and produces the following distribution.\r\n\r\n<img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/cleve\/floatguijpeg.jpeg\" alt=\"\" hspace=\"5\" vspace=\"5\" \/>\r\n\r\nWithin each binary interval $2^e \\leq x \\leq 2^{e+1}$, the numbers are equally spaced with an increment of $2^{e-t}$. If $e = 0$ and $t = 3$, for example, the spacing of the numbers between $1$ and $2$ is $1\/8$. As $e$ increases, the spacing increases.\r\n\r\nIt is also instructive to display the floating point numbers with a logarithmic scale. Here is the output from <tt>floatgui<\/tt> with <tt>logscale<\/tt> checked and $t = 5$, $e_{min} = -4$, and $e_{max} = 3$. With this logarithmic scale, it is more apparent that the distribution in each binary interval is the same.\r\n\r\n<img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/cleve\/logfloatgui.jpeg\" alt=\"\" hspace=\"5\" vspace=\"5\" \/>\r\n<h4>eps<a name=\"2fef5a1d-01fb-4555-8f43-52f68a153d9d\"><\/a><\/h4>\r\nA very important quantity associated with floating point arithmetic is highlighted in red by <tt>floatgui<\/tt>. MATLAB calls this quantity <tt>eps<\/tt>, which is short for <i>machine epsilon<\/i>.\r\n<div>\r\n<ul>\r\n\t<li><tt>eps<\/tt> is the distance from $1$ to the next larger floating point number.<\/li>\r\n<\/ul>\r\n<\/div>\r\nFor the <tt>floatgui<\/tt> model floating point system, <tt>eps<\/tt> = $2^{-t}$.\r\n\r\nBefore the IEEE standard, different machines had different values of <tt>eps<\/tt>. Now, for IEEE double precision the approximate decimal value of <tt>eps<\/tt> is\r\n<pre class=\"codeinput\">   format <span class=\"string\">short<\/span>\r\n   format <span class=\"string\">compact<\/span>\r\n   eps_d = eps\r\n<\/pre>\r\n<pre class=\"codeoutput\">eps_d =\r\n   2.2204e-16\r\n<\/pre>\r\nAnd for single precision,\r\n<pre class=\"codeinput\">   eps_s = eps(<span class=\"string\">'single'<\/span>)\r\n<\/pre>\r\n<pre class=\"codeoutput\">eps_s =\r\n  1.1921e-07\r\n<\/pre>\r\nEither <tt>eps\/2<\/tt> or <tt>eps<\/tt> can be called the roundoff level. The maximum relative error incurred when the result of an arithmetic operation is rounded to the nearest floating point number is <tt>eps\/2<\/tt>. The maximum relative spacing between numbers is <tt>eps<\/tt>. In either case, you can say that the roundoff level is about 16 decimal digits for double and about 7 decimal digits for single.\r\n\r\nActually <tt>eps<\/tt> is a function. <tt>eps(x)<\/tt> is the distance from <tt>abs(x)<\/tt> to the next larger in magnitude floating point number of the same precision as <tt>x<\/tt>.\r\n<h4>One-tenth<a name=\"53100ae5-10b5-4011-ba6c-5d3d00c7e702\"><\/a><\/h4>\r\nA frequently occurring example is provided by the simple MATLAB statement\r\n<pre class=\"language-matlab\">t = 0.1\r\n<\/pre>\r\nThe mathematical value $t$ stored in <tt>t<\/tt> is not exactly $0.1$ because expressing the decimal fraction $1\/10$ in binary requires an infinite series. In fact,\r\n\r\n$$ \\frac{1}{10} = \\frac{1}{2^4} + \\frac{1}{2^5} + \\frac{0}{2^6} +\r\n\\frac{0}{2^7} + \\frac{1}{2^8} + \\frac{1}{2^9} + \\frac{0}{2^{10}} +\r\n\\frac{0}{2^{11}} + \\frac{1}{2^{12}} + \\cdots $$\r\n\r\nAfter the first term, the sequence of coefficients $1, 0, 0, 1$ is repeated infinitely often. Grouping the resulting terms together four at a time expresses $1\/10$ n a base 16, or <i>hexadecimal<\/i>, series.\r\n\r\n$$ \\frac{1}{10} = 2^{-4} \\cdot \\left(1 + \\frac{9}{16} + \\frac{9}{16^2} +\r\n\\frac{9}{16^3} + \\frac{9}{16^{4}} + \\cdots\\right) $$\r\n\r\nFloating-point numbers on either side of $1\/10$ are obtained by terminating the fractional part of this series after $52$ binary terms, or 13 hexadecimal terms, and rounding the last term up or down. Thus\r\n\r\n$$ t_1 &lt; 1\/10 &lt; t_2 $$\r\n\r\nwhere\r\n\r\n$$ t_1 = 2^{-4} \\cdot \\left(1 + \\frac{9}{16} + \\frac{9}{16^2} +\r\n\\frac{9}{16^3} + \\cdots + \\frac{9}{16^{12}} + \\frac{9}{16^{13}}\\right) $$\r\n\r\n$$ t_2 = 2^{-4} \\cdot \\left(1 + \\frac{9}{16} + \\frac{9}{16^2} +\r\n\\frac{9}{16^3} + \\cdots + \\frac{9}{16^{12}} + \\frac{10}{16^{13}}\\right) $$\r\n\r\nIt turns out that $1\/10$ is closer to $t_2$ than to $t_1$, so $t$ is equal to $t_2$. In other words,\r\n\r\n$$ t = (1 + f) \\cdot 2^e $$\r\n\r\nwhere\r\n\r\n$$ f = \\frac{9}{16} + \\frac{9}{16^2} + \\frac{9}{16^3} + \\cdots\r\n+ \\frac{9}{16^{12}} + \\frac{10}{16^{13}} $$\r\n\r\n$$ e = -4 $$\r\n<h4>Hexadecimal format<a name=\"d20292e2-e6f1-4d46-8e81-f4a5c9209ee7\"><\/a><\/h4>\r\nThe MATLAB command\r\n<pre class=\"language-matlab\">format <span class=\"string\">hex<\/span>\r\n<\/pre>\r\ncauses <tt>t<\/tt> to be displayed as\r\n<pre class=\"language-matlab\">3fb999999999999a\r\n<\/pre>\r\nThe characters <tt>a<\/tt> through <tt>f<\/tt> represent the hexadecimal \"digits\" 10 through 15. The first three characters, <tt>3fb<\/tt>, give the hexadecimal representation of decimal $1019$, which is the value of the biased exponent $e+1023$ if $e$ is $-4$. The other 13 characters are the hexadecimal representation of the fraction $f$.\r\n\r\nIn summary, the value stored in <tt>t<\/tt> is very close to, but not exactly equal to, $0.1$. The distinction is occasionally important. For example, the quantity\r\n<pre class=\"language-matlab\">0.3\/0.1\r\n<\/pre>\r\nis not exactly equal to $3$ because the actual numerator is a little less than $0.3$ and the actual denominator is a little greater than $0.1$.\r\n\r\nTen steps of length <tt>t<\/tt> are not precisely the same as one step of length 1. MATLAB is careful to arrange that the last element of the vector\r\n<pre class=\"language-matlab\">0:0.1:1\r\n<\/pre>\r\nis exactly equal to $1$, but if you form this vector yourself by repeated additions of $0.1$, you will miss hitting the final $1$ exactly.\r\n<h4>Golden Ratio<a name=\"bd7575a2-a92a-41dd-ba55-121d8e8ca093\"><\/a><\/h4>\r\nWhat does the floating point approximation to the golden ratio, $\\phi$, look like?\r\n<pre class=\"codeinput\">   format <span class=\"string\">hex<\/span>\r\n   phi = (1 + sqrt(5))\/2\r\n<\/pre>\r\n<pre class=\"codeoutput\">phi =\r\n   3ff9e3779b97f4a8\r\n<\/pre>\r\nThe first hex digit, <tt>3<\/tt>, is <tt>0011<\/tt> in binary. The first bit is the sign of the floating point number; <tt>0<\/tt> is positive, <tt>1<\/tt> is negative. So <tt>phi<\/tt> is positive. The remaining bits of the first three hex digits contain $e+1023$. In this example, <tt>3ff<\/tt> in base 16 is $3 \\cdot 16^2 + 15 \\cdot 16 + 15 = 1023$ in decimal. So\r\n\r\n$$ e = 0 $$\r\n\r\nIn fact, any floating point number between $1.0$ and $2.0$ has $e = 0$, so its <tt>hex<\/tt> output begins with <tt>3ff<\/tt>. The other 13 hex digits contain $f$. In this example,\r\n\r\n$$ f = \\frac{9}{16} + \\frac{14}{16^2} + \\frac{3}{16^3} + \\cdots\r\n+ \\frac{10}{16^{12}} + \\frac{8}{16^{13}} $$\r\n\r\nWith these values of $f$ and $e$,\r\n\r\n$$ (1 + f) 2^e \\approx \\phi $$\r\n<h4>Computing eps<a name=\"24cb4f4d-b8a9-4c19-b22b-9d2a9f7f3812\"><\/a><\/h4>\r\nAnother example is provided by the following code segment.\r\n<pre class=\"codeinput\">   format <span class=\"string\">long<\/span>\r\n   a = 4\/3\r\n   b = a - 1\r\n   c = 3*b\r\n   e = 1 - c\r\n<\/pre>\r\n<pre class=\"codeoutput\">a =\r\n   1.333333333333333\r\nb =\r\n   0.333333333333333\r\nc =\r\n   1.000000000000000\r\ne =\r\n     2.220446049250313e-16\r\n<\/pre>\r\nWith exact computation, <tt>e<\/tt> would be zero. But with floating point, we obtain double precision <tt>eps<\/tt>. Why?\r\n\r\nIt turns out that the only roundoff occurs in the division in the first statement. The quotient cannot be exactly $4\/3$, except on that Russian trinary computer. Consequently the value stored in <tt>a<\/tt> is close to, but not exactly equal to, $4\/3$. The subtraction <tt>b = a - 1<\/tt> produces a quantity <tt>b<\/tt> whose last bit is 0. This means that the multiplication <tt>3*b<\/tt> can be done without any roundoff. The value stored in <tt>c<\/tt> is not exactly equal to $1$, and so the value stored in <tt>e<\/tt> is not 0. Before the IEEE standard, this code was used as a quick way to estimate the roundoff level on various computers.\r\n<h4>Underflow and Overflow<a name=\"1e7249e4-aba4-48e6-8db9-8806a3c61e94\"><\/a><\/h4>\r\nThe roundoff level <tt>eps<\/tt> is sometimes called ``floating point zero,'' but that's a misnomer. There are many floating point numbers much smaller than <tt>eps<\/tt>. The smallest positive normalized floating point number has $f = 0$ and $e = -1022$. The largest floating point number has $f$ a little less than $1$ and $e = 1023$. MATLAB calls these numbers <tt>realmin<\/tt> and <tt>realmax<\/tt>. Together with <tt>eps<\/tt>, they characterize the standard system.\r\n<pre>                        Double Precision\r\n                   Binary           Decimal\r\n      eps          2^(-52)          2.2204e-16\r\n      realmin      2^(-1022)        2.2251e-308\r\n      realmax      (2-eps)*2^1023   1.7977e+308<\/pre>\r\n<pre>                        Single Precision\r\n                   Binary           Decimal\r\n      eps          2^(-23)          1.1921e-07\r\n      realmin      2^(-126)         1.1755e-38\r\n      realmax      (2-eps)*2^127    3.4028e+38<\/pre>\r\nIf any computation tries to produce a value larger than <tt>realmax<\/tt>, it is said to <i>overflow<\/i>. The result is an exceptional floating point value called <i>infinity<\/i> or <tt>Inf<\/tt>. It is represented by taking $e = 1024$ and $f = 0$ and satisfies relations like <tt>1\/Inf = 0<\/tt> and <tt>Inf+Inf = Inf<\/tt>.\r\n\r\nIf any computation tries to produce a value that is undefined even in the real number system, the result is an exceptional value known as Not-a-Number, or <tt>NaN<\/tt>. Examples include <tt>0\/0<\/tt> and <tt>Inf-Inf<\/tt>. <tt>NaN<\/tt> is represented by taking $e = 1024$ and $f$ nonzero.\r\n\r\nIf any computation tries to produce a value smaller than <tt>realmin<\/tt>, it is said to <i>underflow<\/i>. This involves one of most controversial aspects of the IEEE standard. It produces exceptional <i>denormal<\/i> or <i>subnormal<\/i> floating point numbers in the interval between <tt>realmin<\/tt> and <tt>eps*realmin<\/tt>. The smallest positive double precision subnormal number is\r\n<pre class=\"codeinput\">   format <span class=\"string\">short<\/span>\r\n   smallest_d = eps*realmin\r\n   smallest_s = eps(<span class=\"string\">'single'<\/span>)*realmin(<span class=\"string\">'single'<\/span>)\r\n<\/pre>\r\n<pre class=\"codeoutput\">smallest_d =\r\n  4.9407e-324\r\nsmallest_s =\r\n  1.4013e-45\r\n<\/pre>\r\nAny results smaller than this are set to 0.\r\n\r\nThe controversy surrounding underflow and denormals will be the subject of my next blog post.\r\n\r\n<script>\/\/ <![CDATA[\r\nfunction grabCode_e6f92e235c484f40bbb999f7eed1d2d3() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='e6f92e235c484f40bbb999f7eed1d2d3 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' e6f92e235c484f40bbb999f7eed1d2d3';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2014 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('\r\n\r\n<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\r\n\r\n\r\n\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }\r\n\/\/ ]]><\/script>\r\n<p style=\"text-align: right; font-size: xx-small; font-weight: lighter; font-style: italic; color: gray;\">\r\n<a><span style=\"font-size: x-small; font-style: italic;\">Get\r\nthe MATLAB code<noscript>(requires JavaScript)<\/noscript><\/span><\/a>\r\n\r\nPublished with MATLAB\u00ae R2014a<\/p>\r\n\r\n<\/div>\r\n<!--\r\ne6f92e235c484f40bbb999f7eed1d2d3 ##### SOURCE BEGIN #####\r\n%% Floating Point Numbers\r\n% This is the first part of a two-part series about the single- and\r\n% double precision floating point numbers that MATLAB uses for almost all\r\n% of its arithmetic operations.\r\n% (This post is adapted from section 1.7 of my book\r\n% <https:\/\/www.mathworks.com\/moler\/chapters.html % Numerical Computing with MATLAB>, published by MathWorks and SIAM.)\r\n\r\n%% IEEE 754-1985 Standard\r\n% If you look carefully at the definitions of fundamental arithmetic\r\n% operations like addition and multiplication, you soon encounter the\r\n% mathematical abstraction known as real numbers.  But actual computation\r\n% with real numbers is not very practical because it involves notions like\r\n% limits and infinities. Instead, MATLAB and most other technical computing\r\n% environments use floating point arithmetic, which involves a finite set of\r\n% numbers with finite precision.  This leads to the phenomena of _roundoff_,\r\n% _underflow_, and _overflow_.  Most of the time, it is possible to use MATLAB\r\n% effectively without worrying about these details, but, every once in a while,\r\n% it pays to know something about the properties and limitations of floating\r\n% point numbers.\r\n\r\n%%\r\n% Thirty years ago, the situation was far more complicated than it is today.\r\n% Each computer had its own floating point number system.  Some were binary;\r\n% some were decimal.  There was even a Russian computer that used trinary\r\n% arithmetic.  Among the binary computers, some used 2 as the base; others\r\n% used 8 or 16.  And everybody had a different precision.  In 1985, the IEEE\r\n% Standards Board and the American National Standards Institute adopted the\r\n% ANSI\/IEEE Standard 754-1985 for Binary Floating-Point Arithmetic.  This was\r\n% the culmination of almost a decade of work by a 92-person working group of\r\n% mathematicians, computer scientists, and engineers from universities,\r\n% computer manufacturers, and microprocessor companies.\r\n\r\n%%\r\n% A revision of IEEE_754-1985, known as\r\n% <http:\/\/en.wikipedia.org\/wiki\/IEEE_754-2008 IEEE 754-2008>, was published\r\n% in 2008.  The revision combines the binary standards in 754 and the decimal\r\n% standards in another document, 854.  But as far as I can tell, the new\r\n% features introduced in the revision haven't had much impact yet.\r\n\r\n%% Velvel Kahan\r\n% William M. Kahan was the primary architect of IEEE floating point arithmetic.\r\n% I've always known him by his nickname \"Velvel\".  I met him in the early 1960's\r\n% when I was a grad student at Stanford and he was a young faculty member at\r\n% the University of Toronto.  He moved from Toronto to UC Berkeley in the late\r\n% 1960's and spent the rest of his career at Berkeley.  He is now an emeritus\r\n% professor of mathematics and of electrical engineering and computer science.\r\n\r\n%%\r\n% Velvel is an eminent numerical analyst.  Among many other accomplishments,\r\n% he is the developer along with Gene Golub of the first practical algorithm\r\n% for computing the singular value decomposition.\r\n\r\n%%\r\n% In the late 1970's microprocessor manufacturers including Intel, Motorola,\r\n% and Texas Instruments were developing the chips that were to become the\r\n% basis for the personal computer and eventually today's electronics.\r\n% In a remarkable case of cooperation among competitors, they established a\r\n% committee to develop a common standard for floating point arithmetic.\r\n\r\n%%\r\n% Velvel was a consult to Intel, working with John Palmer, a Stanford grad who\r\n% was the principal architect of the 8087 math coprocessor.  The 8087\r\n% accompanied the 8086, which was to become the CPU of the IBM PC.  Kahan and\r\n% Palmer convinced Intel to allow them to release the specs for their math\r\n% coprocessor to the IEEE committee.  These plans became the basis for the\r\n% standard.\r\n\r\n%%\r\n% Velvel received the ACM Turing Award in 1989 for \"his fundamental\r\n% contributions to numerical analysis\".  He was named an ACM Fellow in 1994,\r\n% and was inducted into the National Academy of Engineering in 2005.\r\n\r\n%% Single and Double Precision\r\n% All computers designed since 1985 use IEEE floating point arithmetic.\r\n% This doesn't mean that they all get exactly the same results, because\r\n% there is some flexibility within the standard. But it does mean that we now\r\n% have a machine-independent model of how floating point arithmetic behaves.\r\n\r\n%%\r\n% MATLAB has traditionally used the IEEE double precision format.\r\n% In recent years, the single precision format has also been available.\r\n% This saves space and can sometimes be faster.  There is also an extended\r\n% precision format, which is not precisely specified by the standard and\r\n% which may be used for intermediate results by the floating point hardware.\r\n% This is one source of the lack of uniformity among different machines.\r\n\r\n%%\r\n% Most nonzero floating point numbers are normalized. This means\r\n% they can be expressed as\r\n%\r\n% $$ x = \\pm (1 + f) \\cdot 2^e $$\r\n%\r\n% The quantity $f$ is the _fraction_ or _mantissa_ and $e$ is the _exponent_.\r\n% The fraction satisfies\r\n%\r\n% $$ 0 \\leq f < 1 $$\r\n%\r\n% The fraction $f$ must be representable in binary using at most 52 bits for\r\n% double precision and 23 bits for single recision.  In other words, for\r\n% double, $2^{52} f$ is an integer in the interval\r\n%\r\n% $$ 0 \\leq 2^{52} f < 2^{52} $$\r\n%\r\n% For single, $2^{23} f$ is an integer in the interval\r\n%\r\n% $$ 0 \\leq 2^{23} f < 2^{23} $$\r\n\r\n%%\r\n% For double precision, the exponent $e$ is an integer in the interval\r\n%\r\n% $$ -1022 \\leq e \\leq 1023 $$\r\n%\r\n% For single, the exponent $e$ is in the interval\r\n%\r\n% $$ -126 \\leq e \\leq 127 $$\r\n\r\n%% Precision versus Range\r\n% The finiteness of $f$ is a limitation on _precision_. The\r\n% finiteness of $e$ is a limitation on _range_.  Any numbers that\r\n% don't meet these limitations must be approximated by ones that do.\r\n\r\n%% Floating Point Format\r\n% Double precision floating point numbers are stored in a\r\n% 64-bit word, with 52 bits for $f$, 11 bits for $e$, and 1 bit for the\r\n% sign of the number. The sign of $e$ is accommodated by storing\r\n% $e+1023$, which is between $1$ and $2^{11}-2$. The 2 extreme values\r\n% for the exponent field, $0$ and $2^{11}-1$, are reserved for exceptional\r\n% floating point numbers that I will describe later.\r\n\r\n%%\r\n% Single precision floating point numbers are stored in a\r\n% 32-bit word, with 23 bits for $f$, 8 bits for $e$, and 1 bit for the\r\n% sign of the number. The sign of $e$ is accommodated by storing\r\n% $e+127$, which is between $1$ and $2^{8}-2$. Again, the 2 extreme values\r\n% for the exponent field are reserved for exceptional floating point numbers.\r\n\r\n%%\r\n% The entire fractional part of a floating point number is not $f$, but\r\n% $1+f$.  However, the leading $1$ doesn't need to be stored.  In effect,\r\n% the IEEE formats pack $33$ or $65$ bits of information into a $32$ or\r\n% $64$-bit word.\r\n\r\n%% floatgui\r\n% My program |floatgui|, <https:\/\/www.mathworks.com\/moler.htmlncmfilelist.html % available here>, shows the distribution of the positive\r\n% numbers in a model floating point system with variable parameters.\r\n% The parameter $t$ specifies the number of bits used to store $f$.\r\n% In other words, $2^t f$ is an integer.  The parameters $e_{min}$\r\n% and $e_{max}$ specify the range of the exponent, so\r\n% $e_{min} \\leq e \\leq e_{max}$.  Initially, |floatgui| sets\r\n% $t = 3$, $e_{min} = -4$, and $e_{max} = 2$ and produces the\r\n% following distribution.\r\n%\r\n% <<floatguijpeg.jpeg>>\r\n%\r\n% Within each binary interval $2^e \\leq x \\leq 2^{e+1}$, the numbers\r\n% are equally spaced with an increment of $2^{e-t}$.  If $e = 0$ and $t = 3$,\r\n% for example, the spacing of the numbers between $1$ and $2$ is $1\/8$.\r\n% As $e$ increases, the spacing increases.\r\n\r\n%%\r\n% It is also instructive to display the floating point numbers with\r\n% a logarithmic scale.  Here is the output from |floatgui| with |logscale|\r\n% checked and $t = 5$, $e_{min} = -4$, and $e_{max} = 3$.  With this\r\n% logarithmic scale, it is more apparent that the distribution in each\r\n% binary interval is the same.\r\n%\r\n% <<logfloatgui.jpeg>>\r\n%\r\n\r\n%% eps\r\n% A very important quantity associated with floating point arithmetic is\r\n% highlighted in red by |floatgui|.  MATLAB calls this quantity |eps|,\r\n% which is short for _machine epsilon_.\r\n%\r\n% * |eps| is the distance from $1$ to the next larger floating point number.\r\n%\r\n% For the |floatgui| model floating point system, |eps| = $2^{-t}$.\r\n\r\n%%\r\n% Before the IEEE standard, different machines had different values of |eps|.\r\n% Now, for IEEE double precision the approximate decimal value of |eps| is\r\n\r\nformat short\r\nformat compact\r\neps_d = eps\r\n\r\n%%\r\n% And for single precision,\r\n\r\neps_s = eps('single')\r\n\r\n%%\r\n% Either |eps\/2| or |eps| can be called the roundoff level.\r\n% The maximum relative error incurred when the result of an arithmetic\r\n% operation is rounded to the nearest floating point number is |eps\/2|.\r\n% The maximum relative spacing between numbers is |eps|. In either case,\r\n% you can say that the roundoff level is about 16 decimal digits for\r\n% double and about 7 decimal digits for single.\r\n\r\n%%\r\n% Actually |eps| is a function.  |eps(x)| is the distance from |abs(x)| to the\r\n% next larger in magnitude floating point number of the same precision as |x|.\r\n\r\n%% One-tenth\r\n% A frequently occurring example is provided by the simple MATLAB statement\r\n%\r\n%   t = 0.1\r\n%\r\n% The mathematical value $t$ stored in |t| is not exactly $0.1$ because\r\n% expressing the decimal fraction $1\/10$ in binary requires an infinite\r\n% series. In fact,\r\n%\r\n% $$ \\frac{1}{10} = \\frac{1}{2^4} + \\frac{1}{2^5} + \\frac{0}{2^6} +\r\n% \\frac{0}{2^7} + \\frac{1}{2^8} + \\frac{1}{2^9} + \\frac{0}{2^{10}} +\r\n% \\frac{0}{2^{11}} + \\frac{1}{2^{12}} + \\cdots $$\r\n%\r\n% After the first term, the sequence of coefficients $1, 0, 0, 1$ is repeated\r\n% infinitely often.  Grouping the resulting terms together four at a time\r\n% expresses $1\/10$ n a base 16, or _hexadecimal_, series.\r\n%\r\n% $$ \\frac{1}{10} =  2^{-4} \\cdot \\left(1 + \\frac{9}{16} + \\frac{9}{16^2} +\r\n% \\frac{9}{16^3} + \\frac{9}{16^{4}} + \\cdots\\right) $$\r\n%\r\n% Floating-point numbers on either side of $1\/10$ are obtained by\r\n% terminating the fractional part of this series after $52$ binary terms,\r\n% or 13 hexadecimal terms, and rounding the last term up or down.  Thus\r\n%\r\n%\r\n% $$ t_1 < 1\/10 < t_2 $$ % % where % % $$ t_1 = 2^{-4} \\cdot \\left(1 + \\frac{9}{16} + \\frac{9}{16^2} + % \\frac{9}{16^3} + \\cdots + \\frac{9}{16^{12}} + \\frac{9}{16^{13}}\\right) $$ % % $$ t_2 = 2^{-4} \\cdot \\left(1 + \\frac{9}{16} + \\frac{9}{16^2} + % \\frac{9}{16^3} + \\cdots + \\frac{9}{16^{12}} + \\frac{10}{16^{13}}\\right) $$ % % It turns out that $1\/10$ is closer to $t_2$ than to $t_1$, so $t$ is equal % to $t_2$. In other words, % % $$ t = (1 + f) \\cdot 2^e $$ % % where % % $$ f = \\frac{9}{16} + \\frac{9}{16^2} + \\frac{9}{16^3} + \\cdots % + \\frac{9}{16^{12}} + \\frac{10}{16^{13}} $$ % % $$ e = -4 $$ % %% Hexadecimal format % % The MATLAB command % % format hex % % causes |t| to be displayed as % % 3fb999999999999a % % The characters |a| through |f| represent the hexadecimal \"digits\" % 10 through 15. The first three characters, |3fb|, give the % hexadecimal representation of decimal $1019$, which is the value % of the biased exponent $e+1023$ if $e$ is $-4$. The other 13 % characters are the hexadecimal representation of the fraction $f$. %% % In summary, the value stored in |t| is very close to, but not % exactly equal to, $0.1$. The distinction is occasionally important. % For example, the quantity % % 0.3\/0.1 % % is not exactly equal to $3$ because the actual numerator is a little less % than $0.3$ and the actual denominator is a little greater than $0.1$. %% % Ten steps of length |t| are not precisely the same as one step of length 1. % MATLAB is careful to arrange that the last element of the vector % % 0:0.1:1 % % is exactly equal to $1$, but if you form this vector yourself by % repeated additions of $0.1$, you will miss hitting the final $1$ exactly. %% Golden Ratio % What does the floating point approximation to the golden ratio, $\\phi$, % look like? format hex phi = (1 + sqrt(5))\/2 %% % The first hex digit, |3|, is |0011| in binary. The first bit is % the sign of the floating point number; |0| is positive, |1| is negative. % So |phi| is positive. The remaining bits of the first three hex % digits contain $e+1023$. In this example, |3ff| in base 16 is % $3 \\cdot 16^2 + 15 \\cdot 16 + 15 = 1023$ in decimal. So % % $$ e = 0 $$ % % In fact, any floating point number between $1.0$ and $2.0$ has $e = 0$, so % its |hex| output begins with |3ff|. The other 13 hex digits contain $f$. % In this example, % % $$ f = \\frac{9}{16} + \\frac{14}{16^2} + \\frac{3}{16^3} + \\cdots % + \\frac{10}{16^{12}} + \\frac{8}{16^{13}} $$ % % With these values of $f$ and $e$, % % $$ (1 + f) 2^e \\approx \\phi $$ % %% Computing eps % Another example is provided by the following code segment. format long a = 4\/3 b = a - 1 c = 3*b e = 1 - c %% % With exact computation, |e| would be zero. But with floating point, we % obtain double precision |eps|. Why? % % It turns out that the only roundoff occurs in the division in the first % statement. The quotient cannot be exactly $4\/3$, except on that Russian % trinary computer. Consequently the value stored in |a| is close to, % but not exactly equal to, $4\/3$. The subtraction |b = a - 1| produces % a quantity |b| whose last bit is 0. This means that the multiplication % |3*b| can be done without any roundoff. The value stored in |c| is not % exactly equal to $1$, and so the value stored in |e| is not 0. Before the % IEEE standard, this code was used as a quick way to estimate the roundoff % level on various computers. %% Underflow and Overflow % The roundoff level |eps| is sometimes called ``floating point zero,'' % but that's a misnomer. There are many floating point numbers much smaller % than |eps|. The smallest positive normalized floating point number has % $f = 0$ and $e = -1022$. The largest floating point number has $f$ % a little less than $1$ and $e = 1023$. MATLAB calls these numbers % |realmin| and |realmax|. Together with |eps|, they characterize the % standard system. % % % Double Precision % Binary Decimal % eps 2^(-52) 2.2204e-16 % realmin 2^(-1022) 2.2251e-308 % realmax (2-eps)*2^1023 1.7977e+308 % % Single Precision % Binary Decimal % eps 2^(-23) 1.1921e-07 % realmin 2^(-126) 1.1755e-38 % realmax (2-eps)*2^127 3.4028e+38 %% % If any computation tries to produce a value larger than |realmax|, it is % said to _overflow_. The result is an exceptional floating point value % called _infinity_ or |Inf|. It is represented by taking $e = 1024$ and % $f = 0$ and satisfies relations like |1\/Inf = 0| and |Inf+Inf = Inf|. %% % If any computation tries to produce a value that is undefined even in the % real number system, the result is an exceptional value known as % Not-a-Number, or |NaN|. Examples include |0\/0| and |Inf-Inf|. % |NaN| is represented by taking $e = 1024$ and $f$ nonzero. %% % If any computation tries to produce a value smaller than |realmin|, it is % said to _underflow_. This involves one of most controversial aspects of % the IEEE standard. It produces exceptional _denormal_ or _subnormal_ % floating point numbers in the interval between |realmin| and |eps*realmin|. % The smallest positive double precision subnormal number is format short smallest_d = eps*realmin smallest_s = eps('single')*realmin('single') %% % Any results smaller than this are set to 0. %% % The controversy surrounding underflow and denormals will be the subject of % my next blog post. ##### SOURCE END ##### e6f92e235c484f40bbb999f7eed1d2d3 -->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/cleve\/floatguijpeg.jpeg\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction-->This is the first part of a two-part series about the single- and double precision floating point numbers that MATLAB uses for almost all of its arithmetic operations. (This post is adapted from section 1.7 of my book <a href=\"https:\/\/www.mathworks.com\/moler\/chapters.html\">Numerical Computing with MATLAB<\/a>, published by MathWorks and SIAM.)\r\n\r\n<!--\/introduction-->... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/cleve\/2014\/07\/07\/floating-point-numbers\/\">read more >><\/a><\/p>","protected":false},"author":78,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[4,16,8,7],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/1008"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/users\/78"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/comments?post=1008"}],"version-history":[{"count":5,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/1008\/revisions"}],"predecessor-version":[{"id":1420,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/1008\/revisions\/1420"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media?parent=1008"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/categories?post=1008"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/tags?post=1008"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}