# Quadruple Precision, 128-bit Floating Point Arithmetic

The floating point arithmetic format that occupies 128 bits of storage is known as *binary128* or *quadruple precision*. This blog post describes an implementation of quadruple precision programmed entirely in the MATLAB language.

### Contents

#### Background

The IEEE 754 standard, published in 1985, defines formats for floating point numbers that occupy 32 or 64 bits of storage. These formats are known as *binary32* and *binary64*, or more frequently as *single* and *double precision*. For many years MATLAB used only double precision and it remains our default format. Single precision has been added gradually over the last several years and is now also fully supported.

A revision of IEEE 754, published in 2008, defines two more floating point formats. One, *binary16* or *half precision*, occupies only 16 bits and was the subject of my previous blog post. It is primarily intended to reduce storage and memory bandwidth requirements. Since it provides only "half" precision, its use for actual computation is problematic.

The other new format introduced in IEEE 754-2008 is *binary128* or *quadruple precision*. It is intended for situations where the accuracy or range of double precision is inadequate.

I see two descriptions of quadruple precision software implementations on the Web.

I have not used either package, but judging by their Web pages, they both appear to be complete and well supported.

The MATLAB Symbolic Math Toolbox provides `vpa`, arbitrary precision decimal floating point arithmetic, and `sym`, exact rational arithmetic. Both provide accuracy and range well beyond quadruple precision, but do not specifically support the 128-bit IEEE format.

My goal here is to describe a prototype of a MATLAB object, `fp128`, that implements quadruple precision with code written entirely in the MATLAB language. It is not very efficient, but is does allow experimentation with the 128-bit format.

#### Beyond double

There are other floating point formats beyond double precision. *Long double* usually refers to the 80-bit extended precision floating point registers available with the Intel x86 architecture and described as *double extended* in IEEE 754. This provides the same exponent range as quadruple precision, but much less accuracy.

*Double double* refers to the use of a pair of double precision values. The exponent field and sign bit of the second double are ignored, so this is effectively a 116-bit format. Both the exponent range and the precision are more than double but less than quadruple.

#### Floating point anatomy

The format of a floating point number is characterized by two parameters, `p`, the number of bits in the fraction and `q`, the number of bits in the exponent. I will compare four precisions, *half*, *single*, *double*, and *quadruple*. The four pairs of characterizing parameters are

p = [10, 23, 52 112];

q = [5, 8, 11, 15];

With these values of `p` and `q`, and with one more bit for the sign, the total number of bits in the word, `w`, is a power of two.

```
format shortg
w = p + q + 1
```

w = 16 32 64 128

**Normalized numbers**

Most floating point numbers are *normalized*, and are expressed as

$$ x = \pm (1+f)2^e $$

The *fraction* $f$ is in the half open interval

$$ 0 \leq f < 1 $$

The binary representation of $f$ requires at most `p` bits. In other words $2^p f$ is an integer in the range

$$ 0 \leq 2^p f < 2^p $$

The *exponent* $e$ is an integer in the range

$$ -b+1 \leq e \leq b $$

The quantity $b$ is both the largest exponent and the `bias`.

$$ b = 2^{q-1} - 1 $$

b = 2.^(q-1)-1

b = 15 127 1023 16383

The fractional part of a normalized number is $1+f$, but only $f$ needs to be stored. That leading $1$ is known as the *hidden bit*.

**Subnormal**

There are two values of the exponent $e$ for which the biased exponent, $e+b$, reaches the smallest and largest values possible to represent in `q` bits. The smallest is

$$ e + b = 0 $$

The corresponding floating point numbers do not have a hidden leading bit. These are the *subnormal* or *denormal* numbers.

$$ x = \pm f 2^{-b} $$

**Infinity and Not-A-Number**

The largest possible biased exponent is

$$ e + b = 2^q-1 $$.

Quantities with this exponent field represent *infinities* and *NaN*, or *Not-A-Number*.

The percentage of floating point numbers that are exceptional because they are subnormal, infinity or NaN increases as the precision decreases. Exceptional exponents are only $2$ values out of $2^q$. For quadruple precision this is $2/2^{15}$, which is less than a one one-thousandth of one percent.

Encode the sign bit with `s = 0` for nonnegative and `s = 1` for negative. And encode the exponent with an offsetting bias, `b`. Then a floating point number can be packed in `w` bits with

x = [s e+b 2^p*f]

#### Precision and range

**epsilon**

If a real number cannot be expressed with a binary expansion requiring at most `p` bits, it must be approximated by a floating point number that does have such a binary representation. This is *roundoff error*. The important quantity characterizing precision is *machine epsilon*, or `eps`. In MATLAB, `eps(x)` is the distance from `x` to the next larger (in absolute value) floating point number (of that class). With no argument, `eps` is simply the difference between `1` and the next larger floating point number.

```
format shortg
eps = 2.^(-p)
```

eps = 0.00097656 1.1921e-07 2.2204e-16 1.9259e-34

This tells us that quadruple precision is good for about 34 decimal digits of accuracy, double for about 16 decimal digits, single for about 7, and half for about 3.

**realmax**

If a real number, or the result of an arithmetic operation, is too large to be represented, it *overflows* and is replaced by *infinity*. The largest floating point number that does not overflow is `realmax`. When I try to compute quadruple `realmax` with double precision, it overflows. I will fix this up in the table to follow.

realmax = 2.^b.*(2-eps)

realmax = 65504 3.4028e+38 1.7977e+308 Inf

**realmin**

*Underflow* and representation of very small numbers is more complicated. The smallest normalized floating point number is `realmin`. When I try to compute quadruple `realmin` it underflows to zero. Again, I will fix this up in the table.

realmin = 2.^(-b+1)

realmin = 6.1035e-05 1.1755e-38 2.2251e-308 0

**tiny**

But there are numbers smaller than `realmin`. IEEE 754 introduced the notion of *gradual underflow* and *denormal* numbers. In the 2008 revised standard their name was changed to *subnormal*.

Think of roundoff in numbers near underflow. Before 754, floating point numbers had the disconcerting property that `x` and `y` could be unequal, but their difference could underflow, so `x-y` becomes `0`. With 754 the gap between `0` and `realmin` is filled with numbers whose spacing is the same as the spacing between `realmin` and `2*realmin`. I like to call this spacing, and the smallest subnormal number, `tiny`.

tiny = realmin.*eps

tiny = 5.9605e-08 1.4013e-45 4.9407e-324 0

#### Floating point integers

**flintmax**

It is possible to do integer arithmetic with floating point numbers. I like to call such numbers *flints*. When we write the numbers $3$ and $3.0$, they are different descriptions of the same integer, but we think of one as fixed point and the other as floating point. The largest flint is `flintmax`.

flintmax = 2./eps

flintmax = 2048 1.6777e+07 9.0072e+15 1.0385e+34

Technically all the floating point numbers larger than `flintmax` are integers, but the spacing between them is larger than one, so it is not safe to use them for integer arithmetic. Only integer-valued floating point numbers between `0` and `flintmax` are allowed to be called flints.

#### Table

Let's collect all these anatomical characteristics together in a new MATLAB `table`. I have now edited the output and inserted the correct quadruple precision values.

T = [w; p; q; b; eps; realmax; realmin; tiny; flintmax]; T = table(T(:,1), T(:,2), T(:,3), T(:,4), ... 'variablenames',{'half','single','double','quadruple'}, ... 'rownames',{'w','p','q','b','eps','realmax','realmin', ... 'tiny','flintmax'}); type Table.txt

half single double quadruple __________ __________ ___________ __________ w 16 32 64 128 p 10 23 52 112 q 5 8 11 15 b 15 127 1023 16383 eps 0.00097656 1.1921e-07 2.2204e-16 1.9259e-34 realmax 65504 3.4028e+38 1.7977e+308 1.190e+4932 realmin 6.1035e-05 1.1755e-38 2.2251e-308 3.362e-4932 tiny 5.9605e-08 1.4013e-45 4.9407e-324 6.475e-4966 flintmax 2048 1.6777e+07 9.0072e+15 1.0385e+34

#### fp128

I am currently working on code for an object, `@fp128`, that could provide a full implementation of quadruple-precision arithmetic. The methods available so far are

methods(fp128)

Methods for class fp128: abs eq le mtimes realmax subsref cond fp128 lt ne realmin svd diag frac lu norm shifter sym disp ge max normalize sig times display gt minus normalize2 sign tril double hex mldivide plus sqrt triu ebias hypot mpower power subsasgn uminus eps ldivide mrdivide rdivide subsindex

The code that I have for quadrule precision is much more complex than the code that I have for half precision. There I am able to "cheat" by converting half precision numbers to doubles and relying on traditional MATLAB arithmetic. I can't do that for quads.

The storage scheme for quads is described in the help entry for the constructor.

```
help @fp128/fp128
```

fp128 Quad precision constructor. z = fp128(x) has three fields. x = s*(1+f)*2^e, where z.s, one uint8, s = (-1)^sg = 1-2*sg, sg = (1-s)/2. z.e, 15 bits, biased exponent, one uint16. b = 2^14-1 = 16383, eb = e + b, 1 <= eb <= 2*b for normalized quads, eb = 0 for subnormal quads, eb = 2*b+1 = 32767 for infinity and NaN. z.f, 112 bits, nonnegative fraction, 4-vector of uint64s, each with 1/4-th of the bits, 0 <= f(k) < 2^28, 4*28 = 112. z.f represents sum(f .* pow2), pow2 = 2.^(-28*(1:4)) Reference page in Doc Center doc fp128

Breaking the 112-bit fraction into four 28-bit pieces makes it possible to do arithmetic operations on the pieces without worrying about integer overflow. The core of the `times` code, which implements `x.*y`, is the convolution of the two fractional parts.

dbtype 45:53 @fp128/times

45 % The multiplication. 46 % z.f = conv(x.f,y.f); 47 % Restore hidden 1's. 48 xf = [1 x.f]; 49 yf = [1 y.f]; 50 zf = zeros(1,9,'uint64'); 51 for k = 1:5 52 zf(k:k+4) = zf(k:k+4) + yf(k)*xf; 53 end

The result of the convolution, `zf`, is a `uint64` vector of length nine with 52-bit elements. It must be renormalized to the fit the `fp128` storage scheme.

Addition and subtraction involve addition and subtraction of the fractional parts after they have been shifted so that the corresponding exponents are equal. Again, this produces temporary vectors that must be renormalized.

Scalar division, `y/x`, is done by first computing the reciprocal of the denominator, `1/x`, and then doing one final multiplication, `1/x * y`. The reciprocal is computed by a few steps of Newton iteration, starting with a scaled reciprocal, `1/double(x)`.

#### Examples

The output for each example shows the three fields in hexadecimal -- one sign field, one biased exponent field, and one fraction field that is a vector with four entries displayed with seven hex digits. This is followed by a 36 significant digit decimal representation.

**One**

```
clear
format long
one = fp128(1)
```

one = 0 3fff 0000000 0000000 0000000 0000000 1.0

**eps**

eps = eps(one)

eps = 0 3f8f 0000000 0000000 0000000 0000000 0.000000000000000000000000000000000192592994438723585305597794258492732

**1 + eps**

one_plus_eps = one + eps

one_plus_eps = 0 3fff 0000000 0000000 0000000 0000001 1.00000000000000000000000000000000019

**2 - eps**

two_minus_eps = 2 - eps

two_minus_eps = 0 3fff fffffff fffffff fffffff fffffff 1.99999999999999999999999999999999981

**realmin**

rmin = realmin(one)

rmin = 0 0001 0000000 0000000 0000000 0000000 3.3621031431120935062626778173217526e-4932

**realmax**

rmax = realmax(one)

rmax = 0 7ffe fffffff fffffff fffffff fffffff 1.18973149535723176508575932662800702e4932

**Compute 1/10 with double, then convert to quadruple.**

dble_tenth = fp128(1/10)

dble_tenth = 0 3ffb 9999999 99999a0 0000000 0000000 0.100000000000000005551115123125782702

**Compute 1/10 with quadruple.**

quad_tenth = 1/fp128(10)

quad_tenth = 0 3ffb 9999999 9999999 9999999 9999999 0.0999999999999999999999999999999999928

**Double precision pi converted to quadruple.**

dble_pi = fp128(pi)

dble_pi = 0 4000 921fb54 442d180 0000000 0000000 3.14159265358979311599796346854418516

`pi` accurate to quadruple precision.

```
quad_pi = fp128(sym('pi'))
```

quad_pi = 0 4000 921fb54 442d184 69898cc 51701b8 3.1415926535897932384626433832795028

#### Matrix operations

The 4-by-4 magic square from Durer's Melancholia II provides my first matrix example.

clear M = fp128(magic(4));

Let's see how the 128-bit elements look in hex.

```
format hex
M
```

M = 0 4003 0000000 0000000 0000000 0000000 0 4001 4000000 0000000 0000000 0000000 0 4002 2000000 0000000 0000000 0000000 0 4001 0000000 0000000 0000000 0000000 0 4000 0000000 0000000 0000000 0000000 0 4002 6000000 0000000 0000000 0000000 0 4001 c000000 0000000 0000000 0000000 0 4002 c000000 0000000 0000000 0000000 0 4000 8000000 0000000 0000000 0000000 0 4002 4000000 0000000 0000000 0000000 0 4001 8000000 0000000 0000000 0000000 0 4002 e000000 0000000 0000000 0000000 0 4002 a000000 0000000 0000000 0000000 0 4002 0000000 0000000 0000000 0000000 0 4002 8000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000

Check that the row sums are all equal. This matrix-vector multiply can be done exactly with the flints in the magic square.

e = fp128(ones(4,1)) Me = M*e

e = 0 3fff 0000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000 Me = 0 4004 1000000 0000000 0000000 0000000 0 4004 1000000 0000000 0000000 0000000 0 4004 1000000 0000000 0000000 0000000 0 4004 1000000 0000000 0000000 0000000

#### Quadruple precision backslash

I've overloaded `mldivide`, so I can solve linear systems and compute inverses. The actual computation is done by `lutx`, a "textbook" function that I wrote years ago, long before this quadruple-precision project, followed by the requisite solution of triangular systems. But now the MATLAB object system insures that every individual arithmetic operation is done with IEEE 754 quadruple precision.

Let's generate a 3-by-3 matrix with random two-digit integer entries.

A = fp128(randi(100,3,3))

A = 0 4002 0000000 0000000 0000000 0000000 0 4001 8000000 0000000 0000000 0000000 0 4004 b000000 0000000 0000000 0000000 0 4005 3800000 0000000 0000000 0000000 0 4005 7800000 0000000 0000000 0000000 0 4002 a000000 0000000 0000000 0000000 0 4004 c800000 0000000 0000000 0000000 0 4004 7800000 0000000 0000000 0000000 0 4000 0000000 0000000 0000000 0000000

I am going to use `fp128` backslash to invert `A`. So I need the identity matrix in quadruple precision.

I = fp128(eye(size(A)));

Now the overloaded backslash calls `lutx`, and then solves two triangular systems to produce the inverse.

X = A\I

X = 0 3ff7 2fd38ea bcfb815 69cdccc a36d8a5 1 3ff9 c595b53 8c842ee f26189c a0770d4 0 3ffa c0bc8b7 4adcc40 4ea66ca 61f1380 1 3ff7 a42f790 e4ad874 c358882 7ff988e 0 3ffa 12ea8c2 3ef8c17 01c7616 5e03a5a 1 3ffa 70d4565 958740b 78452d8 f32d866 0 3ff9 2fd38ea bcfb815 69cdccc a36d8a7 0 3ff3 86bc8e5 42ed82a 103d526 a56452f 1 3ff6 97f9949 ba961b3 72d69d9 4ace666

Compute the residual.

```
AX = A*X
R = I - AX;
format short
RD = double(R)
```

AX = 0 3fff 0000000 0000000 0000000 0000000 0 3f90 0000000 0000000 0000000 0000000 1 3f8d 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 3fff 0000000 0000000 0000000 0000000 0 3f8d 8000000 0000000 0000000 0000000 1 3f8c 0000000 0000000 0000000 0000000 1 3f8d 0000000 0000000 0000000 0000000 0 3ffe fffffff fffffff fffffff ffffffb RD = 1.0e-33 * 0 0 0.0241 -0.3852 0 0.0481 0.0481 -0.0722 0.4815

Both `AX` and `R` are what I expect from arithmetic that is accurate to about 34 decimal digits.

Although I get a different random `A` every time I publish this blog post, I expect that it has a modest condition number.

kappa = cond(A)

kappa = 0 4002 7e97c18 91278cd 8375371 7915346 11.9560249020758193065358323606886569

Since `A` is not badly conditioned, I can invert the computed inverse and expect to get close to the original integer matrix. The elements of the resulting `Z` are integers, possibly bruised with quadruple precision fuzz.

```
format hex
Z = X\I
```

Z = 0 4002 0000000 0000000 0000000 0000000 0 4001 8000000 0000000 0000000 0000000 0 4004 b000000 0000000 0000000 0000004 0 4005 37fffff fffffff fffffff ffffffc 0 4005 77fffff fffffff fffffff ffffffe 0 4002 a000000 0000000 0000000 0000001 0 4004 c7fffff fffffff fffffff ffffffc 0 4004 77fffff fffffff fffffff ffffffc 0 3fff fffffff fffffff fffffff ffffffc

#### Quadruple precision SVD

I have just nonchalantly computed `cond(A)`. Here is the code for the overloaded `cond`.

```
type @fp128/cond.m
```

function kappa = cond(A) sigma = svd(A); kappa = sigma(1)/sigma(end); end

So it is correctly using the singular value decomposition. I also have `svd` overloaded. The SVD computation is handled by a 433 line M-file, `svdtx`, that, like `lutx`, was written before `fp128` existed. It was necessary to modify five lines in `svdtx`. The line

u = zeros(n,ncu);

had to be changed to

u = fp128(zeros(n,ncu));

Similarly for `v`, `s`, `e` and `work`. I should point out that the preallocation of the arrays is inherited from the LINPACK Fortran subroutine DSVDC. Without it, `svdtx` would not have required any modification to work correctly in quadruple precision.

Let's compute the full SVD.

[U,S,V] = svd(A)

U = 1 3ffe 57d9492 76f3ea4 dc14bb3 15d42c1 1 3ffe 75a77c4 8c7b469 2cac695 59be7fe 1 3ffc 0621737 9b04c78 1c2109d 8736b46 1 3ffb 38214c0 d75c84c 4bcf5ff f3cffd7 1 3ffb a9281e3 e12dd3a d632d61 c8f6e60 0 3ffe fbbccdc a571fa1 f5a588b fb0d806 1 3ffe 79587db 4889548 f09ae4b cd0150c 0 3ffe 59fae16 17bcabb 6408ba4 7b2a573 0 3ff8 cde38fc e952ad5 8b526c2 780c2e5 S = 0 4006 1f3ad79 d0b9b08 18b1444 030e4ef 0 0000 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 4004 a720ef6 28c6ec0 87f4c54 82dda2a 0 0000 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 0000 0000000 0000000 0000000 0000000 0 4002 8061b9a 0e96d8c c2ef745 9ea4c9a V = 1 3ffb db3df03 9b5e1b3 5bf4478 0e42b0d 1 3ffe b540007 4d4bc9e dc9461a 0de0481 1 3ffe 03aaff4 d9cea2c e8ee2bc 2eba908 0 3ffe fa73e09 9ef8810 a03d2eb 46ade00 1 3ffa b316e2f fe9d3ae dfa9988 fbca927 1 3ffc 184af51 f25fece 97bc0da 5ff13a2 1 3ffb 706955f a877cbb b63f6dd 4e2150e 0 3ffe 08fc1eb 7b86ef7 4af3c6c 732aae9 1 3ffe b3aaead ef356e2 7cd2937 94b22a7

Reconstruct `A` from its quadruple precision SVD. It's not too shabby.

USVT = U*S*V'

USVT = 0 4001 fffffff fffffff fffffff fffffce 0 4001 7ffffff fffffff fffffff fffffc7 0 4004 b000000 0000000 0000000 000000a 0 4005 37fffff fffffff fffffff ffffff1 0 4005 77fffff fffffff fffffff ffffff6 0 4002 9ffffff fffffff fffffff fffffd2 0 4004 c7fffff fffffff fffffff ffffff1 0 4004 77fffff fffffff fffffff ffffff4 0 3fff fffffff fffffff fffffff ffffe83

#### Rosser matrix

An interesting example is provided by a classic test matrix, the 8-by-8 Rosser matrix. Let's compare quadruple precision computation with the exact rational computation provided by the Symbolic Math Toolbox.

First, generate quad and sym versions of `rosser`.

R = fp128(rosser); S = sym(rosser)

S = [ 611, 196, -192, 407, -8, -52, -49, 29] [ 196, 899, 113, -192, -71, -43, -8, -44] [ -192, 113, 899, 196, 61, 49, 8, 52] [ 407, -192, 196, 611, 8, 44, 59, -23] [ -8, -71, 61, 8, 411, -599, 208, 208] [ -52, -43, 49, 44, -599, 411, 208, 208] [ -49, -8, 8, 59, 208, 208, 99, -911] [ 29, -44, 52, -23, 208, 208, -911, 99]

`R` is symmetric, but not positive definite, so its LU factorization requires pivoting.

```
[L,U,p] = lutx(R);
format short
p
```

p = 1 2 3 7 6 8 4 5

`R` is singular, so with exact computation `U(n,n)` would be zero. With quadruple precision, the diagonal of `U` is

format long e diag(U)

ans = 611.0 836.126022913256955810147299509001582 802.209942588471107300640276546225738 99.0115741407236314604636000423592687 -710.481057851148425133280246646085002 579.272484693223512196223933017062413 -1.2455924519190846395771824210276321 0.000000000000000000000000000000215716190833522835766351129431653015

The relative size of the last diagonal element is zero to almost 34 digits.

double(U(8,8)/U(1,1))

ans = 3.530543221497919e-34

Compare this with symbolic computation, which, in this case, can compute an LU decomposition with exact rational arithmetic and no pivoting.

[L,U] = lu(S); diag(U)

ans = 611 510873/611 409827400/510873 50479800/2049137 3120997/10302 -1702299620/3120997 255000/40901 0

As expected, with symbolic computation `U(8,8)` is exactly zero.

How about SVD?

r = svd(R)

r = 1020.04901842999682384631379130551006 1020.04901842999682384631379130550858 1019.99999999999999999999999999999941 1019.90195135927848300282241090227735 999.999999999999999999999999999999014 999.999999999999999999999999999998817 0.0980486407215169971775890977220345302 0.0000000000000000000000000000000832757192990287779822645036097560521

The Rosser matrix is atypical because its characteristic polynomial factors over the rationals. So, even though it is of degree 8, the singular values are the roots of quadratic factors.

s = svd(S)

s = 10*10405^(1/2) 10*10405^(1/2) 1020 10*(1020*26^(1/2) + 5201)^(1/2) 1000 1000 10*(5201 - 1020*26^(1/2))^(1/2) 0

The relative error of the quadruple precision calculation.

double(norm(r - s)/norm(s))

ans = 9.293610246879066e-34

About 33 digits.

#### Postscript

Finally, verify that we've been working all this time with `fp128` and `sym` objects.

whos

Name Size Bytes Class Attributes A 3x3 3531 fp128 AX 3x3 3531 fp128 I 3x3 3531 fp128 L 8x8 8 sym M 4x4 6128 fp128 Me 4x1 1676 fp128 R 8x8 23936 fp128 RD 3x3 72 double S 8x8 8 sym U 8x8 8 sym USVT 3x3 3531 fp128 V 3x3 3531 fp128 X 3x3 3531 fp128 Z 3x3 3531 fp128 ans 1x1 8 double e 4x1 1676 fp128 kappa 1x1 563 fp128 p 8x1 64 double r 8x1 3160 fp128 s 8x1 8 sym

## 댓글

댓글을 남기려면 링크 를 클릭하여 MathWorks 계정에 로그인하거나 계정을 새로 만드십시오.