# Machine epsilon

Machine epsilon or machine precision gives an upper bound on the relative approximation error due to rounding in floating point arithmetic. This value characterizes computer arithmetic in the field of numerical analysis, and by extension in the subject of computational science. The quantity is also called macheps or unit roundoff, and it has the symbols Greek epsilon $\varepsilon$ or bold Roman u, respectively.

## Values for standard hardware floating-point arithmetics

The following values of machine epsilon apply to standard floating-point formats:

IEEE 754 - 2008 Common name C++ data type Base $b$  Precision $p$  Machine epsilon[a] $b^{-(p-1)}/2$  Machine epsilon[b] $b^{-(p-1)}$
binary16 half precision N/A 2 11 (one bit is implicit) 2−11 ≈ 4.88e-04 2−10 ≈ 9.77e-04
binary32 single precision float 2 24 (one bit is implicit) 2−24 ≈ 5.96e-08 2−23 ≈ 1.19e-07
binary64 double precision double 2 53 (one bit is implicit) 2−53 ≈ 1.11e-16 2−52 ≈ 2.22e-16
extended precision, long double _float80 2 64 2−64 ≈ 5.42e-20 2−63 ≈ 1.08e-19
binary128 quad(ruple) precision _float128 2 113 (one bit is implicit) 2−113 ≈ 9.63e-35 2−112 ≈ 1.93e-34
decimal32 single precision decimal _Decimal32 10 7 5 × 10−7 10−6
decimal64 double precision decimal _Decimal64 10 16 5 × 10−16 10−15
decimal128 quad(ruple) precision decimal _Decimal128 10 34 5 × 10−34 10−33
1. ^ According to Prof. Demmel, LAPACK, Scilab
2. ^ According to Prof. Higham; ISO C standard; Ada, C, C++ and Python language constants; Mathematica, MATLAB and Octave; various textbooks - see below for the latter definition

## Formal definition

Rounding is a procedure for choosing the representation of a real number in a floating point number system. For a number system and a rounding procedure, machine epsilon is the maximum relative error of the chosen rounding procedure.

Some background is needed to determine a value from this definition. A floating point number system is characterized by a radix which is also called the base, $b$ , and by the precision $p$ , i.e. the number of radix $b$  digits of the significand (including any leading implicit bit). All the numbers with the same exponent, $e$ , have the spacing, $b^{e-(p-1)}$ . The spacing changes at the numbers that are perfect powers of $b$ ; the spacing on the side of larger magnitude is $b$  times larger than the spacing on the side of smaller magnitude.

Since machine epsilon is a bound for relative error, it suffices to consider numbers with exponent $e=0$ . It also suffices to consider positive numbers. For the usual round-to-nearest kind of rounding, the absolute rounding error is at most half the spacing, or $b^{-(p-1)}/2$ . This value is the biggest possible numerator for the relative error. The denominator in the relative error is the number being rounded, which should be as small as possible to make the relative error large. The worst relative error therefore happens when rounding is applied to numbers of the form $1+a$  where $a$  is between $0$  and $b^{-(p-1)}/2$ . All these numbers round to $1$  with relative error $a/(1+a)$ . The maximum occurs when $a$  is at the upper end of its range. The $1+a$  in the denominator is negligible compared to the numerator, so it is left off for expediency, and just $b^{-(p-1)}/2$  is taken as machine epsilon. As has been shown here, the relative error is worst for numbers that round to $1$ , so machine epsilon also is called unit roundoff meaning roughly "the maximum error that can occur when rounding to the unit value".

Thus, the maximum spacing between a normalised floating point number, $x$ , and an adjacent normalised number is $2\varepsilon |x|$ .

## Arithmetic model

Numerical analysis uses machine epsilon to study the effects of rounding error. The actual errors of machine arithmetic are far too complicated to be studied directly, so instead, the following simple model is used. The IEEE arithmetic standard says all floating-point operations are done as if it were possible to perform the infinite-precision operation, and then, the result is rounded to a floating-point number. Suppose (1) $x$ , $y$  are floating-point numbers, (2) $\bullet$  is an arithmetic operation on floating-point numbers such as addition or multiplication, and (3) $\circ$  is the infinite precision operation. According to the standard, the computer calculates:

$x\bullet y={\mbox{round}}(x\circ y)$

By the meaning of machine epsilon, the relative error of the rounding is at most machine epsilon in magnitude, so:

$x\bullet y=(x\circ y)(1+z)$

where $z$  in absolute magnitude is at most $\varepsilon$  or u. The books by Demmel and Higham in the references can be consulted to see how this model is used to analyze the errors of, say, Gaussian elimination.

## Variant definitions

The IEEE standard does not define the terms machine epsilon and unit roundoff, so differing definitions of these terms are in use, which can cause some confusion.

The definition given here for machine epsilon is the one used by Prof. James Demmel in lecture scripts, the LAPACK linear algebra package, numerics research papers and some scientific computing software. Most numerical analysts use the words machine epsilon and unit roundoff interchangeably with this meaning.

The following different definition is much more widespread outside academia: Machine epsilon is defined as the difference between 1 and the next larger floating point number. By this definition, $\varepsilon$  equals the value of the unit in the last place relative to 1, i.e. $b^{-(p-1)}$ , and the unit roundoff is u$\,=\varepsilon /2$ , assuming round-to-nearest mode. The prevalence of this definition is rooted in its use in the ISO C Standard for constants relating to floating-point types and corresponding constants in other programming languages. It is also widely used in scientific computing software, and in the numerics and computing literature.

## How to determine machine epsilon

Where standard libraries do not provide precomputed values (as <float.h> does with FLT_EPSILON, DBL_EPSILON and LDBL_EPSILON for C and <limits> does with std::numeric_limits<T>::epsilon() in C++), the best way to determine machine epsilon is to refer to the table, above, and use the appropriate power formula. Computing machine epsilon is often given as a textbook exercise. The following examples compute machine epsilon in the sense of the spacing of the floating point numbers at 1 rather than in the sense of the unit roundoff.

Note that results depend on the particular floating-point format used, such as float, double, long double, or similar as supported by the programming language, the compiler, and the runtime library for the actual platform.

Some formats supported by the processor might not be supported by the chosen compiler and operating system. Other formats might be emulated by the runtime library, including arbitrary-precision arithmetic available in some languages and libraries.

In a strict sense the term machine epsilon means the $1+\varepsilon$  accuracy directly supported by the processor (or coprocessor), not some $1+\varepsilon$  accuracy supported by a specific compiler for a specific operating system, unless it's known to use the best format.

IEEE 754 floating-point formats have the property that, when reinterpreted as a two's complement integer of the same width, they monotonically increase over positive values and monotonically decrease over negative values (see the binary representation of 32 bit floats). They also have the property that $0<|f(x)|<\infty$ , and $|f(x+1)-f(x)|\geq |f(x)-f(x-1)|$  (where $f(x)$  is the aforementioned integer reinterpretation of $x$ ). In languages that allow type punning and always use IEEE 754-1985, we can exploit this to compute a machine epsilon in constant time. For example, in C:

typedef union {
long long i64;
double d64;
} dbl_64;
double machine_eps (double value)
{
dbl_64 s;
s.d64 = value;
s.i64++;
return s.d64 - value;
}


Example on Python:

def machineEpsilon(func=float):
machine_epsilon = func(1)
while func(1)+machine_epsilon != func(1):
machine_epsilon_last = machine_epsilon
machine_epsilon = func(machine_epsilon) / func(2)
return machine_epsilon_last


This will give a result of the same sign as value. If a positive result is always desired, the return statement of machine_eps can be replaced with:

    return (s.i64 < 0 ? value - s.d64 : s.d64 - value);


64-bit doubles give 2.220446e-16, which is 2−52 as expected.

### Approximation

The following simple algorithm can be used to approximate[clarification needed] the machine epsilon, to within a factor of two (one order of magnitude) of its true value, using a linear search.

epsilon = 1.0;

while (1.0 + 0.5 * epsilon) ≠ 1.0:
epsilon = 0.5 * epsilon


The machine epsilon, ${\textstyle \epsilon _{{mac}h}}$  can also simply be calculated as two to the negative power of the number of bits used for the mantissa.

{\begin{aligned}\epsilon _{mach}\ &=\ 2^{-{bits\ used\ for\ magnitude\ of\ mantissa}}\end{aligned}}

## Relation of representing a number to machine epsilon

Prove that if ${\textstyle y}$  is the machine number representation of a number ${\textstyle \ x}$ , then show that the absolute relative true error in the representation is

${\textstyle \displaystyle \left|{\frac {x-y}{x}}\right|\leq \epsilon _{mach}}$

Proof: 

Let's limit the proof here to positive numbers and the use of chopping for machine representation.

If ${\textstyle x}$  is a positive number we want to represent, it will be between a machine number ${\textstyle x_{b}}$  below ${\textstyle x}$  and a machine number ${\textstyle x_{u}}$  above ${\textstyle \ x}$ .

If

$x_{b}=\left(1.b_{1}b_{2}\ldots b_{m}\right)_{2}\times 2^{k}$

where

$m={\text{number of bits used for the magnitude of the mantissa, then}}$

{\begin{aligned}x_{u}&=\left\lbrack (1.b_{1}b_{2}\ldots b_{m})_{2}+(0.00\ldots 1)_{2}\right\rbrack \times 2^{k}\\&=\left\lbrack (1.b_{1}b_{2}\ldots b_{m})_{2}+2^{-m}\right\rbrack \times 2^{k}\\&=(1.b_{1}b_{2}\ldots b_{m})_{2}\times 2^{k}+2^{-m}\times 2^{k}\\&=(1.b_{1}b_{2}\ldots b_{m})_{2}\times 2^{k}+2^{-m+k}\end{aligned}}

Since ${\textstyle x}$  is getting represented either as ${\textstyle x_{b}}$  or ${\textstyle {x}_{u}}$ ,

{\begin{aligned}\left|x-y\right|&\leq \left|x_{b}-x_{u}\right|\\&=2^{-m+k}\end{aligned}}

{\begin{aligned}\left|{\frac {x-y}{x}}\right|&\leq {\frac {2^{-m+k}}{x}}\\&\leq {\frac {2^{-m+k}}{x_{b}}}\\&={\frac {2^{-m+k}}{(1\cdot b_{1}b_{2}\ldots b_{m})_{2}2^{k}}}\\&={\frac {2^{-m}}{(1\cdot b_{1}b_{2}\ldots b_{m})_{2}}}\\&\leq 2^{-m}=\varepsilon _{mach}\end{aligned}}

The above proof is for positive numbers that use chopping for machine representation. You can prove the same for negative numbers or when rounding is used for machine representation.

## Notes and references

1. ^ a b Floating Types - Using the GNU Compiler Collection (GCC)
2. ^ a b c Decimal Float - Using the GNU Compiler Collection (GCC)
3. ^ Higham, Nicholas (2002). Accuracy and Stability of Numerical Algorithms (2 ed). SIAM. p. 37.
4. ^ "Basic Issues in Floating Point Arithmetic and Error Analysis". 21 Oct 1999. Retrieved 11 Apr 2013.
5. ^ "LAPACK Users' Guide Third Edition". 22 August 1999. Retrieved 9 March 2012.
6. ^
7. ^ "Scilab documentation - number_properties - determine floating-point parameters". Retrieved 11 Apr 2013.
8. ^ note that here p is defined as the precision, i.e. the total number of bits in the significand including implicit leading bit, as used in the table above
9. ^ Jones, Derek M. (2009). The New C Standard - An Economic and Cultural Commentary (PDF). p. 377.
10. ^ "float.h reference at cplusplus.com". Retrieved 11 Apr 2013.
11. ^ "std::numeric_limits reference at cplusplus.com". Retrieved 11 Apr 2013.
12. ^ "Python documentation - System-specific parameters and functions". Retrieved 11 Apr 2013.
13. ^ "Mathematica documentation: \$MachineEpsilon". Retrieved 11 Apr 2013.
14. ^ "Matlab documentation - eps - Floating-point relative accuracy". Archived from the original on 2013-08-07. Retrieved 11 Apr 2013.
15. ^ "Octave documentation - eps function". Retrieved 11 Apr 2013.
16. ^ Higham, Nicholas (2002). Accuracy and Stability of Numerical Algorithms (2 ed). SIAM. pp. 27–28.
17. ^ Quarteroni, Alfio; Sacco, Riccardo; Saleri, Fausto (2000). Numerical Mathematics (PDF). Springer. p. 49. ISBN 0-387-98959-5. Archived from the original (PDF) on 2017-11-14. Retrieved 2013-04-11.
18. ^ Press, William H.; Teukolsky, Saul A.; Vetterling, William T.; Flannery, Brian P. Numerical Recipes. p. 890.
19. ^ Engeln-Müllges, Gisela; Reutter, Fritz (1996). Numerik-Algorithmen. p. 6. ISBN 3-18-401539-4.
20. ^ "machine epsilon value for IEEE double precision standard alternative proof using relative error". 12 October 2020. Retrieved 5 May 2022.
• Anderson, E.; LAPACK Users' Guide, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, third edition, 1999.
• Cody, William J.; MACHAR: A Soubroutine to Dynamically Determine Machine Parameters, ACM Transactions on Mathematical Software, Vol. 14(4), 1988, 303-311.
• Besset, Didier H.; Object-Oriented Implementation of Numerical Methods, Morgan & Kaufmann, San Francisco, CA, 2000.
• Demmel, James W., Applied Numerical Linear Algebra, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1997.
• Higham, Nicholas J.; Accuracy and Stability of Numerical Algorithms, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, second edition, 2002.
• Press, William H.; Teukolsky, Saul A.; Vetterling, William T.; and Flannery, Brian P.; Numerical Recipes in Fortran 77, 2nd ed., Chap. 20.2, pp. 881–886
• Forsythe, George E.; Malcolm, Michael A.; Moler, Cleve B.; "Computer Methods for Mathematical Computations", Prentice-Hall, ISBN 0-13-165332-6, 1977