You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#%config InlineBackend.figure_format = 'svg'
from pylab import *
We use decimal system for counting and representing fractions which uses the digits ${0,1,2,\ldots,9}$. Real numbers are sometimes written in floating point notation
i.e., a fractional part in $(0,1)$ and exponent; here we have used five decimal places. While a real number may have an infinite number of digits in its fractional part, on a computer, we have finite memory and we can only store a few digits.
Floating point numbers $\float$
In general, we can use any base $\beta$ for the number system. On a computer using base $\beta$, any non-zero number $x$ is of the form
$$
x = s \cdot (.a_1 a_2 \ldots a_t)_\beta \cdot \beta^e = \textrm{(sign)} \cdot
\textrm{(fractional part)} \cdot \textrm{(exponent)}
$$
where
\begin{gather}
s = -1 \textrm{ or } +1 \
0 \le a_i \le \beta - 1 \
L \le e \le U \textrm{ is a signed integer}
\end{gather}
and $L < 0$ and $U > 0$ are integers. In base 10, the fractional part has the value
$(\beta, t, L,U)$ specifies the arithmetic characteristics of the floating point system. In such a system, any number $x$ is determined by
$$
{s, a_1, a_2, \ldots, a_t, e }
$$
On modern computers, the base $\beta = 2$, i.e., we use the binary number system. There are only two digits, ${0,1}$, which can be easily represented in computer hardware.
Since we have finite memory for each number $x \in \float$,
We cannot represent every real number.
There are gaps in $\float$
Very small and very large numbers cannot be represented in $\float$.
In spite of these limitations, modern computers have enough precision in representing real numbers which allows us to obtain useful numerical answers to mathematical equations modeling nature.
+++
Chopping and rounding
Since a computer uses finite amount of information to represent numbers, most real numbers cannot be exactly represented on a computer. Hence they must be represented by a nearby number in the floating point system. This requires us to truncate a real number to fit into the floating point system, which can be done in two ways: chopping and rounding.
Consider a real number
$$
x = s \cdot (.a_1 a_2 \ldots a_t a_{t+1} \ldots )_\beta \cdot \beta^e, \qquad a_1 \ne 0
$$
which possibly requires an infinite number of digits $a_i$. Let $\textrm{fl} : \re \to \float$
$$
\fl{x} = \textrm{approximation of $x \in \re$ in the floating point system}
$$
Thus the floating point approximation $\fl{x}$ of $x \in \re$ satisfies
$$
\fl{x} = x (1 + \epsilon)
$$
This type of analysis was introduced by Wilkinson (1963). It allows us to deal precisely with the effects of rounding/chopping in computer arithmetic operations.
+++
Proof for chopping
Take sign $s=+1$. Then $\fl{x} \le x$ and
$$
x - \fl{x} = (.00\ldots 0a_{t+1}\ldots)_\beta \cdot \beta^e
$$
Let $\gamma=\beta-1$ (largest digit in base $\beta$). Then
where $\uround$ is the unit round of $\float$.
:::
By default, Python computes in double precision where the unit round is $\uround = 2^{-52}$.
one = 1.0
u = 2.0**(-53)
x1 = one + u
x2 = one + 1.000001*u
x3 = one + 2.0*u
print("u = %50.40e" % u)
print("one = %46.40f" % one)
print("one + u = %46.40f" % x1)
print("one + 1.000001*u = %46.40f" % x2)
print("one + 2*u = %46.40f" % x3)
We see that $1 + \uround$ is rounded down to 1, while $1 + 1.000001\uround$ is rounded up to the next floating point number which is $1 + 2\uround$.
+++
:::{prf:example} With rounding on binary computer, show that $\uround = 2^{-t}$
The range of numbers that are representable in the floating-point system is
$$
x_L \le |x| \le x_U
$$
Obviously, we cannot represent a number which is outside this range.
Overflow error: $|x| > x_U$
leads to fatal error in computation
program may terminate
Underflow error: $|x| < x_L$
may not be a fatal error
$\fl{x}$ is set to zero and computations can continue
However, in some cases, accuracy may be severly lost.
+++
IEEE floating point system
The IEEE floating point system was developed under the leadership of Prof. William Kahan of Univ. of California, Berkely. Project 754 was formed in 1970 to produce the best possible definition of floating-point arithmetic. The IEEE 754 Standard gives
definition of floating-point numbers
rounding operations
format conversion
exception rules (invalid operation, division by zero, overflow, underflow)
IEEE single precision
Fortran77: real*4, Fortran90: real, C/C++: float
32 bit word
1 bit for sign
8 bits for biased exponent $(L=-126, U=127)$; $2^8 = 256 = 126 + 1 + 127 + 1 + 1$, the last two are for Inf and NaN.
23 bits for fractional part (mantissa)
Largest number = $(2 - 2^{-23}) \cdot 2^{127} \approx 3.4028 \times 10^{38}$
Smallest positive normalized number = $2^{-126} \approx 1.1755 \times 10^{-38}$
Unit roundoff, $\uround = 2^{-24} \approx 5.96 \times 10^{-8}$, corresponds to about seven decimal places.
with spacing $2^{-52}$. The interval $[2,4]$ is approximated by the numbers
$$
2 \times { 1, 1 + 2^{-52}, 1 + 2 \times 2^{-52}, 1 + 3 \times 2^{-52}, \ldots, 2 }
$$
with spacing $2 \times 2^{-52}$. Similarly, the interval $[2^j, 2^{j+1}]$ for $j \ge 0$ is approximated by the same numbers as in $[1,2]$ but scaled by $2^j$. Thus as the numbers becomes larger, the spacing between them increases. Large numbers cannot be approximated accurately by floating point numbers, but the relative error is always less than the unit round; we can only expect relative accuracy on a computer, not absolute accuracy.
In the interval $(0,1)$, we have many more numbers of the form
$$
(.1 a_2 \ldots a_t)_2 \cdot 2^e, \quad L \le e \le 0
$$
With IEEE double precision, the interval $[1,2]$ is approximated by about $10^{16}$ floating point numbers. In a handful of solid or liquid, or a balloon of gas, the number of atoms/molecules in a line from one point to another is of the order of $10^8$ (cube root of Avogadro number). Such a system behaves like a continuum, enabling us to define physical quantities like density, pressure, stress, strain and temperature. Computer arithmetic is more than a million times finer tham this.
The fundamental constants of physics are known to only a few decimal places.
gravitational constant $G$ - 4 digits
Planck's constant $h$ - 7 digits
elementary charge $e$ - 7 digits
ratio of magnetic moment of electron to the Bohr magneton $\mu_e/\mu_B$ - 12 digits
Nothing in physics is known to more than 12 or 13 digits of accuracy. IEEE numbers are orders of magnitude more precise than any number in science. Mathematical quantities like $\pi$ are of course known to more accuracy.
In the early days of numerical analysis, computers were not very precise, numerical algorithms, especially conditioning and stability were not well understood. Hence there was too much emphasis on studying rounding errors. Today, computer arithmetic is more precise and well understood, and we can control rounding errors through stability of numerical algorithms.
The main business of numerical analysis is designing algorithms that converge quickly; rounding error analysis is rarely the central issue. If rounding errors vanished, 99% of numerical analysis would remain. - Lloyd N. Trefethen
+++
Propagation of errors
Let us consider the basic arithmetic operations: $+, -, *, /$. The computer version of these operations is an approximation due to rounding.
The first form suffers from roundoff errors while the second one is more accurate. This may or may not be a problem. A situation where this is problematic is if you want to compute
$$
f(x) = \frac{1 - \cos(x)}{x^2}
$$
y = (1.0 - cos(x))/x**2
print("%24.14e" % y)
We will get an answer of zero, but the correct value is closer to $\half$.
z = 2.0*sin(0.5*x)**2 / x**2
print("%24.14e" % z)
:::
+++
:::{prf:example} Round-off error
Consider computing
$$
x = 10^{-8}, \qquad y = \sqrt{1+x^2} - 1
$$
x = 1.0e-8
y = sqrt(1.0 + x**2) - 1.0
print("%20.10e" % y)
The result is zero due to round-off error. An equivalent expression is
$$
y = \frac{x^2}{\sqrt{1 + x^2} + 1}
$$
y = x**2/(sqrt(1.0+x**2) + 1.0)
print("%20.10e" % y)
We have $(n-1)$ additions and the above estimate says that the relative error is propertional to the number of additions. Usually this is a pessimistic estimate since we took absolute values, and in actual computations, the succesive errors can cancel, leading to much smaller total error in the sum.
Note that the coefficients of $x_i$ decrease with increasing $i$. If we want to minimize the error, we can sort the numbers in increasing order $x_1 \le x_2 \le \ldots \le x_n$ and then sum them. In practice, the $\epsilon_i$ can be positive or negative, so the precise behaviour can be case dependent.
+++
Function evaluation
Suppose we want to compute a function value
$$
y = f(x), \qquad x \in \re
$$
There are two types of errors we have to deal with on a computer. Firstly, the real number has to be approximated by a floating point number
$$
\hatx = \fl{x}
$$
Secondly, the function $f$ may be computed only approximately as $\hatf$. The only computations a computer can do exactly are addition and multiplication, and even here we have roundoff errors. Any other type of computation, even division, can only be performed approximately. E.g., if $f(x) = \sin(x)$, a computer will evaluate it approximately by some truncated Taylor approximation of $\sin(x)$. So what we actually evaluate on a computer is
$|f(x) - \hatf(x)|$: numerical discretization error, can be controlled by using an accurate algorithm
$|\hatf(x) - \hatf(\hatx)|$: transmission of round-off error by the numerical scheme, can be controlled by stability of numerical scheme and using enough precision so that $x - \hatx$ is small.