# Floating Point Arithmetic

Floating point system

Floating point arithmetic

IEEE flaoting point standard

## Floating point arithmetic

### Loss of associativity in floating point system

Let  x  be a nonzero floating point number such that  1+x=1.  A simple example of loss of associativity is given by  the sum x+1-1.  Clearly (x+1)+ (-1) = 0 which is not the same as x + ( 1 + (-1) ) =x.

Another example
Let m be the smallest floating point number and M be the largest, then ( 10-2 x m ) x M   and  10-2 x ( m x M ) are not the same, since (10-2 x m )  underflows to zero.

In a computer system where the number representation is normalized 4-digit
mantissa and 2-digit exponent, consider the following calculations, assuming
rounding is used in the 4-digit arithmetic registers,

1. p+q+r+s= 20
2. p+q+s+r= 26
3. q+r+p+s= 30

where p= 75640, q= 4, r= 6, s= -75620. Explain the results.

The floating point representations of the numbers are:
p= 0.7564 x 105, q= 0.4000 x 101, r= 0.6000 x 101, s= -0.7562 x 105.

Under the usual rule of adding from left to right:

### p+q+r+s= 20

p+q:
 0.7564 x 105 0.7564 x 105 +) 0.4000 x 101 +) 0.0000 x 105   ( matching exponent) -------------------- 0.7564 x 105

p+q+r:
 0.7564 x 105 0.7564 x 105 +) 0.6000 x 101 +) 0.0000 x 105   ( matching exponent) -------------------- 0.7564 x 105

p+q+r+s:
 0.7564 x 105 0.7564 x 105 +)x 105 +) -0.7562 x 105 ----------------------- 0.0002 x 105 = 0.2000 x 102

After normalizing and adding spurious zeros, get final result with only 1 reliable digit.

### p+q+s+r= 26

p+q:
 0.7564 x 105 0.7564 x 105 +) 0.4000 x 101 +) 0.0000 x 105   ( matching exponent) -------------------- 0.7564 x 105

p+q+s:
 0.7564 x 105 0.7564 x 105 +)x 105 +) -0.7562 x 105 ----------------------- 0.0002 x 105 = 0.2000 x 102

p+q+s+r:
 0.2000 x 105 0.2000 x 102 +) 0.6000 x 101 +) 0.060 x 102   (matching exponent) -------------------- 0.2600 x 102

Again get final result with only 1 (not 2) reliable digit.

### q+r+p+s= 30

q+r:
 0.4000 x 101 0.4000 x 101 +) 0.6000 x 101 +) 0.6000 x 101 -------------------- 1.000 x 101 = 0.1000 x 102

q+r+p:
 0.1000 x 102 0.0001 x 105   ( matching exponent) +) 0.7564 x 105 +) 0.7564 x 105 -------------------- 0.7565 x 105

q+r+p+s:
 0.7565 x 105 0.7565 x 105 +)x 105 +) -0.7562 x 105 ----------------------- 0.0003 x 105= 0.3000 x 102

After normalizing and adding spurious zeros, again get final result with only 1 reliable digit.

{\bf More on floating point System}

5-bit representation
$x = \sigma \overline{x} 2^e$ with $\frac{1}{2} \leq \overline{x} < 1.$

i.e. $\overline{x}= 0.1**...*$ in binary

\begin{verbatim}
sign exponent mantissa
position 1 2-3 4-5
# bits 1 2 3 (including hidden bit)
\end{verbatim}

Consider only positive numbers here : (sign= 1)
\begin{verbatim}
exponent : 00 01 10 11
value : * -1 0 1 ==> range of exponent [-1,1]
if exponent is 00, the number is 0

exponent mantissa number in binary number in decimal
00 any 0 0

01 00 0.100 x 2^-1= 0.01 0.25
01 01 0.101 x 2^-1= 0.0101 0.3125
01 10 0.110 x 2^-1= 0.0110 0.375
01 11 0.111 x 2^-1= 0.0111 0.4375

10 00 0.100 x 2^0= 0.1 0.5
10 01 0.101 x 2^0= 0.101 0.625
10 10 0.110 x 2^0= 0.110 0.75
10 11 0.111 x 2^0= 0.111 0.875

11 00 0.100 x 2^1= 1.0 1
11 01 0.101 x 2^1= 1.01 1.25
11 10 0.110 x 2^1= 1.10 1.5
11 11 0.111 x 2^1= 1.11 1.75
\end{verbatim}

Other possibilities

(i) Inclusion of other numbers':
\begin{verbatim}
00 00 0 0
00 01 + infinity
00 10 - infinity
00 11 NaN
\end{verbatim}

(ii) More bits for exponent (e.g. 3-bit exponent, 2-bit mantissa)
\begin{verbatim}
exponent : 000 001 010 011 100 101 110 111
value : * -3 -2 -1 0 1 2 3
==> range of exponent [-3, 3]

exponent mantissa number in binary number in decimal
00 any 0 0
001 0 0.10 x 2^-3= 0.0001 0.0625
001 1 0.11 x 2^-3= 0.00011 0.
010 0 0.10 x 2^-2= 0.001 0.125
010 1 0.11 x 2^-2= 0.0011 0.1875
011 0 0.10 x 2^-1= 0.01 0.25
011 1 0.11 x 2^-1= 0.011 0.375
100 0 0.10 x 2^0= 0.1 0.5
100 1 0.10 x 2^0= 0.11 0.625
101 0 0.10 x 2^1= 1. 1
101 1 0.11 x 2^1= 1.1 1.5
110 0 0.10 x 2^2= 10 2
110 1 0.11 x 2^2= 11 3
111 0 0.10 x 2^3= 100 4
111 1 0.11 x 2^3= 110 6
\end{verbatim}

Worst possible error in chopping: \begin{verbatim}|------------x|\end{verbatim}

Worst possible error in rounding: \begin{verbatim}|------x------|\end{verbatim}
% | x- fl(x) | < 2^{-n}

\begin{enumerate}
\item The machine epsilon $\neq$ smallest positive floating point number

smallest positive floating point number is given by

mantissa bits all zero

exponent bits give most negative exponent

e.g. vax: $0.100..0 x 2^{-127}$
IEEE: $1.000..0 x 2^{-126}$

Machine epsilon: smallest number s.t. $1+ \epsilon \neq 1$

The distance between 1 and the next larger floating point number is
2* (machine epsilon).

\noindent [Remark: use of chopping/rounding affects the value of the machine epsilon
because of the addition operation in the definition]
\begin{itemize}
\item rounding: $2^{1-n}$ (more generally $\beta^{1-n}$)
\item chopping: $2^{-n}$ ( $\frac{1}{2} \beta^{1-n}$)
\end{itemize}
where $n$ is the no. of binary digits in the mantissa (including any
hidden bit)

e.g. (rounding) vax and IEEE: $n=24$, so machine epsilon= $2^{-23}$
Note: this is much LARGER then the smallest positive floating point number.

\item The largest integer' $M$ is represented exactly in the floating point system;

-- All integers in $[0,M]$ are exactly representable in the floating point
system.

Mantissa bits: 11...1 (k-1 binary digits) [excluding the hidden bit]

exponent bits => gives a value $2^k$ : this is ok if k lies within the
range of the exponent

e.g. vax: exponent range [-127,127], k= 24;
IEEE: exponent range [-126,127], k= 24.

vax type: $0.1\quad 11...1 \times 2^k$

\{ k binary digits (including hidden bit) behind 'binary' decimal point\}

IEEE type: $1.11...1 \times 2^{k-1}$
\{ k-1 binary digits (including hidden bit) behind 'binary' decimal point\}
\{if $2^k$ were used above, cannot guarantee all numbers below M are
representable\}
$\mbox{value}= (1/2+ 1/2^2+ \cdots + 1/2^k) 2^k = \frac{1}{2} {1- 1/2^k}{1- 1/2}= 2^k - 1$

If exponent $2^{k+1}$ is also representable in the exponent part (usually
ok), we have $M=2^k$: set mantissa bits to 00...0 .

Now all numbers between 0 and $2^k$ are representable exactly, however,
$2^{k}+ 1$ cannot be exactly represented since we run out of binary bits.
\end{enumerate}

Tools for handling floating point error:
\begin{itemize}
\item forward error analysis- tedious
\item backward error analysis- difficult but worthwhile
\item interval analysis
\item multiple precision calculation
\item error free computation
\end{itemize}