previous
 next 
CS 3843 Computer Organization
Notes on Chapter 2

Section 2.4 - Floating Point

Scientific notation: V = x * 10y, with 1 <= x < 10.
In computer science we use V = x * 2y, with 1 <= x < 2.
We will be looking at the IEEE standard for floating point numbers.

Section 2.4.1 - Fractional Binary Numbers


Today's News: January 31
No news yet.

Section 2.4.2 - IEEE floating point representation

This representation is similar to scientific notation. IEEE floating point can be 32 bits (single precision), 64 bits (double precision), or 80 bits (extended precision)
Each has the form:
    ------------------------
   |s|  exp   |     frac    |
    ------------------------
 number of bits
precision   s  exp  frac  total
single182332
double1115264
extended1156480

For each of these there are 4 different formats: Normalized: Example A: single precision normalized number: 0 10001100 11011011011011000000000
exp = 100011002 = 140
Bias = 27 -1 = 127
E = 140 - 127 = 13.
frac = 11011011011011000...
M = 1 + .11011011011011 = 1.11011011011011
Result 1.11011011011011 * 213 = 11101101101101.1 = 15213.5
Note: Smallest positive normalized number gotten by taking exp = 1, frac = 0, giving 1 * 21 - Bias.
For single precision this is 2-126, or approximately 10-38
Floating Point Single Precision Example


Today's News: February 3
No news yet.

Example B: How is 25 represented?
Example C: What decimal number is represented by the single precision number:
0 10000000 00000000000000000000000


Denormalized: Note: Largest denormalized number gotten by taking frac = 11111...111111, so M is almost 1, giving almost 1.0 * 21-Bias.
For single precision this is almost 2-126, just smaller than the smallest positive normalized number.

Special Values
These have exp all 1's.
Floating Point Representation Introduction


Important property:
The non-negative floating point numbers can be ordered using their bit representations, treated as unsigned quantities.
Comparisons do not need floating point computations.

Section 2.4.3 - Examples

Example 1: A 6-bit format
Figure 2.33 from the book (below) shows a hypothetical 6-bit floating point representation:
    --------------------------
   |s|exp(3 bits)|frac(2 bits)|
    --------------------------
Floating Point Representation Example 1
  1. How many different values can be represented using 6 bits?
  2. What is the bias?
  3. How many of these are NaN?
  4. How many of these are infinity?
  5. How many of these are positive, normalized?
  6. How many of these are negative, normalized?
  7. How many of values are zero (denormalized)?
  8. How many of these are denormalized > 0?
  9. How many of these are denormalized < 0?


What are the positive normalized values?
s = 0
exp = 1, 2, 3, 4, 5, or 6
frac = 00, 01, 10, or 11, corresponding to 1.00, 1.01, 1,10, and 1.11
Value = 1.frac * 2exp - 3
exp=1exp=2exp=3exp=4exp=5exp=6
2-22-120212223
frac=001.000.250.51.02.04.08.0  
frac=011.250.31250.6251.252.55.010.0  
frac=101.500.3750.751.53.06.012.0  
frac=111.750.42750.8751.753.57.014.0  
Smallest positive normalized number: 0.25
Denormalized values: M = frac * 2-2 = frac/4: 0, .25, .5, .75
value = M * 2-2 = M/4.
The values are 0, .0625, 0.125, and 0.1875
Denormalized spacing: 0.0625
Largest denormalized number: 0.1875
fig2.33.jpg

Today's News: February 5
Assignment 1 is due today.
Did you do your memorization homework?


See this summary.

Example 2: an 8-bit format
    --------------------------
   |s|exp(4 bits)|frac(3 bits)|
    --------------------------
  1. Who many different values can be represented with 8 bits?
  2. What is the bias:
  3. How many of these are NaN?
  4. How many of these are infinity?
  5. How many of these are positive, normalized?
  6. How many of these are negative, normalized?
  7. How many of values are zero (denormalized)?
  8. How many of these are denormalized > 0?
  9. How many of these are denormalized < 0?
  10. Approximate number of decimal places of accuracy (significant figures)?

fig2.34.jpg
Example 3: IEEE single precision floating point
    ---------------------------
   |s|exp(8 bits)|frac(23 bits)|
    ---------------------------

Example 4: Equally spaced values
Example 5: IEEE double precision floating point
    ----------------------------
   |s|exp(11 bits)|frac(52 bits)|
    ----------------------------

Section 2.4.4 - Rounding

Since we often cannot represent floating point values exactly, we may need to round.
Traditional rounding: round to nearest, half way rounds up.
e.g. 1.4 rounds to 1, 1.5 rounds to 2, 1.6 rounds to 2, etc.
Four IEEE rounding methods: Note: last three guarantee bounds on the actual value.

Example from book:
Mode$1.40$1.60$1.50$2.50$-1.50
Round-to-even$1$2$2$2$-2
Round-toward-zero$1$1$1$2$-1
Round-down$1$1$1$2$-2
Round-up$2$2$2$3$-1


Today's News: February 7


Today's News: February 10
Exam next week!


Section 2.4.5 - IEEE Floating-Point Operations

Operations are based on computing the exact result and then rounding using the current rounding method.
Special values such as ∞, -∞, and NaN behave in a reasonable way.
Properties of Addition: Multiplication is similar: commutative, not associative:
(x*y)*z is not always x*(y*z).
Question:
Give an example to show the floating point multiplication is not associative.
Answer:


Floating point addition does satisfy: if a >= b, then x + a >= x + b (as long as x is not NaN).
Note: Integer arithmetic does not satisfy this.

Section 2.4.6 - Floating Point in C

Most machines support IEEE floating point with float being single precision and double being double precision.
Intel machines do all calculations in 80-bit extended format and then convert (round) the result to float or double.
On these machines it is not faster to do calculations with floats than doubles, but floats can be used to save space if you have large arrays.

Casting:
 Back to CS 3843 Notes Table of Contents
 next