Floating Point - summary
Basic representation:
V = (-1)s * M * 2E.
Format:
------------------------
|s| exp | frac |
------------------------
k = number of exp bits
Bias=2
k-1 - 1
f = number of frac bits
Normalized:
exp not all 0 or all 1:
M = 1 + .frac which means 1 + frac × 2-f
E = exp - Bias
Denormalized:
exp = 0
M = .frac which means frac × 2-f
E = 1 - Bias
Infinity:
exp = all 1's, frac == 0
NaN
exp = all 1's, frac != 0