2. Basic Types: Numbers
Bits and Bytes: Floats
A number in floating point format is represented approximately by a fixed number of significant digits called the significand and scaled using an exponent in a fixed base. The general representation has the following form:
#\textbf{significand} \times \textbf{base}^\textbf{exponent}#
The base is set to a certain number that works best for the corresponding hardware, for your computer that would be base 2.
Base 10 exampleHowever the most intuitive and common base is base 10. Let's look at an example for the decimal number #1.2345# in floating point format using base 10:
#1.2345 = 12345 \times 10^{-4}#
Looks familiar, doesn't it? Floating point notation using base 10 is actually just scientific or exponential notation.
In the example above, the significand is simply the number we want to represent without the decimal point, but this changes when using a base that is not a multiple of 10. In fact, it might even be the case that a significand that is able to represent a decimal exactly does not exist in a different base. The solution is to settle for the significand that results in the number closest to the decimal we want to represent. This is why floating point numbers are generally described as approximations.
Base 2 exampleLet's take a look at an example using the same decimal but in floating point format using base 2:
#1.2345 \approx 5559693739988877 \times 2^{-52}#
That got complicated quickly. The number in floating point notation is the number that is actually stored in memory. It's the best approximation your computer can store.
Floating point numbers are saved in computer memory according to IEEE conventions and are generally contained within either 32 or 64 bits, these formatting standards are also referred to as single precision and double precision respectively. Because computers work with bits, floats are stored slightly different than the notations above might imply.
The significand and exponent are assigned a different amount of bits each, a single bit is reserved as sign bit, which determines whether the resulting number is positive or negative. Remember that the base is fixed to 2.
Precision | Sign | Exponent | Siginificand |
Single (32-bit) | 1 bit | 8 bits | 23 bits |
Double (64-bit) | 1 bit | 11 bits | 52 bits |
The conversion from 32-bit representation to decimal follows the following formula:
#(-1)^\textbf{sign} \times 2^{\textbf{exponent}-127} \times \textbf{significand}#
Where:
#\textbf{significand} = 1 + \sum_{i=1}^{23} \textbf{bit}_{23-i} \times 2^{-i}#
The next example shows how the above formulas are applied when converting a binary sequence to a float.
Binary to float conversionLet's take a look at #1.2345# stored in bits as single precision #\mathtt{\text{float}}#. Note that the actual value of the decimal shown in the previous example is in double precision, which allows for a better approximation than the value we will find here.
Sign |
Exponent | Significand |
0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
31 | 30 | ... | ... | 23 | 22 | ... | ... | 0 |
Step-by-step conversion
Step 1: Convert 8-bit exponent to integer.
#\textbf{exponent} = 2^0 + 2^1 + 2^2 +2^3 + 2^4 +2^5 + 2^6 = 127#
Step 2: Convert 23-bit significand to fractional component. The resulting value is rounded down for readability.
#\textbf{significand} = 1 + 2^{-3} + 2^{-4} + 2^{-5} + 2^{-6} + 2^{-13} + 2^{-19} + 2^{-20} + 2^{-23} \approx 1.23450005054#
Step 3: Substitute the values in the formula:
#\begin{align*}
(-1)^\textbf{sign} \times 2^{\textbf{exponent}-127} \times \textbf{significand} &= (-1)^0 \times 2^{127-127} \times (1 + 0.23450005054)\\
&= 1 \times 2^0 \times 1.23450005054\\
&= 1.23450005054
\end{align*}#
As shown in the previous theory, undesired behavior may occur due to this imprecise representation of decimals.
>>> 0.1 + 0.2
|
0.30000000000000004
|
This is not a bug, but a direct result of the trade-off between range and precision mentioned in the definition of the floating point format. Recall that we call this minimal error compared to the actual result a floating point error. The error is usually insignificant, but could potentially inflate over repeated calculations or result in unexpected events, it is therefore important to be aware of this behavior.
Or visit omptest.org if jou are taking an OMPT exam.