Bits and Bytes: Floats

2. Basic Types: Numbers

Bits and Bytes: Floats

A number in floating point format is represented approximately by a fixed number of significant digits called the significand and scaled using an exponent in a fixed base. The general representation has the following form:

#\textbf{significand} \times \textbf{base}^\textbf{exponent}#

The base is set to a certain number that works best for the corresponding hardware, for your computer that would be base 2.

Base 10 exampleHowever the most intuitive and common base is base 10. Let's look at an example for the decimal number #1.2345# in floating point format using base 10:

#1.2345 = 12345 \times 10^{-4}#

Looks familiar, doesn't it? Floating point notation using base 10 is actually just scientific or exponential notation.

In the example above, the significand is simply the number we want to represent without the decimal point, but this changes when using a base that is not a multiple of 10. In fact, it might even be the case that a significand that is able to represent a decimal exactly does not exist in a different base. The solution is to settle for the significand that results in the number closest to the decimal we want to represent. This is why floating point numbers are generally described as approximations.

Base 2 exampleLet's take a look at an example using the same decimal but in floating point format using base 2:

#1.2345 \approx 5559693739988877 \times 2^{-52}#

That got complicated quickly. The number in floating point notation is the number that is actually stored in memory. It's the best approximation your computer can store.

Approximation accuracyWhen you try to evaluate #5559693739988877 \times 2^{-52}# on your calculator or your computer you will see that they display #1.2345# exactly. However, the result is actually a very good approximation:

#5559693739988877 \times 2^{-52} = 1.2344999999999999307220832633902318775653839111328125#

This is the actual number that is stored and used by your computer when performing calculations with #1.2345#. Your computer, and in extension Python, just display a more manageable rounded down value instead.

Floating point numbers are saved in computer memory according to IEEE conventions and are generally contained within either 32 or 64 bits, these formatting standards are also referred to as single precision and double precision respectively. Because computers work with bits, floats are stored slightly different than the notations above might imply.

The significand and exponent are assigned a different amount of bits each, a single bit is reserved as sign bit, which determines whether the resulting number is positive or negative. Remember that the base is fixed to 2.

Precision	Sign	Exponent	Siginificand
Single (32-bit)	1 bit	8 bits	23 bits
Double (64-bit)	1 bit	11 bits	52 bits

The conversion from 32-bit representation to decimal follows the following formula:

#(-1)^\textbf{sign} \times 2^{\textbf{exponent}-127} \times \textbf{significand}#

Where:

#\textbf{significand} = 1 + \sum_{i=1}^{23} \textbf{bit}_{23-i} \times 2^{-i}#

The next example shows how the above formulas are applied when converting a binary sequence to a float.

Binary to float conversionLet's take a look at #1.2345# stored in bits as single precision #\mathtt{\text{float}}#. Note that the actual value of the decimal shown in the previous example is in double precision, which allows for a better approximation than the value we will find here.

Sign

Exponent

Significand

...

Step-by-step conversion

Step 1: Convert 8-bit exponent to integer.

#\textbf{exponent} = 2^0 + 2^1 + 2^2 +2^3 + 2^4 +2^5 + 2^6 = 127#

Step 2: Convert 23-bit significand to fractional component. The resulting value is rounded down for readability.

#\textbf{significand} = 1 + 2^{-3} + 2^{-4} + 2^{-5} + 2^{-6} + 2^{-13} + 2^{-19} + 2^{-20} + 2^{-23} \approx 1.23450005054#

Step 3: Substitute the values in the formula:

#\begin{align*}
(-1)^\textbf{sign} \times 2^{\textbf{exponent}-127} \times \textbf{significand} &= (-1)^0 \times 2^{127-127} \times (1 + 0.23450005054)\\
&= 1 \times 2^0 \times 1.23450005054\\
&= 1.23450005054
\end{align*}#

As shown in the previous theory, undesired behavior may occur due to this imprecise representation of decimals.

Floating point error

>>> 0.1 + 0.2

0.30000000000000004

This is not a bug, but a direct result of the trade-off between range and precision mentioned in the definition of the floating point format. Recall that we call this minimal error compared to the actual result a floating point error. The error is usually insignificant, but could potentially inflate over repeated calculations or result in unexpected events, it is therefore important to be aware of this behavior.