IEEE 754 Floating Point: Bits, Encoding, Precision

Why This Matters

A float is not a real number. It is a 32-bit integer pattern that a hardware unit interprets as $(-1)^s \cdot 1.m \cdot 2^{e-127}$ . Roughly $2^{32} \approx 4.3 \times 10^9$ patterns must cover a range from about $10^{-38}$ to $10^{38}$ , so most reals get rounded. The constant $0.1$ has no finite binary expansion; what you store is closer to $0.10000000149011612$ . When you sum $10^7$ such values in a training loop, the drift is visible, and naive accumulation in float32 can lose the bottom 4-5 decimal digits.

ML systems live inside these limits. Gradient underflow, NaN explosions, and the reason bfloat16 keeps 8 exponent bits but only 7 mantissa bits all trace back to IEEE 754. Knowing the bit layout tells you why mixed precision needs a loss scale and why Kahan summation exists.

Core Definitions

Definition

Floating-point number

A number of the form $(-1)^s \cdot M \cdot \beta^E$ where $s \in \{0,1\}$ is the sign, $M$ is the significand (mantissa), $E$ is an integer exponent, and $\beta$ is the radix. IEEE 754 binary formats use $\beta = 2$ and a normalized significand $M = 1.m_1 m_2 \ldots m_p$ in $[1, 2)$ .

Definition

Binary32 (IEEE 754 single precision)

A 32-bit format: 1 sign bit, 8 exponent bits with bias 127, 23 fraction bits. For a normal value with stored exponent $e \in [1, 254]$ and stored fraction bits $m$ , the value is $(-1)^s \cdot (1 + m / 2^{23}) \cdot 2^{e - 127}$ . The leading 1 is implicit and not stored.

Definition

Machine epsilon

The gap between $1.0$ and the next representable number. For binary32, $\varepsilon = 2^{-23} \approx 1.192 \times 10^{-7}$ . For binary64, $\varepsilon = 2^{-52} \approx 2.220 \times 10^{-16}$ .

Definition

ULP (Unit in the Last Place)

The distance between two consecutive representable floats near a given value $x$ . For a normal $x$ in $[2^e, 2^{e+1})$ , $\mathrm{ulp}(x) = 2^{e-p}$ where $p$ is the mantissa width (23 for binary32, 52 for binary64).

The Bit Layout

Binary32 packs three fields into one 32-bit word, MSB first:

 31  30        23 22                              0
+--+-----------+--------------------------------+
| s|  exponent |          fraction              |
+--+-----------+--------------------------------+
  1     8                    23 bits

Binary64 widens both fields: 1 sign, 11 exponent (bias 1023), 52 fraction. The exponent encoding uses biased (excess- $K$ ) representation: stored field $e$ means true exponent $e - 127$ (or $e - 1023$ ). Bias makes float comparison work as unsigned integer comparison on the magnitude bits, which simplifies sort networks.

The stored exponent encodes three regimes:

$e = 0$ : zero (if $m = 0$ ) or subnormal (if $m \neq 0$ ), value $(-1)^s \cdot 0.m \cdot 2^{-126}$ .
$1 \leq e \leq 254$ : normal, value $(-1)^s \cdot 1.m \cdot 2^{e-127}$ .
$e = 255$ : infinity (if $m = 0$ ) or NaN (if $m \neq 0$ ).

The implicit leading 1 buys one extra bit of precision: a 23-bit fraction field encodes a 24-bit significand.

Worked Example: Encoding 1.5

Write $1.5$ in binary: $1.1_2$ . Normalized form: $1.1 \cdot 2^0$ . So $s = 0$ , true exponent $0$ , stored exponent $0 + 127 = 127 = 01111111_2$ , fraction $= 10000000000000000000000$ .

Concatenate:

0 01111111 10000000000000000000000

As hex: 0x3FC00000. You can verify in C:

#include <stdio.h>
#include <string.h>

int main(void) {
    float x = 1.5f;
    unsigned int bits;
    memcpy(&bits, &x, 4);
    printf("%08x\n", bits);   // prints 3fc00000
    return 0;
}

Worked Example: Why 0.1 Is Not Exact

The decimal $0.1$ in binary is the repeating fraction

$0.0\overline{0011}_2 = 0.00011001100110011\ldots_2.$

To normalize, shift left 4 places: $1.10011001100\ldots \cdot 2^{-4}$ . Stored exponent $= -4 + 127 = 123 = 01111011_2$ . The 23-bit fraction is the first 23 bits after the implicit leading 1; the 24th bit governs round-to-nearest-even:

fraction bits: 10011001100110011001101
                                      ^ rounded up

Final binary32 encoding of $0.1$ : 0x3DCCCCCD. Decoded exactly, this equals

$0.100000001490116119384765625.$

The error is $\approx 1.49 \times 10^{-9}$ , about $0.125$ ULP at that magnitude.

For $-0.1$ , flip the sign bit: 0xBDCCCCCD. The magnitude error is identical.

Special Values

Zero. Stored exponent 0 and fraction 0. There are two zeros: $+0$ (0x00000000) and $-0$ (0x80000000). They compare equal but are bit-distinct; $1/(+0) = +\infty$ , $1/(-0) = -\infty$ .

Infinity. Stored exponent all ones, fraction zero. $+\infty$ is 0x7F800000. Produced by overflow ( $\text{FLT\_MAX} \cdot 2$ ) and by $1/0$ .

NaN. Stored exponent all ones, fraction nonzero. Two flavors:

quiet NaN (qNaN), MSB of fraction set, propagates silently
signaling NaN (sNaN), MSB of fraction clear, raises an FP exception on use

NaN has the property $\text{NaN} \neq \text{NaN}$ . This is how x != x works as a NaN check.

Subnormals (denormals). When stored exponent is 0 and fraction nonzero, the leading bit is implicitly 0, not 1, and the exponent is fixed at $-126$ . Subnormals fill the gap between 0 and the smallest normal, $2^{-126} \approx 1.18 \times 10^{-38}$ . The smallest positive binary32 is $2^{-149} \approx 1.40 \times 10^{-45}$ . Subnormals preserve "gradual underflow" but often run on a slow microcode path; many GPUs and SIMD modes set FTZ (flush-to-zero) and DAZ (denormals-are-zero) flags to skip them.

Precision and ULPs

Floats are densest near zero and sparser at large magnitudes. Between $2^e$ and $2^{e+1}$ there are exactly $2^{23}$ uniformly spaced binary32 values. Consequences:

Near $1.0$ , $\mathrm{ulp} = 2^{-23} \approx 1.19 \times 10^{-7}$ .
Near $10^6$ , $\mathrm{ulp} \approx 2^{20-23} = 0.125$ . You cannot represent $10^6 + 0.01$ in binary32.
Near $2^{24} = 16{,}777{,}216$ , consecutive floats are 1 apart, and $2^{24} + 1$ rounds to $2^{24}$ .

Round-to-nearest-even (the default) gives a worst-case relative error of $\varepsilon / 2$ per operation. For binary32, that bounds one operation at $\approx 6 \times 10^{-8}$ relative. Errors compound: summing $N$ values naively has worst-case error growing like $N \varepsilon$ and expected error like $\sqrt{N} \varepsilon$ .

Catastrophic Cancellation and Kahan Summation

When two close numbers are subtracted, leading bits cancel and the result is dominated by rounding noise.

float a = 1.0000001f;
float b = 1.0000000f;
float d = a - b;   // d = 1.1920929e-07, not 1e-7

The relative error in d is on the order of 1, even though a and b were each accurate to a fraction of a ULP. This is why the variance formula

$\sigma^2 = \overline{x^2} - \bar{x}^2$

is numerically dangerous when $\bar{x}^2 \approx \overline{x^2}$ . Use the two-pass formulation $\sum (x_i - \bar{x})^2 / n$ instead.

For long sums, Kahan summation tracks the low-order error in a separate compensator:

float kahan_sum(const float *x, size_t n) {
    float s = 0.0f;
    float c = 0.0f;            // running compensation
    for (size_t i = 0; i < n; i++) {
        float y = x[i] - c;    // y = next term, minus prior loss
        float t = s + y;       // s is large, y small; low bits drop
        c = (t - s) - y;       // recover the dropped low bits
        s = t;
    }
    return s;
}

This reduces the error bound from $O(N \varepsilon)$ to $O(\varepsilon)$ independent of $N$ , at the cost of 4x arithmetic. ML training loops summing $10^8$ gradient terms in float32 will see meaningful drift without compensation; this is one reason gradient accumulation buffers are often kept in float32 even when activations are float16.

Main Theorem

Theorem

Standard Model of Floating-Point Arithmetic

Statement

For any two representable floats $a, b$ and any operation $\circ \in \{+, -, \cdot, /\}$ , the computed result $\mathrm{fl}(a \circ b)$ satisfies $\mathrm{fl}(a \circ b) = (a \circ b)(1 + \delta), \quad |\delta| \leq u$ where $u = 2^{-p}$ is the unit roundoff ( $u = 2^{-24}$ for binary32, $u = 2^{-53}$ for binary64).

Intuition

Each operation rounds the exact mathematical result to the nearest representable float. The maximum relative error is half a ULP of the result, which is bounded by $u$ .

Proof Sketch

IEEE 754 mandates correctly rounded results for the four basic operations (and square root). Round-to-nearest places the exact result within $\frac{1}{2}\mathrm{ulp}$ of a representable value. For a normalized result in $[2^e, 2^{e+1})$ , $\mathrm{ulp} = 2^{e-p+1}$ and the value is at least $2^e$ , giving relative error at most $2^{-p}$ .

Why It Matters

This is the foundation of every floating-point error analysis. Wilkinson-style backward error proofs string together $(1+\delta_i)$ factors and bound their product. It tells you that one float32 multiply costs at most $6 \times 10^{-8}$ relative error, and lets you reason about whether 10000 of them will stay within tolerance.

Failure Mode

The bound fails on underflow to subnormals, where relative error can be 100 percent (the result rounds to zero). It also fails for non-elementary functions like exp or sin; libm implementations typically guarantee 1-2 ULPs, not correctly rounded.

report a correction →

Common Confusions

Watch Out

Float comparison with ==

0.1f + 0.2f == 0.3f is false in C. Each literal rounds independently and the sum has a different rounding error than the direct encoding of $0.3$ . Use fabsf(a - b) < tol with a tolerance tuned to the magnitudes involved, or compare ULP distances by reinterpreting as int32_t.

Watch Out

Mantissa width versus significand width

A binary32 has a 23-bit fraction field but a 24-bit significand because of the implicit leading 1. When converting between hand-derived and machine-printed bit patterns, this off-by-one is the most common source of confusion.

Exercises

ExerciseCore

Problem

Encode the decimal value 0.1 in IEEE 754 binary32. Show the sign, biased exponent, and 23-bit fraction field. Why is the encoded value not exactly equal to 0.1?

ExerciseCore

Problem

What are the smallest positive normal and smallest positive subnormal binary32 values? What is the largest finite binary32 value? Express each as a decimal power-of-two expression and an approximate decimal magnitude.

ExerciseAdvanced

Problem

Find three binary32 values $a, b, c$ such that $(a + b) + c \neq a + (b + c)$ , demonstrating that floating-point addition is not associative. Compute both sides and show the bit difference.

References

Canonical:

Petzold, Code: The Hidden Language of Computer Hardware and Software (2nd ed., 2022), ch. 7–9 — builds binary arithmetic and signed/unsigned representations from first principles before arriving at floating-point encoding
Tanenbaum & Austin, Structured Computer Organization (6th ed., 2013), ch. 2 — covers IEEE 754 binary32/binary64 layouts, normalization, and arithmetic at the instruction-set level
Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic," ACM Computing Surveys 23(1), 1991 — definitive reference for ULP analysis, rounding modes, and the fundamental theorem of floating-point arithmetic
Overton, Numerical Computing with IEEE Floating Point Arithmetic (SIAM, 2001), ch. 2–4 — rigorous treatment of the IEEE 754 standard, rounding error models, and exception handling
Muller et al., Handbook of Floating-Point Arithmetic (2nd ed., Birkhäuser, 2018), ch. 1–3 — detailed reference covering representation, rounding, and elementary function error bounds

Accessible:

Harris & Harris, Digital Design and Computer Architecture (2nd ed., 2012), ch. 5 — accessible hardware-oriented walkthrough of IEEE 754 with datapath diagrams
Bryant & O'Hallaron, Computer Systems: A Programmer's Perspective (3rd ed., 2016), ch. 2 — programmer-focused treatment of float bit patterns, special values, and rounding with C examples

Next Topics

Floating-Point Arithmetic and Rounding Error Analysis — formalizes ULP-based error bounds, Wilkinson backward error, and condition numbers for sequences of operations
Integer Representation and Two's Complement — the fixed-point sibling of IEEE 754; understanding overflow and wraparound at the bit level
Numerical Stability and Catastrophic Cancellation — how subtraction of nearly equal floats amplifies relative error and techniques such as compensated summation (Kahan) to mitigate it
Computer Arithmetic Circuits — hardware implementation of floating-point adders and multipliers, including guard/round/sticky bits and fused multiply-add (FMA)