Why This Matters
A float is not a real number. It is a 32-bit integer pattern that a hardware unit interprets as . Roughly patterns must cover a range from about to , so most reals get rounded. The constant has no finite binary expansion; what you store is closer to . When you sum such values in a training loop, the drift is visible, and naive accumulation in float32 can lose the bottom 4-5 decimal digits.
ML systems live inside these limits. Gradient underflow, NaN explosions, and the reason bfloat16 keeps 8 exponent bits but only 7 mantissa bits all trace back to IEEE 754. Knowing the bit layout tells you why mixed precision needs a loss scale and why Kahan summation exists.
Core Definitions
Floating-point number
A number of the form where is the sign, is the significand (mantissa), is an integer exponent, and is the radix. IEEE 754 binary formats use and a normalized significand in .
Binary32 (IEEE 754 single precision)
A 32-bit format: 1 sign bit, 8 exponent bits with bias 127, 23 fraction bits. For a normal value with stored exponent and stored fraction bits , the value is . The leading 1 is implicit and not stored.
Machine epsilon
The gap between and the next representable number. For binary32, . For binary64, .
ULP (Unit in the Last Place)
The distance between two consecutive representable floats near a given value . For a normal in , where is the mantissa width (23 for binary32, 52 for binary64).
The Bit Layout
Binary32 packs three fields into one 32-bit word, MSB first:
31 30 23 22 0
+--+-----------+--------------------------------+
| s| exponent | fraction |
+--+-----------+--------------------------------+
1 8 23 bits
Binary64 widens both fields: 1 sign, 11 exponent (bias 1023), 52 fraction. The exponent encoding uses biased (excess-) representation: stored field means true exponent (or ). Bias makes float comparison work as unsigned integer comparison on the magnitude bits, which simplifies sort networks.
The stored exponent encodes three regimes:
- : zero (if ) or subnormal (if ), value .
- : normal, value .
- : infinity (if ) or NaN (if ).
The implicit leading 1 buys one extra bit of precision: a 23-bit fraction field encodes a 24-bit significand.
Worked Example: Encoding 1.5
Write in binary: . Normalized form: . So , true exponent , stored exponent , fraction .
Concatenate:
0 01111111 10000000000000000000000
As hex: 0x3FC00000. You can verify in C:
#include <stdio.h>
#include <string.h>
int main(void) {
float x = 1.5f;
unsigned int bits;
memcpy(&bits, &x, 4);
printf("%08x\n", bits); // prints 3fc00000
return 0;
}
Worked Example: Why 0.1 Is Not Exact
The decimal in binary is the repeating fraction
To normalize, shift left 4 places: . Stored exponent . The 23-bit fraction is the first 23 bits after the implicit leading 1; the 24th bit governs round-to-nearest-even:
fraction bits: 10011001100110011001101
^ rounded up
Final binary32 encoding of : 0x3DCCCCCD. Decoded exactly, this equals
The error is , about ULP at that magnitude.
For , flip the sign bit: 0xBDCCCCCD. The magnitude error is identical.
Special Values
Zero. Stored exponent 0 and fraction 0. There are two zeros: (0x00000000) and (0x80000000). They compare equal but are bit-distinct; , .
Infinity. Stored exponent all ones, fraction zero. is 0x7F800000. Produced by overflow () and by .
NaN. Stored exponent all ones, fraction nonzero. Two flavors:
- quiet NaN (qNaN), MSB of fraction set, propagates silently
- signaling NaN (sNaN), MSB of fraction clear, raises an FP exception on use
NaN has the property . This is how x != x works as a NaN check.
Subnormals (denormals). When stored exponent is 0 and fraction nonzero, the leading bit is implicitly 0, not 1, and the exponent is fixed at . Subnormals fill the gap between 0 and the smallest normal, . The smallest positive binary32 is . Subnormals preserve "gradual underflow" but often run on a slow microcode path; many GPUs and SIMD modes set FTZ (flush-to-zero) and DAZ (denormals-are-zero) flags to skip them.
Precision and ULPs
Floats are densest near zero and sparser at large magnitudes. Between and there are exactly uniformly spaced binary32 values. Consequences:
- Near , .
- Near , . You cannot represent in binary32.
- Near , consecutive floats are 1 apart, and rounds to .
Round-to-nearest-even (the default) gives a worst-case relative error of per operation. For binary32, that bounds one operation at relative. Errors compound: summing values naively has worst-case error growing like and expected error like .
Catastrophic Cancellation and Kahan Summation
When two close numbers are subtracted, leading bits cancel and the result is dominated by rounding noise.
float a = 1.0000001f;
float b = 1.0000000f;
float d = a - b; // d = 1.1920929e-07, not 1e-7
The relative error in d is on the order of 1, even though a and b were each accurate to a fraction of a ULP. This is why the variance formula
is numerically dangerous when . Use the two-pass formulation instead.
For long sums, Kahan summation tracks the low-order error in a separate compensator:
float kahan_sum(const float *x, size_t n) {
float s = 0.0f;
float c = 0.0f; // running compensation
for (size_t i = 0; i < n; i++) {
float y = x[i] - c; // y = next term, minus prior loss
float t = s + y; // s is large, y small; low bits drop
c = (t - s) - y; // recover the dropped low bits
s = t;
}
return s;
}
This reduces the error bound from to independent of , at the cost of 4x arithmetic. ML training loops summing gradient terms in float32 will see meaningful drift without compensation; this is one reason gradient accumulation buffers are often kept in float32 even when activations are float16.
Main Theorem
Standard Model of Floating-Point Arithmetic
Statement
For any two representable floats and any operation , the computed result satisfies where is the unit roundoff ( for binary32, for binary64).
Intuition
Each operation rounds the exact mathematical result to the nearest representable float. The maximum relative error is half a ULP of the result, which is bounded by .
Proof Sketch
IEEE 754 mandates correctly rounded results for the four basic operations (and square root). Round-to-nearest places the exact result within of a representable value. For a normalized result in , and the value is at least , giving relative error at most .
Why It Matters
This is the foundation of every floating-point error analysis. Wilkinson-style backward error proofs string together factors and bound their product. It tells you that one float32 multiply costs at most relative error, and lets you reason about whether 10000 of them will stay within tolerance.
Failure Mode
The bound fails on underflow to subnormals, where relative error can be 100 percent (the result rounds to zero). It also fails for non-elementary functions like exp or sin; libm implementations typically guarantee 1-2 ULPs, not correctly rounded.
Common Confusions
Float comparison with ==
0.1f + 0.2f == 0.3f is false in C. Each literal rounds independently and the sum has a different rounding error than the direct encoding of . Use fabsf(a - b) < tol with a tolerance tuned to the magnitudes involved, or compare ULP distances by reinterpreting as int32_t.
Mantissa width versus significand width
A binary32 has a 23-bit fraction field but a 24-bit significand because of the implicit leading 1. When converting between hand-derived and machine-printed bit patterns, this off-by-one is the most common source of confusion.
Exercises
Problem
Encode the decimal value 0.1 in IEEE 754 binary32. Show the sign, biased exponent, and 23-bit fraction field. Why is the encoded value not exactly equal to 0.1?
Problem
What are the smallest positive normal and smallest positive subnormal binary32 values? What is the largest finite binary32 value? Express each as a decimal power-of-two expression and an approximate decimal magnitude.
Problem
Find three binary32 values such that , demonstrating that floating-point addition is not associative. Compute both sides and show the bit difference.
References
Canonical:
- Petzold, Code: The Hidden Language of Computer Hardware and Software (2nd ed., 2022), ch. 7–9 — builds binary arithmetic and signed/unsigned representations from first principles before arriving at floating-point encoding
- Tanenbaum & Austin, Structured Computer Organization (6th ed., 2013), ch. 2 — covers IEEE 754 binary32/binary64 layouts, normalization, and arithmetic at the instruction-set level
- Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic," ACM Computing Surveys 23(1), 1991 — definitive reference for ULP analysis, rounding modes, and the fundamental theorem of floating-point arithmetic
- Overton, Numerical Computing with IEEE Floating Point Arithmetic (SIAM, 2001), ch. 2–4 — rigorous treatment of the IEEE 754 standard, rounding error models, and exception handling
- Muller et al., Handbook of Floating-Point Arithmetic (2nd ed., Birkhäuser, 2018), ch. 1–3 — detailed reference covering representation, rounding, and elementary function error bounds
Accessible:
- Harris & Harris, Digital Design and Computer Architecture (2nd ed., 2012), ch. 5 — accessible hardware-oriented walkthrough of IEEE 754 with datapath diagrams
- Bryant & O'Hallaron, Computer Systems: A Programmer's Perspective (3rd ed., 2016), ch. 2 — programmer-focused treatment of float bit patterns, special values, and rounding with C examples
Next Topics
- Floating-Point Arithmetic and Rounding Error Analysis — formalizes ULP-based error bounds, Wilkinson backward error, and condition numbers for sequences of operations
- Integer Representation and Two's Complement — the fixed-point sibling of IEEE 754; understanding overflow and wraparound at the bit level
- Numerical Stability and Catastrophic Cancellation — how subtraction of nearly equal floats amplifies relative error and techniques such as compensated summation (Kahan) to mitigate it
- Computer Arithmetic Circuits — hardware implementation of floating-point adders and multipliers, including guard/round/sticky bits and fused multiply-add (FMA)