Skip to main content

Numerical Optimization

Conditioning and Condition Number

The condition number measures how sensitive a problem is to perturbations in its input. Ill-conditioned matrices turn small errors into catastrophic ones, and understanding conditioning is essential for any computation involving linear algebra.

CoreTier 1StableSupporting~50 min

Why This Matters

Every numerical computation in machine learning. solving linear systems, computing eigendecompositions, inverting matrices, running gradient descent. All of these are affected by the conditioning of the problem. A matrix can be perfectly invertible in exact arithmetic and completely useless in floating point. The condition number tells you when to trust your computation and when to panic.

When a matrix inversion produces garbage, or a linear system solver returns wildly wrong answers, or gradient descent oscillates instead of converging, the condition number is almost always the explanation. Understanding numerical stability requires understanding conditioning first.

theorem visual

Error Amplification Lens

The condition number tracks how a small relative perturbation in the data can expand into a much larger relative perturbation in the solution.

κ = 8

INPUT GEOMETRYAFTER APPLYING Aaspect ratio viewinput perturbationamplified solution error

spectrum

worst-case bound

A well-scaled input error can still explode after inversion when one singular direction is tiny.

practical reading

Rough digits lost: 0.9

Large means the problem is intrinsically sensitive before any algorithmic instability shows up.

Mental Model

Think of the condition number as an amplification factor for errors. If you perturb the input to a problem by a relative amount ϵ\epsilon, the output changes by at most κϵ\kappa \cdot \epsilon (relative). A condition number of 101010^{10} means that 16 digits of floating point precision lose 10 digits of accuracy. leaving you with 6 meaningful digits at best, and often fewer.

Core Definitions

Definition

Condition Number of a Matrix

For a nonsingular matrix ARn×nA \in \mathbb{R}^{n \times n}, the condition number with respect to the 2-norm is:

κ(A)=A2A12=σmax(A)σmin(A)\kappa(A) = \|A\|_2 \cdot \|A^{-1}\|_2 = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}

where σmax\sigma_{\max} and σmin\sigma_{\min} are the largest and smallest singular values of AA, respectively. By convention, κ(A)1\kappa(A) \geq 1 for any nonsingular matrix, with κ(A)=1\kappa(A) = 1 only for scalar multiples of orthogonal matrices.

Definition

Ill-Conditioned Matrix

A matrix is called ill-conditioned when κ(A)\kappa(A) is large (typically κ(A)1\kappa(A) \gg 1). There is no universal threshold, but as a rule of thumb: if κ(A)10k\kappa(A) \approx 10^k, you lose roughly kk digits of accuracy when solving Ax=bAx = b in floating point. In double precision (about 16 digits), κ(A)=1016\kappa(A) = 10^{16} means you may get zero correct digits.

Definition

Well-Conditioned Matrix

A matrix is well-conditioned when κ(A)\kappa(A) is small. close to 1, or at least moderate (say, κ(A)104\kappa(A) \leq 10^4). Computations involving well-conditioned matrices are numerically reliable.

The Fundamental Perturbation Bound

Theorem

Relative Perturbation Bound for Ax = b

Statement

Let Ax=bAx = b with AA nonsingular, and let (A)(x+δx)=b+δb(A)(x + \delta x) = b + \delta b. Then:

δxxκ(A)δbb\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \cdot \frac{\|\delta b\|}{\|b\|}

More generally, if both AA and bb are perturbed:

δxx+δxκ(A)(δAA+δbb)\frac{\|\delta x\|}{\|x + \delta x\|} \leq \kappa(A) \left( \frac{\|\delta A\|}{\|A\|} + \frac{\|\delta b\|}{\|b\|} \right)

Intuition

The condition number is the worst-case amplification factor. A 1% error in bb can become a κ(A)%\kappa(A)\% error in xx. If κ(A)=108\kappa(A) = 10^8, then 8 digits of relative error in bb get amplified by 10810^8, leaving no accuracy at all. The bound is tight. there exist perturbations that achieve equality.

Proof Sketch

From Aδx=δbA \delta x = \delta b, we get δx=A1δb\delta x = A^{-1} \delta b, so δxA1δb\|\delta x\| \leq \|A^{-1}\| \cdot \|\delta b\|. Also, b=AxAx\|b\| = \|Ax\| \leq \|A\| \cdot \|x\|, so 1/xA/b1/\|x\| \leq \|A\|/\|b\|. Combining:

δxxA1δbAb=κ(A)δbb\frac{\|\delta x\|}{\|x\|} \leq \|A^{-1}\| \cdot \|\delta b\| \cdot \frac{\|A\|}{\|b\|} = \kappa(A) \cdot \frac{\|\delta b\|}{\|b\|}

Why It Matters

This bound is why the condition number is the central quantity in numerical linear algebra. It tells you the intrinsic difficulty of the problem Ax=bAx=b, independent of the algorithm used to solve it. No algorithm can do better than this bound. an ill-conditioned problem is inherently sensitive to errors, no matter how clever your solver.

Failure Mode

The bound gives the worst case. For a specific perturbation δb\delta b, the actual error amplification may be much less than κ(A)\kappa(A). The condition number is pessimistic but never optimistic. It guarantees things cannot be worse, but the typical case may be better. Also, the condition number depends on the norm used; the 2-norm condition number is most common, but κ1\kappa_1 and κ\kappa_\infty are also used.

The Singular Value Connection

The condition number κ(A)=σmax/σmin\kappa(A) = \sigma_{\max}/\sigma_{\min} has a clean geometric interpretation. The matrix AA maps the unit sphere to an ellipsoid whose semi-axes have lengths σ1σ2σn>0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n > 0.

The condition number is the aspect ratio of this ellipsoid. The ratio of the longest axis to the shortest. When κ\kappa is large, the ellipsoid is extremely elongated: AA amplifies some directions enormously while compressing others. Inverting AA then amplifies the compressed directions, blowing up any errors in those components.

For symmetric positive definite matrices, the singular values equal the eigenvalues, so:

κ(A)=λmax(A)λmin(A)\kappa(A) = \frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}

Connection to Optimization

Definition

Condition Number of an Optimization Problem

For a function ff that is μ\mu-strongly convex and LL-smooth (i.e., μI2f(x)LI\mu I \preceq \nabla^2 f(x) \preceq LI everywhere), the condition number of the optimization problem is:

κ=Lμ\kappa = \frac{L}{\mu}

Gradient descent on such a function converges at rate (11/κ)t(1 - 1/\kappa)^t per iteration. Large κ\kappa means slow convergence. The function has a narrow, elongated valley that gradient descent zigzags across.

For minimizing f(x)=12xTAxbTxf(x) = \frac{1}{2} x^T A x - b^T x (the quadratic case), the optimization condition number is exactly κ(A)\kappa(A). The Hessian is AA, with L=λmax(A)L = \lambda_{\max}(A) and μ=λmin(A)\mu = \lambda_{\min}(A).

Connection to Davis-Kahan

The condition number connects to spectral perturbation theory. For Hermitian (or symmetric) matrices, the Davis-Kahan sinΘ\sin\Theta theorem says that the perturbation of eigenvectors depends on the eigenvalue gap: if two eigenvalues are close, the corresponding eigenvectors are sensitive to perturbation, and the bound on the angle between perturbed and unperturbed eigenspaces is inversely proportional to the gap. The eigenvalue-problem condition number 1/gap1/\mathrm{gap} governs eigenvector sensitivity in this Hermitian setting.

For non-normal matrices (the typical case for ML Jacobians, transition matrices, and learned linear maps), Davis-Kahan does not apply directly. Eigenvectors can be extremely ill-conditioned even when eigenvalues are well-separated, because:

  • the left and right eigenvectors are different and may be nearly orthogonal,
  • the relevant sensitivity is governed by pseudospectra (Trefethen-Embree, Spectra and Pseudospectra, 2005) rather than the spectrum,
  • conditioning of the eigenvector basis VV in A=VΛV1A = V \Lambda V^{-1}, κ(V)\kappa(V), enters every perturbation bound.

So: for symmetric / normal AA, eigenvalue gap \Rightarrow eigenvector stability via Davis-Kahan. For non-normal AA, you also need pseudospectral / eigenvector-conditioning information; the spectral gap alone is not sufficient. This matters for analyzing optimization Jacobians (typically non-symmetric) and for transition matrices of non-reversible Markov chains.

Separately: large κ(A)=σmax/σmin\kappa(A) = \sigma_{\max}/\sigma_{\min} means σmin\sigma_{\min} is close to zero, so AA is close to singular. The right and left singular vectors corresponding to small singular values are highly sensitive to perturbation.

Preconditioning

If κ(A)\kappa(A) is large, you can often improve it by preconditioning: instead of solving Ax=bAx = b, solve M1Ax=M1bM^{-1}Ax = M^{-1}b where MM is chosen so that κ(M1A)κ(A)\kappa(M^{-1}A) \ll \kappa(A).

The ideal preconditioner is M=AM = A (giving κ=1\kappa = 1), but then you need to invert AA, which is the original problem. Practical preconditioners approximate AA cheaply. diagonal, incomplete Cholesky, or multigrid methods.

Canonical Examples

Example

The Hilbert matrix: notoriously ill-conditioned

The n×nn \times n Hilbert matrix HnH_n has entries Hij=1/(i+j1)H_{ij} = 1/(i+j-1). Its condition number grows exponentially:

nnκ(Hn)\kappa(H_n)
54.8×105\approx 4.8 \times 10^5
101.6×1013\approx 1.6 \times 10^{13}
153.7×1017\approx 3.7 \times 10^{17}

For n=15n = 15, the condition number exceeds 101710^{17}. larger than 1/ϵmachine1/\epsilon_{\text{machine}} for double precision. Solving H15x=bH_{15} x = b in double precision gives nearly random answers, even though H15H_{15} is mathematically invertible.

Example

Orthogonal matrices are perfectly conditioned

If QQ is an orthogonal matrix (QTQ=IQ^T Q = I), then all singular values are 1, so κ(Q)=1\kappa(Q) = 1. Orthogonal transformations preserve distances and do not amplify errors. This is why algorithms based on QR decomposition and orthogonal transformations are numerically stable. They avoid introducing ill-conditioning.

Example

Diagonal matrices make conditioning transparent

For a diagonal matrix D=diag(d1,,dn)D = \text{diag}(d_1, \ldots, d_n) with all di>0d_i > 0:

κ(D)=maxidiminidi\kappa(D) = \frac{\max_i d_i}{\min_i d_i}

If D=diag(1,108)D = \text{diag}(1, 10^{-8}), then κ(D)=108\kappa(D) = 10^8. The second component of the solution to Dx=bDx = b has errors amplified by 10810^8 relative to the first.

Common Confusions

Watch Out

Invertible does not mean well-conditioned

A matrix can be mathematically invertible (det(A)0\det(A) \neq 0) but so ill-conditioned that numerical inversion is useless. The determinant is a terrible indicator of conditioning: a 100×100100 \times 100 identity matrix has det=1\det = 1 and κ=1\kappa = 1, but the matrix 0.1I1000.1 \cdot I_{100} has det=10100\det = 10^{-100} (tiny!) and κ=1\kappa = 1 (perfect conditioning).

Watch Out

Condition number is a property of the problem, not the algorithm

The condition number measures the intrinsic sensitivity of the mathematical problem Ax=bAx = b. It is not a property of your solver. A better algorithm can avoid adding unnecessary error (this is backward stability), but it cannot avoid the error amplification that is inherent to the problem. Gaussian elimination with partial pivoting is backward stable. It solves a nearby problem exactly. But the sensitivity to perturbations in that nearby problem is still governed by κ(A)\kappa(A).

Watch Out

Large condition number does not mean the matrix is nearly singular

Well, it does. But the converse needs care. A matrix with κ=1010\kappa = 10^{10} has σmin/σmax=1010\sigma_{\min}/\sigma_{\max} = 10^{-10}, so it is "close to singular" in a relative sense. But σmin\sigma_{\min} could still be 10510^5 if σmax=1015\sigma_{\max} = 10^{15}. The matrix is far from the zero matrix but close to the set of singular matrices relative to its own scale.

Condition Number and Gradient Descent Convergence

The condition number κ=L/μ\kappa = L/\mu of an optimization problem directly sets the convergence rate of gradient descent. For a function ff that is μ\mu-strongly convex and LL-smooth, gradient descent with step size η=1/L\eta = 1/L satisfies:

f(xt)f(11κ)t[f(x0)f]f(x_t) - f^* \leq \left(1 - \frac{1}{\kappa}\right)^t [f(x_0) - f^*]

To reach ϵ\epsilon-accuracy from initial suboptimality f(x0)f=Δf(x_0) - f^* = \Delta, the number of iterations needed is:

tκlog ⁣(Δϵ)t \geq \kappa \log\!\left(\frac{\Delta}{\epsilon}\right)

This is the O(κlog(1/ε))O(\kappa \log(1/\varepsilon)) iteration bound. The full proof uses the descent lemma (LL-smoothness gives a quadratic upper bound on ff at each step) and strong convexity (which prevents the iterates from stalling in flat regions). See gradient descent variants for the derivation.

For quadratic functions f(x)=12xTAxbTxf(x) = \frac{1}{2} x^T A x - b^T x with AA positive definite, the smoothness constant is L=λmax(A)L = \lambda_{\max}(A) and the strong convexity constant is μ=λmin(A)>0\mu = \lambda_{\min}(A) > 0, so κ=λmax(A)/λmin(A)\kappa = \lambda_{\max}(A)/\lambda_{\min}(A): exactly the condition number of AA. (For merely PSD AA — e.g. a rank-deficient quadratic — λmin=0\lambda_{\min}=0, the function is convex but not strongly convex, and the optimization condition number is infinite. In that regime classical O(κlog1/ε)O(\kappa \log 1/\varepsilon) rates do not apply directly; analysis instead uses error-bound conditions or the Polyak–Łojasiewicz inequality.)

Why large κ\kappa causes slow convergence geometrically. The sublevel sets of ff are ellipsoids with axis lengths proportional to 1/λi1/\sqrt{\lambda_i}. So the largest eigenvalue λmax\lambda_{\max} corresponds to the shortest, steepest axis, and λmin\lambda_{\min} to the longest, flattest axis — large eigenvalue = strong curvature = short axis. When κ\kappa is large, the ellipsoid is elongated along the λmin\lambda_{\min} direction. The stable gradient-descent step size is constrained by the steep direction (η1/L=1/λmax\eta \leq 1/L = 1/\lambda_{\max}), and that same small step makes only ηλmin=λmin/λmax=1/κ\eta \lambda_{\min} = \lambda_{\min}/\lambda_{\max} = 1/\kappa relative progress along the flat direction. The result is the characteristic zigzag pattern: oscillation across the steep narrow axis, glacial progress along the flat long axis, and an overall iteration count of O(κlog(1/ε))O(\kappa \log(1/\varepsilon)).

Comparison with Newton's method. Newton's method applies the inverse Hessian at each step: xt+1=xt[2f(xt)]1f(xt)x_{t+1} = x_t - [\nabla^2 f(x_t)]^{-1} \nabla f(x_t). For strongly convex quadratics, this converges in one step regardless of κ\kappa (affine invariance). For general smooth strongly convex functions, quadratic convergence kicks in once iterates enter a neighborhood of xx^*. The price: each step requires solving a linear system of size d×dd \times d, costing O(d3)O(d^3) or O(d2)O(d^2) with structure, compared to O(d)O(d) per gradient descent step.

Preconditioning as approximate Newton. Preconditioning multiplies the gradient by an approximation M1[2f]1M^{-1} \approx [\nabla^2 f]^{-1}. The effective condition number becomes κ(M1/2AM1/2)\kappa(M^{-1/2} A M^{-1/2}). If MM captures the dominant scale variation in AA, the preconditioned condition number is much smaller, and convergence accelerates. Diagonal preconditioning (as in Adam's adaptive step sizes) approximates the Hessian diagonal and reduces effective κ\kappa per coordinate.

Summary

  • κ(A)=σmax/σmin\kappa(A) = \sigma_{\max}/\sigma_{\min}: ratio of largest to smallest singular value
  • The condition number bounds error amplification: a relative perturbation ϵ\epsilon in the input becomes at most κϵ\kappa \cdot \epsilon in the output
  • κ10k\kappa \approx 10^k means you lose roughly kk digits of accuracy
  • Invertibility is binary; conditioning is a spectrum
  • In optimization, κ=L/μ\kappa = L/\mu controls convergence rate of gradient descent
  • Preconditioning reduces κ\kappa to make problems tractable
  • The Hilbert matrix is the canonical example of catastrophic ill-conditioning

Exercises

ExerciseCore

Problem

Compute the condition number (2-norm) of the matrix:

A=(110106)A = \begin{pmatrix} 1 & 1 \\ 0 & 10^{-6} \end{pmatrix}

What does this tell you about solving Ax=bAx = b in floating point?

ExerciseAdvanced

Problem

Prove that for any nonsingular matrix AA:

κ(A)=maxδAδx/xδA/A\kappa(A) = \max_{\delta A} \frac{\|\delta x\| / \|x\|}{\|\delta A\| / \|A\|}

where Ax=bAx = b, (A+δA)(x+δx)=b(A + \delta A)(x + \delta x) = b (i.e., the right-hand side is fixed and the matrix is perturbed). This shows the condition number is the exact worst-case amplification.

References

Canonical:

  • Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 12-15 (conditioning, perturbation bounds, SVD connection)
  • Golub & Van Loan, Matrix Computations (2013), Chapter 2 (condition number, perturbation theory for linear systems)
  • Demmel, Applied Numerical Linear Algebra (1997), Chapters 1-2 (conditioning, backward error, perturbation theory)

Optimization:

  • Boyd & Vandenberghe, Convex Optimization (2004), Section 9.3.1 (condition number and gradient descent convergence)
  • Nesterov, Introductory Lectures on Stochastic Optimization (2004), Chapter 2 (smoothness and strong convexity constants)

Numerical:

  • Higham, Accuracy and Stability of Numerical Algorithms (2002), Chapters 1-2 and 7 (backward error, condition numbers, iterative refinement)
  • Strang, Linear Algebra and Its Applications (2006), Section 7.3 (condition numbers and numerical stability)

Next Topics

The natural next step from conditioning:

  • Numerical linear algebra basics: algorithms (LU, QR, SVD) and their stability properties, building on the conditioning framework established here

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

6

Derived topics

0

No published topic currently declares this as a prerequisite.