Numerical Optimization
Conditioning and Condition Number
The condition number measures how sensitive a problem is to perturbations in its input. Ill-conditioned matrices turn small errors into catastrophic ones, and understanding conditioning is essential for any computation involving linear algebra.
Prerequisites
Why This Matters
Every numerical computation in machine learning. solving linear systems, computing eigendecompositions, inverting matrices, running gradient descent. All of these are affected by the conditioning of the problem. A matrix can be perfectly invertible in exact arithmetic and completely useless in floating point. The condition number tells you when to trust your computation and when to panic.
When a matrix inversion produces garbage, or a linear system solver returns wildly wrong answers, or gradient descent oscillates instead of converging, the condition number is almost always the explanation. Understanding numerical stability requires understanding conditioning first.
Mental Model
Think of the condition number as an amplification factor for errors. If you perturb the input to a problem by a relative amount , the output changes by at most (relative). A condition number of means that 16 digits of floating point precision lose 10 digits of accuracy. leaving you with 6 meaningful digits at best, and often fewer.
Core Definitions
Condition Number of a Matrix
For a nonsingular matrix , the condition number with respect to the 2-norm is:
where and are the largest and smallest singular values of , respectively. By convention, for any nonsingular matrix, with only for scalar multiples of orthogonal matrices.
Ill-Conditioned Matrix
A matrix is called ill-conditioned when is large (typically ). There is no universal threshold, but as a rule of thumb: if , you lose roughly digits of accuracy when solving in floating point. In double precision (about 16 digits), means you may get zero correct digits.
Well-Conditioned Matrix
A matrix is well-conditioned when is small. close to 1, or at least moderate (say, ). Computations involving well-conditioned matrices are numerically reliable.
The Fundamental Perturbation Bound
Relative Perturbation Bound for Ax = b
Statement
Let with nonsingular, and let . Then:
More generally, if both and are perturbed:
Intuition
The condition number is the worst-case amplification factor. A 1% error in can become a error in . If , then 8 digits of relative error in get amplified by , leaving no accuracy at all. The bound is tight. there exist perturbations that achieve equality.
Proof Sketch
From , we get , so . Also, , so . Combining:
Why It Matters
This bound is why the condition number is the central quantity in numerical linear algebra. It tells you the intrinsic difficulty of the problem , independent of the algorithm used to solve it. No algorithm can do better than this bound. an ill-conditioned problem is inherently sensitive to errors, no matter how clever your solver.
Failure Mode
The bound gives the worst case. For a specific perturbation , the actual error amplification may be much less than . The condition number is pessimistic but never optimistic. It guarantees things cannot be worse, but the typical case may be better. Also, the condition number depends on the norm used; the 2-norm condition number is most common, but and are also used.
The Singular Value Connection
The condition number has a clean geometric interpretation. The matrix maps the unit sphere to an ellipsoid whose semi-axes have lengths .
The condition number is the aspect ratio of this ellipsoid. The ratio of the longest axis to the shortest. When is large, the ellipsoid is extremely elongated: amplifies some directions enormously while compressing others. Inverting then amplifies the compressed directions, blowing up any errors in those components.
For symmetric positive definite matrices, the singular values equal the eigenvalues, so:
Connection to Optimization
Condition Number of an Optimization Problem
For a function that is -strongly convex and -smooth (i.e., everywhere), the condition number of the optimization problem is:
Gradient descent on such a function converges at rate per iteration. Large means slow convergence. The function has a narrow, elongated valley that gradient descent zigzags across.
For minimizing (the quadratic case), the optimization condition number is exactly . The Hessian is , with and .
Connection to Davis-Kahan
The condition number connects to spectral perturbation theory. For Hermitian (or symmetric) matrices, the Davis-Kahan theorem says that the perturbation of eigenvectors depends on the eigenvalue gap: if two eigenvalues are close, the corresponding eigenvectors are sensitive to perturbation, and the bound on the angle between perturbed and unperturbed eigenspaces is inversely proportional to the gap. The eigenvalue-problem condition number governs eigenvector sensitivity in this Hermitian setting.
For non-normal matrices (the typical case for ML Jacobians, transition matrices, and learned linear maps), Davis-Kahan does not apply directly. Eigenvectors can be extremely ill-conditioned even when eigenvalues are well-separated, because:
- the left and right eigenvectors are different and may be nearly orthogonal,
- the relevant sensitivity is governed by pseudospectra (Trefethen-Embree, Spectra and Pseudospectra, 2005) rather than the spectrum,
- conditioning of the eigenvector basis in , , enters every perturbation bound.
So: for symmetric / normal , eigenvalue gap eigenvector stability via Davis-Kahan. For non-normal , you also need pseudospectral / eigenvector-conditioning information; the spectral gap alone is not sufficient. This matters for analyzing optimization Jacobians (typically non-symmetric) and for transition matrices of non-reversible Markov chains.
Separately: large means is close to zero, so is close to singular. The right and left singular vectors corresponding to small singular values are highly sensitive to perturbation.
Preconditioning
If is large, you can often improve it by preconditioning: instead of solving , solve where is chosen so that .
The ideal preconditioner is (giving ), but then you need to invert , which is the original problem. Practical preconditioners approximate cheaply. diagonal, incomplete Cholesky, or multigrid methods.
Canonical Examples
The Hilbert matrix: notoriously ill-conditioned
The Hilbert matrix has entries . Its condition number grows exponentially:
| 5 | |
| 10 | |
| 15 |
For , the condition number exceeds . larger than for double precision. Solving in double precision gives nearly random answers, even though is mathematically invertible.
Orthogonal matrices are perfectly conditioned
If is an orthogonal matrix (), then all singular values are 1, so . Orthogonal transformations preserve distances and do not amplify errors. This is why algorithms based on QR decomposition and orthogonal transformations are numerically stable. They avoid introducing ill-conditioning.
Diagonal matrices make conditioning transparent
For a diagonal matrix with all :
If , then . The second component of the solution to has errors amplified by relative to the first.
Common Confusions
Invertible does not mean well-conditioned
A matrix can be mathematically invertible () but so ill-conditioned that numerical inversion is useless. The determinant is a terrible indicator of conditioning: a identity matrix has and , but the matrix has (tiny!) and (perfect conditioning).
Condition number is a property of the problem, not the algorithm
The condition number measures the intrinsic sensitivity of the mathematical problem . It is not a property of your solver. A better algorithm can avoid adding unnecessary error (this is backward stability), but it cannot avoid the error amplification that is inherent to the problem. Gaussian elimination with partial pivoting is backward stable. It solves a nearby problem exactly. But the sensitivity to perturbations in that nearby problem is still governed by .
Large condition number does not mean the matrix is nearly singular
Well, it does. But the converse needs care. A matrix with has , so it is "close to singular" in a relative sense. But could still be if . The matrix is far from the zero matrix but close to the set of singular matrices relative to its own scale.
Condition Number and Gradient Descent Convergence
The condition number of an optimization problem directly sets the convergence rate of gradient descent. For a function that is -strongly convex and -smooth, gradient descent with step size satisfies:
To reach -accuracy from initial suboptimality , the number of iterations needed is:
This is the iteration bound. The full proof uses the descent lemma (-smoothness gives a quadratic upper bound on at each step) and strong convexity (which prevents the iterates from stalling in flat regions). See gradient descent variants for the derivation.
For quadratic functions with positive definite, the smoothness constant is and the strong convexity constant is , so : exactly the condition number of . (For merely PSD — e.g. a rank-deficient quadratic — , the function is convex but not strongly convex, and the optimization condition number is infinite. In that regime classical rates do not apply directly; analysis instead uses error-bound conditions or the Polyak–Łojasiewicz inequality.)
Why large causes slow convergence geometrically. The sublevel sets of are ellipsoids with axis lengths proportional to . So the largest eigenvalue corresponds to the shortest, steepest axis, and to the longest, flattest axis — large eigenvalue = strong curvature = short axis. When is large, the ellipsoid is elongated along the direction. The stable gradient-descent step size is constrained by the steep direction (), and that same small step makes only relative progress along the flat direction. The result is the characteristic zigzag pattern: oscillation across the steep narrow axis, glacial progress along the flat long axis, and an overall iteration count of .
Comparison with Newton's method. Newton's method applies the inverse Hessian at each step: . For strongly convex quadratics, this converges in one step regardless of (affine invariance). For general smooth strongly convex functions, quadratic convergence kicks in once iterates enter a neighborhood of . The price: each step requires solving a linear system of size , costing or with structure, compared to per gradient descent step.
Preconditioning as approximate Newton. Preconditioning multiplies the gradient by an approximation . The effective condition number becomes . If captures the dominant scale variation in , the preconditioned condition number is much smaller, and convergence accelerates. Diagonal preconditioning (as in Adam's adaptive step sizes) approximates the Hessian diagonal and reduces effective per coordinate.
Summary
- : ratio of largest to smallest singular value
- The condition number bounds error amplification: a relative perturbation in the input becomes at most in the output
- means you lose roughly digits of accuracy
- Invertibility is binary; conditioning is a spectrum
- In optimization, controls convergence rate of gradient descent
- Preconditioning reduces to make problems tractable
- The Hilbert matrix is the canonical example of catastrophic ill-conditioning
Exercises
Problem
Compute the condition number (2-norm) of the matrix:
What does this tell you about solving in floating point?
Problem
Prove that for any nonsingular matrix :
where , (i.e., the right-hand side is fixed and the matrix is perturbed). This shows the condition number is the exact worst-case amplification.
References
Canonical:
- Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 12-15 (conditioning, perturbation bounds, SVD connection)
- Golub & Van Loan, Matrix Computations (2013), Chapter 2 (condition number, perturbation theory for linear systems)
- Demmel, Applied Numerical Linear Algebra (1997), Chapters 1-2 (conditioning, backward error, perturbation theory)
Optimization:
- Boyd & Vandenberghe, Convex Optimization (2004), Section 9.3.1 (condition number and gradient descent convergence)
- Nesterov, Introductory Lectures on Stochastic Optimization (2004), Chapter 2 (smoothness and strong convexity constants)
Numerical:
- Higham, Accuracy and Stability of Numerical Algorithms (2002), Chapters 1-2 and 7 (backward error, condition numbers, iterative refinement)
- Strang, Linear Algebra and Its Applications (2006), Section 7.3 (condition numbers and numerical stability)
Next Topics
The natural next step from conditioning:
- Numerical linear algebra basics: algorithms (LU, QR, SVD) and their stability properties, building on the conditioning framework established here
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Eigenvalues and Eigenvectorslayer 0A · tier 1
- Matrix Normslayer 0A · tier 1
- Matrix Operations and Propertieslayer 0A · tier 1
- Singular Value Decompositionlayer 0A · tier 1
- Numerical Stability and Conditioninglayer 1 · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.