Conditioning and Condition Number

Sneiderman, Robby

Numerical Optimization

Conditioning and Condition Number

The condition number measures how sensitive a problem is to perturbations in its input. Ill-conditioned matrices turn small errors into catastrophic ones, and understanding conditioning is essential for any computation involving linear algebra.

CoreTier 1StableSupporting~50 min

Prerequisites

Eigenvalues and Eigenvectors Matrix Operations and Properties Matrix Norms Numerical Stability

Quiz (5)Pulse Check Prereq Map

Why This Matters

Every numerical computation in machine learning. solving linear systems, computing eigendecompositions, inverting matrices, running gradient descent. All of these are affected by the conditioning of the problem. A matrix can be perfectly invertible in exact arithmetic and completely useless in floating point. The condition number tells you when to trust your computation and when to panic.

When a matrix inversion produces garbage, or a linear system solver returns wildly wrong answers, or gradient descent oscillates instead of converging, the condition number is almost always the explanation. Understanding numerical stability requires understanding conditioning first.

theorem visual

Error Amplification Lens

$The condition number tracks how a small relative perturbation in the data can expand into a much larger relative perturbation in the solution.$

Condition ratio

κ = 8

spectrum

$κ (A) = σ_{m a x} (A) / σ_{m i n} (A)$

$σ_{m a x} = 1.0$ $σ_{m i n} = 0.1$

worst-case bound

$∥ δ x ∥/∥ x ∥ \leq κ (A) ∥ δ b ∥/∥ b ∥$

A well-scaled input error can still explode after inversion when one singular direction is tiny.

practical reading

Rough digits lost: 0.9

Large $κ$ means the problem is intrinsically sensitive before any algorithmic instability shows up.

Mental Model

Think of the condition number as an amplification factor for errors. If you perturb the input to a problem by a relative amount $\epsilon$ , the output changes by at most $\kappa \cdot \epsilon$ (relative). A condition number of $10^{10}$ means that 16 digits of floating point precision lose 10 digits of accuracy. leaving you with 6 meaningful digits at best, and often fewer.

Core Definitions

Definition

Condition Number of a Matrix $κ (A)$

For a nonsingular matrix $A \in \mathbb{R}^{n \times n}$ , the condition number with respect to the 2-norm is:

$\kappa(A) = \|A\|_2 \cdot \|A^{-1}\|_2 = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$

where $\sigma_{\max}$ and $\sigma_{\min}$ are the largest and smallest singular values of $A$ , respectively. By convention, $\kappa(A) \geq 1$ for any nonsingular matrix, with $\kappa(A) = 1$ only for scalar multiples of orthogonal matrices.

Definition

Ill-Conditioned Matrix

A matrix is called ill-conditioned when $\kappa(A)$ is large (typically $\kappa(A) \gg 1$ ). There is no universal threshold, but as a rule of thumb: if $\kappa(A) \approx 10^k$ , you lose roughly $k$ digits of accuracy when solving $Ax = b$ in floating point. In double precision (about 16 digits), $\kappa(A) = 10^{16}$ means you may get zero correct digits.

Definition

Well-Conditioned Matrix

A matrix is well-conditioned when $\kappa(A)$ is small. close to 1, or at least moderate (say, $\kappa(A) \leq 10^4$ ). Computations involving well-conditioned matrices are numerically reliable.

The Fundamental Perturbation Bound

Theorem

Relative Perturbation Bound for Ax = b

Statement

Let $Ax = b$ with $A$ nonsingular, and let $(A)(x + \delta x) = b + \delta b$ . Then:

$\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \cdot \frac{\|\delta b\|}{\|b\|}$

More generally, if both $A$ and $b$ are perturbed:

$\frac{\|\delta x\|}{\|x + \delta x\|} \leq \kappa(A) \left( \frac{\|\delta A\|}{\|A\|} + \frac{\|\delta b\|}{\|b\|} \right)$

Intuition

The condition number is the worst-case amplification factor. A 1% error in $b$ can become a $\kappa(A)\%$ error in $x$ . If $\kappa(A) = 10^8$ , then 8 digits of relative error in $b$ get amplified by $10^8$ , leaving no accuracy at all. The bound is tight. there exist perturbations that achieve equality.

Proof Sketch

From $A \delta x = \delta b$ , we get $\delta x = A^{-1} \delta b$ , so $\|\delta x\| \leq \|A^{-1}\| \cdot \|\delta b\|$ . Also, $\|b\| = \|Ax\| \leq \|A\| \cdot \|x\|$ , so $1/\|x\| \leq \|A\|/\|b\|$ . Combining:

$\frac{\|\delta x\|}{\|x\|} \leq \|A^{-1}\| \cdot \|\delta b\| \cdot \frac{\|A\|}{\|b\|} = \kappa(A) \cdot \frac{\|\delta b\|}{\|b\|}$

Why It Matters

This bound is why the condition number is the central quantity in numerical linear algebra. It tells you the intrinsic difficulty of the problem $Ax=b$ , independent of the algorithm used to solve it. No algorithm can do better than this bound. an ill-conditioned problem is inherently sensitive to errors, no matter how clever your solver.

Failure Mode

The bound gives the worst case. For a specific perturbation $\delta b$ , the actual error amplification may be much less than $\kappa(A)$ . The condition number is pessimistic but never optimistic. It guarantees things cannot be worse, but the typical case may be better. Also, the condition number depends on the norm used; the 2-norm condition number is most common, but $\kappa_1$ and $\kappa_\infty$ are also used.

report a correction →

The Singular Value Connection

The condition number $\kappa(A) = \sigma_{\max}/\sigma_{\min}$ has a clean geometric interpretation. The matrix $A$ maps the unit sphere to an ellipsoid whose semi-axes have lengths $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n > 0$ .

The condition number is the aspect ratio of this ellipsoid. The ratio of the longest axis to the shortest. When $\kappa$ is large, the ellipsoid is extremely elongated: $A$ amplifies some directions enormously while compressing others. Inverting $A$ then amplifies the compressed directions, blowing up any errors in those components.

For symmetric positive definite matrices, the singular values equal the eigenvalues, so:

$\kappa(A) = \frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}$

Connection to Optimization

Definition

Condition Number of an Optimization Problem $κ = L / μ$

For a function $f$ that is $\mu$ -strongly convex and $L$ -smooth (i.e., $\mu I \preceq \nabla^2 f(x) \preceq LI$ everywhere), the condition number of the optimization problem is:

$\kappa = \frac{L}{\mu}$

Gradient descent on such a function converges at rate $(1 - 1/\kappa)^t$ per iteration. Large $\kappa$ means slow convergence. The function has a narrow, elongated valley that gradient descent zigzags across.

For minimizing $f(x) = \frac{1}{2} x^T A x - b^T x$ (the quadratic case), the optimization condition number is exactly $\kappa(A)$ . The Hessian is $A$ , with $L = \lambda_{\max}(A)$ and $\mu = \lambda_{\min}(A)$ .

Connection to Davis-Kahan

The condition number connects to spectral perturbation theory. For Hermitian (or symmetric) matrices, the Davis-Kahan $\sin\Theta$ theorem says that the perturbation of eigenvectors depends on the eigenvalue gap: if two eigenvalues are close, the corresponding eigenvectors are sensitive to perturbation, and the bound on the angle between perturbed and unperturbed eigenspaces is inversely proportional to the gap. The eigenvalue-problem condition number $1/\mathrm{gap}$ governs eigenvector sensitivity in this Hermitian setting.

For non-normal matrices (the typical case for ML Jacobians, transition matrices, and learned linear maps), Davis-Kahan does not apply directly. Eigenvectors can be extremely ill-conditioned even when eigenvalues are well-separated, because:

the left and right eigenvectors are different and may be nearly orthogonal,
the relevant sensitivity is governed by pseudospectra (Trefethen-Embree, Spectra and Pseudospectra, 2005) rather than the spectrum,
conditioning of the eigenvector basis $V$ in $A = V \Lambda V^{-1}$ , $\kappa(V)$ , enters every perturbation bound.

So: for symmetric / normal $A$ , eigenvalue gap $\Rightarrow$ eigenvector stability via Davis-Kahan. For non-normal $A$ , you also need pseudospectral / eigenvector-conditioning information; the spectral gap alone is not sufficient. This matters for analyzing optimization Jacobians (typically non-symmetric) and for transition matrices of non-reversible Markov chains.

Separately: large $\kappa(A) = \sigma_{\max}/\sigma_{\min}$ means $\sigma_{\min}$ is close to zero, so $A$ is close to singular. The right and left singular vectors corresponding to small singular values are highly sensitive to perturbation.

Preconditioning

If $\kappa(A)$ is large, you can often improve it by preconditioning: instead of solving $Ax = b$ , solve $M^{-1}Ax = M^{-1}b$ where $M$ is chosen so that $\kappa(M^{-1}A) \ll \kappa(A)$ .

The ideal preconditioner is $M = A$ (giving $\kappa = 1$ ), but then you need to invert $A$ , which is the original problem. Practical preconditioners approximate $A$ cheaply. diagonal, incomplete Cholesky, or multigrid methods.

Canonical Examples

Example

The Hilbert matrix: notoriously ill-conditioned

The $n \times n$ Hilbert matrix $H_n$ has entries $H_{ij} = 1/(i+j-1)$ . Its condition number grows exponentially:

$n$	$\kappa(H_n)$
5	$\approx 4.8 \times 10^5$
10	$\approx 1.6 \times 10^{13}$
15	$\approx 3.7 \times 10^{17}$

For $n = 15$ , the condition number exceeds $10^{17}$ . larger than $1/\epsilon_{\text{machine}}$ for double precision. Solving $H_{15} x = b$ in double precision gives nearly random answers, even though $H_{15}$ is mathematically invertible.

Example

Orthogonal matrices are perfectly conditioned

If $Q$ is an orthogonal matrix ( $Q^T Q = I$ ), then all singular values are 1, so $\kappa(Q) = 1$ . Orthogonal transformations preserve distances and do not amplify errors. This is why algorithms based on QR decomposition and orthogonal transformations are numerically stable. They avoid introducing ill-conditioning.

Example

Diagonal matrices make conditioning transparent

For a diagonal matrix $D = \text{diag}(d_1, \ldots, d_n)$ with all $d_i > 0$ :

$\kappa(D) = \frac{\max_i d_i}{\min_i d_i}$

If $D = \text{diag}(1, 10^{-8})$ , then $\kappa(D) = 10^8$ . The second component of the solution to $Dx = b$ has errors amplified by $10^8$ relative to the first.

Common Confusions

Watch Out

Invertible does not mean well-conditioned

A matrix can be mathematically invertible ( $\det(A) \neq 0$ ) but so ill-conditioned that numerical inversion is useless. The determinant is a terrible indicator of conditioning: a $100 \times 100$ identity matrix has $\det = 1$ and $\kappa = 1$ , but the matrix $0.1 \cdot I_{100}$ has $\det = 10^{-100}$ (tiny!) and $\kappa = 1$ (perfect conditioning).

Watch Out

Condition number is a property of the problem, not the algorithm

The condition number measures the intrinsic sensitivity of the mathematical problem $Ax = b$ . It is not a property of your solver. A better algorithm can avoid adding unnecessary error (this is backward stability), but it cannot avoid the error amplification that is inherent to the problem. Gaussian elimination with partial pivoting is backward stable. It solves a nearby problem exactly. But the sensitivity to perturbations in that nearby problem is still governed by $\kappa(A)$ .

Watch Out

Large condition number does not mean the matrix is nearly singular

Well, it does. But the converse needs care. A matrix with $\kappa = 10^{10}$ has $\sigma_{\min}/\sigma_{\max} = 10^{-10}$ , so it is "close to singular" in a relative sense. But $\sigma_{\min}$ could still be $10^5$ if $\sigma_{\max} = 10^{15}$ . The matrix is far from the zero matrix but close to the set of singular matrices relative to its own scale.

Condition Number and Gradient Descent Convergence

The condition number $\kappa = L/\mu$ of an optimization problem directly sets the convergence rate of gradient descent. For a function $f$ that is $\mu$ -strongly convex and $L$ -smooth, gradient descent with step size $\eta = 1/L$ satisfies:

$f(x_t) - f^* \leq \left(1 - \frac{1}{\kappa}\right)^t [f(x_0) - f^*]$

To reach $\epsilon$ -accuracy from initial suboptimality $f(x_0) - f^* = \Delta$ , the number of iterations needed is:

$t \geq \kappa \log\!\left(\frac{\Delta}{\epsilon}\right)$

This is the $O(\kappa \log(1/\varepsilon))$ iteration bound. The full proof uses the descent lemma ( $L$ -smoothness gives a quadratic upper bound on $f$ at each step) and strong convexity (which prevents the iterates from stalling in flat regions). See gradient descent variants for the derivation.

For quadratic functions $f(x) = \frac{1}{2} x^T A x - b^T x$ with $A$ positive definite, the smoothness constant is $L = \lambda_{\max}(A)$ and the strong convexity constant is $\mu = \lambda_{\min}(A) > 0$ , so $\kappa = \lambda_{\max}(A)/\lambda_{\min}(A)$ : exactly the condition number of $A$ . (For merely PSD $A$ — e.g. a rank-deficient quadratic — $\lambda_{\min}=0$ , the function is convex but not strongly convex, and the optimization condition number is infinite. In that regime classical $O(\kappa \log 1/\varepsilon)$ rates do not apply directly; analysis instead uses error-bound conditions or the Polyak–Łojasiewicz inequality.)

Why large $\kappa$ causes slow convergence geometrically. The sublevel sets of $f$ are ellipsoids with axis lengths proportional to $1/\sqrt{\lambda_i}$ . So the largest eigenvalue $\lambda_{\max}$ corresponds to the shortest, steepest axis, and $\lambda_{\min}$ to the longest, flattest axis — large eigenvalue = strong curvature = short axis. When $\kappa$ is large, the ellipsoid is elongated along the $\lambda_{\min}$ direction. The stable gradient-descent step size is constrained by the steep direction ( $\eta \leq 1/L = 1/\lambda_{\max}$ ), and that same small step makes only $\eta \lambda_{\min} = \lambda_{\min}/\lambda_{\max} = 1/\kappa$ relative progress along the flat direction. The result is the characteristic zigzag pattern: oscillation across the steep narrow axis, glacial progress along the flat long axis, and an overall iteration count of $O(\kappa \log(1/\varepsilon))$ .

Comparison with Newton's method. Newton's method applies the inverse Hessian at each step: $x_{t+1} = x_t - [\nabla^2 f(x_t)]^{-1} \nabla f(x_t)$ . For strongly convex quadratics, this converges in one step regardless of $\kappa$ (affine invariance). For general smooth strongly convex functions, quadratic convergence kicks in once iterates enter a neighborhood of $x^*$ . The price: each step requires solving a linear system of size $d \times d$ , costing $O(d^3)$ or $O(d^2)$ with structure, compared to $O(d)$ per gradient descent step.

Preconditioning as approximate Newton. Preconditioning multiplies the gradient by an approximation $M^{-1} \approx [\nabla^2 f]^{-1}$ . The effective condition number becomes $\kappa(M^{-1/2} A M^{-1/2})$ . If $M$ captures the dominant scale variation in $A$ , the preconditioned condition number is much smaller, and convergence accelerates. Diagonal preconditioning (as in Adam's adaptive step sizes) approximates the Hessian diagonal and reduces effective $\kappa$ per coordinate.

Summary

$\kappa(A) = \sigma_{\max}/\sigma_{\min}$ : ratio of largest to smallest singular value
The condition number bounds error amplification: a relative perturbation $\epsilon$ in the input becomes at most $\kappa \cdot \epsilon$ in the output
$\kappa \approx 10^k$ means you lose roughly $k$ digits of accuracy
Invertibility is binary; conditioning is a spectrum
In optimization, $\kappa = L/\mu$ controls convergence rate of gradient descent
Preconditioning reduces $\kappa$ to make problems tractable
The Hilbert matrix is the canonical example of catastrophic ill-conditioning

Exercises

ExerciseCore

Problem

Compute the condition number (2-norm) of the matrix:

$A = \begin{pmatrix} 1 & 1 \\ 0 & 10^{-6} \end{pmatrix}$

What does this tell you about solving $Ax = b$ in floating point?

ExerciseAdvanced

Problem

Prove that for any nonsingular matrix $A$ :

$\kappa(A) = \max_{\delta A} \frac{\|\delta x\| / \|x\|}{\|\delta A\| / \|A\|}$

where $Ax = b$ , $(A + \delta A)(x + \delta x) = b$ (i.e., the right-hand side is fixed and the matrix is perturbed). This shows the condition number is the exact worst-case amplification.

References

Canonical:

Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 12-15 (conditioning, perturbation bounds, SVD connection)
Golub & Van Loan, Matrix Computations (2013), Chapter 2 (condition number, perturbation theory for linear systems)
Demmel, Applied Numerical Linear Algebra (1997), Chapters 1-2 (conditioning, backward error, perturbation theory)

Optimization:

Boyd & Vandenberghe, Convex Optimization (2004), Section 9.3.1 (condition number and gradient descent convergence)
Nesterov, Introductory Lectures on Stochastic Optimization (2004), Chapter 2 (smoothness and strong convexity constants)

Numerical:

Higham, Accuracy and Stability of Numerical Algorithms (2002), Chapters 1-2 and 7 (backward error, condition numbers, iterative refinement)
Strang, Linear Algebra and Its Applications (2006), Section 7.3 (condition numbers and numerical stability)

Next Topics

The natural next step from conditioning:

Numerical linear algebra basics: algorithms (LU, QR, SVD) and their stability properties, building on the conditioning framework established here

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Eigenvalues and Eigenvectorslayer 0A · tier 1
Matrix Normslayer 0A · tier 1
Matrix Operations and Propertieslayer 0A · tier 1
Singular Value Decompositionlayer 0A · tier 1
Numerical Stability and Conditioninglayer 1 · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.