Skip to main content

Foundations

Matrix Operations and Properties

Essential matrix operations for ML: trace, determinant, inverse, pseudoinverse, Schur complement, and the Sherman-Morrison-Woodbury formula. When and why each matters.

CoreTier 1StableSupporting~50 min

Why This Matters

Matrices are the language of ML. Weight matrices, covariance matrices, kernel matrices, Hessians. They are everywhere. You need to know what operations you can perform on them, what those operations mean, and when they are numerically safe.

Three operations recur throughout: the , the , and the .

This page covers the operations that appear repeatedly in ML theory and practice.

If you want to build the geometric intuition first, the Matrix Mechanics Lab is the hands-on version of this page. It shows what matrix multiplication does to shapes, vectors, and coordinate systems before the abstract identities take over.

Quick Version

  • Trace. tr(A)\mathrm{tr}(A) adds the diagonal entries and, for square matrices, equals the sum of eigenvalues.
  • Determinant. det(A)\det(A) measures signed area or volume scaling. Zero means the map crushes some direction completely. Numerically, use singular values or a condition number rather than determinant magnitude to judge near singularity.
  • Inverse. A1A^{-1} exists only when the map is nonsingular, but the real numerical question is whether inversion is stable enough to trust.
  • Pseudoinverse. A+A^+ gives the minimum-norm least-squares solution when the exact inverse does not exist.
  • Schur complement / SMW. These are the “block algebra” and “cheap update” tools that keep Gaussian, kernel, and optimization computations tractable.

Mental Model

A matrix represents a linear map between vector spaces. Trace, determinant, inverse, and pseudoinverse each extract a different invariant of that map: trace sums the eigenvalues, determinant is the signed volume-scaling factor, inverse exists when no eigenvalue is zero, and pseudoinverse handles the rank-deficient case via the SVD.

Trace

Definition

Trace

The trace of a square matrix ARn×nA \in \mathbb{R}^{n \times n} is the sum of its diagonal elements:

tr(A)=i=1nAii\text{tr}(A) = \sum_{i=1}^{n} A_{ii}

Equivalently, the trace equals the sum of eigenvalues: tr(A)=i=1nλi\text{tr}(A) = \sum_{i=1}^{n} \lambda_i.

Key properties of the trace:

  • Linearity: tr(A+B)=tr(A)+tr(B)\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B) and tr(cA)=ctr(A)\text{tr}(cA) = c \cdot \text{tr}(A)
  • Cyclic property: tr(ABC)=tr(CAB)=tr(BCA)\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)
  • Transpose invariance: tr(A)=tr(AT)\text{tr}(A) = \text{tr}(A^T)

The cyclic property is extremely useful. It lets you rearrange matrix products inside a trace, which simplifies many derivations in ML (e.g., computing gradients of matrix expressions).

Determinant

Definition

Determinant

The determinant of a square matrix ARn×nA \in \mathbb{R}^{n \times n} equals the product of its eigenvalues:

det(A)=i=1nλi\det(A) = \prod_{i=1}^{n} \lambda_i

Geometrically, det(A)|\det(A)| measures the factor by which AA scales volumes. The matrix is singular (not invertible) if and only if det(A)=0\det(A) = 0.

Key properties:

  • det(AB)=det(A)det(B)\det(AB) = \det(A)\det(B)
  • det(A1)=1/det(A)\det(A^{-1}) = 1/\det(A)
  • det(AT)=det(A)\det(A^T) = \det(A)
  • det(cA)=cndet(A)\det(cA) = c^n \det(A) for ARn×nA \in \mathbb{R}^{n \times n}

In ML, determinants appear in Gaussian distributions (the normalization constant involves det(Σ)\det(\Sigma)), in volume arguments for information theory, and in Bayesian model selection.

In numerical code, determinant magnitude is a poor invertibility diagnostic. The product of many eigenvalues can underflow, overflow, or look tiny simply because the dimension is large. If you care whether solving with AA is safe, look at singular values, rank, or the condition number. If you need logdet(A)\log\det(A) for a positive definite matrix, compute it through a Cholesky factorization: if A=LLA=LL^\top, then logdet(A)=2ilogLii\log\det(A)=2\sum_i \log L_{ii}.

Matrix Inverse

Definition

Matrix Inverse

For a square matrix ARn×nA \in \mathbb{R}^{n \times n}, the inverse A1A^{-1} satisfies AA1=A1A=IAA^{-1} = A^{-1}A = I. It exists if and only if det(A)0\det(A) \neq 0 (equivalently, all eigenvalues are nonzero).

Key identities:

  • (AB)1=B1A1(AB)^{-1} = B^{-1}A^{-1} (note the reversed order)
  • (AT)1=(A1)T(A^T)^{-1} = (A^{-1})^T

When Inversion is Dangerous

Definition

Condition Number

The condition number of a matrix is:

κ(A)=AA1=σmax(A)σmin(A)\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}

where σmax\sigma_{\max} and σmin\sigma_{\min} are the largest and smallest singular values. A large condition number means the matrix is nearly singular and inversion is numerically unstable.

Rule of thumb: if κ(A)10k\kappa(A) \approx 10^k, you lose about kk digits of accuracy when solving Ax=bAx = b by inversion. For double precision (about 16 digits), κ(A)>1012\kappa(A) > 10^{12} is dangerous.

In practice, avoid explicit matrix inversion. Use factorizations (Cholesky, LU, QR) to solve linear systems instead.

Moore-Penrose Pseudoinverse

Definition

Moore-Penrose Pseudoinverse

The Moore-Penrose pseudoinverse A+A^+ of a matrix ARm×nA \in \mathbb{R}^{m \times n} is the unique matrix satisfying four conditions: (1) AA+A=AAA^+A = A, (2) A+AA+=A+A^+AA^+ = A^+, (3) (AA+)T=AA+(AA^+)^T = AA^+, (4) (A+A)T=A+A(A^+A)^T = A^+A.

For full column rank (rank(A)=nm\text{rank}(A) = n \leq m):

A+=(ATA)1ATA^+ = (A^T A)^{-1} A^T

For full row rank (rank(A)=mn\text{rank}(A) = m \leq n):

A+=AT(AAT)1A^+ = A^T (A A^T)^{-1}

The pseudoinverse gives the least-squares solution to Ax=bAx = b when AA is not square or not invertible: x+=A+bx^+ = A^+ b minimizes Axb2\|Ax - b\|_2. When the solution is non-unique, x+=A+bx^+ = A^+ b also has the smallest x2\|x\|_2 among the minimizers — the minimum-norm least-squares solution.

This is exactly what happens in linear regression: the OLS solution β^=(XTX)1XTy\hat{\beta} = (X^T X)^{-1} X^T y is X+yX^+ y.

The cleanest geometric reading is the SVD form A+=VΣ+UA^+ = V\Sigma^+ U^\top, where Σ+\Sigma^+ inverts the nonzero singular values and zeroes the rest. That form is numerically stable and works for every matrix shape. See SVD. For the orthogonal-projection picture (closest reachable output, simplest input that reaches it), the standalone Pseudoinverse Geometry Lab walks the three cases — overdetermined, underdetermined, rank-deficient — as draggable diagrams.

Transpose and Adjoint

The transpose ATA^T swaps rows and columns: (AT)ij=Aji(A^T)_{ij} = A_{ji}.

The conjugate transpose (adjoint) AA^* also conjugates complex entries: (A)ij=Aji(A^*)_{ij} = \overline{A_{ji}}. For real matrices, A=ATA^* = A^T.

A matrix is symmetric if and only if A=ATA = A^T. A matrix is orthogonal if and only if ATA=IA^T A = I. These properties simplify many computations. Symmetric matrices have real eigenvalues, and orthogonal matrices preserve lengths. A symmetric matrix with all nonnegative eigenvalues is positive semidefinite.

Schur Complement

Definition

Schur Complement

Given a block matrix:

M=(ABCD)M = \begin{pmatrix} A & B \\ C & D \end{pmatrix}

If DD is invertible, the Schur complement of DD in MM is:

M/D=ABD1CM/D = A - B D^{-1} C

The determinant factors as det(M)=det(D)det(M/D)\det(M) = \det(D) \cdot \det(M/D).

The Schur complement appears in:

  • Gaussian conditioning (deriving conditional distributions from joint)
  • Block matrix inversion
  • Optimization (eliminating variables in quadratic forms)

Sherman-Morrison-Woodbury Formula

Theorem

Sherman-Morrison-Woodbury Formula

Statement

If ARn×nA \in \mathbb{R}^{n \times n} is invertible, URn×kU \in \mathbb{R}^{n \times k}, CRk×kC \in \mathbb{R}^{k \times k} is invertible, and VRk×nV \in \mathbb{R}^{k \times n}, then:

(A+UCV)1=A1A1U(C1+VA1U)1VA1(A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}

Intuition

When you add a low-rank update UCVUCV to a matrix AA whose inverse you already know, you can compute the new inverse by solving a smaller k×kk \times k system instead of a full n×nn \times n inversion. This is a huge saving when knk \ll n.

Proof Sketch

Multiply (A+UCV)(A + UCV) by the proposed right-hand side and verify you get II. This is a direct algebraic verification. Expand the product and simplify using AA1=IAA^{-1} = I and CC1=ICC^{-1} = I.

Why It Matters

This formula appears throughout ML: online learning (rank-1 updates to covariance matrices), Kalman filters, Gaussian process inference with structured kernels, and Bayesian linear regression. Any time you have A1A^{-1} and need (A+low-rank)1(A + \text{low-rank})^{-1}, use this formula.

Failure Mode

The formula requires both AA and C1+VA1UC^{-1} + VA^{-1}U to be invertible. If either is singular (or nearly so), the formula is inapplicable or numerically unstable.

The special case with k=1k=1 (rank-1 update) is the Sherman-Morrison formula:

(A+uvT)1=A1A1uvTA11+vTA1u(A + uv^T)^{-1} = A^{-1} - \frac{A^{-1}uv^T A^{-1}}{1 + v^T A^{-1}u}

ML Implementation Patterns

TaskAlgebraic objectSafer implementation habit
Solve Ax=bAx=bA1bA^{-1}bUse solve, LU, QR, or Cholesky; do not form A1A^{-1}
Gaussian log likelihoodlogdet(Σ)\log\det(\Sigma)Use Cholesky and sum log diagonals
Least squaresX+yX^+yUse QR or SVD, especially when XXX^\top X is ill-conditioned
Rank or low-rank structurerank(A)\mathrm{rank}(A)Inspect singular values and choose a tolerance
Online covariance / kernel update(A+UCV)1(A+UCV)^{-1}Use SMW only when the small inner system is well-conditioned

Common Confusions

Watch Out

Trace and determinant serve different purposes

The trace sums eigenvalues; the determinant multiplies them. A matrix can have large trace but zero determinant (e.g., diag(100,0)\text{diag}(100, 0)). The trace tells you about the "total magnitude" of eigenvalues; the determinant tells you whether the matrix is invertible and how it scales volume.

Watch Out

Never invert matrices explicitly

In nearly all practical ML code, you should solve Ax=bAx = b using a linear system solver (e.g., np.linalg.solve), not by computing A1A^{-1} and then multiplying. Direct inversion is slower, less numerically stable, and rarely necessary.

Watch Out

A tiny determinant is not the same as a bad solve

Determinants multiply all volume scalings together, so they depend strongly on dimension and units. A matrix can have a tiny determinant because it has many moderately small singular values, while the solve may still be acceptable. A matrix can also have determinant 1 and be badly conditioned if one singular value is huge and another is tiny. Use singular values or κ(A)\kappa(A) when the question is numerical safety.

Summary

  • tr(A)\text{tr}(A) = sum of eigenvalues; use the cyclic property freely
  • det(A)\det(A) = product of eigenvalues; zero means singular
  • The pseudoinverse A+A^+ gives the least-squares solution to Ax=bAx = b
  • Condition number κ(A)\kappa(A) measures how dangerous inversion is
  • Schur complements factor block matrices and appear in Gaussian conditioning
  • Sherman-Morrison-Woodbury turns rank-kk updates into k×kk \times k solves
  • Use Cholesky/QR/SVD/logdet tricks in code; algebraic identities are not always numerically safe as written
  • In practice, solve linear systems instead of inverting matrices

Exercises

ExerciseCore

Problem

Let A=diag(3,5,7)A = \text{diag}(3, 5, 7). Compute tr(A)\text{tr}(A), det(A)\det(A), A1A^{-1}, and κ(A)\kappa(A) (using the 2-norm).

ExerciseCore

Problem

Show that tr(AB)=tr(BA)\text{tr}(AB) = \text{tr}(BA) for any matrices ARm×nA \in \mathbb{R}^{m \times n} and BRn×mB \in \mathbb{R}^{n \times m}.

ExerciseAdvanced

Problem

You have computed A1A^{-1} for a 1000×10001000 \times 1000 matrix. A new data point arrives, requiring you to compute (A+uvT)1(A + uv^T)^{-1} where u,vR1000u, v \in \mathbb{R}^{1000}. What is the computational cost using Sherman-Morrison versus recomputing the inverse from scratch?

References

Canonical:

  • Strang, Introduction to Linear Algebra (2016), Chapters 2, 5, 6
  • Horn & Johnson, Matrix Analysis (2013), Chapters 0-1
  • Golub & Van Loan, Matrix Computations (2013), Chapters 2-3 (matrix operations and LU/QR factorization)

Current:

  • Petersen & Pedersen, The Matrix Cookbook (2012). Essential reference for matrix identities.
  • Axler, Linear Algebra Done Right (2024), Chapters 3-4 (linear maps, invertibility, determinants)
  • Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 4 (matrix decompositions)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

12

+7 more on the derived-topics page.