Matrix Operations and Properties

Sneiderman, Robby

Foundations

Matrix Operations and Properties

Essential matrix operations for ML: trace, determinant, inverse, pseudoinverse, Schur complement, and the Sherman-Morrison-Woodbury formula. When and why each matters.

CoreTier 1StableSupporting~50 min

Prerequisites

Sets Functions and Relations Linear Independence

Quiz (27)Pulse Check Prereq Map

Why This Matters

Matrices are the language of ML. Weight matrices, covariance matrices, kernel matrices, Hessians. They are everywhere. You need to know what operations you can perform on them, what those operations mean, and when they are numerically safe.

Three operations recur throughout: the , the , and the .

This page covers the operations that appear repeatedly in ML theory and practice.

If you want to build the geometric intuition first, the Matrix Mechanics Lab is the hands-on version of this page. It shows what matrix multiplication does to shapes, vectors, and coordinate systems before the abstract identities take over.

Quick Version

Trace. $\mathrm{tr}(A)$ adds the diagonal entries and, for square matrices, equals the sum of eigenvalues.
Determinant. $\det(A)$ measures signed area or volume scaling. Zero means the map crushes some direction completely. Numerically, use singular values or a condition number rather than determinant magnitude to judge near singularity.
Inverse. $A^{-1}$ exists only when the map is nonsingular, but the real numerical question is whether inversion is stable enough to trust.
Pseudoinverse. $A^+$ gives the minimum-norm least-squares solution when the exact inverse does not exist.
Schur complement / SMW. These are the “block algebra” and “cheap update” tools that keep Gaussian, kernel, and optimization computations tractable.

Mental Model

A matrix represents a linear map between vector spaces. Trace, determinant, inverse, and pseudoinverse each extract a different invariant of that map: trace sums the eigenvalues, determinant is the signed volume-scaling factor, inverse exists when no eigenvalue is zero, and pseudoinverse handles the rank-deficient case via the SVD.

Trace

Definition

Trace $t r (A)$

The trace of a square matrix $A \in \mathbb{R}^{n \times n}$ is the sum of its diagonal elements:

$\text{tr}(A) = \sum_{i=1}^{n} A_{ii}$

Equivalently, the trace equals the sum of eigenvalues: $\text{tr}(A) = \sum_{i=1}^{n} \lambda_i$ .

Key properties of the trace:

Linearity: $\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)$ and $\text{tr}(cA) = c \cdot \text{tr}(A)$
Cyclic property: $\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)$
Transpose invariance: $\text{tr}(A) = \text{tr}(A^T)$

The cyclic property is extremely useful. It lets you rearrange matrix products inside a trace, which simplifies many derivations in ML (e.g., computing gradients of matrix expressions).

Determinant

Definition

Determinant $d e t (A)$

The determinant of a square matrix $A \in \mathbb{R}^{n \times n}$ equals the product of its eigenvalues:

$\det(A) = \prod_{i=1}^{n} \lambda_i$

Geometrically, $|\det(A)|$ measures the factor by which $A$ scales volumes. The matrix is singular (not invertible) if and only if $\det(A) = 0$ .

Key properties:

$\det(AB) = \det(A)\det(B)$
$\det(A^{-1}) = 1/\det(A)$
$\det(A^T) = \det(A)$
$\det(cA) = c^n \det(A)$ for $A \in \mathbb{R}^{n \times n}$

In ML, determinants appear in Gaussian distributions (the normalization constant involves $\det(\Sigma)$ ), in volume arguments for information theory, and in Bayesian model selection.

In numerical code, determinant magnitude is a poor invertibility diagnostic. The product of many eigenvalues can underflow, overflow, or look tiny simply because the dimension is large. If you care whether solving with $A$ is safe, look at singular values, rank, or the condition number. If you need $\log\det(A)$ for a positive definite matrix, compute it through a Cholesky factorization: if $A=LL^\top$ , then $\log\det(A)=2\sum_i \log L_{ii}$ .

Matrix Inverse

Definition

Matrix Inverse $A^{- 1}$

For a square matrix $A \in \mathbb{R}^{n \times n}$ , the inverse $A^{-1}$ satisfies $AA^{-1} = A^{-1}A = I$ . It exists if and only if $\det(A) \neq 0$ (equivalently, all eigenvalues are nonzero).

Key identities:

$(AB)^{-1} = B^{-1}A^{-1}$ (note the reversed order)
$(A^T)^{-1} = (A^{-1})^T$

When Inversion is Dangerous

Definition

Condition Number $κ (A)$

The condition number of a matrix is:

$\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$

where $\sigma_{\max}$ and $\sigma_{\min}$ are the largest and smallest singular values. A large condition number means the matrix is nearly singular and inversion is numerically unstable.

Rule of thumb: if $\kappa(A) \approx 10^k$ , you lose about $k$ digits of accuracy when solving $Ax = b$ by inversion. For double precision (about 16 digits), $\kappa(A) > 10^{12}$ is dangerous.

In practice, avoid explicit matrix inversion. Use factorizations (Cholesky, LU, QR) to solve linear systems instead.

Moore-Penrose Pseudoinverse

Definition

Moore-Penrose Pseudoinverse $A^{+}$

The Moore-Penrose pseudoinverse $A^+$ of a matrix $A \in \mathbb{R}^{m \times n}$ is the unique matrix satisfying four conditions: (1) $AA^+A = A$ , (2) $A^+AA^+ = A^+$ , (3) $(AA^+)^T = AA^+$ , (4) $(A^+A)^T = A^+A$ .

For full column rank ( $\text{rank}(A) = n \leq m$ ):

$A^+ = (A^T A)^{-1} A^T$

For full row rank ( $\text{rank}(A) = m \leq n$ ):

$A^+ = A^T (A A^T)^{-1}$

The pseudoinverse gives the least-squares solution to $Ax = b$ when $A$ is not square or not invertible: $x^+ = A^+ b$ minimizes $\|Ax - b\|_2$ . When the solution is non-unique, $x^+ = A^+ b$ also has the smallest $\|x\|_2$ among the minimizers — the minimum-norm least-squares solution.

This is exactly what happens in linear regression: the OLS solution $\hat{\beta} = (X^T X)^{-1} X^T y$ is $X^+ y$ .

The cleanest geometric reading is the SVD form $A^+ = V\Sigma^+ U^\top$ , where $\Sigma^+$ inverts the nonzero singular values and zeroes the rest. That form is numerically stable and works for every matrix shape. See SVD. For the orthogonal-projection picture (closest reachable output, simplest input that reaches it), the standalone Pseudoinverse Geometry Lab walks the three cases — overdetermined, underdetermined, rank-deficient — as draggable diagrams.

Transpose and Adjoint

The transpose $A^T$ swaps rows and columns: $(A^T)_{ij} = A_{ji}$ .

The conjugate transpose (adjoint) $A^*$ also conjugates complex entries: $(A^*)_{ij} = \overline{A_{ji}}$ . For real matrices, $A^* = A^T$ .

A matrix is symmetric if and only if $A = A^T$ . A matrix is orthogonal if and only if $A^T A = I$ . These properties simplify many computations. Symmetric matrices have real eigenvalues, and orthogonal matrices preserve lengths. A symmetric matrix with all nonnegative eigenvalues is positive semidefinite.

Schur Complement

Definition

Schur Complement

Given a block matrix:

$M = \begin{pmatrix} A & B \\ C & D \end{pmatrix}$

If $D$ is invertible, the Schur complement of $D$ in $M$ is:

$M/D = A - B D^{-1} C$

The determinant factors as $\det(M) = \det(D) \cdot \det(M/D)$ .

The Schur complement appears in:

Gaussian conditioning (deriving conditional distributions from joint)
Block matrix inversion
Optimization (eliminating variables in quadratic forms)

Sherman-Morrison-Woodbury Formula

Theorem

Sherman-Morrison-Woodbury Formula

Statement

If $A \in \mathbb{R}^{n \times n}$ is invertible, $U \in \mathbb{R}^{n \times k}$ , $C \in \mathbb{R}^{k \times k}$ is invertible, and $V \in \mathbb{R}^{k \times n}$ , then:

$(A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}$

Intuition

When you add a low-rank update $UCV$ to a matrix $A$ whose inverse you already know, you can compute the new inverse by solving a smaller $k \times k$ system instead of a full $n \times n$ inversion. This is a huge saving when $k \ll n$ .

Proof Sketch

Multiply $(A + UCV)$ by the proposed right-hand side and verify you get $I$ . This is a direct algebraic verification. Expand the product and simplify using $AA^{-1} = I$ and $CC^{-1} = I$ .

Why It Matters

This formula appears throughout ML: online learning (rank-1 updates to covariance matrices), Kalman filters, Gaussian process inference with structured kernels, and Bayesian linear regression. Any time you have $A^{-1}$ and need $(A + \text{low-rank})^{-1}$ , use this formula.

Failure Mode

The formula requires both $A$ and $C^{-1} + VA^{-1}U$ to be invertible. If either is singular (or nearly so), the formula is inapplicable or numerically unstable.

report a correction →

The special case with $k=1$ (rank-1 update) is the Sherman-Morrison formula:

$(A + uv^T)^{-1} = A^{-1} - \frac{A^{-1}uv^T A^{-1}}{1 + v^T A^{-1}u}$

ML Implementation Patterns

Task	Algebraic object	Safer implementation habit
Solve $Ax=b$	$A^{-1}b$	Use `solve`, LU, QR, or Cholesky; do not form $A^{-1}$
Gaussian log likelihood	$\log\det(\Sigma)$	Use Cholesky and sum log diagonals
Least squares	$X^+y$	Use QR or SVD, especially when $X^\top X$ is ill-conditioned
Rank or low-rank structure	$\mathrm{rank}(A)$	Inspect singular values and choose a tolerance
Online covariance / kernel update	$(A+UCV)^{-1}$	Use SMW only when the small inner system is well-conditioned

Common Confusions

Watch Out

Trace and determinant serve different purposes

The trace sums eigenvalues; the determinant multiplies them. A matrix can have large trace but zero determinant (e.g., $\text{diag}(100, 0)$ ). The trace tells you about the "total magnitude" of eigenvalues; the determinant tells you whether the matrix is invertible and how it scales volume.

Watch Out

Never invert matrices explicitly

In nearly all practical ML code, you should solve $Ax = b$ using a linear system solver (e.g., np.linalg.solve), not by computing $A^{-1}$ and then multiplying. Direct inversion is slower, less numerically stable, and rarely necessary.

Watch Out

A tiny determinant is not the same as a bad solve

Determinants multiply all volume scalings together, so they depend strongly on dimension and units. A matrix can have a tiny determinant because it has many moderately small singular values, while the solve may still be acceptable. A matrix can also have determinant 1 and be badly conditioned if one singular value is huge and another is tiny. Use singular values or $\kappa(A)$ when the question is numerical safety.

Summary

$\text{tr}(A)$ = sum of eigenvalues; use the cyclic property freely
$\det(A)$ = product of eigenvalues; zero means singular
The pseudoinverse $A^+$ gives the least-squares solution to $Ax = b$
Condition number $\kappa(A)$ measures how dangerous inversion is
Schur complements factor block matrices and appear in Gaussian conditioning
Sherman-Morrison-Woodbury turns rank- $k$ updates into $k \times k$ solves
Use Cholesky/QR/SVD/logdet tricks in code; algebraic identities are not always numerically safe as written
In practice, solve linear systems instead of inverting matrices

Exercises

ExerciseCore

Problem

Let $A = \text{diag}(3, 5, 7)$ . Compute $\text{tr}(A)$ , $\det(A)$ , $A^{-1}$ , and $\kappa(A)$ (using the 2-norm).

ExerciseCore

Problem

Show that $\text{tr}(AB) = \text{tr}(BA)$ for any matrices $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times m}$ .

ExerciseAdvanced

Problem

You have computed $A^{-1}$ for a $1000 \times 1000$ matrix. A new data point arrives, requiring you to compute $(A + uv^T)^{-1}$ where $u, v \in \mathbb{R}^{1000}$ . What is the computational cost using Sherman-Morrison versus recomputing the inverse from scratch?

References

Canonical:

Strang, Introduction to Linear Algebra (2016), Chapters 2, 5, 6
Horn & Johnson, Matrix Analysis (2013), Chapters 0-1
Golub & Van Loan, Matrix Computations (2013), Chapters 2-3 (matrix operations and LU/QR factorization)

Current:

Petersen & Pedersen, The Matrix Cookbook (2012). Essential reference for matrix identities.
Axler, Linear Algebra Done Right (2024), Chapters 3-4 (linear maps, invertibility, determinants)
Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 4 (matrix decompositions)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Linear Independencelayer 0A · tier 1
Sets, Functions, and Relationslayer 0A · tier 1

Derived topics

12

Eigenvalues and Eigenvectorslayer 0A · tier 1
Singular Value Decompositionlayer 0A · tier 1
The Hessian Matrixlayer 0A · tier 1
Conditioning and Condition Numberlayer 1 · tier 1
Convex Optimization Basicslayer 1 · tier 1

+7 more on the derived-topics page.

Graph-backed continuations

Eigenvalues and Eigenvectors Singular Value Decomposition Attention Mechanism Theory Conditioning and Condition Number Conjugate Gradient Methods Convex Optimization Basics Linear Layer: Shapes, Bias, and Memory Linear Regression Numerical Linear Algebra Numerical Stability and Conditioning Quantization Theory The Hessian Matrix