Skip to main content

Foundations

Inner Product Spaces and Orthogonality

Inner product axioms, Cauchy-Schwarz inequality, orthogonality, Gram-Schmidt, projections, and the bridge to Hilbert spaces.

CoreTier 1StableSupporting~40 min

Why This Matters

Inner products define geometry in vector spaces: lengths, angles, and orthogonality. Kernel methods replace the standard with arbitrary inner products to measure similarity in high-dimensional feature spaces. Projections onto subspaces underlie least squares linear regression, PCA, and conditional expectation.

The angle-bracket notation denotes the inner product, and denotes orthogonality.

Core Definitions

Definition

Inner Product

An inner product on a real vector space VV is a function ,:V×VR\langle \cdot, \cdot \rangle: V \times V \to \mathbb{R} satisfying:

  1. Symmetry: u,v=v,u\langle u, v \rangle = \langle v, u \rangle
  2. Linearity in first argument: αu+βv,w=αu,w+βv,w\langle \alpha u + \beta v, w \rangle = \alpha \langle u, w \rangle + \beta \langle v, w \rangle
  3. Positive definiteness: v,v0\langle v, v \rangle \geq 0 with equality iff v=0v = 0

The induced norm is v=v,v\|v\| = \sqrt{\langle v, v \rangle}. The standard inner product on Rn\mathbb{R}^n is x,y=xTy\langle x, y \rangle = x^T y.

Definition

Orthogonality

Two vectors u,vu, v are orthogonal if and only if u,v=0\langle u, v \rangle = 0, written uvu \perp v. A set {e1,,ek}\{e_1, \ldots, e_k\} is orthonormal if and only if ei,ej=δij\langle e_i, e_j \rangle = \delta_{ij} (Kronecker delta). The orthogonal complement of a subspace WW is W={vV:v,w=0 for all wW}W^\perp = \{v \in V : \langle v, w \rangle = 0 \text{ for all } w \in W\}.

Definition

Orthogonal Projection

The orthogonal projection of vv onto a subspace WW is the unique v^W\hat{v} \in W minimizing vw\|v - w\| over wWw \in W, when this minimizer exists. If {e1,,ek}\{e_1, \ldots, e_k\} is an orthonormal basis for WW:

projW(v)=i=1kv,eiei\text{proj}_W(v) = \sum_{i=1}^{k} \langle v, e_i \rangle \, e_i

The residual vprojW(v)v - \text{proj}_W(v) lies in WW^\perp.

In finite dimensions every subspace admits a unique projection. In infinite-dimensional Hilbert spaces, existence and uniqueness of v^\hat{v} require WW to be a closed subspace (Hilbert projection theorem). For a non-closed subspace, the infimum infwWvw\inf_{w \in W} \|v - w\| may not be attained in WW; one then projects onto its closure W\overline{W}.

Orthogonal projection is the geometry behind least squares. When the linear system Ax=bAx = b has no exact solution, the closest reachable point is projCol(A)(b)=Ax+\text{proj}_{\text{Col}(A)}(b) = Ax^+ where x+=A+bx^+ = A^+ b is given by the Moore-Penrose pseudoinverse. The residual bAx+b - Ax^+ lies in Col(A)\text{Col}(A)^\perp, which is the optimality condition. The Pseudoinverse Geometry Lab makes this draggable: move bb and watch the residual stay perpendicular to Col(A)\text{Col}(A).

Definition

Hilbert Space (Preview)

A Hilbert space is a complete inner product space: every Cauchy sequence converges. Finite-dimensional inner product spaces are automatically Hilbert spaces. Infinite-dimensional examples include L2L^2 (the space of square-integrable functions) and reproducing kernel Hilbert spaces (RKHS) used in kernel methods.

Main Theorems

Theorem

Cauchy-Schwarz Inequality

Statement

For all u,vVu, v \in V:

u,vuv|\langle u, v \rangle| \leq \|u\| \, \|v\|

Equality holds if and only if uu and vv are linearly dependent.

Intuition

The cosine of the angle between two vectors has magnitude at most 1. In any inner product space, we can define cosθ=u,v/(uv)\cos \theta = \langle u, v \rangle / (\|u\| \|v\|), and Cauchy-Schwarz says this is well-defined.

Proof Sketch

For v0v \neq 0, define w=uu,vv,vvw = u - \frac{\langle u, v \rangle}{\langle v, v \rangle} v. Then w,v=0\langle w, v \rangle = 0 and 0w2=u2u,v2v20 \leq \|w\|^2 = \|u\|^2 - \frac{\langle u, v \rangle^2}{\|v\|^2}. Rearranging gives u,v2u2v2\langle u, v \rangle^2 \leq \|u\|^2 \|v\|^2.

Why It Matters

Cauchy-Schwarz is the single most used inequality in analysis. It proves the triangle inequality for the induced norm. It bounds inner products (and hence correlations, cosine similarities, kernel evaluations) in terms of norms.

Failure Mode

The inequality is tight only when vectors are linearly dependent. For nearly orthogonal vectors, the bound is very loose. In high dimensions, random vectors tend to be nearly orthogonal, so the bound often has significant slack.

Gram-Schmidt Orthogonalization

Given linearly independent vectors {v1,,vk}\{v_1, \ldots, v_k\}, the Gram-Schmidt process produces an orthonormal set {e1,,ek}\{e_1, \ldots, e_k\} spanning the same subspace:

uj=vji=1j1vj,eiei,ej=ujuju_j = v_j - \sum_{i=1}^{j-1} \langle v_j, e_i \rangle \, e_i, \qquad e_j = \frac{u_j}{\|u_j\|}

This is the constructive proof that every finite-dimensional inner product space has an orthonormal basis. In matrix form, Gram-Schmidt produces the QR decomposition A=QRA = QR, which is central to numerical matrix operations.

Bessel's Inequality and the Polarization Identity

Two identities recur throughout analysis and ML.

Definition

Bessel's Inequality

Let {e1,e2,}\{e_1, e_2, \ldots\} be an orthonormal set in an inner product space VV (finite or countable). For every vVv \in V:

iv,ei2v2\sum_i |\langle v, e_i \rangle|^2 \leq \|v\|^2

with equality if and only if vv lies in the closure of span{ei}\text{span}\{e_i\}. Each term v,ei2|\langle v, e_i \rangle|^2 is the energy of vv along the direction eie_i; the inequality says total projected energy cannot exceed total energy. Equality is Parseval's identity and characterizes complete orthonormal systems.

Definition

Polarization Identity

For real inner product spaces, the inner product is recoverable from the norm:

u,v=14(u+v2uv2)\langle u, v \rangle = \tfrac{1}{4}\bigl(\|u + v\|^2 - \|u - v\|^2\bigr)

Combined with the parallelogram law u+v2+uv2=2u2+2v2\|u + v\|^2 + \|u - v\|^2 = 2\|u\|^2 + 2\|v\|^2, this characterizes inner-product norms among all norms (Jordan-von Neumann theorem). Practically, it is how kernel methods compute inner products from distances: u,v=12(u2+v2uv2)\langle u, v \rangle = \tfrac{1}{2}(\|u\|^2 + \|v\|^2 - \|u - v\|^2), the trick that turns a distance kernel into an inner product kernel.

Beyond the Standard Dot Product

Two inner products that appear constantly in ML.

Definition

Frobenius Inner Product on Matrices

For A,BRm×nA, B \in \mathbb{R}^{m \times n}:

A,BF=tr(AB)=i,jAijBij\langle A, B \rangle_F = \text{tr}(A^\top B) = \sum_{i,j} A_{ij} B_{ij}

with induced norm AF=i,jAij2\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2}. Vectorizing AA and BB shows this is just the standard dot product on Rmn\mathbb{R}^{mn}. The Frobenius inner product gives the geometry behind low-rank matrix approximation: the Eckart-Young theorem says the best rank-kk approximation of MM in Frobenius norm is its truncated SVD, and the optimal value equals the residual singular-value energy.

Definition

Mahalanobis Inner Product

For a symmetric positive-definite matrix MRd×dM \in \mathbb{R}^{d \times d}:

x,yM=xMy\langle x, y \rangle_M = x^\top M y

with induced Mahalanobis distance dM(x,y)=(xy)M(xy)d_M(x, y) = \sqrt{(x - y)^\top M (x - y)}. Choosing M=Σ1M = \Sigma^{-1} (the inverse covariance) makes the geometry whitening-invariant: directions of high variance get down-weighted so that Mahalanobis distance reflects standardized deviation rather than raw Euclidean displacement. Quadratic discriminant analysis, Gaussian likelihoods, and metric learning (Xing et al. 2002, Weinberger and Saul 2009) all use Mahalanobis inner products. The factorization M=LLM = L^\top L shows that x,yM=Lx,Ly\langle x, y \rangle_M = \langle Lx, Ly \rangle - a Mahalanobis inner product is just a standard inner product after a linear transformation.

Common Confusions

Watch Out

Orthogonality depends on the inner product

Vectors that are orthogonal under the standard dot product may not be orthogonal under a different inner product. In ML, when using a Mahalanobis distance x,yM=xTMy\langle x, y \rangle_M = x^T M y, the notion of orthogonality changes with MM.

Watch Out

Projection minimizes distance, not angle

The orthogonal projection projW(v)\text{proj}_W(v) is the closest point in WW to vv in the norm sense. It does not necessarily minimize the angle between vv and elements of WW.

Canonical Examples

Example

Least squares as projection

Given Ax=bAx = b with no exact solution, the least squares solution minimizes Axb2\|Ax - b\|^2. This is equivalent to projecting bb onto the column space of AA. The projection satisfies the normal equations ATAx=ATbA^T A x = A^T b, which express the condition that the residual bAxb - Ax is orthogonal to the column space.

Exercises

ExerciseCore

Problem

Let u=(1,2,3)Tu = (1, 2, 3)^T and v=(1,1,0)Tv = (1, -1, 0)^T in R3\mathbb{R}^3 with the standard inner product. Compute projv(u)\text{proj}_v(u) and verify that uprojv(u)u - \text{proj}_v(u) is orthogonal to vv.

ExerciseAdvanced

Problem

Prove that in any inner product space, the induced norm satisfies the parallelogram law: u+v2+uv2=2u2+2v2\|u + v\|^2 + \|u - v\|^2 = 2\|u\|^2 + 2\|v\|^2. Show conversely that any norm satisfying the parallelogram law comes from an inner product.

References

Canonical:

  • Axler, Linear Algebra Done Right (2024), Chapters 6-7
  • Halmos, Finite-Dimensional Vector Spaces (1958), Chapters 1-2
  • Strang, Linear Algebra and Its Applications (2006), Section 4.4 (orthogonality and projections)

For ML context:

  • Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 3
  • Horn & Johnson, Matrix Analysis (2013), Chapter 5 (norms and inner products)
  • Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 7-8 (Gram-Schmidt and QR decomposition)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

5