Inner Product Spaces and Orthogonality

Sneiderman, Robby

Foundations

Inner Product Spaces and Orthogonality

Inner product axioms, Cauchy-Schwarz inequality, orthogonality, Gram-Schmidt, projections, and the bridge to Hilbert spaces.

CoreTier 1StableSupporting~40 min

Prerequisites

Vectors Matrices and Linear Maps

Quiz (19)Pulse Check Prereq Map

Why This Matters

Inner products define geometry in vector spaces: lengths, angles, and orthogonality. Kernel methods replace the standard with arbitrary inner products to measure similarity in high-dimensional feature spaces. Projections onto subspaces underlie least squares linear regression, PCA, and conditional expectation.

The angle-bracket notation denotes the inner product, and denotes orthogonality.

Core Definitions

Definition

Inner Product $⟨ \cdot, \cdot ⟩$

An inner product on a real vector space $V$ is a function $\langle \cdot, \cdot \rangle: V \times V \to \mathbb{R}$ satisfying:

Symmetry: $\langle u, v \rangle = \langle v, u \rangle$
Linearity in first argument: $\langle \alpha u + \beta v, w \rangle = \alpha \langle u, w \rangle + \beta \langle v, w \rangle$
Positive definiteness: $\langle v, v \rangle \geq 0$ with equality iff $v = 0$

The induced norm is $\|v\| = \sqrt{\langle v, v \rangle}$ . The standard inner product on $\mathbb{R}^n$ is $\langle x, y \rangle = x^T y$ .

Definition

Orthogonality $u ⊥ v$

Two vectors $u, v$ are orthogonal if and only if $\langle u, v \rangle = 0$ , written $u \perp v$ . A set $\{e_1, \ldots, e_k\}$ is orthonormal if and only if $\langle e_i, e_j \rangle = \delta_{ij}$ (Kronecker delta). The orthogonal complement of a subspace $W$ is $W^\perp = \{v \in V : \langle v, w \rangle = 0 \text{ for all } w \in W\}$ .

Definition

Orthogonal Projection

The orthogonal projection of $v$ onto a subspace $W$ is the unique $\hat{v} \in W$ minimizing $\|v - w\|$ over $w \in W$ , when this minimizer exists. If $\{e_1, \ldots, e_k\}$ is an orthonormal basis for $W$ :

$\text{proj}_W(v) = \sum_{i=1}^{k} \langle v, e_i \rangle \, e_i$

The residual $v - \text{proj}_W(v)$ lies in $W^\perp$ .

In finite dimensions every subspace admits a unique projection. In infinite-dimensional Hilbert spaces, existence and uniqueness of $\hat{v}$ require $W$ to be a closed subspace (Hilbert projection theorem). For a non-closed subspace, the infimum $\inf_{w \in W} \|v - w\|$ may not be attained in $W$ ; one then projects onto its closure $\overline{W}$ .

Orthogonal projection is the geometry behind least squares. When the linear system $Ax = b$ has no exact solution, the closest reachable point is $\text{proj}_{\text{Col}(A)}(b) = Ax^+$ where $x^+ = A^+ b$ is given by the Moore-Penrose pseudoinverse. The residual $b - Ax^+$ lies in $\text{Col}(A)^\perp$ , which is the optimality condition. The Pseudoinverse Geometry Lab makes this draggable: move $b$ and watch the residual stay perpendicular to $\text{Col}(A)$ .

Definition

Hilbert Space (Preview)

A Hilbert space is a complete inner product space: every Cauchy sequence converges. Finite-dimensional inner product spaces are automatically Hilbert spaces. Infinite-dimensional examples include $L^2$ (the space of square-integrable functions) and reproducing kernel Hilbert spaces (RKHS) used in kernel methods.

Main Theorems

Theorem

Cauchy-Schwarz Inequality

Statement

For all $u, v \in V$ :

$|\langle u, v \rangle| \leq \|u\| \, \|v\|$

Equality holds if and only if $u$ and $v$ are linearly dependent.

Intuition

The cosine of the angle between two vectors has magnitude at most 1. In any inner product space, we can define $\cos \theta = \langle u, v \rangle / (\|u\| \|v\|)$ , and Cauchy-Schwarz says this is well-defined.

Proof Sketch

For $v \neq 0$ , define $w = u - \frac{\langle u, v \rangle}{\langle v, v \rangle} v$ . Then $\langle w, v \rangle = 0$ and $0 \leq \|w\|^2 = \|u\|^2 - \frac{\langle u, v \rangle^2}{\|v\|^2}$ . Rearranging gives $\langle u, v \rangle^2 \leq \|u\|^2 \|v\|^2$ .

Why It Matters

Cauchy-Schwarz is the single most used inequality in analysis. It proves the triangle inequality for the induced norm. It bounds inner products (and hence correlations, cosine similarities, kernel evaluations) in terms of norms.

Failure Mode

The inequality is tight only when vectors are linearly dependent. For nearly orthogonal vectors, the bound is very loose. In high dimensions, random vectors tend to be nearly orthogonal, so the bound often has significant slack.

report a correction →

Gram-Schmidt Orthogonalization

Given linearly independent vectors $\{v_1, \ldots, v_k\}$ , the Gram-Schmidt process produces an orthonormal set $\{e_1, \ldots, e_k\}$ spanning the same subspace:

$u_j = v_j - \sum_{i=1}^{j-1} \langle v_j, e_i \rangle \, e_i, \qquad e_j = \frac{u_j}{\|u_j\|}$

This is the constructive proof that every finite-dimensional inner product space has an orthonormal basis. In matrix form, Gram-Schmidt produces the QR decomposition $A = QR$ , which is central to numerical matrix operations.

Bessel's Inequality and the Polarization Identity

Two identities recur throughout analysis and ML.

Definition

Bessel's Inequality

Let $\{e_1, e_2, \ldots\}$ be an orthonormal set in an inner product space $V$ (finite or countable). For every $v \in V$ :

$\sum_i |\langle v, e_i \rangle|^2 \leq \|v\|^2$

with equality if and only if $v$ lies in the closure of $\text{span}\{e_i\}$ . Each term $|\langle v, e_i \rangle|^2$ is the energy of $v$ along the direction $e_i$ ; the inequality says total projected energy cannot exceed total energy. Equality is Parseval's identity and characterizes complete orthonormal systems.

Definition

Polarization Identity

For real inner product spaces, the inner product is recoverable from the norm:

$\langle u, v \rangle = \tfrac{1}{4}\bigl(\|u + v\|^2 - \|u - v\|^2\bigr)$

Combined with the parallelogram law $\|u + v\|^2 + \|u - v\|^2 = 2\|u\|^2 + 2\|v\|^2$ , this characterizes inner-product norms among all norms (Jordan-von Neumann theorem). Practically, it is how kernel methods compute inner products from distances: $\langle u, v \rangle = \tfrac{1}{2}(\|u\|^2 + \|v\|^2 - \|u - v\|^2)$ , the trick that turns a distance kernel into an inner product kernel.

Beyond the Standard Dot Product

Two inner products that appear constantly in ML.

Definition

Frobenius Inner Product on Matrices

For $A, B \in \mathbb{R}^{m \times n}$ :

$\langle A, B \rangle_F = \text{tr}(A^\top B) = \sum_{i,j} A_{ij} B_{ij}$

with induced norm $\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2}$ . Vectorizing $A$ and $B$ shows this is just the standard dot product on $\mathbb{R}^{mn}$ . The Frobenius inner product gives the geometry behind low-rank matrix approximation: the Eckart-Young theorem says the best rank- $k$ approximation of $M$ in Frobenius norm is its truncated SVD, and the optimal value equals the residual singular-value energy.

Definition

Mahalanobis Inner Product $⟨ x, y ⟩_{M}$

For a symmetric positive-definite matrix $M \in \mathbb{R}^{d \times d}$ :

$\langle x, y \rangle_M = x^\top M y$

with induced Mahalanobis distance $d_M(x, y) = \sqrt{(x - y)^\top M (x - y)}$ . Choosing $M = \Sigma^{-1}$ (the inverse covariance) makes the geometry whitening-invariant: directions of high variance get down-weighted so that Mahalanobis distance reflects standardized deviation rather than raw Euclidean displacement. Quadratic discriminant analysis, Gaussian likelihoods, and metric learning (Xing et al. 2002, Weinberger and Saul 2009) all use Mahalanobis inner products. The factorization $M = L^\top L$ shows that $\langle x, y \rangle_M = \langle Lx, Ly \rangle$ - a Mahalanobis inner product is just a standard inner product after a linear transformation.

Common Confusions

Watch Out

Orthogonality depends on the inner product

Vectors that are orthogonal under the standard dot product may not be orthogonal under a different inner product. In ML, when using a Mahalanobis distance $\langle x, y \rangle_M = x^T M y$ , the notion of orthogonality changes with $M$ .

Watch Out

Projection minimizes distance, not angle

The orthogonal projection $\text{proj}_W(v)$ is the closest point in $W$ to $v$ in the norm sense. It does not necessarily minimize the angle between $v$ and elements of $W$ .

Canonical Examples

Example

Least squares as projection

Given $Ax = b$ with no exact solution, the least squares solution minimizes $\|Ax - b\|^2$ . This is equivalent to projecting $b$ onto the column space of $A$ . The projection satisfies the normal equations $A^T A x = A^T b$ , which express the condition that the residual $b - Ax$ is orthogonal to the column space.

Exercises

ExerciseCore

Problem

Let $u = (1, 2, 3)^T$ and $v = (1, -1, 0)^T$ in $\mathbb{R}^3$ with the standard inner product. Compute $\text{proj}_v(u)$ and verify that $u - \text{proj}_v(u)$ is orthogonal to $v$ .

ExerciseAdvanced

Problem

Prove that in any inner product space, the induced norm satisfies the parallelogram law: $\|u + v\|^2 + \|u - v\|^2 = 2\|u\|^2 + 2\|v\|^2$ . Show conversely that any norm satisfying the parallelogram law comes from an inner product.

References

Canonical:

Axler, Linear Algebra Done Right (2024), Chapters 6-7
Halmos, Finite-Dimensional Vector Spaces (1958), Chapters 1-2
Strang, Linear Algebra and Its Applications (2006), Section 4.4 (orthogonality and projections)

For ML context:

Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 3
Horn & Johnson, Matrix Analysis (2013), Chapter 5 (norms and inner products)
Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 7-8 (Gram-Schmidt and QR decomposition)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Vectors, Matrices, and Linear Mapslayer 0A · tier 1

Derived topics

5

Eigenvalues and Eigenvectorslayer 0A · tier 1
Gram Matrices and Kernel Matriceslayer 1 · tier 1
Functional Analysis Corelayer 0B · tier 2
Semantic Search and Embeddingslayer 3 · tier 2
Random Matrix Theory Overviewlayer 4 · tier 2

Graph-backed continuations

Eigenvalues and Eigenvectors Functional Analysis Core Gram Matrices and Kernel Matrices Random Matrix Theory Overview Semantic Search and Embeddings