Foundations
Gram Matrices and Kernel Matrices
The Gram matrix encodes pairwise inner products of a dataset. Always PSD. Appears in kernel methods, PCA, SVD, and attention. Connects linear algebra to ML.
Prerequisites
Why This Matters
Gram matrix of pentagon vertices: rank-2 data ⇒ spectrum has only 2 nonzero eigenvalues
The Gram matrix appears everywhere in ML, often without being named. When you do kernel PCA, you operate on a kernel matrix (which is a Gram matrix in feature space). When you compute for PCA or linear regression, that is a Gram matrix of the columns of . In transformer attention, is a closely related cross inner-product matrix, generalizing the Gram structure to two different projections of the inputs (see the Attention section below for the precise relationship).
Understanding the Gram matrix and its properties (especially positive semidefiniteness) is the key link between linear algebra and kernel methods, and between kernel methods and attention.
Definitions
Gram Matrix
Given vectors in an inner product space, the Gram matrix has entries:
If is the matrix with as its -th row, then .
Kernel Matrix
Given a kernel function and data points , the kernel matrix (or Gram matrix of the kernel) has entries:
If is a valid (positive definite) kernel, then by the Moore-Aronszajn theorem there exists a (unique) reproducing kernel Hilbert space (RKHS) and a feature map such that , so the kernel matrix is the Gram matrix of the mapped data. Mercer's theorem is a stronger statement: when is compact and is continuous, it gives an explicit spectral expansion in terms of eigenfunctions of the integral operator. For general PD kernels (e.g. on infinite or non-compact domains, or without continuity), Moore-Aronszajn is the right tool; Mercer is a special case.
The standard inner product gives the simplest case. The Gram matrix with the standard inner product is just . A kernel matrix with kernel generalizes this to , which computes inner products in a (possibly infinite-dimensional) feature space.
Positive Semidefiniteness
Gram Matrices are Positive Semidefinite
Statement
For any vectors in a real inner product space, the Gram matrix with is positive semidefinite. That is, for all :
Intuition
The quadratic form computes the squared norm of a linear combination of the data vectors. Squared norms are always non-negative. This is why Gram matrices are always PSD: they represent inner product structure, and inner products induce non-negative norms.
Proof Sketch
The second equality uses bilinearity of the inner product. The inequality follows from the definition of a norm.
Why It Matters
PSD is the central property of Gram matrices. It guarantees that eigenvalues are non-negative, that the matrix defines a valid (semi-)metric on the data, and that kernel methods are well-posed. Any matrix that arises as pairwise inner products must be PSD.
Failure Mode
The Gram matrix is PSD but may not be positive definite. It is singular (has zero eigenvalues) when the vectors are linearly dependent. For (more points than dimensions), the Gram matrix has rank at most , so at least eigenvalues are zero.
Eigenvalues and Data Geometry
Gram Matrix Eigenvalues Reflect Data Geometry
Statement
The non-zero eigenvalues of are the same as the non-zero eigenvalues of . If the SVD of is , then:
The eigenvalues of are where and are the singular values of .
Intuition
The eigenvalues of the Gram matrix tell you how the data is spread in different directions. Large eigenvalues correspond to directions of high variance in the data. This is exactly the information PCA extracts. Computing PCA from (dual PCA) is equivalent to computing it from (primal PCA), but the matrix sizes differ: is while is .
Proof Sketch
If , then . Similarly, . Both have eigenvalues .
Why It Matters
This duality between and is the foundation of kernel PCA. When (more features than samples), working with the Gram matrix is cheaper. When you use a kernel, you only have access to (the Gram matrix in feature space), and the eigendecomposition of gives you PCA in feature space.
Failure Mode
Computing the full eigendecomposition of costs . For large , this is prohibitive, and you need approximate methods (randomized SVD, Nystrom approximation). Also, the Gram matrix requires storage, which limits applicability to datasets with many observations.
Connections to ML
Kernel Methods
In kernel methods, you work with and never compute the feature map explicitly. The kernel matrix is all you need for:
- Kernel SVM: the dual formulation involves directly
- Kernel PCA: eigendecompose to get principal components in feature space
- Gaussian processes: the kernel matrix (plus noise) is the covariance matrix of the function values
Attention
In a transformer, the attention matrix is:
In a transformer, and with generally , so is a cross inner-product matrix: it records inner products between two different linear projections of the input tokens. This is not a Gram matrix in the strict sense. It is generally neither symmetric nor positive semidefinite (Exercise 2 works this out). Before the softmax, can be read as a bilinear similarity score under the learned metric . Only in the special case (tied weights) does reduce to a true Gram matrix. The Gram-matrix intuition (pairwise similarities) transfers; the PSD property does not.
PCA and SVD
PCA on data points in can be computed from either (, primal) or (, dual). The dual formulation uses the Gram matrix and is the starting point for kernel PCA.
Common Confusions
Gram matrix vs covariance matrix
The Gram matrix is (size , pairwise similarities between data points). The covariance matrix is (size , pairwise relationships between features), assuming centered data. They encode complementary views of the same dataset: point similarities vs feature correlations.
PSD does not mean all entries are non-negative
A Gram matrix can have negative entries. For example, if and , then . PSD refers to the quadratic form for all , not to individual entries.
Summary
- Gram matrix: . For data matrix :
- Always PSD. Eigenvalues are non-negative.
- Kernel matrix: same structure with replacing
- Eigenvalues of equal squared singular values of
- Appears in kernel methods, PCA (dual formulation), and SVD. Attention's is a related cross inner-product matrix (symmetric/PSD only with tied )
Exercises
Problem
Given , , , compute the Gram matrix where rows of are . Verify that is PSD by checking that all eigenvalues are non-negative.
Problem
Show that the attention score matrix in a transformer is not a Gram matrix in general. Under what tying conditions on the projections (or on the inputs) does reduce to a Gram matrix, and what does that imply about its eigenvalues? Construct an explicit example where has a negative eigenvalue.
References
Canonical:
- Horn & Johnson, Matrix Analysis (2013), Chapter 7 (Positive Definite Matrices)
- Axler, Linear Algebra Done Right (2024), Chapter 6
- Schoelkopf & Smola, Learning with Kernels (2002), Chapter 2
- Shawe-Taylor & Cristianini, Kernel Methods for Pattern Analysis (2004), Chapters 3-4
- Aronszajn, "Theory of Reproducing Kernels", Transactions of the American Mathematical Society (1950)
Current:
- Vaswani et al., "Attention Is All You Need" (2017), for the attention connection
- Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 16
- Wainwright, High-Dimensional Statistics (2019), Chapter 12
- Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 16
Next Topics
- Kernels and RKHS: the theoretical foundation for kernel matrices
- Principal component analysis: PCA from the dual (Gram matrix) perspective
- Attention mechanism theory: where Gram matrices meet modern deep learning
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Eigenvalues and Eigenvectorslayer 0A · tier 1
- Inner Product Spaces and Orthogonalitylayer 0A · tier 1
- The Kernel Tricklayer 2 · tier 1
- Distance Metrics Comparedlayer 1 · tier 2
- Matrix Multiplication Algorithmslayer 1 · tier 2
Derived topics
5- Principal Component Analysislayer 1 · tier 1
- Gaussian Process Regressionlayer 3 · tier 2
- Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
- Attention Mechanism Theorylayer 4 · tier 2
- Gaussian Processes for Machine Learninglayer 4 · tier 3