Foundations
Tensors and Tensor Operations
What a tensor actually is: a multilinear map with specific transformation rules, how tensor contraction generalizes matrix multiplication, Einstein summation, tensor decompositions, and how ML frameworks use the word tensor to mean multidimensional array.
Why This Matters
Every neural network operates on tensors. A batch of images is a rank-4 tensor (batch, channels, height, width). Attention scores are rank-3 tensors. Weight matrices are rank-2 tensors. Understanding what tensors are, both the mathematical object and the computational object, is prerequisite to reading any ML paper or writing any training loop. The prerequisite linear algebra is covered in eigenvalues and eigenvectors.
The tensor-product symbol appears in formal definitions; in code, tensors are most often treated as .
Mental Model
A scalar is a single number. A vector is a list of numbers. A matrix is a grid of numbers. A tensor extends this pattern to arbitrary dimensions. But that description misses the point. A tensor is not just an array with more indices. It is a multilinear map that transforms in specific ways under changes of basis.
In ML practice, the distinction rarely matters because we work in fixed coordinate systems. But when you encounter differential geometry, general relativity, or continuum mechanics, the transformation rules are the entire point.
Formal Setup and Notation
Tensor (algebraic)
A tensor of order over vector spaces is a multilinear map:
where denotes the dual space of . The tensor has contravariant indices and covariant indices. In coordinates, is represented by a -dimensional array of components .
Tensor order (rank)
The order of a tensor is the number of indices needed to specify a component. A scalar has order 0, a vector has order 1, a matrix has order 2, and so on. Some authors call this "rank," but that conflicts with the linear algebra notion of rank (dimension of the column space). To avoid confusion: "order" means number of indices, "rank" means something different (see tensor rank below).
Tensor contraction
Contraction sums over one upper and one lower index:
Matrix multiplication is contraction: . The trace is contraction of both indices: .
Einstein Summation Convention
Einsum: kept indices appear in the output; the repeated index is summed
Repeated indices (one upper, one lower) are implicitly summed:
NumPy and PyTorch implement this with np.einsum and torch.einsum. The
string notation maps directly: torch.einsum('ik,kj->ij', A, B) performs
matrix multiplication.
Common einsum patterns:
- Matrix multiply:
ik,kj->ij - Batch matrix multiply:
bik,bkj->bij - Trace:
ii-> - Outer product:
i,j->ij - Attention scores:
bqd,bkd->bqk
Reading einsum strings systematically
Each letter is an index. Letters appearing on both the left and the right of -> are free indices (they appear in the output). Letters appearing only on the left are contracted (summed over). The order of free indices on the right determines the output shape.
Examples with explicit shapes:
ij,jk->ik: inputs and , output . Standard matrix multiplication.ij,ij->: inputs and , output scalar. Frobenius inner product .ij,kl->ijkl: outer product, output .bij,bjk->bik: batched matrix multiply, output .ijk,ijk->: full contraction, output scalar. Useful for computing the Frobenius norm of an order-3 tensor.ij->ji: transpose, swaps the two indices.
The notation also makes eigenvalue computations explicit. The characteristic polynomial involves ; the trace in einsum notation is ii->, which contracts the two indices of a square matrix. The sum of eigenvalues equals the trace; this is why torch.einsum('ii->', A) computes the trace of A.
Why einsum notation matters for ML code
Many numerical errors in deep learning come from incorrect axis ordering in tensor operations. Einsum makes the contraction structure explicit, eliminating transposition bugs. Consider computing the quadratic form for a batch of vectors and a shared matrix : torch.einsum('bi,ij,bj->b', x, A, x) contracts both the and indices, producing a scalar per batch element. The naive approach using torch.bmm would require broadcasting explicitly, introducing a potential shape mismatch.
This is particularly relevant for attention: the operation in a multi-head attention block involves tensors of shape and , and einsum('bhqd,bhkd->bhqk', Q, K) captures the contraction precisely.
Tensor Decompositions
CP decomposition
The Canonical Polyadic (CP) decomposition writes an order-3 tensor as a sum of rank-1 tensors (outer products of vectors):
The minimal for exact decomposition is the tensor rank. Unlike matrix rank, tensor rank is NP-hard to compute and the best rank- approximation may not exist (the set of rank- tensors is not closed).
Tucker decomposition
The Tucker decomposition writes an order-3 tensor as:
where is a smaller core tensor and are factor matrices. This generalizes the SVD to higher orders. Unlike CP, Tucker allows interactions between components via the core tensor .
The Computational Tensor: ML Frameworks
In PyTorch and NumPy, a "tensor" is a multidimensional array with:
- Shape: tuple of dimensions, e.g.,
(32, 3, 224, 224)for a batch of images - dtype: data type (
float32,float16,bfloat16, etc.) - device: where the data lives (
cpu,cuda:0) - Strides: number of elements to skip in memory per index increment
This is the physicist/mathematician definition stripped of transformation rules. The ML tensor does not transform covariantly or contravariantly. It is an array with named dimensions and a compute device.
Broadcasting Rules
When operating on tensors of different shapes, broadcasting aligns dimensions from the right and expands size-1 dimensions:
- If tensors differ in number of dimensions, prepend size-1 dimensions to the smaller tensor.
- Dimensions of size 1 are stretched to match the other tensor.
- Dimensions must be equal or one of them must be 1.
Adding a vector of shape (3,) to a matrix of shape (4, 3) broadcasts the
vector across all 4 rows. Broadcasting avoids explicit copies, saving memory.
Main Theorems
Universality of Tensor Product
Statement
For any bilinear map , there exists a unique linear map such that for all .
Intuition
The tensor product is the "freest" bilinear construction. Every bilinear map factors through it. This is why tensors appear everywhere: they are the canonical way to combine vector spaces while preserving linearity in each argument separately.
Proof Sketch
Define on elementary tensors by and extend by linearity. Bilinearity of ensures this is well-defined (independent of how the tensor is written as a sum of elementary tensors). Uniqueness follows because elementary tensors span .
Why It Matters
This universal property is why tensors appear in physics (stress tensors, electromagnetic field tensor) and ML (multilinear maps between feature spaces). The tensor product is the right algebraic structure for anything multilinear.
Failure Mode
The algebraic universal property holds for arbitrary vector spaces, including infinite-dimensional ones. What changes in the infinite-dimensional analytic setting is that the algebraic tensor product is typically not complete with respect to natural norms. In Hilbert and Banach spaces one introduces completed tensor products (Hilbert-Schmidt, projective , injective , nuclear) when topology, convergence, or operator-norm structure is required. This matters for kernel methods in RKHS and operator-valued kernels but not for standard neural networks operating on finite-dimensional activations.
Canonical Examples
Attention as tensor contraction
In self-attention, queries , keys , and values are rank-3 tensors
of shape (batch, seq_len, d_model). The attention score computation is:
In einsum: bqd,bkd->bqk. This is a batched contraction over the embedding
dimension. The output is a rank-3 tensor of attention logits.
CP decomposition for weight compression
A weight tensor of shape with parameters can be approximated by a rank- CP decomposition with parameters. For and , this reduces parameters from to , a compression.
Common Confusions
Tensor order vs tensor rank
The order (number of indices) of a tensor is not the same as its rank (minimum number of rank-1 terms in a CP decomposition). A matrix has order 2 but could have rank anywhere from 0 to 3. The ML community often says "rank" when they mean "order." Watch for this.
ML tensors vs math tensors
A PyTorch tensor does not transform under change of basis. It is a multidimensional array, full stop. The mathematical tensor carries transformation rules (covariant or contravariant). In ML, this distinction almost never matters because we work in a fixed computational basis. If you move to differential geometry or physics, it matters a great deal.
Broadcasting is not free
Broadcasting creates no physical copies, but it does expand the computation.
Adding a shape (1, 1000) tensor to a shape (1000, 1000) tensor performs
additions, not . Memory is saved, compute is not.
Summary
- A tensor is a multilinear map. In coordinates, it is a multidimensional array that transforms according to specific rules under change of basis.
- Order = number of indices. Rank = minimum CP decomposition terms.
- Contraction generalizes matrix multiplication and trace.
- Einstein summation (
einsum) is the lingua franca for tensor operations. - CP and Tucker decompositions compress tensors, with applications in model compression and data analysis.
- In ML frameworks, "tensor" means multidimensional array with shape, dtype, and device. No transformation rules are tracked.
Exercises
Problem
Write the einsum string for computing the Frobenius norm squared of a matrix , i.e., .
Problem
A rank-3 tensor has entries. Suppose its CP rank is . How many parameters does the CP decomposition use? By what factor is this compressed compared to storing directly?
References
Canonical:
- Kolda & Bader, "Tensor Decompositions and Applications," SIAM Review 51(3), 2009. Sections 2-3 for CP and Tucker decompositions
- Lim, "Tensors and Hypermatrices," Chapter 15 in Handbook of Linear Algebra (2013)
- Einstein, "Die Grundlage der allgemeinen Relativitätstheorie," Annalen der Physik 49, 1916. Original summation convention
- Bourbaki, Algebra I, Chapter 3. Universal property of tensor products in full generality
Current:
- Novikov et al., "Tensorizing Neural Networks," NeurIPS 2015
- PyTorch documentation on
torch.einsum - Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 2. Computational tensor operations for ML
- Vaswani et al., "Attention Is All You Need," NeurIPS 2017. Einsum patterns in multi-head attention
Next Topics
From tensors, the natural continuations are:
- Feedforward networks and backpropagation: neural networks as compositions of tensor operations
- Principal component analysis: the SVD is the order-2 analog of tensor decomposition
- Matrix operations and properties: matrices as order-2 tensors; determinants, traces, and inverses as contraction operations
- Eigenvalues and eigenvectors: spectral structure that generalizes to tensor decompositions via the Tucker core
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Eigenvalues and Eigenvectorslayer 0A · tier 1
- Pandas and NumPy Fundamentalslayer 4 · tier 3
Derived topics
2- Principal Component Analysislayer 1 · tier 1
- Feedforward Networks and Backpropagationlayer 2 · tier 1
Graph-backed continuations