Skip to main content

Foundations

Tensors and Tensor Operations

What a tensor actually is: a multilinear map with specific transformation rules, how tensor contraction generalizes matrix multiplication, Einstein summation, tensor decompositions, and how ML frameworks use the word tensor to mean multidimensional array.

CoreTier 1StableSupporting~55 min

Why This Matters

Every neural network operates on tensors. A batch of images is a rank-4 tensor (batch, channels, height, width). Attention scores are rank-3 tensors. Weight matrices are rank-2 tensors. Understanding what tensors are, both the mathematical object and the computational object, is prerequisite to reading any ML paper or writing any training loop. The prerequisite linear algebra is covered in eigenvalues and eigenvectors.

The tensor-product symbol appears in formal definitions; in code, tensors are most often treated as .

Mental Model

A scalar is a single number. A vector is a list of numbers. A matrix is a grid of numbers. A tensor extends this pattern to arbitrary dimensions. But that description misses the point. A tensor is not just an array with more indices. It is a multilinear map that transforms in specific ways under changes of basis.

In ML practice, the distinction rarely matters because we work in fixed coordinate systems. But when you encounter differential geometry, general relativity, or continuum mechanics, the transformation rules are the entire point.

Formal Setup and Notation

Definition

Tensor (algebraic)

A tensor of order pp over vector spaces V1,,VpV_1, \ldots, V_p is a multilinear map:

T:V1××Vk×Vk+1××VpRT: V_1^* \times \cdots \times V_k^* \times V_{k+1} \times \cdots \times V_p \to \mathbb{R}

where ViV_i^* denotes the dual space of ViV_i. The tensor has kk contravariant indices and pkp - k covariant indices. In coordinates, TT is represented by a pp-dimensional array of components Tj1jpki1ikT^{i_1 \ldots i_k}_{j_1 \ldots j_{p-k}}.

Definition

Tensor order (rank)

The order of a tensor is the number of indices needed to specify a component. A scalar has order 0, a vector has order 1, a matrix has order 2, and so on. Some authors call this "rank," but that conflicts with the linear algebra notion of rank (dimension of the column space). To avoid confusion: "order" means number of indices, "rank" means something different (see tensor rank below).

Definition

Tensor contraction

Contraction sums over one upper and one lower index:

C(T)j1j^mi1i^k=sTj1si1sC(T)^{i_1 \ldots \hat{i}_k \ldots}_{j_1 \ldots \hat{j}_m \ldots} = \sum_{s} T^{i_1 \ldots s \ldots}_{j_1 \ldots s \ldots}

Matrix multiplication is contraction: (AB)ji=kAkiBjk(AB)^i_j = \sum_k A^i_k B^k_j. The trace is contraction of both indices: tr(A)=iAii\text{tr}(A) = \sum_i A^i_i.

Einstein Summation Convention

Einsum: kept indices appear in the output; the repeated index is summed

A
B
C
Common patternsEINSUM STRINGMEANINGSHAPE TRANSFORMATION
matrix multiply
batched matmul
trace
outer product
attention scores

Repeated indices (one upper, one lower) are implicitly summed:

Cji=AkiBjkmeansCji=kAkiBjkC^i_j = A^i_k B^k_j \quad \text{means} \quad C^i_j = \sum_k A^i_k B^k_j

NumPy and PyTorch implement this with np.einsum and torch.einsum. The string notation maps directly: torch.einsum('ik,kj->ij', A, B) performs matrix multiplication.

Common einsum patterns:

  • Matrix multiply: ik,kj->ij
  • Batch matrix multiply: bik,bkj->bij
  • Trace: ii->
  • Outer product: i,j->ij
  • Attention scores: bqd,bkd->bqk

Reading einsum strings systematically

Each letter is an index. Letters appearing on both the left and the right of -> are free indices (they appear in the output). Letters appearing only on the left are contracted (summed over). The order of free indices on the right determines the output shape.

Examples with explicit shapes:

  • ij,jk->ik: inputs (m×n)(m \times n) and (n×p)(n \times p), output (m×p)(m \times p). Standard matrix multiplication.
  • ij,ij->: inputs (m×n)(m \times n) and (m×n)(m \times n), output scalar. Frobenius inner product A,BF=ijAijBij\langle A, B \rangle_F = \sum_{ij} A_{ij} B_{ij}.
  • ij,kl->ijkl: outer product, output (m×n×p×q)(m \times n \times p \times q).
  • bij,bjk->bik: batched matrix multiply, output (b×m×p)(b \times m \times p).
  • ijk,ijk->: full contraction, output scalar. Useful for computing the Frobenius norm of an order-3 tensor.
  • ij->ji: transpose, swaps the two indices.

The notation also makes eigenvalue computations explicit. The characteristic polynomial involves det(AλI)\det(A - \lambda I); the trace tr(A)=iAii\text{tr}(A) = \sum_i A_{ii} in einsum notation is ii->, which contracts the two indices of a square matrix. The sum of eigenvalues equals the trace; this is why torch.einsum('ii->', A) computes the trace of A.

Why einsum notation matters for ML code

Many numerical errors in deep learning come from incorrect axis ordering in tensor operations. Einsum makes the contraction structure explicit, eliminating transposition bugs. Consider computing the quadratic form xAxx^\top A x for a batch of vectors xRb×dx \in \mathbb{R}^{b \times d} and a shared matrix ARd×dA \in \mathbb{R}^{d \times d}: torch.einsum('bi,ij,bj->b', x, A, x) contracts both the ii and jj indices, producing a scalar per batch element. The naive approach using torch.bmm would require broadcasting AA explicitly, introducing a potential shape mismatch.

This is particularly relevant for attention: the operation QK/dQ K^\top / \sqrt{d} in a multi-head attention block involves tensors of shape (b,h,q,dk)(b, h, q, d_k) and (b,h,k,dk)(b, h, k, d_k), and einsum('bhqd,bhkd->bhqk', Q, K) captures the contraction precisely.

Tensor Decompositions

Definition

CP decomposition

The Canonical Polyadic (CP) decomposition writes an order-3 tensor as a sum of RR rank-1 tensors (outer products of vectors):

Tijkr=1RairbjrckrT_{ijk} \approx \sum_{r=1}^{R} a_{ir} \, b_{jr} \, c_{kr}

The minimal RR for exact decomposition is the tensor rank. Unlike matrix rank, tensor rank is NP-hard to compute and the best rank-RR approximation may not exist (the set of rank-RR tensors is not closed).

Definition

Tucker decomposition

The Tucker decomposition writes an order-3 tensor as:

TG×1U1×2U2×3U3T \approx G \times_1 U_1 \times_2 U_2 \times_3 U_3

where GG is a smaller core tensor and UiU_i are factor matrices. This generalizes the SVD to higher orders. Unlike CP, Tucker allows interactions between components via the core tensor GG.

The Computational Tensor: ML Frameworks

In PyTorch and NumPy, a "tensor" is a multidimensional array with:

  • Shape: tuple of dimensions, e.g., (32, 3, 224, 224) for a batch of images
  • dtype: data type (float32, float16, bfloat16, etc.)
  • device: where the data lives (cpu, cuda:0)
  • Strides: number of elements to skip in memory per index increment

This is the physicist/mathematician definition stripped of transformation rules. The ML tensor does not transform covariantly or contravariantly. It is an array with named dimensions and a compute device.

Broadcasting Rules

When operating on tensors of different shapes, broadcasting aligns dimensions from the right and expands size-1 dimensions:

  1. If tensors differ in number of dimensions, prepend size-1 dimensions to the smaller tensor.
  2. Dimensions of size 1 are stretched to match the other tensor.
  3. Dimensions must be equal or one of them must be 1.

Adding a vector of shape (3,) to a matrix of shape (4, 3) broadcasts the vector across all 4 rows. Broadcasting avoids explicit copies, saving memory.

Main Theorems

Theorem

Universality of Tensor Product

Statement

For any bilinear map f:V×WZf: V \times W \to Z, there exists a unique linear map f~:VWZ\tilde{f}: V \otimes W \to Z such that f(v,w)=f~(vw)f(v, w) = \tilde{f}(v \otimes w) for all vV,wWv \in V, w \in W.

Intuition

The tensor product is the "freest" bilinear construction. Every bilinear map factors through it. This is why tensors appear everywhere: they are the canonical way to combine vector spaces while preserving linearity in each argument separately.

Proof Sketch

Define f~\tilde{f} on elementary tensors by f~(vw)=f(v,w)\tilde{f}(v \otimes w) = f(v, w) and extend by linearity. Bilinearity of ff ensures this is well-defined (independent of how the tensor is written as a sum of elementary tensors). Uniqueness follows because elementary tensors span VWV \otimes W.

Why It Matters

This universal property is why tensors appear in physics (stress tensors, electromagnetic field tensor) and ML (multilinear maps between feature spaces). The tensor product is the right algebraic structure for anything multilinear.

Failure Mode

The algebraic universal property holds for arbitrary vector spaces, including infinite-dimensional ones. What changes in the infinite-dimensional analytic setting is that the algebraic tensor product is typically not complete with respect to natural norms. In Hilbert and Banach spaces one introduces completed tensor products (Hilbert-Schmidt, projective ^π\hat\otimes_\pi, injective ^ε\hat\otimes_\varepsilon, nuclear) when topology, convergence, or operator-norm structure is required. This matters for kernel methods in RKHS and operator-valued kernels but not for standard neural networks operating on finite-dimensional activations.

Canonical Examples

Example

Attention as tensor contraction

In self-attention, queries QQ, keys KK, and values VV are rank-3 tensors of shape (batch, seq_len, d_model). The attention score computation is:

scoresbqk=dQbqdKbkd/d\text{scores}_{bqk} = \sum_d Q_{bqd} \, K_{bkd} / \sqrt{d}

In einsum: bqd,bkd->bqk. This is a batched contraction over the embedding dimension. The output is a rank-3 tensor of attention logits.

Example

CP decomposition for weight compression

A weight tensor WW of shape (d1,d2,d3)(d_1, d_2, d_3) with d1d2d3d_1 d_2 d_3 parameters can be approximated by a rank-RR CP decomposition with R(d1+d2+d3)R(d_1 + d_2 + d_3) parameters. For di=100d_i = 100 and R=10R = 10, this reduces parameters from 10610^6 to 3,0003{,}000, a 333×333\times compression.

Common Confusions

Watch Out

Tensor order vs tensor rank

The order (number of indices) of a tensor is not the same as its rank (minimum number of rank-1 terms in a CP decomposition). A 3×33 \times 3 matrix has order 2 but could have rank anywhere from 0 to 3. The ML community often says "rank" when they mean "order." Watch for this.

Watch Out

ML tensors vs math tensors

A PyTorch tensor does not transform under change of basis. It is a multidimensional array, full stop. The mathematical tensor carries transformation rules (covariant or contravariant). In ML, this distinction almost never matters because we work in a fixed computational basis. If you move to differential geometry or physics, it matters a great deal.

Watch Out

Broadcasting is not free

Broadcasting creates no physical copies, but it does expand the computation. Adding a shape (1, 1000) tensor to a shape (1000, 1000) tensor performs 10610^6 additions, not 10310^3. Memory is saved, compute is not.

Summary

  • A tensor is a multilinear map. In coordinates, it is a multidimensional array that transforms according to specific rules under change of basis.
  • Order = number of indices. Rank = minimum CP decomposition terms.
  • Contraction generalizes matrix multiplication and trace.
  • Einstein summation (einsum) is the lingua franca for tensor operations.
  • CP and Tucker decompositions compress tensors, with applications in model compression and data analysis.
  • In ML frameworks, "tensor" means multidimensional array with shape, dtype, and device. No transformation rules are tracked.

Exercises

ExerciseCore

Problem

Write the einsum string for computing the Frobenius norm squared of a matrix AA, i.e., AF2=ijAij2\|A\|_F^2 = \sum_{ij} A_{ij}^2.

ExerciseAdvanced

Problem

A rank-3 tensor TR100×100×100T \in \mathbb{R}^{100 \times 100 \times 100} has 10610^6 entries. Suppose its CP rank is R=5R = 5. How many parameters does the CP decomposition use? By what factor is this compressed compared to storing TT directly?

References

Canonical:

  • Kolda & Bader, "Tensor Decompositions and Applications," SIAM Review 51(3), 2009. Sections 2-3 for CP and Tucker decompositions
  • Lim, "Tensors and Hypermatrices," Chapter 15 in Handbook of Linear Algebra (2013)
  • Einstein, "Die Grundlage der allgemeinen Relativitätstheorie," Annalen der Physik 49, 1916. Original summation convention
  • Bourbaki, Algebra I, Chapter 3. Universal property of tensor products in full generality

Current:

  • Novikov et al., "Tensorizing Neural Networks," NeurIPS 2015
  • PyTorch documentation on torch.einsum
  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 2. Computational tensor operations for ML
  • Vaswani et al., "Attention Is All You Need," NeurIPS 2017. Einsum patterns in multi-head attention

Next Topics

From tensors, the natural continuations are:

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

2