LLM Construction
Linear Layer: Shapes, Bias, and Memory
A systems-first note on the linear layer: tensor shapes, the bias term, forward pass, backward pass, parameter memory, FLOPs, and finite-difference gradient tests.
Prerequisites
Why This Matters
The linear layer is the first place where deep learning becomes a systems object instead of a drawing of neurons. It has tensor shapes, parameter memory, activation memory, FLOPs, and a backward pass that can be tested. If you cannot predict these quantities for a single layer, attention, transformer blocks, KV cache, and accelerator performance will feel like vocabulary rather than engineering.
It is also the first useful bridge from matrix calculus to attention, transformer architecture, and KV-cache inference. The shape and gradient analysis below uses the same conventions as matrix operations and properties and the chain rule from feedforward networks and backpropagation.
This note is Week 1 of the Deep Learning Systems From Scratch track. The rule is simple: every concept becomes code, a shape ledger, a memory estimate, a test, and review questions.
Core Object
Linear layer
For a mini-batch , weights , and bias , a linear layer computes
where .
The bias broadcasts across the batch dimension. It gives each output coordinate an intercept. Without , the affine map is forced through the origin of feature space, which is unnecessarily restrictive.
Shape Ledger
| Quantity | Shape | Meaning |
|---|---|---|
| mini-batch input | ||
| trainable weights | ||
| trainable bias, broadcast over | ||
| layer output | ||
| upstream gradient | ||
| weight gradient | ||
| bias gradient | ||
| input gradient |
The safest shape habit is to write the ledger before writing code. If the ledger is wrong, the implementation will only be accidentally right.
Forward Pass
import numpy as np
def linear_forward(X: np.ndarray, W: np.ndarray, b: np.ndarray) -> np.ndarray:
assert X.ndim == 2
assert W.ndim == 2
assert b.ndim == 1
assert X.shape[1] == W.shape[0]
assert W.shape[1] == b.shape[0]
return X @ W + b
The matrix product contracts the shared dimension:
Backward Pass
Linear Layer Backward Pass
Statement
For the linear layer , the gradients are
and
Intuition
The weight gradient pairs each input coordinate with each output-gradient coordinate. The bias gradient sums over the batch because the same bias is used for every row. The input gradient sends the upstream signal backward through the transpose of the weights.
Why It Matters
These three equations are the smallest useful backpropagation testbed. If they are wrong, deeper networks silently learn the wrong update.
def linear_backward(X: np.ndarray, W: np.ndarray, dY: np.ndarray):
assert X.ndim == 2 and W.ndim == 2 and dY.ndim == 2
assert X.shape[0] == dY.shape[0]
assert W.shape[1] == dY.shape[1]
dW = X.T @ dY
db = dY.sum(axis=0)
dX = dY @ W.T
return dX, dW, db
Parameter Memory and FLOPs
The parameter count is
For fp32 parameters, parameter memory is roughly
The forward matmul costs approximately
counting one multiply and one add for each input-output pair. A rough lower bound on forward memory traffic is
elements read or written, times bytes per element.
This estimate is intentionally rough. Real hardware uses caches, tiling, kernel fusion, vectorization, and mixed precision. The point of the ledger is to know the first-order cost before profiling.
Finite-Difference Test
A finite-difference test checks whether the analytical gradient predicts a small perturbation in the loss:
def finite_difference_param(loss_fn, param, index, eps=1e-5):
old = param[index]
param[index] = old + eps
plus = loss_fn()
param[index] = old - eps
minus = loss_fn()
param[index] = old
return (plus - minus) / (2 * eps)
Use this against selected entries of , , and . It is slower than backprop, but it catches sign errors, missing batch sums, and transpose bugs.
The relative error
is the right diagnostic. With and double-precision floats, usually means the analytical gradient is correct; or worse signals a bug.
A Concrete Example
Linear layer with batch 32, in 128, out 64
Pick , , . Then:
- Parameter count: .
- Parameter memory (fp32): bytes ≈ KiB.
- Forward FLOPs: .
- Output activation : floats = bytes.
- Cached input for backward: floats = bytes.
Backward arithmetic dominates the forward pass: computing costs another FLOPs, and adds a third matmul of the same size. A forward-backward pass on this single layer is therefore FLOPs.
Scale this layer to (one column of a small Llama-style MLP) and the parameter memory jumps to MB; the activation memory at sequence length jumps to MB per layer. The arithmetic intensity (FLOPs per byte read) shifts from compute-bound to memory-bound on most accelerators; this is the tile-and-fuse regime where naïve matmul kernels lose to tuned BLAS.
Connections
The linear layer is the building block for almost every parametric model in deep learning: an MLP is a stack of linear layers separated by activations; attention's , , projections are three linear layers applied to the same input; the LM head that maps hidden states to logits is one final linear layer of shape where is the vocabulary size. Understanding the shape ledger and FLOP count of one linear layer transfers directly to estimating the cost of a transformer block. The dominant cost in a transformer is not attention but the four MLP linears per block; this is why mixture-of-experts replaces them with sparser conditional variants.
The bias term plays a quieter role than weights but matters in two ways. First, it lets the affine map shift the output away from the origin without distorting the linear part. Second, in normalized-input regimes (see batch normalization) the bias of the next linear layer absorbs the shift parameter that the normalization's affine produces. Some recent architectures (e.g., LLaMA) drop the bias on linear layers entirely, treating it as redundant once RMSNorm provides the centering capacity.
Common Bugs
Inconsistent weight orientation
A common bug is treating as in one
place (matching the math-textbook convention ) and as
in another (matching with
row-major batches). Pick one, write it in the shape ledger, and never
silently transpose. PyTorch's torch.nn.Linear stores weights internally
as but applies , so the
effective math matches .
Bias gradient summed over the wrong axis
The bias has shape and is broadcast across the batch. The gradient must sum over axis (batch), not axis (features). Summing over axis gives a vector of shape and silently breaks shape compatibility with . The same trap appears in any broadcast-then-reduce pattern (e.g., layer norm's affine parameters).
Counting parameters and forgetting activation memory
A linear layer with has about parameters (about 67 MB in fp32). For a batch of sequences of length , the activation alone is floats (≈ 537 MB). Activation memory grows with batch and sequence length; parameter memory does not. Optimization-time memory budgets must include both, plus the cached for the backward pass.
Review Questions
- Why does the bias have shape ?
- Why does sum over the batch dimension?
- What are the parameter counts for and ?
- What is the output shape of ?
- How many approximate FLOPs does the matmul cost?
- What memory must be read and written?
- Why is this operation usually better suited to batching?
- How would a finite-difference test catch a wrong gradient?
References
Canonical:
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press, Chapters 6-8.
- Petersen, K. B. and Pedersen, M. S. (2012). "The Matrix Cookbook."
- Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning representations by back-propagating errors." Nature.
Current:
- PyTorch documentation.
torch.nn.Linear. - NumPy documentation.
numpy.matmul. - Stanford CS231n notes. "Backpropagation, Intuitions."
Exercises
Problem
Let have shape , have shape , and have shape . What is the output shape of , and how many trainable parameters does the layer have?
Problem
Explain why instead of .
Last reviewed: May 1, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Matrix Operations and Propertieslayer 0A · tier 1
- Matrix Calculuslayer 1 · tier 1
- Feedforward Networks and Backpropagationlayer 2 · tier 1
Derived topics
3- Attention Mechanism Theorylayer 4 · tier 2
- Transformer Architecturelayer 4 · tier 2
- KV Cachelayer 5 · tier 2
Graph-backed continuations