Linear Layer: Shapes, Bias, and Memory

Sneiderman, Robby

LLM Construction

Linear Layer: Shapes, Bias, and Memory

A systems-first note on the linear layer: tensor shapes, the bias term, forward pass, backward pass, parameter memory, FLOPs, and finite-difference gradient tests.

CoreTier 1StableCore spine~35 min

Prerequisites

Matrix Operations and Properties Matrix Calculus Feedforward Networks and Backpropagation

Quiz (5)Pulse Check Prereq Map

Why This Matters

The linear layer is the first place where deep learning becomes a systems object instead of a drawing of neurons. It has tensor shapes, parameter memory, activation memory, FLOPs, and a backward pass that can be tested. If you cannot predict these quantities for a single layer, attention, transformer blocks, KV cache, and accelerator performance will feel like vocabulary rather than engineering.

It is also the first useful bridge from matrix calculus to attention, transformer architecture, and KV-cache inference. The shape and gradient analysis below uses the same conventions as matrix operations and properties and the chain rule from feedforward networks and backpropagation.

This note is Week 1 of the Deep Learning Systems From Scratch track. The rule is simple: every concept becomes code, a shape ledger, a memory estimate, a test, and review questions.

Core Object

Definition

Linear layer

For a mini-batch $X \in \mathbb{R}^{B \times D_{\mathrm{in}}}$ , weights $W \in \mathbb{R}^{D_{\mathrm{in}} \times D_{\mathrm{out}}}$ , and bias $b \in \mathbb{R}^{D_{\mathrm{out}}}$ , a linear layer computes

$Y = XW + b$

where $Y \in \mathbb{R}^{B \times D_{\mathrm{out}}}$ .

The bias broadcasts across the batch dimension. It gives each output coordinate an intercept. Without $b$ , the affine map is forced through the origin of feature space, which is unnecessarily restrictive.

Shape Ledger

Quantity	Shape	Meaning
$X$	$[B, D_{\mathrm{in}}]$	mini-batch input
$W$	$[D_{\mathrm{in}}, D_{\mathrm{out}}]$	trainable weights
$b$	$[D_{\mathrm{out}}]$	trainable bias, broadcast over $B$
$Y$	$[B, D_{\mathrm{out}}]$	layer output
$dY$	$[B, D_{\mathrm{out}}]$	upstream gradient
$dW$	$[D_{\mathrm{in}}, D_{\mathrm{out}}]$	weight gradient
$db$	$[D_{\mathrm{out}}]$	bias gradient
$dX$	$[B, D_{\mathrm{in}}]$	input gradient

The safest shape habit is to write the ledger before writing code. If the ledger is wrong, the implementation will only be accidentally right.

Forward Pass

import numpy as np

def linear_forward(X: np.ndarray, W: np.ndarray, b: np.ndarray) -> np.ndarray:
    assert X.ndim == 2
    assert W.ndim == 2
    assert b.ndim == 1
    assert X.shape[1] == W.shape[0]
    assert W.shape[1] == b.shape[0]
    return X @ W + b

The matrix product contracts the shared $D_{\mathrm{in}}$ dimension:

$[B, D_{\mathrm{in}}] \cdot [D_{\mathrm{in}}, D_{\mathrm{out}}] = [B, D_{\mathrm{out}}].$

Backward Pass

Proposition

Linear Layer Backward Pass

Statement

For the linear layer $Y = XW + b$ , the gradients are

$dW = X^\top dY,$

$db = \sum_{i=1}^{B} dY_i,$

and

$dX = dY W^\top.$

Intuition

The weight gradient pairs each input coordinate with each output-gradient coordinate. The bias gradient sums over the batch because the same bias is used for every row. The input gradient sends the upstream signal backward through the transpose of the weights.

Why It Matters

These three equations are the smallest useful backpropagation testbed. If they are wrong, deeper networks silently learn the wrong update.

report a correction →

def linear_backward(X: np.ndarray, W: np.ndarray, dY: np.ndarray):
    assert X.ndim == 2 and W.ndim == 2 and dY.ndim == 2
    assert X.shape[0] == dY.shape[0]
    assert W.shape[1] == dY.shape[1]
    dW = X.T @ dY
    db = dY.sum(axis=0)
    dX = dY @ W.T
    return dX, dW, db

Parameter Memory and FLOPs

The parameter count is

$D_{\mathrm{in}}D_{\mathrm{out}} + D_{\mathrm{out}}.$

For fp32 parameters, parameter memory is roughly

$4(D_{\mathrm{in}}D_{\mathrm{out}} + D_{\mathrm{out}})\ \text{bytes}.$

The forward matmul costs approximately

$2 \, B \, D_{\mathrm{in}} \, D_{\mathrm{out}}\ \text{FLOPs},$

counting one multiply and one add for each input-output pair. A rough lower bound on forward memory traffic is

$B D_{\mathrm{in}} + D_{\mathrm{in}}D_{\mathrm{out}} + D_{\mathrm{out}} + B D_{\mathrm{out}}$

elements read or written, times bytes per element.

This estimate is intentionally rough. Real hardware uses caches, tiling, kernel fusion, vectorization, and mixed precision. The point of the ledger is to know the first-order cost before profiling.

Finite-Difference Test

A finite-difference test checks whether the analytical gradient predicts a small perturbation in the loss:

$\frac{L(\theta + \epsilon e_j) - L(\theta - \epsilon e_j)}{2\epsilon} \approx \frac{\partial L}{\partial \theta_j}.$

def finite_difference_param(loss_fn, param, index, eps=1e-5):
    old = param[index]
    param[index] = old + eps
    plus = loss_fn()
    param[index] = old - eps
    minus = loss_fn()
    param[index] = old
    return (plus - minus) / (2 * eps)

Use this against selected entries of $W$ , $b$ , and $X$ . It is slower than backprop, but it catches sign errors, missing batch sums, and transpose bugs.

The relative error

e = \frac{\lvert g_{\mathrm{analytical}} - g_{\mathrm{numerical}} \rvert}{\max(\lvert g_{\mathrm{analytical}} \rvert, \lvert g_{\mathrm{numerical}} \rvert, 10^{-12})}

is the right diagnostic. With $\epsilon = 10^{-5}$ and double-precision floats, $e \lesssim 10^{-7}$ usually means the analytical gradient is correct; $e \approx 10^{-2}$ or worse signals a bug.

A Concrete Example

Example

Linear layer with batch 32, in 128, out 64

Pick $B = 32$ , $D_{\mathrm{in}} = 128$ , $D_{\mathrm{out}} = 64$ . Then:

Parameter count: $128 \cdot 64 + 64 = 8\,256$ .
Parameter memory (fp32): $4 \cdot 8\,256 = 33\,024$ bytes ≈ $32$ KiB.
Forward FLOPs: $2 \cdot 32 \cdot 128 \cdot 64 = 524\,288 \approx 5.2 \times 10^{5}$ .
Output activation $Y$ : $32 \cdot 64 = 2\,048$ floats = $8\,192$ bytes.
Cached input $X$ for backward: $32 \cdot 128 = 4\,096$ floats = $16\,384$ bytes.

Backward arithmetic dominates the forward pass: computing $dW = X^\top dY$ costs another $\sim 5.2 \times 10^{5}$ FLOPs, and $dX = dY W^\top$ adds a third matmul of the same size. A forward-backward pass on this single layer is therefore $\approx 1.6 \times 10^{6}$ FLOPs.

Scale this layer to $D_{\mathrm{in}} = D_{\mathrm{out}} = 4\,096$ (one column of a small Llama-style MLP) and the parameter memory jumps to $\approx 67$ MB; the activation memory at sequence length $1\,024$ jumps to $\approx 537$ MB per layer. The arithmetic intensity (FLOPs per byte read) shifts from compute-bound to memory-bound on most accelerators; this is the tile-and-fuse regime where naïve matmul kernels lose to tuned BLAS.

Connections

The linear layer is the building block for almost every parametric model in deep learning: an MLP is a stack of linear layers separated by activations; attention's $Q$ , $K$ , $V$ projections are three linear layers applied to the same input; the LM head that maps hidden states to logits is one final linear layer of shape $[D_{\mathrm{model}}, V]$ where $V$ is the vocabulary size. Understanding the shape ledger and FLOP count of one linear layer transfers directly to estimating the cost of a transformer block. The dominant cost in a transformer is not attention but the four MLP linears per block; this is why mixture-of-experts replaces them with sparser conditional variants.

The bias term plays a quieter role than weights but matters in two ways. First, it lets the affine map shift the output away from the origin without distorting the linear part. Second, in normalized-input regimes (see batch normalization) the bias of the next linear layer absorbs the shift parameter that the normalization's affine produces. Some recent architectures (e.g., LLaMA) drop the bias on linear layers entirely, treating it as redundant once RMSNorm provides the centering capacity.

Common Bugs

Watch Out

Inconsistent weight orientation

A common bug is treating $W$ as $[D_{\mathrm{out}}, D_{\mathrm{in}}]$ in one place (matching the math-textbook convention $y = Wx$ ) and as $[D_{\mathrm{in}}, D_{\mathrm{out}}]$ in another (matching $Y = XW$ with row-major batches). Pick one, write it in the shape ledger, and never silently transpose. PyTorch's torch.nn.Linear stores weights internally as $[D_{\mathrm{out}}, D_{\mathrm{in}}]$ but applies $XW^\top$ , so the effective math matches $XW$ .

Watch Out

Bias gradient summed over the wrong axis

The bias $b$ has shape $[D_{\mathrm{out}}]$ and is broadcast across the batch. The gradient $db$ must sum over axis $0$ (batch), not axis $1$ (features). Summing over axis $1$ gives a vector of shape $[B]$ and silently breaks shape compatibility with $b$ . The same trap appears in any broadcast-then-reduce pattern (e.g., layer norm's affine parameters).

Watch Out

Counting parameters and forgetting activation memory

A linear layer with $D_{\mathrm{in}} = D_{\mathrm{out}} = 4096$ has about $\approx 1.7 \times 10^{7}$ parameters (about 67 MB in fp32). For a batch of $32$ sequences of length $1024$ , the activation $Y$ alone is $32 \cdot 1024 \cdot 4096 = 1.3 \times 10^{8}$ floats (≈ 537 MB). Activation memory grows with batch and sequence length; parameter memory does not. Optimization-time memory budgets must include both, plus the cached $X$ for the backward pass.

Review Questions

Why does the bias have shape $[D_{\mathrm{out}}]$ ?
Why does $db$ sum over the batch dimension?
What are the parameter counts for $W$ and $b$ ?
What is the output shape of $XW + b$ ?
How many approximate FLOPs does the matmul cost?
What memory must be read and written?
Why is this operation usually better suited to batching?
How would a finite-difference test catch a wrong gradient?

References

Canonical:

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press, Chapters 6-8.
Petersen, K. B. and Pedersen, M. S. (2012). "The Matrix Cookbook."
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning representations by back-propagating errors." Nature.

Current:

PyTorch documentation. torch.nn.Linear.
NumPy documentation. numpy.matmul.
Stanford CS231n notes. "Backpropagation, Intuitions."

Exercises

ExerciseCore

Problem

Let $X$ have shape $[32, 128]$ , $W$ have shape $[128, 64]$ , and $b$ have shape $[64]$ . What is the output shape of $Y = XW + b$ , and how many trainable parameters does the layer have?

ExerciseAdvanced

Problem

Explain why $db = dY.\mathrm{sum}(\mathrm{axis}=0)$ instead of $dY.\mathrm{sum}(\mathrm{axis}=1)$ .

Last reviewed: May 1, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Matrix Operations and Propertieslayer 0A · tier 1
Matrix Calculuslayer 1 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

3

Attention Mechanism Theorylayer 4 · tier 2
Transformer Architecturelayer 4 · tier 2
KV Cachelayer 5 · tier 2

Graph-backed continuations

Attention Mechanism Theory Transformer Architecture KV Cache