Skip to main content

CPU and Machine Model · 40 min

Pipelining Basics

Why overlap fetch, decode, execute and writeback across consecutive instructions. The 5-stage pipeline, hazards, and bypassing.

Why This Matters

A non-pipelined implementation that spends 5 cycles per instruction has CPI =5=5. A 5-stage pipeline can complete about one instruction per cycle after fill, so the ideal steady-state CPI is 11. The single instruction latency is still 5 stages; the gain is throughput.

The price is hazard handling. A RAW dependence between two adjacent ALU instructions needs a bypass path, a load followed by its consumer still needs one bubble, and a taken branch under predict-not-taken squashes fetched wrong-path instructions.

Core Definitions

Definition

Pipeline stage

A pipeline stage is a slice of instruction processing separated from the next slice by pipeline registers. In the canonical 5-stage MIPS pipeline the stages are IF, ID, EX, MEM, and WB.

Definition

Throughput and latency

Latency is the time for one instruction to travel from fetch to retirement. Throughput is the rate at which completed instructions leave the pipeline. Pipelining mainly raises throughput, not single-instruction latency.

Definition

CPI

Cycles per instruction, written CPI, is total cycles divided by completed instructions. Ideal scalar pipelining gives CPI near 11 after fill and drain. Stalls and flushes add penalty cycles.

Definition

Hazard

A hazard is a condition that prevents the next instruction from executing in its nominal pipeline stage. Structural hazards come from resource conflicts, data hazards from dependences, and control hazards from changes to the program counter.

Throughput Versus Latency

Consider five work units for every instruction, each taking one cycle. Without pipelining, instruction 1 uses all hardware for IF, then ID, then EX, then MEM, then WB. Instruction 2 starts only after instruction 1 writes back.

For four instructions, the non-pipelined schedule is 20 cycles.

CycleInst 1Inst 2Inst 3Inst 4
1IF
2ID
3EX
4MEM
5WB
6IF
10WB
15WB
20WB

With pipelining, each instruction still takes 5 stages, but different instructions occupy different stages in the same cycle.

CycleInst 1Inst 2Inst 3Inst 4
1IF
2IDIF
3EXIDIF
4MEMEXIDIF
5WBMEMEXID
6WBMEMEX
7WBMEM
8WB

For nn instructions and k=5k=5 stages, the ideal cycle count is k+n1k+n-1. For n=100n=100, non-pipelined execution takes 500500 cycles, while ideal pipelined execution takes 104104 cycles. The observed speedup is 500/1044.81500/104 \approx 4.81, below 5 because of fill and drain.

For large nn, the ratio tends to 5. That does not mean every program runs 5 times faster. Stage imbalance, hazards, cache misses, and branch penalties lower realized throughput.

The Canonical 5-Stage MIPS Pipeline

The teaching pipeline from MIPS has five named stages.

StageMain workState read or written
IFFetch instruction at PC, compute PC + 4Instruction memory, PC
IDDecode opcode, read register fileRegister file
EXALU operation or address calculationALU
MEMData memory access for load or storeData memory
WBWrite register resultRegister file

A simple sequence shows how instructions overlap.

# MIPS-like assembly
add  $t0, $t1, $t2     # t0 = t1 + t2
sub  $t3, $t4, $t5     # t3 = t4 - t5
and  $t6, $t0, $t3     # t6 = t0 & t3
lw   $s0, 0($sp)       # s0 = memory[sp]

The static instruction encoding is still a stream of 32-bit words. Pipelining changes when fields are consumed, not what the instruction means.

InstructionExample word bytes in little-endian memory
add $t0,$t1,$t220 40 2a 01
sub $t3,$t4,$t522 58 8d 01
and $t6,$t0,$t324 70 0b 01
lw $s0,0($sp)00 00 b0 8f

IF reads 4 bytes and forms the 32-bit instruction word. ID reads register specifiers. EX uses the ALU. MEM touches data memory only for lw and sw. WB writes a register for ALU instructions and loads.

A common register convention is that the register file writes in the first half of a cycle and reads in the second half. That small timing detail removes some WB to ID hazards without a separate stall.

Structural Hazards

A structural hazard occurs when two stages need the same hardware in the same cycle. The classic example is a single memory port used for both instruction fetch and data access.

lw   $t0, 0($s0)
add  $t1, $t2, $t3
sub  $t4, $t5, $t6
or   $t7, $s1, $s2

In the ideal 5-stage schedule, the load is in MEM during cycle 4. At the same time, the fourth instruction wants IF.

Cyclelwaddsubor
1IF
2IDIF
3EXIDIF
4MEMEXIDIF
5WBMEMEXID

If there is only one memory port, cycle 4 contains a conflict. The machine must either stall IF or provide separate instruction and data memories or caches. The usual textbook MIPS pipeline assumes separate instruction and data memories, which avoids this structural hazard.

This is why the block diagram matters. Pipelining is more than just drawing diagonal stage names; the resources must support simultaneous stage activity.

Data Hazards and Forwarding

A RAW dependence occurs when an instruction reads a register that an earlier instruction writes. The most common case is an ALU result consumed by the next instruction.

add  $t0, $t1, $t2     # produces t0 at end of EX
sub  $t3, $t0, $t4     # needs t0 as ALU input in EX

Without forwarding, sub reads $t0 in ID before add writes it in WB. The pipeline would stall until writeback. That would waste cycles even though the ALU result exists earlier.

Forwarding, also called bypassing, routes a result from a later pipeline register directly to an ALU input. For the pair above, add computes $t0 in EX during cycle 3. The value sits in the EX/MEM pipeline register during cycle 4. The sub instruction reaches EX in cycle 4. A mux selects the forwarded EX/MEM value instead of the stale register-file value.

CycleaddsubForwarding action
1IF
2IDIF
3EXIDALU computes $t0
4MEMEXEX/MEM $t0 feeds sub ALU input
5WBMEM
6WB

The forwarding control is a small equality test over register numbers. In pseudocode:

// Inputs are pipeline-register fields, not architectural variables.
if (EX_MEM.RegWrite &&
    EX_MEM.Rd != 0 &&
    EX_MEM.Rd == ID_EX.Rs) {
    ForwardA = FROM_EX_MEM;
} else if (MEM_WB.RegWrite &&
           MEM_WB.Rd != 0 &&
           MEM_WB.Rd == ID_EX.Rs) {
    ForwardA = FROM_MEM_WB;
} else {
    ForwardA = FROM_ID_EX;
}

The same logic is applied to the second ALU operand register. Register zero is excluded because MIPS $zero is not changed by writes.

The Load-Use Stall

Forwarding does not remove every RAW stall. A load produces its data after the MEM stage, not after EX. The immediate consumer needs the value at the start of its EX stage one cycle too early.

lw   $t0, 0($s0)       # data returns at end of MEM
add  $t1, $t0, $t2     # needs t0 in EX

Nominal timing would put lw in MEM and add in EX during cycle 4. The loaded data is not available until the end of cycle 4, but the add operand is needed during cycle 4. One bubble is inserted.

CyclelwaddBubble
1IF
2IDIF
3EXID
4MEMIDEX
5WBEX
6MEM
7WB

The hazard detection rule is compact:

if (ID_EX.MemRead &&
    (ID_EX.Rt == IF_ID.Rs || ID_EX.Rt == IF_ID.Rt)) {
    PCWrite = 0;        // hold fetch
    IF_ID_Write = 0;    // hold decoded consumer
    ID_EX_Control = 0;  // inject bubble into EX
}

After the one-cycle bubble, the loaded value can be forwarded from MEM/WB to the add EX input.

Control Hazards

A branch changes the PC, but the pipeline fetches the next sequential instruction before the branch decision is known. If branch comparison and target computation finish in EX, then two younger instructions may already be in IF and ID.

beq  $t0, $t1, target
add  $s0, $s0, $s1     # wrong path if branch taken
sub  $s2, $s2, $s3     # wrong path if branch taken
target:
and  $s4, $s4, $s5

Predict-not-taken fetches the next sequential PC. If the branch is not taken, there is no penalty. If it is taken, the wrong-path instructions are flushed and the target fetch starts.

With a branch resolved in EX, a taken branch can cost two bubbles in this simple design. Moving branch compare to ID can reduce the penalty, but it adds forwarding and timing pressure to ID.

MIPS also exposed a delayed branch in older ISA designs. The instruction after the branch, called the delay slot, executes whether the branch is taken or not.

beq  $t0, $t1, target
add  $s0, $s0, $s1     # delay slot, always executes

A compiler tries to place useful work in the delay slot. If it cannot, it emits a nop. Modern code generation avoids depending on this model unless targeting an ISA mode that specifies it.

Key Result

Proposition

Ideal Five-Stage Pipeline Speedup Ceiling

Statement

For nn instructions in a 5-stage scalar pipeline, the ideal cycle count is n+4n+4. Compared with a non-overlapped 5-cycle implementation taking 5n5n cycles, the speedup is 5n/(n+4)5n/(n+4) and tends to 55 as nn \to \infty.

Intuition

The first instruction fills the pipeline for 5 cycles. After that, one instruction completes per cycle until the stream ends. The final 4 cycles drain partially occupied stages.

Proof Sketch

The first completion occurs in cycle 5. Each later instruction completes one cycle after the previous one, so instruction nn completes in cycle 5+(n1)=n+45+(n-1)=n+4. The non-overlapped implementation spends 5 cycles per instruction, giving 5n5n cycles. Dividing gives 5n/(n+4)5n/(n+4). The limit is 55 because 4/n04/n \to 0.

Why It Matters

The proposition gives a ceiling, not a guarantee. A loop with one load-use stall and one taken branch penalty per iteration can have CPI well above 11 even on this idealized pipeline.

Failure Mode

Do not multiply clock frequency by 5 and also set CPI to 1 without checking stage delay. Pipeline registers and the slowest stage set the clock period.

For a quick numeric model, if 100 instructions include 10 load-use bubbles and 5 taken branches with a 2-cycle penalty, cycles are 100+4+10+10=124100+4+10+10=124. CPI is 124/100=1.24124/100=1.24, and speedup over 500 cycles is 500/1244.03500/124 \approx 4.03.

Common Confusions

Watch Out

Pipelining does not make one instruction take one cycle

A single instruction still passes through IF, ID, EX, MEM, and WB. In the simple model that is 5 cycles of latency. One completed instruction per cycle is a steady-state throughput statement.

Watch Out

Forwarding is not the same as writing the register file early

Forwarding sends a value from a pipeline register to an ALU input. It bypasses the architectural register file. Early write and late read in the same cycle handles some WB to ID cases, but it does not fix an adjacent ALU dependence by itself.

Watch Out

A load-use hazard remains even with bypassing

A load value arrives after memory access. The consumer's EX stage needs the operand during the same cycle if scheduled immediately after the load. That timing mismatch forces one bubble in the canonical 5-stage pipeline.

Watch Out

Delayed branch is an ISA promise, not only a hardware trick

If the ISA defines a delay slot, software observes it. The instruction in the slot executes regardless of branch direction. Changing that behavior breaks binaries unless the ISA mode changes.

Exercises

ExerciseCore

Problem

A 5-stage pipeline runs 40 instructions with no stalls. The non-pipelined machine takes 5 cycles per instruction. Compute the pipelined cycle count, CPI, and speedup.

ExerciseCore

Problem

For the sequence below, identify whether forwarding removes the stall. If not, give the number of bubbles in the canonical 5-stage pipeline.

add  $t0, $t1, $t2
sub  $t3, $t0, $t4
lw   $t5, 0($s0)
and  $t6, $t5, $t7
ExerciseAdvanced

Problem

Assume a loop body has 8 instructions. Each iteration has one load-use stall and one branch. Predict-not-taken is used, and the branch is taken for 99 of 100 iterations. Branch resolution is in EX with a 2-cycle taken penalty. Ignore cache misses. Compute total cycles and CPI for 100 iterations, including pipeline fill and drain.

References

Canonical:

  • John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2017), §1.4 and §1.6, performance metrics and CPI
  • John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2017), §3.3, pipelining and hazards
  • David A. Patterson and John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 5th ed. (2013), ch. 4, the 5-stage MIPS datapath and control
  • Randal E. Bryant and David R. O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 4, processor architecture and pipelining
  • Randal E. Bryant and David R. O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), §5.7, program performance and pipeline-oriented thinking

Accessible:

  • Onur Mutlu, Digital Design and Computer Architecture lecture notes, pipeline hazards and data forwarding
  • Daniel J. Sorin, Mark D. Hill, and David A. Wood, A Primer on Memory Consistency and Cache Coherence, synthesis lectures, early chapters for the processor-memory context
  • Berkeley CS 61C lecture notes, pipelining, hazards, forwarding, and branch delay slots

Next Topics

  • /computationpath/branch-prediction-basics
  • /computationpath/cache-hierarchy-basics
  • /computationpath/superscalar-execution
  • /computationpath/simultaneous-multithreading
  • /topics/partial-orders