Why This Matters
A non-pipelined implementation that spends 5 cycles per instruction has CPI . A 5-stage pipeline can complete about one instruction per cycle after fill, so the ideal steady-state CPI is . The single instruction latency is still 5 stages; the gain is throughput.
The price is hazard handling. A RAW dependence between two adjacent ALU instructions needs a bypass path, a load followed by its consumer still needs one bubble, and a taken branch under predict-not-taken squashes fetched wrong-path instructions.
Core Definitions
Pipeline stage
A pipeline stage is a slice of instruction processing separated from the next slice by pipeline registers. In the canonical 5-stage MIPS pipeline the stages are IF, ID, EX, MEM, and WB.
Throughput and latency
Latency is the time for one instruction to travel from fetch to retirement. Throughput is the rate at which completed instructions leave the pipeline. Pipelining mainly raises throughput, not single-instruction latency.
CPI
Cycles per instruction, written CPI, is total cycles divided by completed instructions. Ideal scalar pipelining gives CPI near after fill and drain. Stalls and flushes add penalty cycles.
Hazard
A hazard is a condition that prevents the next instruction from executing in its nominal pipeline stage. Structural hazards come from resource conflicts, data hazards from dependences, and control hazards from changes to the program counter.
Throughput Versus Latency
Consider five work units for every instruction, each taking one cycle. Without pipelining, instruction 1 uses all hardware for IF, then ID, then EX, then MEM, then WB. Instruction 2 starts only after instruction 1 writes back.
For four instructions, the non-pipelined schedule is 20 cycles.
| Cycle | Inst 1 | Inst 2 | Inst 3 | Inst 4 |
|---|---|---|---|---|
| 1 | IF | |||
| 2 | ID | |||
| 3 | EX | |||
| 4 | MEM | |||
| 5 | WB | |||
| 6 | IF | |||
| 10 | WB | |||
| 15 | WB | |||
| 20 | WB |
With pipelining, each instruction still takes 5 stages, but different instructions occupy different stages in the same cycle.
| Cycle | Inst 1 | Inst 2 | Inst 3 | Inst 4 |
|---|---|---|---|---|
| 1 | IF | |||
| 2 | ID | IF | ||
| 3 | EX | ID | IF | |
| 4 | MEM | EX | ID | IF |
| 5 | WB | MEM | EX | ID |
| 6 | WB | MEM | EX | |
| 7 | WB | MEM | ||
| 8 | WB |
For instructions and stages, the ideal cycle count is . For , non-pipelined execution takes cycles, while ideal pipelined execution takes cycles. The observed speedup is , below 5 because of fill and drain.
For large , the ratio tends to 5. That does not mean every program runs 5 times faster. Stage imbalance, hazards, cache misses, and branch penalties lower realized throughput.
The Canonical 5-Stage MIPS Pipeline
The teaching pipeline from MIPS has five named stages.
| Stage | Main work | State read or written |
|---|---|---|
| IF | Fetch instruction at PC, compute PC + 4 | Instruction memory, PC |
| ID | Decode opcode, read register file | Register file |
| EX | ALU operation or address calculation | ALU |
| MEM | Data memory access for load or store | Data memory |
| WB | Write register result | Register file |
A simple sequence shows how instructions overlap.
# MIPS-like assembly
add $t0, $t1, $t2 # t0 = t1 + t2
sub $t3, $t4, $t5 # t3 = t4 - t5
and $t6, $t0, $t3 # t6 = t0 & t3
lw $s0, 0($sp) # s0 = memory[sp]
The static instruction encoding is still a stream of 32-bit words. Pipelining changes when fields are consumed, not what the instruction means.
| Instruction | Example word bytes in little-endian memory |
|---|---|
add $t0,$t1,$t2 | 20 40 2a 01 |
sub $t3,$t4,$t5 | 22 58 8d 01 |
and $t6,$t0,$t3 | 24 70 0b 01 |
lw $s0,0($sp) | 00 00 b0 8f |
IF reads 4 bytes and forms the 32-bit instruction word. ID reads register specifiers. EX uses the ALU. MEM touches data memory only for lw and sw. WB writes a register for ALU instructions and loads.
A common register convention is that the register file writes in the first half of a cycle and reads in the second half. That small timing detail removes some WB to ID hazards without a separate stall.
Structural Hazards
A structural hazard occurs when two stages need the same hardware in the same cycle. The classic example is a single memory port used for both instruction fetch and data access.
lw $t0, 0($s0)
add $t1, $t2, $t3
sub $t4, $t5, $t6
or $t7, $s1, $s2
In the ideal 5-stage schedule, the load is in MEM during cycle 4. At the same time, the fourth instruction wants IF.
| Cycle | lw | add | sub | or |
|---|---|---|---|---|
| 1 | IF | |||
| 2 | ID | IF | ||
| 3 | EX | ID | IF | |
| 4 | MEM | EX | ID | IF |
| 5 | WB | MEM | EX | ID |
If there is only one memory port, cycle 4 contains a conflict. The machine must either stall IF or provide separate instruction and data memories or caches. The usual textbook MIPS pipeline assumes separate instruction and data memories, which avoids this structural hazard.
This is why the block diagram matters. Pipelining is more than just drawing diagonal stage names; the resources must support simultaneous stage activity.
Data Hazards and Forwarding
A RAW dependence occurs when an instruction reads a register that an earlier instruction writes. The most common case is an ALU result consumed by the next instruction.
add $t0, $t1, $t2 # produces t0 at end of EX
sub $t3, $t0, $t4 # needs t0 as ALU input in EX
Without forwarding, sub reads $t0 in ID before add writes it in WB. The pipeline would stall until writeback. That would waste cycles even though the ALU result exists earlier.
Forwarding, also called bypassing, routes a result from a later pipeline register directly to an ALU input. For the pair above, add computes $t0 in EX during cycle 3. The value sits in the EX/MEM pipeline register during cycle 4. The sub instruction reaches EX in cycle 4. A mux selects the forwarded EX/MEM value instead of the stale register-file value.
| Cycle | add | sub | Forwarding action |
|---|---|---|---|
| 1 | IF | ||
| 2 | ID | IF | |
| 3 | EX | ID | ALU computes $t0 |
| 4 | MEM | EX | EX/MEM $t0 feeds sub ALU input |
| 5 | WB | MEM | |
| 6 | WB |
The forwarding control is a small equality test over register numbers. In pseudocode:
// Inputs are pipeline-register fields, not architectural variables.
if (EX_MEM.RegWrite &&
EX_MEM.Rd != 0 &&
EX_MEM.Rd == ID_EX.Rs) {
ForwardA = FROM_EX_MEM;
} else if (MEM_WB.RegWrite &&
MEM_WB.Rd != 0 &&
MEM_WB.Rd == ID_EX.Rs) {
ForwardA = FROM_MEM_WB;
} else {
ForwardA = FROM_ID_EX;
}
The same logic is applied to the second ALU operand register. Register zero is excluded because MIPS $zero is not changed by writes.
The Load-Use Stall
Forwarding does not remove every RAW stall. A load produces its data after the MEM stage, not after EX. The immediate consumer needs the value at the start of its EX stage one cycle too early.
lw $t0, 0($s0) # data returns at end of MEM
add $t1, $t0, $t2 # needs t0 in EX
Nominal timing would put lw in MEM and add in EX during cycle 4. The loaded data is not available until the end of cycle 4, but the add operand is needed during cycle 4. One bubble is inserted.
| Cycle | lw | add | Bubble |
|---|---|---|---|
| 1 | IF | ||
| 2 | ID | IF | |
| 3 | EX | ID | |
| 4 | MEM | ID | EX |
| 5 | WB | EX | |
| 6 | MEM | ||
| 7 | WB |
The hazard detection rule is compact:
if (ID_EX.MemRead &&
(ID_EX.Rt == IF_ID.Rs || ID_EX.Rt == IF_ID.Rt)) {
PCWrite = 0; // hold fetch
IF_ID_Write = 0; // hold decoded consumer
ID_EX_Control = 0; // inject bubble into EX
}
After the one-cycle bubble, the loaded value can be forwarded from MEM/WB to the add EX input.
Control Hazards
A branch changes the PC, but the pipeline fetches the next sequential instruction before the branch decision is known. If branch comparison and target computation finish in EX, then two younger instructions may already be in IF and ID.
beq $t0, $t1, target
add $s0, $s0, $s1 # wrong path if branch taken
sub $s2, $s2, $s3 # wrong path if branch taken
target:
and $s4, $s4, $s5
Predict-not-taken fetches the next sequential PC. If the branch is not taken, there is no penalty. If it is taken, the wrong-path instructions are flushed and the target fetch starts.
With a branch resolved in EX, a taken branch can cost two bubbles in this simple design. Moving branch compare to ID can reduce the penalty, but it adds forwarding and timing pressure to ID.
MIPS also exposed a delayed branch in older ISA designs. The instruction after the branch, called the delay slot, executes whether the branch is taken or not.
beq $t0, $t1, target
add $s0, $s0, $s1 # delay slot, always executes
A compiler tries to place useful work in the delay slot. If it cannot, it emits a nop. Modern code generation avoids depending on this model unless targeting an ISA mode that specifies it.
Key Result
Ideal Five-Stage Pipeline Speedup Ceiling
Statement
For instructions in a 5-stage scalar pipeline, the ideal cycle count is . Compared with a non-overlapped 5-cycle implementation taking cycles, the speedup is and tends to as .
Intuition
The first instruction fills the pipeline for 5 cycles. After that, one instruction completes per cycle until the stream ends. The final 4 cycles drain partially occupied stages.
Proof Sketch
The first completion occurs in cycle 5. Each later instruction completes one cycle after the previous one, so instruction completes in cycle . The non-overlapped implementation spends 5 cycles per instruction, giving cycles. Dividing gives . The limit is because .
Why It Matters
The proposition gives a ceiling, not a guarantee. A loop with one load-use stall and one taken branch penalty per iteration can have CPI well above even on this idealized pipeline.
Failure Mode
Do not multiply clock frequency by 5 and also set CPI to 1 without checking stage delay. Pipeline registers and the slowest stage set the clock period.
For a quick numeric model, if 100 instructions include 10 load-use bubbles and 5 taken branches with a 2-cycle penalty, cycles are . CPI is , and speedup over 500 cycles is .
Common Confusions
Pipelining does not make one instruction take one cycle
A single instruction still passes through IF, ID, EX, MEM, and WB. In the simple model that is 5 cycles of latency. One completed instruction per cycle is a steady-state throughput statement.
Forwarding is not the same as writing the register file early
Forwarding sends a value from a pipeline register to an ALU input. It bypasses the architectural register file. Early write and late read in the same cycle handles some WB to ID cases, but it does not fix an adjacent ALU dependence by itself.
A load-use hazard remains even with bypassing
A load value arrives after memory access. The consumer's EX stage needs the operand during the same cycle if scheduled immediately after the load. That timing mismatch forces one bubble in the canonical 5-stage pipeline.
Delayed branch is an ISA promise, not only a hardware trick
If the ISA defines a delay slot, software observes it. The instruction in the slot executes regardless of branch direction. Changing that behavior breaks binaries unless the ISA mode changes.
Exercises
Problem
A 5-stage pipeline runs 40 instructions with no stalls. The non-pipelined machine takes 5 cycles per instruction. Compute the pipelined cycle count, CPI, and speedup.
Problem
For the sequence below, identify whether forwarding removes the stall. If not, give the number of bubbles in the canonical 5-stage pipeline.
add $t0, $t1, $t2
sub $t3, $t0, $t4
lw $t5, 0($s0)
and $t6, $t5, $t7
Problem
Assume a loop body has 8 instructions. Each iteration has one load-use stall and one branch. Predict-not-taken is used, and the branch is taken for 99 of 100 iterations. Branch resolution is in EX with a 2-cycle taken penalty. Ignore cache misses. Compute total cycles and CPI for 100 iterations, including pipeline fill and drain.
References
Canonical:
- John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2017), §1.4 and §1.6, performance metrics and CPI
- John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2017), §3.3, pipelining and hazards
- David A. Patterson and John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 5th ed. (2013), ch. 4, the 5-stage MIPS datapath and control
- Randal E. Bryant and David R. O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 4, processor architecture and pipelining
- Randal E. Bryant and David R. O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), §5.7, program performance and pipeline-oriented thinking
Accessible:
- Onur Mutlu, Digital Design and Computer Architecture lecture notes, pipeline hazards and data forwarding
- Daniel J. Sorin, Mark D. Hill, and David A. Wood, A Primer on Memory Consistency and Cache Coherence, synthesis lectures, early chapters for the processor-memory context
- Berkeley CS 61C lecture notes, pipelining, hazards, forwarding, and branch delay slots
Next Topics
/computationpath/branch-prediction-basics/computationpath/cache-hierarchy-basics/computationpath/superscalar-execution/computationpath/simultaneous-multithreading/topics/partial-orders