Skip to main content

CPU and Machine Model · 35 min

Registers And The Instruction Cycle

The CPU as a fetch-decode-execute-writeback loop over a small register file, with PC, IR, ALU, and one ADD instruction through the datapath.

Why This Matters

A 64-bit integer add in a register file can complete in one processor cycle, while a load from DRAM often costs tens to hundreds of cycles after cache misses. The CPU therefore treats registers as the operands of the instruction set: fixed count, fixed width, named by small integer fields in the instruction word.

Every compiled loop becomes a stream of register transfers. A matrix multiply inner loop, a pointer chase, and a neural-network kernel all reduce to repeated fetch, decode, execute, and writeback steps. The program counter selects the next instruction; the instruction register holds its bits; the ALU computes; the register file records the result.

Core Definitions

Definition

Register

A register is a small, fixed-width storage cell inside the processor datapath. An architectural integer register in a 64-bit ISA usually stores 64 data bits and is named by an instruction field such as r1 or x5. A register read has lower latency than a cache access because it uses local wires and small multiplexers rather than an address translation path and cache tag comparison.

Definition

Register file

A register file is an array of registers plus read and write ports. A simple integer datapath for ADD rd, rs1, rs2 needs two read ports and one write port: read rs1, read rs2, and write rd after the ALU result is ready.

Definition

Program counter

The program counter, often PC, is a special-purpose register that holds the address of the instruction being fetched or the address of the next instruction, depending on the pipeline convention. For a fixed 32-bit instruction ISA without a taken branch, the common update is $PC \leftarrow PC + 4$.

Definition

Instruction register

The instruction register, often IR, holds the fetched instruction bits while decode logic extracts the opcode, source register numbers, destination register number, immediate fields, and control signals.

Definition

Architectural versus micro-architectural registers

Architectural registers are named by the ISA and visible to compiled code. Micro-architectural registers are internal storage locations used by a concrete implementation, such as pipeline registers, physical registers for renaming, reorder-buffer entries, and bypass latches.

Registers In The Programmer's Machine

General-purpose registers hold integers, addresses, and bit patterns. The ISA manual defines their number and width. If an ISA has 32 general-purpose registers, each source or destination register number needs 5 instruction bits because $2^5 = 32$. If it has 16 registers, a register field needs 4 bits.

A small abstract ISA with 32-bit instructions might encode an integer register-register add like this.

FieldBitsValue for ADD r3, r1, r2
opcode6000000
rd500011
rs1500001
rs2500010
unused/function1100000000000

One compact representation:

bits 31..26  opcode  000000
bits 25..21  rd      00011
bits 20..16  rs1     00001
bits 15..11  rs2     00010
bits 10..0   rest    00000000000

full word    00000000011000010001000000000000
hex          0x00611000

On a little-endian machine storing this 32-bit instruction at byte address 0x1000, memory holds the least significant byte first.

AddressByte
0x10000x00
0x10010x10
0x10020x61
0x10030x00

Memory bytes:

0x1000: 00
0x1001: 10
0x1002: 61
0x1003: 00

The instruction word is reconstructed by the fetch unit as 0x00611000 before decode. Endianness changes the byte order in memory, not the register numbers after decode.

Special-purpose registers carry control state. The stack pointer SP points to the active stack frame boundary used by calls, returns, spills, and local storage. The status or flags register holds condition codes such as zero, negative, carry, and overflow on ISAs that expose them. The PC drives instruction fetch. The IR is usually not visible to software, but it is part of the textbook datapath.

The Fetch-Decode-Execute-Writeback Cycle

The sequential description of instruction execution is a state transition over registers and memory. At time $t$, the machine state contains PC, architectural registers, memory, and status bits. One instruction computes the state at $t+1$.

For a fixed 32-bit instruction stream with no branch, the high-level loop is:

IR  <- Memory[PC .. PC+3]
PC  <- PC + 4
ctl <- Decode(IR)
A   <- Reg[rs1]
B   <- Reg[rs2]
Y   <- ALU(A, B, ctl.alu_op)
Reg[rd] <- Y

A C-like interpreter for the abstract ADD instruction looks like this:

#include <stdint.h>

struct CPU {
    uint32_t pc;
    uint32_t regs[32];
    uint8_t *mem;
};

static uint32_t load32_le(uint8_t *p) {
    return ((uint32_t)p[0]) |
           ((uint32_t)p[1] << 8) |
           ((uint32_t)p[2] << 16) |
           ((uint32_t)p[3] << 24);
}

void step_one_add(struct CPU *cpu) {
    uint32_t ir = load32_le(&cpu->mem[cpu->pc]);
    cpu->pc += 4;

    uint32_t rd  = (ir >> 21) & 31;
    uint32_t rs1 = (ir >> 16) & 31;
    uint32_t rs2 = (ir >> 11) & 31;

    cpu->regs[rd] = cpu->regs[rs1] + cpu->regs[rs2];
}

This is not how a high-performance core is implemented, but it matches the architected transition for this instruction. Hardware performs these actions with combinational decode logic, register-file ports, an ALU, multiplexers, latches, and clock edges.

Worked ADD Instruction Walkthrough

Assume this initial state.

Register or memoryValue
PC0x00001000
Reg[1]0x00000007
Reg[2]0x0000000b
Reg[3]0xdeadbeef
Mem[0x1000..0x1003]00 10 61 00

Fetch reads four bytes starting at PC.

IR <- 0x00611000
PC <- 0x00001004

Decode extracts the fields.

opcode = 0
rd     = 3
rs1    = 1
rs2    = 2

The register file presents the two operands on its read-data wires.

A <- Reg[1] = 0x00000007
B <- Reg[2] = 0x0000000b

Execute runs the ALU addition.

Y <- A + B = 0x00000012

Writeback stores the result into the destination architectural register.

Reg[3] <- 0x00000012

The final visible state is PC = 0x1004, Reg[3] = 18, and all other registers unchanged, except for status flags on an ISA where integer add writes them. If a status register is present and the add sets a zero flag, then zero is false because the result is not zero. Carry and overflow depend on the ISA's exact flag rules and operand width.

A small assembly fragment using this abstract syntax is:

add r3, r1, r2
add r4, r3, r3

The second instruction has a data dependence on r3. In a single sequential model, that means the second decode reads the new Reg[3]. In a pipeline, bypassing or stalling must preserve the same architectural result.

Single-Cycle, Multi-Cycle, And Five Pipeline Stages

A single-cycle datapath completes every instruction in one long clock period. The clock must be long enough for the slowest instruction path, often instruction fetch, register read, ALU, data memory, and register write. A register-register add then pays for memory stages it does not need.

A multi-cycle datapath reuses hardware over several shorter cycles. Instruction fetch can occupy one cycle, decode another, execute another, and writeback another. A load instruction can use an extra data-memory cycle. The control unit tracks which substep is active.

The canonical five-stage RISC pipeline splits the work into stages.

StageNameMain work
IFinstruction fetchread instruction memory, compute PC + 4
IDinstruction decodedecode fields, read register operands
EXexecuteALU operation or effective address
MEMmemorydata cache access for load/store
WBwritebackwrite register result

For three independent adds, the pipeline occupancy is:

cycle      1   2   3   4   5   6   7
add r3     IF  ID  EX  MEM WB
add r4         IF  ID  EX  MEM WB
add r5             IF  ID  EX  MEM WB

The latency of one add is still five cycles from IF to WB in this simple picture. The throughput after fill is one completed instruction per cycle if there are no stalls. Hazards break that ideal. A load-use dependence, a taken branch, or a structural conflict can insert bubbles.

Why The Register Count Is Small

A 32-entry register file with 64-bit registers stores only 2048 data bits. The storage bits are not the hard part. The ports, decoders, wordlines, bitlines, bypass wires, and multiplexers dominate delay and energy.

For an instruction like ADD rd, rs1, rs2, an in-order scalar core commonly needs 2 read ports and 1 write port on the integer register file. A dual-issue design that can start two integer adds in the same cycle can need 4 read ports and 2 write ports, unless it restricts issue combinations or banks the file. More ports mean more wiring and larger cells.

The read path also grows with register count. A 32-register file needs a 32-to-1 selection per output bit. A 64-register file needs a 64-to-1 selection per output bit and 6-bit register specifiers in instruction fields. Wider register numbers make instruction encoding tighter. More physical distance adds wire delay.

This is why ISAs expose a modest architectural register set: enough to keep common temporaries near the ALU, not so many that every instruction burns many bits naming them. Micro-architectures may contain many more physical registers for out-of-order execution, but those are not directly named by machine code.

Key Result

Two timing equations matter:

TsingleTIF+TID+TEX+TMEM+TWBT_{\text{single}} \geq T_{\text{IF}} + T_{\text{ID}} + T_{\text{EX}} + T_{\text{MEM}} + T_{\text{WB}} Tpipemax(TIF,TID,TEX,TMEM,TWB)+TlatchT_{\text{pipe}} \geq \max(T_{\text{IF}}, T_{\text{ID}}, T_{\text{EX}}, T_{\text{MEM}}, T_{\text{WB}}) + T_{\text{latch}}

Suppose the stage delays are 250 ps for IF, 120 ps for ID, 180 ps for EX, 300 ps for MEM, and 100 ps for WB. A single-cycle datapath needs at least 950 ps per instruction, so its best CPI is 1 but its maximum instruction rate is about 1.05 billion instructions per second.

With a five-stage pipeline and 20 ps latch overhead, the clock period is at least max(250,120,180,300,100)+20 = 320 ps. After fill, the best throughput is one instruction per 320 ps, about 3.125 billion instructions per second. The latency of one isolated instruction is 5 * 320 ps = 1600 ps, longer than the single-cycle latency. Pipelining improves throughput, not the latency of a lone instruction.

Architectural invariants constrain both designs.

For an ADD with no trap:
Reg[rd] after writeback equals old Reg[rs1] + old Reg[rs2] modulo 2^w.
The next sequential PC equals old PC + instruction_length.
Instructions after the ADD must observe the written architectural value.

These invariants are why forwarding networks, stalls, scoreboards, and reorder buffers exist. They let a faster implementation preserve the same architectural transition.

Common Confusions

Watch Out

PC and IR are not ordinary temporaries

The PC is part of the control path, more than just another integer variable. Changing it changes the next fetch address. The IR holds bits being decoded and is often a pipeline register field rather than a programmer-visible register. Treating both as general-purpose registers hides control-flow behavior.

Watch Out

More registers are not automatically faster

More architectural registers reduce spills in some programs, but they widen register fields and enlarge register-file selection logic. A 64-register ISA needs 6 bits per register operand. A three-register instruction spends 18 bits on register names before opcode and immediate fields.

Watch Out

A pipeline is not parallel execution of one instruction

The five stages overlap different instructions. One instruction is not doing IF, ID, EX, MEM, and WB at the same time. The pipeline is an assembly line for an instruction stream; hazards force it to pause or forward values.

Exercises

ExerciseCore

Problem

A 32-bit fixed-length ISA has 32 general-purpose registers. An instruction format contains three register fields and a 7-bit opcode. How many bits remain for an immediate or function field? What changes if the ISA has 64 registers?

ExerciseCore

Problem

Use the abstract encoding from the page. Memory at 0x2000 contains 00 18 82 00. The initial state is PC = 0x2000, Reg[2] = 9, Reg[3] = 14, and Reg[4] = 99. Decode and execute the instruction.

ExerciseAdvanced

Problem

A single-cycle datapath has stage delays IF 220 ps, ID 160 ps, EX 210 ps, MEM 280 ps, and WB 120 ps. A five-stage pipeline adds 25 ps of latch overhead per stage. Compute the best single-cycle instruction throughput and the best pipelined steady-state throughput, ignoring stalls.

References

Canonical:

  • Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2017), ch. 1, quantitative design principles and performance equations
  • Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2017), §3.3, instruction-level parallelism and pipeline hazards
  • Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 4, processor architecture and sequential implementations
  • Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd ed. (2016), ch. 5, program performance and machine-level effects
  • Patterson and Hennessy, Computer Organization and Design: The Hardware/Software Interface, 5th ed. (2014), ch. 4, datapath, control, and pipelining

Accessible:

  • Berkeley CS 61C, Machine Structures, lecture notes on CPU datapath and pipelining
  • Carnegie Mellon 15-213, Introduction to Computer Systems, processor architecture notes
  • RISC-V International, The RISC-V Instruction Set Manual, Volume I, unprivileged ISA overview and integer register model

Next Topics

  • /computationpath/machine-code-and-assembly
  • /computationpath/memory-hierarchy-and-cache-basics
  • /computationpath/pipelining-hazards-and-forwarding
  • /computationpath/out-of-order-execution-and-register-renaming