Instruction Pipelining - assembly-line idea used to speed instruction completion rate
Assume that an automobile assembly process takes 4 hours.
If you divide the process into four equal stages, then ideally
time between completions =
Problems:
Serial Execution
Pipelined Execution - goal is to complete one instruction per clock cycle
Instruction-set Design Issues: what does the ML instruction format(s)
1) Which instructions to include:
complex e.g., VAX
"MATCHC substrLength, substr, strLength, str"
2) Which built-in data types
3) Instruction format:
4) # registers
5) Addressing modes supported
Reduced Instruction Set Computers (RISC)
Two approaches to instruction set design:
1) CISC (Complex Instruction Set Computer) e.g., VAX
1960's: Make assembly language (AL) as much like high-level language (HLL) as possible to reduce the "semantic gap" between AL and HLL
Alleged Reasons:
Characteristics of CISC:
Problems with CISC:
2) RISC (1980's) Addresses these problems to improve speed.
General Characteristics of RISC:
a) one instruction completion per cycle
b) register-to-register operations
c) simple addressing modes
d) simple, fixed-length instruction formats
RISC Instruction-Set Architecture (ISA) can be effectively pipelined
RISC Instruction Pipelining Example: One possible break down of instruction execution.
Stage | Abbreviation | Actions |
Instruction Fetch | IF | Read next instruction into CPU and increment PC by 4 |
Instruction Decode | ID | Determine opcode, read registers, compare registers (if branch), sign-extend immediate if needed, computer target address of branch, update PC if branch |
Execution / Effective addr | EX | Calculate using operands prepared in ID
|
Memory access | MEM |
|
Write-back | WB |
|
Pipeline latches/registers between each stage. Hold temporary results and act like an IR. Some of the hardware components used (e.g., Memory and Register File) are shown as if they are duplicated, but they are not.
Problems that delay/stall the pipeline:
In what stage does the ADD instruction update R3?
In what stage does the SUB instruction read R3?
Wrong result in below since SUB read the "old" value of R3 in ID, before ADD updates R3 in WB stage.
  | Time | |||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
ADD R3, R2, R1 | IF | ID | EX | MEM | WB |   |   |   |   |   |   |   |
SUB R4, R3, R5 |   | IF | ID | EX | MEM | WB |   |   |   |   |   |   |
Solution Alternatives:
1) Introduce stalls - stall reading of R3 in last half of ID until ADD writes R3 in first half of WB
  | Time | |||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
ADD R3, R2, R1 | IF | ID | EX | MEM | WB |   |   |   |   |   |   |   |
SUB R4, R3, R5 |   | IF | stall | stall | ID | EX | MEM | WB |   |   |   |   |
2) Add additional hardware (bypass-signal paths) to "foward" R3's new value to the SUB instruction:
No stalls needed in this case.
  | Time | |||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
ADD R3, R2, R1 | IF | ID | EX | MEM | WB |   |   |   |   |   |   |   |
SUB R4, R3, R5 |   | IF | ID | EX | MEM | WB |   |   |   |   |   |   |
What would control the MUX?
MUX Operation:
Consider the following code: ADD R3, R2, R1
LOAD R4, 4(R3)
What would the timing be without bypass-signal paths/forwarding?
  | Time | |||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
ADD R3, R2, R1 | IF | ID | EX | MEM | WB |   |   |   |   |   |   |   |
LOAD R4, 4(R3) |   | IF | stall | stall | ID | EX | MEM | WB |   |   |   |   |
This assumes that R3 can be written in the first half of the WB stage and its new value read in the last half of the ID stage.
What would the timing be with bypass-signal paths?
  | Time | |||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
ADD R3, R2, R1 | IF | ID | EX | MEM | WB |   |   |   |   |   |   |   |
LOAD R4, 4(R3) |   | IF | ID | EX | MEM | WB |   |   |   |   |   |   |
Draw the bypass-signal paths needed for the above example.
How many cycles are needed to perform the following AL program without forwarding?
  | Time | |||||||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
ADD R3, R2, R1 | IF | ID | EX | MEM | WB |   |   |   |   |   |   |   |   |   |   |   |
LOAD R4, 4(R3) |   | IF |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
SUB R5, R4, R3 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
STORE R5, 8(R6) |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
ADD R6, R5, R4 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
How many cycles are needed to perform the following AL program with forwarding?
  | Time | |||||||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
ADD R3, R2, R1 | IF | ID | EX | MEM | WB |   |   |   |   |   |   |   |   |   |   |   |
LOAD R4, 4(R3) |   | IF |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
SUB R5, R4, R3 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
STORE R5, 8(R6) |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
ADD R6, R5, R4 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
Draw ALL the bypass-signal paths needed for the above example.