Instruction Pipelining - assembly-line idea used to speed instruction completion rate
Assume that an automobile assembly process takes 4 hours.
If you divide the process into four equal stages, then ideally
time between completions =
Problems:
Serial Execution
Pipelined Execution
Instruction Pipelining Example: One possible break down of instruction execution. (Different than text's)
Stage | Abbreviation | Actions |
Fetch Instruction | FI | Read next instruction into CPU |
Decode Instruction | DI | Determine opcode and operand specifiers |
Calculate Operands |
CO | Calculate the effective addresses of all operands |
Fetch Operands |
FO | Fetch operands from memory or register file |
Execute Instruction |
EI | Perform the indicated operation |
Write Operand |
WO | Write operand to memory or register file |
Pipeline latches/registers between each stage. Hold temporary results and act like an IR. Some of the hardware components used (e.g., Memory and Register File) are shown as if they are duplicated, but they are not.
Problems that delay/stall the pipeline:
In what stage does the ADD instruction update R3?
In what stage does the SUB instruction read R3?
  | Time | ||||||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
ADD R3, R2, R1 | FI | DI | CO | FO | EI | WO |   |   |   |   |   |   |   |   |   |
SUB R4, R3, R5 |   | FI | DI | CO | FO | EI | WO |   |   |   |   |   |   |   |   |
Alternatives:
1) Introduce stalls
  | Time | ||||||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
ADD R3, R2, R1 | FI | DI | CO | FO | EI | WO |   |   |   |   |   |   |   |   |   |
SUB R4, R3, R5 |   | FI | DI | CO | stall | stall | FO | EI | WO |   |   |   |   |   |   |
2) Add additional hardware (bypass-signal paths) to "foward" R3's new value to the SUB instruction:
No stalls needed in this case.
  | Time | ||||||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
ADD R3, R2, R1 | FI | DI | CO | FO | EI | WO |   |   |   |   |   |   |   |   |   |
SUB R4, R3, R5 |   | FI | DI | CO | FO | EI | WO |   |   |   |   |   |   |   |   |
MUX Operation:
Consider the following code: ADD R3, R2, R1
LOAD R4, 4(R3)
What would the timing be without bypass-signal paths/forwarding?
  | Time | ||||||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
ADD R3, R2, R1 | FI | DI | CO | FO | EI | WO |   |   |   |   |   |   |   |   |   |
LOAD R4, 4(R3) |   | FI | DI | DI | DI | DI | DI | CO | FO | EI | WO |   |   |   |   |
This assumes that R3 cannot be written and the new value read in the same stage.
If we assume that R3 can be written in the first half of the WO stage and its new value read in the last half of the DI stage, then we get:
  | Time | ||||||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
ADD R3, R2, R1 | FI | DI | CO | FO | EI | WO |   |   |   |   |   |   |   |   |   |
LOAD R4, 4(R3) |   | FI | DI | DI | DI | DI | CO | FO | EI | WO |   |   |   |   |   |
What would the timing be with bypass-signal paths?
  | Time | ||||||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
ADD R3, R2, R1 | FI | DI | CO | FO | EI | WO |   |   |   |   |   |   |   |   |   |
LOAD R4, 4(R3) |   | FI | DI | stall | stall | CO | FO | EI | WO |   |   |   |   |   |   |
Draw the bypass-signal paths needed for the above example.