Body

Instruction Pipelining - assembly-line idea used to speed instruction completion rate

Assume that an automobile assembly process takes 4 hours.

If you divide the process into four equal stages, then ideally

time between completions =

Problems:

stages might not be balanced
overhead of moving cars between stages
two stages need same specialized tool (structural hazard)

Serial Execution

Pipelined Execution - goal is to complete one instruction per clock cycle

Instruction-set Design Issues: what does the ML instruction format(s)

1) Which instructions to include:

How many?
Complexity - simple "ADD R1, R2, R3"
complex e.g., VAX
"MATCHC substrLength, substr, strLength, str"

looks for a substring within a string

2) Which built-in data types

3) Instruction format:

Length (fixed, variable)
number of address (2, 3, etc)
field sizes

4) # registers

5) Addressing modes supported

Reduced Instruction Set Computers (RISC)

Two approaches to instruction set design:

1) CISC (Complex Instruction Set Computer) e.g., VAX

1960's: Make assembly language (AL) as much like high-level language (HLL) as possible to reduce the "semantic gap" between AL and HLL

Alleged Reasons:

reduce compiler complexity and aid assembly language programming - compilers not too good at the time (e.g., they did not allocate registers very efficiently)
reduce the code size - (memory limited at this time)
improve code efficiency - complex sequence of instructions implemented in microcode (e.g., VAX "MATCHC substrLength, substr, strLength, str" that looks for a substring within a string)

Characteristics of CISC:

high-level like AL instructions
variable format and number of cycles
many addressing modes (VAX 10 addressing modes)

Problems with CISC:

complex hardware needed to implement more and complex instructions which slows the execution of simplier instructions
compiler can rarely figure out when to use complex instructions (verified by studies of programs)
variability in instruction format and instruction execution time made CISC hard to pipeline

2) RISC (1980's) Addresses these problems to improve speed.

General Characteristics of RISC:

emphasis on optimizing instruction pipeline
a) one instruction completion per cycle
b) register-to-register operations
c) simple addressing modes
d) simple, fixed-length instruction formats
limited and simple instruction set and addressing modes
large number of registers or use of compiler technology to optimize register usage
hardwired control unit

RISC Instruction-Set Architecture (ISA) can be effectively pipelined

RISC Instruction Pipelining Example: One possible break down of instruction execution.

Stage Abbreviation Actions

Instruction Fetch IF Read next instruction into CPU and increment PC by 4

Instruction Decode ID Determine opcode, read registers, compare registers (if branch), sign-extend immediate if needed, computer target address of branch, update PC if branch

Execution / Effective addr EX Calculate using operands prepared in ID

memory ref: add base reg to offset to form effective address
reg-reg ALU: ALU performs specified calculation
reg-immediate ALU: ALU performs specified calculation

Memory access MEM

load: read memory from effective address into pipeline register
store: write reg value from ID stage to memory at effective address

Write-back WB

ALU or load instruction: write result into register file

Pipeline latches/registers between each stage. Hold temporary results and act like an IR. Some of the hardware components used (e.g., Memory and Register File) are shown as if they are duplicated, but they are not.

Problems that delay/stall the pipeline:

structural hazard - a piece of hardware is needed by several stages at the same time, e.g., Memory in FI, FO, and WO. This might require stages to sequentially access the hardware.
data hazard - an instruction depends on the results of a previous instruction which has not been calculated yet. (RAW) read-after-write example: ADD R3, R2, R1 ; R3 R2 + R1

SUB R4, R3, R5 ; R4

R3 + R5

In what stage does the ADD instruction update R3?

In what stage does the SUB instruction read R3?

control hazard - branching makes it difficult to fetch the "correct" instructions to be executed

Data Hazards:

Wrong result in below since SUB read the "old" value of R3 in ID, before ADD updates R3 in WB stage.

Time

Instructions 1 2 3 4 5 6 7 8 9 10 11 12

ADD R3, R2, R1 IF ID EX MEM WB

SUB R4, R3, R5 IF ID EX MEM WB

Solution Alternatives:

1) Introduce stalls - stall reading of R3 in last half of ID until ADD writes R3 in first half of WB

Time

Instructions 1 2 3 4 5 6 7 8 9 10 11 12

ADD R3, R2, R1 IF ID EX MEM WB

SUB R4, R3, R5 IF stall stall ID EX MEM WB

2) Add additional hardware (bypass-signal paths) to "foward" R3's new value to the SUB instruction:

No stalls needed in this case.

Time

Instructions 1 2 3 4 5 6 7 8 9 10 11 12

ADD R3, R2, R1 IF ID EX MEM WB

SUB R4, R3, R5 IF ID EX MEM WB

What would control the MUX?

MUX Operation:

Consider the following code: ADD R3, R2, R1

LOAD R4, 4(R3)

What would the timing be without bypass-signal paths/forwarding?

Time

Instructions 1 2 3 4 5 6 7 8 9 10 11 12

ADD R3, R2, R1 IF ID EX MEM WB

LOAD R4, 4(R3) IF stall stall ID EX MEM WB

This assumes that R3 can be written in the first half of the WB stage and its new value read in the last half of the ID stage.

What would the timing be with bypass-signal paths?

Time

Instructions 1 2 3 4 5 6 7 8 9 10 11 12

ADD R3, R2, R1 IF ID EX MEM WB

LOAD R4, 4(R3) IF ID EX MEM WB

Draw the bypass-signal paths needed for the above example.

How many cycles are needed to perform the following AL program without forwarding?

Time

Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ADD R3, R2, R1 IF ID EX MEM WB

LOAD R4, 4(R3) IF

SUB R5, R4, R3

STORE R5, 8(R6)

ADD R6, R5, R4

How many cycles are needed to perform the following AL program with forwarding?

Time

Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ADD R3, R2, R1 IF ID EX MEM WB

LOAD R4, 4(R3) IF

SUB R5, R4, R3

STORE R5, 8(R6)

ADD R6, R5, R4

Draw ALL the bypass-signal paths needed for the above example.

Stage	Abbreviation	Actions
Instruction Fetch	IF	Read next instruction into CPU and increment PC by 4
Instruction Decode	ID	Determine opcode, read registers, compare registers (if branch), sign-extend immediate if needed, computer target address of branch, update PC if branch
Execution / Effective addr	EX	Calculate using operands prepared in ID memory ref: add base reg to offset to form effective address reg-reg ALU: ALU performs specified calculation reg-immediate ALU: ALU performs specified calculation
Memory access	MEM	load: read memory from effective address into pipeline register store: write reg value from ID stage to memory at effective address
Write-back	WB	ALU or load instruction: write result into register file

	Time
Instructions	1	2	3	4	5	6	7	8	9	10	11	12
ADD R3, R2, R1	IF	ID	EX	MEM	WB
SUB R4, R3, R5		IF	ID	EX	MEM	WB