Body

Serial Execution

Pipelined Execution - goal is to complete one instruction per clock cycle

Advanced Architectures - multiple instructions completed per clock cycle

(Original RISC - one instruction completed per clock cycle)

superpipelined (e.g., MIPS R4000)- split each stage into substages to create finer-grain stages
superscalar (e.g., Intel Pentium, AMD Athlon)- multiple instructions in the same stage of execution in duplicate pipeline hardware

Figure 8.15 on page 288 - several instructions in the "execute" stage on different functional units

very-long-instruction-word, VLIW (e.g., Intel Itanium) - compiler encodes multiple operations into a long instruction word so hardware can schedule these operations at run-time on multiple functional units without analysis

machine parallelism - the ability of the processor to take advantage of instruction-level parallelism. This is limited by:

number of instructions that can be fetched and executed at the same time (# of parallel pipelines)
ability of the processor to find independent instructions (the processor needs to look ahead of the current point of execution to locate independent instructions that can be brought into the pipeline and executed)

Limitations of superscalar - how much "instruction-level parallelism" (ILP) exists in the program. Independent instructions in the program can be executed in parallel, but not all can be.

1) true data dependency:

SUB R1, R2, R3 ; R1 R2 - R3

ADD R4, R1, R1 ; R4 R1 + R1

Cannot be avoided by rearranging code

2) procedural dependency - cannot execute instructions after branch until branch executes

3) resource conflict / structural hazard - several instructions need same piece of hardware at the same time (e.g., memory, caches, buses, register file, functional units)

Three types of orderings:

1) order in which instructions are fetched

2) order in which instructions are executed (called instruction issuing)

3) order in which instructions update registers and memory

The more sophisticated the processor, the less it is bound by the strict relationship between these orderings. The only real constraint is that the results match that of sequential execution.

Some Categories:

a) In-order issue with In-order completion.

b) In-order issue with out-of-order completion

Problem: Output dependency / WAW dependency (Write-After-Write)

I1: R3 R3 op R5

I2: R4 R3 + 1

I3: R3 R5 + 1

I4: R7 R3 op R4 ; R3 value generated from I3 must be used

c) Out-of-Order Issue (decouple decode and execution) with Out-of-Order Completion

Instruction window provides a pool of possible instructions to be executed:

filled after decode
removed when issued if (1) fn. unit is available and (2) no conflicts or dependencies
Antidependency / WAR (Write-After-Read)
I1: R3 R3 op R5
I2: R4 R3 + 1
I3: R3 R5 + 1 ; If executed out-of-order, then I2 could get wrong value for R3
I4: R7 R3 op R4
Notice that I3 is just reusing R3 and does not need its value, so it is just a conflict for the use of a register. Register Renaming is a solution to this problem; We allocate a different register dynamically at run-time
I1: R3_b R3_a op R5_a ; R3_b and R3_a are different registers
I2: R4_b R3_b + 1
I3: R3_c R5_a + 1
I4: R7_b R3_c op R4_b
Example using Tomasulo's Algorithm

Tomasulo's Algorithm is an example of dynamic scheduling. In dynamic scheduling the ID-WB stages of the five-stage pipeline is split into three stages to allow for out-of-order execution:

Issue - decodes instructions and checks for structural hazards. Instructions are issued in-order through a FIFO queue to maintain correct data flow. If there is not a free reservation station of the appropriate type, the instruction queue stalls.
Read operands - waits until no data hazards, then read operands
Write result - send the result to the CDB to be grabbed by any waiting register or reservation stations

All instructions pass through the issue stage in order, but instructions stalling on operands can be bypass by later instructions whose operands are available. RAW hazards are handled by delaying instructions in reservation stations until all their operands are available. WAR and WAW hazards are handled by renaming registers in instructions by reservation station numbers. Load and Store instructions to different memory addresses can be done in any order, but the relative order of a Store and accesses to the same memory location must be maintained. One way to perform dynamic disambiguation of memory references, is to perform effective address calculations of Loads and Stores in program order in the issue stage.

Before issuing a Load from the instruction queue, make sure that its effective address does not match the address of any Store instruction in the Store buffers. If there is a match, stall the instruction queue until, the corresponding Store completes. (Alternatively, the Store could forward the value to the corresponding Load)
Before issuing a Store from the instruction queue, make sure that its effective address does not match the address of any Store or Load instructions in the Store or Load buffers.
Studies have shown that superscalar machines:
need register renaming to significantly benefit from duplicate functional units
with renaming a larger window size is important
Branch prediction - usually used instead of delayed branching since multiple instructions need to execute in the delay slot causing problems related to instruction dependencies
Committing / Retiring Step - needed since instructions may complete out-of-order
Using branch prediction and speculative execution means some instructions' results need to be thrown out
Results held is some temporary storage and stores performed in order of sequential execution.

Itanium Processor
Interesting Features:
Uses explicit parallel instruction computing (EPIC) from very-long-instruction-word (VLIW) architecture. In EPIC the compiler encodes multiple operations into a long instruction word so hardware can schedule these operations at run-time on multiple functional units without analysis. On the Itanium, a three instruction bundle is read
Features to Enhance ILP: (1) Hiding memory latency by speculative loads, and (2) Improving branch handling by using predication
Provides hardware support for efficient procedure calls and returns -- large number of registers with overlapping register windows