Advanced Architectures - multiple instructions completed per clock cycle

(Original RISC - one instruction completed per clock cycle)

  1. superpipelined (Figure 8.16)- split each stage into substages to create finer-grain stages
  2. superscalar (Figures 8.14 and 8.15)- multiple instructions in the same stage of execution in duplicate pipeline hardware
  3. very-long-instruction-word (VLIW) - compiler encodes multiple operations into a long instruction word so hardware can schedule these operations at run-time on multiple functional units without analysis
We looked at the Pentium, MIPS R4000, PowerPC, SPARC, and Vector processor examples from sections 8.6-8.7.

Limitations of superscalar - how much "instruction-level parallelism" (ILP) exists in the program. Independent instructions in the program can be executed in parallel, but not all can be.

1) true data dependency: SUB R1, R2, R3 ; R1 R2 - R3

ADD R4, R1, R1 ; R4 R1 + R1

Cannot be avoided by rearranging code

2) procedural dependency - cannot execute instructions after branch until branch executes

3) resource conflict / structural hazard - several instructions need same piece of hardware at the same time (e.g., memory, caches, buses, register file, functional units)

machine parallelism - the ability of the processor to take advantage of instruction-level parallelism. This is limited by:

Three types of orderings: 1) order in which instructions are fetched 2) order in which instructions are executed (called instruction issuing) 3) order in which instructions update registers and memory The more sophisticated the processor, the less it is bound by the strict relationship between these orderings. The only real constraint is that the results match that of sequential execution. Some Categories: a) In-order issue with In-order completion. b) In-order issue with out-of-order completion Problem: Output dependency / WAW dependency (Write-After-Write) I1: R3 R3 op R5 I2: R4 R3 + 1 I3: R3 R5 + 1 I4: R7 R3 op R4 ; R3 value generated from I3 must be used c) Out-of-Order Issue (decouple decode and execution) with Out-of-Order Completion Instruction window provides a pool of possible instructions to be executed: Antidependency / WAR (Write-After-Read) I1: R3 R3 op R5 I2: R4 R3 + 1 I3: R3 R5 + 1 ; If executed out-of-order, then I2 could get wrong value for R3 I4: R7 R3 op R4 Notice that I3 is just reusing R3 and does not need its value, so it is just a conflict for the use of a register. Register Renaming is a solution to this problem; We allocate a different register dynamically at run-time I1: R3b R3a op R5a ; R3b and R3a are different registers I2: R4b R3b + 1 I3: R3c R5a + 1 I4: R7b R3c op R4b Example using Tomasulo's Algorithm Studies have shown that superscalar machines: Branch prediction - usually used instead of delayed branching since multiple instructions need to execute in the delay slot causing problems related to instruction dependencies Committing / Retiring Step - needed since instructions may complete out-of-order Using branch prediction and speculative execution means some instructions' results need to be thrown out Results held is some temporary storage and stores performed in order of sequential execution.
Itanium Processor
Interesting Features: