Computer Systems Test 1

Question 1. (40 points)

a) For the six stage pipeline of the text (see above), complete the following timing diagram assuming NO by-pass signal paths.

Instructions

Time    
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
I1: ADD R1, R3, R4 FI DI CO FO EI WO                              
I2: STORE R4, 16(R1)   FI         DI CO FO EI WO                    
I3: SUB R1, R3, R4             FI DI CO FO EI WO                  
I4: MUL R4, R1, R3               FI DI CO     FO EI WO            
I5: ADD R1, R5, R6                 FI DI     CO FO EI WO          
I6: LOAD R4, 12(R1)                   FI             DI CO FO EI WO

b) Complete the following timing diagram assuming by-pass signal paths.

Instructions

Time
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
I1: ADD R1, R3, R4 FI DI CO FO EI WO                          
I2: STORE R4, 16(R1)   FI DI     CO FO EI WO                    
I3: SUB R1, R3, R4     FI     DI CO FO EI WO                  
I4: MUL R4, R1, R3           FI DI CO FO EI WO                
I5: ADD R1, R5, R6             FI DI CO FO EI WO              
I6: LOAD R4, 12(R1)               FI DI     CO FO EI WO        

c) In the diagram at the top of the page add all by-pass signal paths used in part (b). (in red above)

d) For the above program, indicate all pairs of instruction that have

i) write-read/read-after-write (RAW)/"true" data dependencies -

I1 and I2 on R1; I3 and I4 on R1; I5 and I6 on R1

ii) output/write-write/write-after-write dependencies -

I1 and I3 on R1; I2 and I4 on R4; I3 and I5 on R1; I4 and I6 on R4

iii) antidependencies/read-write/write-after-read dependencies -

I2 and I3 on R1; I3 and I4 on R4; I4 and I5 on R1

Question 2. (35 points) Consider the following sequential search algorithm that searches an array for a specified "target" value. The index of where the "target" value is found is returned. If the "target" value is not in the array, then -1 is returned.

SequentialSearch (integer numberOfElements, integer target, integer array numbers[]) returns an integer

integer test;

for test = 1 to numberOfElements do (conditional branch) (PREDICT_NOT_TAKEN)

if number[test] = target then (conditional branch) (PREDICT_TAKEN)

return test;

end if

end for (unconditional branch)

return -1;

end SequentialSearch

a) Where in the code would unconditional branches be used and where would conditional branches be used? (see above code)

b) If the compiler could predict by opcode for the conditional branches (i.e., select whether to use machine language statements like: "BRANCH_LE_PREDICT_NOT_TAKEN" or "BRANCH_LE_PREDICT_TAKEN"), then which conditional branches would be "PREDICT_NOT_TAKEN" and which would be "PREDICT_TAKEN"? (see above code)

c) Assumptions:

Under the above assumptions, answer the following questions:

i) If static predict-never-taken is used by the hardware, then what will be the total branch penalty (# cycles wasted) for the algorithm? (Here assume NO branch-history table) For partial credit, explain your answer.

ii) If a branch-history table with one history bit per entry is used, then what will be the total branch penalty (# cycles wasted) for the algorithm? (Assume predict-not taken is used if there is no match in the branch-history table) For partial credit, explain your answer.

iii) Explain how a branch-history table with two history bits per entry (i.e., two wrong predictions needed before changing the prediction) would decrease the total branch penalty (# cycles wasted) for the algorithm? (Assume predict-not taken is used if there is no match in the branch-history table)

In this algorithm the two-bit BHT would not help much over the one-bit BHT since there are not nested loops.

The only possible improvement would occur if the SequentialSearch algorithm is called multiple times while the algorithm's branches are still in the BHT. In this case during the 1st iteration of the loop, the two-bit BHT will predict TAKEN for the "if" while the one-bit BHT will predict NOT_TAKEN. Since the TAKEN is more likely correct, it will save 4 cycles.

Question 3. (10 points) The Intel x86 instruction set is a CISC. In general, describe how Intel has been able to make use of RISC ideas (e.g., pipelining and superscalar) in the Pentium architecture for the CISC x86 instruction set.

Each CISC instruction of the program is fetched and then dynamically translated to one or more micro-operations which are RISC instructions. The Pentium processor then uses pipelining and superscalar techniques to execute the RISC micro-operations.

Question 4. (15 points) Smith '95 Studied the relationship between out-of-order issue, duplication of resources, and register renaming on a MIPS R2000-like architecture. The results are shown in the figure:

Notes:

* the speedup is relative

to the scalar machine

* "window size" is the

instruction window

size which dicates the

amount of lookahead

The different machines considered were:

1) base machine - no duplicate functional units, but can issue out-of-order

2) + ld/st: duplicates load / store functional unit that access data cache

3) + alu: duplicates ALU

4) + both: duplicates both load/store and ALU

(finally the questions)

Why is register renaming with a large window size needed to significantly benefit from duplicate functional units?

Register renaming reduces the antidependencies and output dependencies so the additional functional units can be utilized.