You are to assume the same 6-stage pipeline as discussed in class (see http://www.cs.uni.edu/~fienup/cs142f03/lectures/lec6.htm) when answering these questions.

Assume that the first register in an arithmetic operation is the destination register, e.g., in "ADD R3, R2, R1" register R3 receives the result of adding registers R2 and R1.

1. What would the timing be **without** bypass-signal paths/forwarding (use "stalls" to solve the data hazard)?

(This code might require more that 15 cycles)

Time | |||||||||||||||||||||||

Instructions |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |

ADD R3, R2, R1 | FI | DI | CO | FO | EI | WO | |||||||||||||||||

SUB R4, R3, R6 | FI | DI | CO | FO | EI | WO | |||||||||||||||||

LOAD R8, 4(R4) | FI | DI | CO | FO | EI | WO | |||||||||||||||||

STORE R4, 16(R8) | FI | DI | CO | FO | EI | WO | |||||||||||||||||

SUB R6, R4, R8 | FI | DI | CO | FO | EI | WO | |||||||||||||||||

ADD R5, R3, R6 | FI | DI | CO | FO | EI | WO |

2. What would the timing be **with** bypass-signal paths? (This code might require more that 15 cycles)

Time | ||||||||||||||||||||||

Instructions |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |

ADD R3, R2, R1 | FI | DI | CO | FO | EI | WO | ||||||||||||||||

SUB R4, R3, R6 | FI | DI | CO | FO | EI | WO | ||||||||||||||||

LOAD R8, 4(R4) | FI | DI | CO | FO | EI | WO | ||||||||||||||||

STORE R4, 16(R8) | FI | DI | CO | FO | EI | WO | ||||||||||||||||

SUB R6, R4, R8 | FI | DI | CO | FO | EI | WO | ||||||||||||||||

ADD R5, R3, R6 | FI | DI | CO | FO | EI | WO |

3. Draw ALL the bypass-signal paths needed for the above example.

4. Consider the following insertion sort algorithm that sorts an array numbers:

InsertionSort(numbers - address to integer array, length - integer)

integer firstUnsortedIndex, testIndex, elementToInsert;

for firstUnsortedIndex = 1 to (length-1) do (cond. predict-not-taken)

testIndex = firstUnsortedIndex-1;

elementToInsert = numbers[firstUnsortedIndex];

while (testIndex >=0) AND (numbers[testIndex] > elementToInsert ) do (two cond. branches:

numbers[ testIndex + 1 ] = numbers[ testIndex ]; 1st predict-not-taken

testIndex = testIndex - 1; 2nd hard to predict)

end while (unconditional branch)

numbers[ testIndex + 1 ] = elementToInsert;

end for (unconditional branch)

end InsertionSort

a) Where in the code would unconditional branches be used and where would conditional branches be used?

b) If the compiler could predict by opcode for the conditional branches (i.e., select whether to use machine language statements like: "BRANCH_LE_PREDICT_NOT_TAKEN" or "BRANCH_LE_PREDICT_TAKEN"), then which conditional branches would be "PREDICT_NOT_TAKEN" and which would be "PREDICT_TAKEN"?

c) Assumptions:

- n = 100 and the numbers are initially in descending order before the insertion sort algorithm is called
- the six-stage pipeline of the text
- the outcome of conditional branches is known at the end of the EI stage
- target addresses of all branches is known at the end of the CO stage
- ignore any data hazards

i) If fixed predict-never-taken is used by the hardware, then what will be the total branch penalty (# cycles wasted) for the algorithm? (Here assume NO branch-history table) For partial credit, explain your answer.

Branch |
Penalty |
Comment |

for conditional | 4 | penalty only when dropping out of "for" |

while conditional (testIndex >=0) |
4 * 99 | penalty each time we drop out of the "while" |

while conditional (numbers[testIndex] > elementToInsert) |
0 | with descending data the 2nd condition is always true |

end while unconditional | 2 * (2+3+4+...+100) | penalty each time we jump back to the start of the loop |

end for unconditional | 2* 99 | penalty each time we jump back to the start of the loop |

Total = 10,696 cycles |

ii) If a branch-history table with one history bit per entry is used, then what will be the total branch penalty (# cycles wasted) for the algorithm? (Assume predict-not taken is used if there is no match in the branch-history table) For partial credit, explain your answer.

Branch |
Penalty |
Comment |

for conditional | 4 | penalty only when dropping out of "for" |

while conditional (testIndex >=0) |
4 + (4+4)* 98 | one penalty for dropping out of first "while"; for all other invocations of the "while" the first iteration will predict TAKEN and dropping out will be predicted incorrectly |

while conditional (numbers[testIndex] > elementToInsert) |
0 | with descending data the 2nd condition is always predicted correctly since we never drop out of the "while" because of this condition |

end while unconditional | 2 | penalty only occurs on very first execution when it is not in the BHT |

end for unconditional | 2 | penalty only occurs on very first execution when it is not in the BHT |

Total = 792 cycles |

iii) If a branch-history table with two history bits per entry is used as in Figure 8.12, then what will be the total branch penalty (# cycles wasted) for the algorithm? (Assume predict-not taken is used if there is no match in the branch-history table) For partial credit, explain your answer.

Branch |
Penalty |
Comment |

for conditional | 4 | penalty only when dropping out of "for" |

while conditional (testIndex >=0) |
4 * 99 | one penalty for dropping out of each "while", since two mispredictions are needed to switch the actual predictions. Dropping out of the ith invocation does not change the prediction for the first iteration of the (i+1)st invocation |

while conditional (numbers[testIndex] > elementToInsert) |
0 | with descending data the 2nd condition is always predicted correctly since we never drop out of the "while" because of this condition |

end while unconditional | 2 | penalty only occurs on very first execution when it is not in the BHT |

end for unconditional | 2 | penalty only occurs on very first execution when it is not in the BHT |

Total = 404 cycles |