Body

Computer Systems Test 1

Question 1. (20 points)

a) Why are complex instructions of CISC (Complex Instr. Set Computer) machines difficult to pipeline?

b) Why are RISC machines usually Load & Store machines (i.e., only Load and Store instructions access memory)?

c) One characteristic of RISC machines is to have either large register files or large on-chip caches. Answer the following questions related to this characteristic.

i) One problem with using large register files is the increased number of bits needed to specify a register operand in the machine language instructions. How does SPARC avoid this problem?

ii) How does the overlap of SPARC register windows improve program performance?

Question 2. (35 points)

The whole question refers to a pipelined, RISC machine with five stages:

F, fetch - fetch the instruction from memory
D, decode - determine the type of instruction and read any necessary register values
E, execute - perform ALU operation or memory address calculation for LOAD or STORE instructions
M, memory - access memory on LOAD or STORE instruction
W, write - write register values

a) Complete the following timing diagram assuming NO by-pass signal paths.

Without by-pass
signal paths
Time

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ADD R1, R3, R4 F D E M W

ADD R2, R4, R5

ADD R3, R2, R1

LOAD R2, 12(R3)

STORE R2, 16(R2)

b) Complete the following timing diagram assuming by-pass signal paths as shown above.

With by-pass
signal paths
Time

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

ADD R1, R3, R4 F D E M W

ADD R2, R4, R5

ADD R3, R2, R1

LOAD R2, 12(R3)

STORE R2, 16(R2)

c) If the outcome of a conditional branch instruction is known at the end of the Execute stage, then what would be the branch penalty for a taken conditional branch? (Assume no delayed branching)

d) If the outcome of a conditional branch instruction is known at the end of the Execute stage, then what would be the branch penalty for a taken conditional branch assuming a one-slot delayed branch? .

Question 3. (20 points)

a) If the above "for-loop" is recompiled on a machine that allows the compiler to select different opcodes to statically predict the branch outcome, then should the compiler select BGT_LIKELY or BGT_UNLIKELY for the conditional branch? (Justify your answer)

b) If the above "for-loop" is recompiled on a machine with a branch-history table to dynamically predict the branch outcome, then how does the branch-history table reduce the branch penalty?

c) Would a machine benefit from (1) allows the compiler to select different opcodes to statically predict the branch outcome, AND (2) having a branch-history table? (Justify your answer)

Question 4. (25 points)

Assume the above superscalar processor organization:

can issue two instructions per cycle if no resource conflicts or data dependencies
has essentially two pipelines each with their own fetch and decode units
has a lookahead window for out-of-order instruction issue that's used if an instruction is stalled in either f1 or f2
has four functional units that are shared dynamically between the pipelines
has two store units that are shared dynamically between the pipelines

a) For the below program, indicate all pairs of instruction that have

i) write-read/read-after-write (RAW)/"true" data dependencies -

ii) output/write-write/write-after-write dependencies -

iii) antidependencies/read-write/write-after-read dependencies -

b) For out-of-order issue with out-of-order execution, show the stages that each instruction is in for the following program. (Note: "Add R1, R2" performs R1 := R1 + R2) (Assume NO by-pass signal paths, i.e., the decode stage for a dependent instruction cannot occur until after the write-back stage!)

Instructions Cycle

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

1 Add R1, R2 f1 d1 a1 a2 s1

2 Mul R2, R4

3 Or R3, R2

4 Add R4, R5

5 Or R7, R9

6 Or R3, R6

7 Add R7, R1

8 Load R3, (R1)

Without by-pass signal paths	Time
	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
ADD R1, R3, R4	F	D	E	M	W
ADD R2, R4, R5
ADD R3, R2, R1
LOAD R2, 12(R3)
STORE R2, 16(R2)

With by-pass signal paths	Time
	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
ADD R1, R3, R4	F	D	E	M	W
ADD R2, R4, R5
ADD R3, R2, R1
LOAD R2, 12(R3)
STORE R2, 16(R2)

	Instructions	Cycle
	Instructions	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19
1	Add R1, R2	f1	d1	a1	a2	s1
2	Mul R2, R4
3	Or R3, R2
4	Add R4, R5
5	Or R7, R9
6	Or R3, R6
7	Add R7, R1
8	Load R3, (R1)