CS 3853 Architecture Notes on Chapter 3

next

CS 3853 Computer Architecture Notes on Chapter 3 Section 2

Read Section 3.2

3.2: Compiler Techniques for Exposing ILP

Keep pipeline full: need sequences of unrelated instructions.
Related instructions must be separated by an amount dependent on the pipeline depth.
The section concentrates on using loop unrolling.

Today's News: April 6

Consider the following code, where x and s are floating point and i is an int:

for (i=999; i>=0; i--)
   x[i] = x[i] + s;

The following is a MIPS implementation assuming that s is in F2, R1 has the address of the last element of the array, and 8(R2) is the address of the first element of the array.

loop:  L.D    F0,0(R1)
       ADD.D  F4, F0, F2
       S.D    F4, 0(R1)
       DADDUI R1, R1, #-8
       BNE    R1, R2, loop

What happens when you execute this on a simple pipeline.
We have not discussed how floating point operations work, but we will assume the following latencies:

Instruction Producing Result	Instruction Using Result	Latency in cycles
FP ALU Op	FP ALU Op	3
FP ALU Op	Store double	2
Load double	FP ALU Op	1
Load double	Store double	0

Here is the timing of these instructions.

			clock cycle issued
`loop:`	`L.D`	`F0, 0(R1)`	1
	stall		2
	`ADD.D`	`F4, F0, F2`	3
	stall		4
	stall		5
	`S.D`	`F4, 0(R1)`	6
	`DADDUI`	`R1, R1, #-8`	7
	stall		8
	`BNE`	`R1, R2, loop`	9

We assume the latencies of from the table above.
We assume a latency of 1 cycle from integer ALU to branch since the branch address is calculated in ID which occurs in the same cycle is the EX of the previous instruction.
We ignore other delays due to branches.

We can remove half of the stalls by moving the DADDUI up after the L.D.

			clock cycle issued
`loop:`	`L.D`	`F0, 0(R1)`	1
	`DADDUI`	`R1, R1, #-8`	2
	`ADD.D`	`F4, F0, F2`	3
	stall		4
	stall		5
	`S.D`	`F4, 8(R1)`	6
	`BNE`	`R1, R2, loop`	7

The body of the loop takes 7 cycles.

Now we unroll 4 cycles of the loop, assuming the number of iterations is divisible by 4:

			clock cycle issued
`loop:`	`L.D`	`F0, 0(R1)`	1
	`ADD.D`	`F4, F0, F2`	3
	`S.D`	`F4, 0(R1)`	6
	`L.D`	`F6, -8(R1)`	7
	`ADD.D`	`F8, F6, F2`	9
	`S.D`	`F8, -8(R1)`	12
	`L.D`	`F10, -16(R1)`	13
	`ADD.D`	`F12, F10, F2`	15
	`S.D`	`F12, -16(R1)`	18
	`L.D`	`F14, -24(R1)`	19
	`ADD.D`	`F16, F14, F2`	21
	`S.D`	`F16, -24(R1)`	24
	`DADDUI`	`R1, R1, #-32`	25
	`BNE`	`R1, R2, loop`	27

Each LD has 1 stall, each ADDD has 2, and the DADDUI has 1 for a total of 13 stall cycles and a total of 27 clock cycles for the loop.
Without unrolling, the original would take 36 cycles for 4 iterations and the rescheduled code would take 28 cycles.

We can do better by changing the order of the instructions:

			clock cycle issued
`loop:`	`L.D`	`F0, 0(R1)`	1
	`L.D`	`F6, -8(R1)`	2
	`L.D`	`F10, -16(R1)`	3
	`L.D`	`F14, -24(R1)`	4
	`ADD.D`	`F4, F0, F2`	5
	`ADD.D`	`F8, F6, F2`	6
	`ADD.D`	`F12, F10, F2`	7
	`ADD.D`	`F16, F14, F2`	8
	`S.D`	`F4, 0(R1)`	9
	`S.D`	`F8, -8(R1)`	10
	`DADDUI`	`R1, R1, #-32`	11
	`S.D`	`F12, 16(R1)`	12
	`S.D`	`F16, 8(R1)`	13
	`BNE`	`R1, R2, loop`	14

There are now no stalls at all.

Summary of the 4 examples:

Description	Cycles per iteration
ideal	5
original	9
scheduled	7
unrolled	6.75
unrolled and scheduled	3.5

Limitations of loop unrolling:

decrease in saving as we unroll more
- When we unroll 4, 2 cycles out of 14 or 14.3% are loop overhead
- If we unroll 8, 2 cycles out of 26 or 7.7% are loop overhead
- If we unroll 16, 2 cycles out of 34 or 5.9% are loop overhead
increase in code size
limited number of registers

Next Notes

Back to CS 3853 Notes Table of Contents