previous
 next 
CS 3853 Computer Architecture Notes on Chapter 3 Section 2

Read Section 3.2

3.2: Compiler Techniques for Exposing ILP


Today's News: April 6
No news

Consider the following code, where x and s are floating point and i is an int:
for (i=999; i>=0; i--)
   x[i] = x[i] + s;
The following is a MIPS implementation assuming that s is in F2, R1 has the address of the last element of the array, and 8(R2) is the address of the first element of the array.
loop:  L.D    F0,0(R1)
       ADD.D  F4, F0, F2
       S.D    F4, 0(R1)
       DADDUI R1, R1, #-8
       BNE    R1, R2, loop
What happens when you execute this on a simple pipeline.
We have not discussed how floating point operations work, but we will assume the following latencies:
Instruction
Producing Result
Instruction
Using Result
Latency
in cycles
FP ALU OpFP ALU Op3
FP ALU OpStore double2
Load doubleFP ALU Op1
Load doubleStore double0
Here is the timing of these instructions.
clock cycle
issued
loop:  L.DF0, 0(R1)1
stall2
ADD.DF4, F0, F23
stall4
stall5
S.DF4, 0(R1)   6
DADDUI  R1, R1, #-87
stall8
BNER1, R2, loop  9
We assume the latencies of from the table above.
We assume a latency of 1 cycle from integer ALU to branch since the branch address is calculated in ID which occurs in the same cycle is the EX of the previous instruction.
We ignore other delays due to branches.

We can remove half of the stalls by moving the DADDUI up after the L.D.
clock cycle
issued
loop:  L.DF0, 0(R1)1
DADDUI  R1, R1, #-82
ADD.DF4, F0, F23
stall4
stall5
S.DF4, 8(R1)   6
BNER1, R2, loop  7
The body of the loop takes 7 cycles.

Now we unroll 4 cycles of the loop, assuming the number of iterations is divisible by 4:
clock cycle
issued
loop:  L.DF0, 0(R1) 1
ADD.DF4, F0, F2 3
S.DF4, 0(R1)    6
L.DF6, -8(R1) 7
ADD.DF8, F6, F2 9
S.DF8, -8(R1)   12
L.DF10, -16(R1)13
ADD.DF12, F10, F215
S.DF12, -16(R1)   18
L.DF14, -24(R1)19
ADD.DF16, F14, F221
S.DF16, -24(R1)   24
DADDUI  R1, R1, #-3225
BNER1, R2, loop27

Each LD has 1 stall, each ADDD has 2, and the DADDUI has 1 for a total of 13 stall cycles and a total of 27 clock cycles for the loop.
Without unrolling, the original would take 36 cycles for 4 iterations and the rescheduled code would take 28 cycles.

We can do better by changing the order of the instructions:
clock cycle
issued
loop:  L.DF0, 0(R1) 1
L.DF6, -8(R1) 2
L.DF10, -16(R1) 3
L.DF14, -24(R1) 4
ADD.DF4, F0, F2 5
ADD.DF8, F6, F2 6
ADD.DF12, F10, F2 7
ADD.DF16, F14, F2 8
S.DF4, 0(R1)    9
S.DF8, -8(R1)   10
DADDUI  R1, R1, #-3211
S.DF12, 16(R1)   12
S.DF16, 8(R1)   13
BNER1, R2, loop  14

There are now no stalls at all.

Summary of the 4 examples:
DescriptionCycles per iteration
ideal5
original9
scheduled7
unrolled6.75
unrolled and scheduled3.5

Limitations of loop unrolling:
  • decrease in saving as we unroll more
    • When we unroll 4, 2 cycles out of 14 or 14.3% are loop overhead
    • If we unroll 8, 2 cycles out of 26 or 7.7% are loop overhead
    • If we unroll 16, 2 cycles out of 34 or 5.9% are loop overhead
  • increase in code size
  • limited number of registers

Next Notes

Back to CS 3853 Notes Table of Contents