CS 3853 Computer Architecture Notes on Appendix C Section 2
Read Appendix C
C2: Pipeline Hazards
A hazard prevents the next instruction from executing during its designated clock cycle.
Hazard Classifications
structural hazard: insufficient hardware due to overlapped execution.
data hazard: instruction needs data from a previous instruction before it is available.
control hazard: branch changes the PC after a later instruction has been fetched.
A hazard may require that the pipeline stalls until the hazard can be cleared.
For now, when a stall occurs:
all instructions issued later will also stall
all instructions issued earlier will continue so that the hazard can be cleared
Performance with stalls
We will compare an unpipelined machine in which instructions take several cycles to a pipelined machine
with the same clock rate.
speedup =
Average instruction time unpipelinedAverage instruction time pipelined
=
CPI unpipelinedCPI pipelined
with no hazards, CPI pipelined = 1.
with hazards, CPI pipelined = 1 + stall cycles per instruction
speedup =
CPI unpipelined 1 + stall cycles per instruction
in the case in which all instructions on the unpipelined machine take the same time and the pipeline is completely
balanced with no overhead:
CPI unpipelined = pipeline depth and
speedup =
pipeline depth 1 + stall cycles per instruction
Structural Hazards
At some stage of the pipeline, two instructions require the same resource. Example: A shared single-memory port for data and instructions
instruction memory is always used in the first stage of the pipeline for the instruction fetch
a load (or store) instruction will access the data memory in the 4th stage (MEM)
with a shared single-memory for data and instructions we cannot access the instruction memory and data memory in the same clock cycle.
Figure C.4
shows a load instruction followed by 4 non-memory instructions.
Here is a timing diagram showing the stall (like figure C.5):
This assumes none of the other instructions are loads or stores so they do not need to access memory in the MEM stage.
clock number
Instruction
1
2
3
4
5
6
7
8
9
10
11
12
Load instruction
IF
ID
EX
MEM
WB
Instruction i+1
IF
ID
EX
MEM
WB
Instruction i+2
IF
ID
EX
MEM
WB
Instruction i+3
stall
IF
ID
EX
MEM
WB
Instruction i+4
IF
ID
EX
MEM
WB
Instruction i+5
IF
ID
EX
MEM
WB
Instruction i+6
IF
ID
EX
MEM
WB
Today's News: September 9, 2015
No news
Examples:
Memory structural hazard
Compare the corresponding balanced unpipelined machine with a 5-stage pipelined machine with one shared memory port
to a pipelined machine with a single memory port in which loads and stores together make up 30% of the instructions.
Since most computers use the same memory of data and instructions, why is the above not a problem for modern machines?
Data Hazards
These occur when the pipeline would change the order of read/write accesses so that they differ from the order of unpipelined execution.
Consider:
DADD R1, R2, R3
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10 R1, R11
Recall that in each of these instructions, the first register is the destination of the operation.
We assume that in ID, register reads occur at the end of the cycle, and in WB, the register writes occur at the start of the cycle.
We will see how this eliminates one of the hazards.
Each instruction after the first uses R1.
Figure C.6
shows the execution of these instructions in the standard pipeline.
The DSUB instruction needs the new value of R1 at the end of CC 3, but it is not available until the beginning of CC 5 has completed.
Similarly, the AND needs it at the end of CC 4.
The OR needs it at the end of CC 5 so this should be OK.
The XOR doesn't need it until CC 6 so it is fine.
Forwarding
The idea of forwarding is that even though a result is not stored in the register file until WB,
it is often available several cycles earlier. For an ALU instruction it is available in the EX/MEM
pipeline register. Figure C.7
shows how the two stalls from the previous example can be eliminated by forwarding (part of the) contents of
the pipeline registers to the next stage.
Problem
forwarding hardware
To implement forwarding for the first two instructions in the example above, one of the ALU inputs
must be able to be gotten from two different places depending on the previous instruction.
From which pipeline register(s) does the ALU get its inputs?
What type of circuit is required to implement this?
Examples:
Consider:
DADD R1, R2, R3
LD R4, 0(R1)
SD R4, 12(R1)
The LD and SD use the ALU to calculate the effective address in EX. Figure C.8
shows how forwarding can be used to get R1 before it is stored back in the register file.
Also, the value of R4 from the LD is given to the SD before it goes into the register file.
Today's News: September 11, 2015
Late Assignment 1 due today
Sometimes stalls are necessary
Consider:
LD R1, 0(R2)
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
Figure C.9
shows the required forwarding paths.
The DSUB needs the result of the LD before it is available anywhere, so a stall is required.
The EX cycle of the DSUB requires the value generated in the MEM cycle of the LD
which occurs at the same time.
It is fixed by introducing a stall in before the EX cycle of the DSUB.
All subsequent instructions are also stalled.
clock number
Instruction
1
2
3
4
5
6
7
8
9
LD R1,0(R2)
IF
ID
EX
MEM
WB
DSUB R4,R1,R5
IF
ID
stall
EX
MEM
WB
AND R6,R1,R7
IF
stall
ID
EX
MEM
WB
OR R8,R1,R9
stall
IF
ID
EX
MEM
WB
Question:
stalls after memory access
Suppose the sequence of instructions is:
LD R1, 0(R2)
DSUB R4, R1, R5
AND R6, R7, R7
OR R8, R7, R7
Would we still have to delay the AND and OR instructions, even though they use different registers? Why?
How could you prevent the stalls in this code sequence?
Answer:
?
Branch Hazards
Control hazards can cause a significant performance loss.
If a branch changes the PC to its target address, we say the branch is taken.
Otherwise, it is not taken or untaken.
If a branch is taken, the PC is modified at the end of ID.
At this point the next instruction has already been fetched and needs to be discarded.
One way to do this is to always redo the fetch for a branch instruction as shown below:
Branch instruction
IF
ID
EX
MEM
WB
Branch successor
IF
IF
ID
EX
MEM
WB
Branch successor + 1
IF
ID
EX
MEM
WB
Branch successor + 2
IF
ID
EX
MEM
WB
Four static methods of dealing with branch stalls Method 1: freeze or flush the pipeline
This is the method that was shown above.
The penalty is always one cycle and cannot be fixed by software.
Method 2: treat every branch as not taken
In general we would need to back out of any action that occurred when we find out the the branch was taken.
In our simple 5-stage pipeline, turn the next instruction into a no-op.
This works because we can tell if the branch is taken during ID.
This is illustrated below:
Untaken branch instruction
IF
ID
EX
MEM
WB
Instruction i + 1
IF
ID
EX
MEM
WB
Instruction i + 2
IF
ID
EX
MEM
WB
Instruction i + 3
IF
ID
EX
MEM
WB
Taken branch instruction
IF
ID
EX
MEM
WB
Instruction i + 1
IF
idle
idle
idle
idle
Branch target
IF
ID
EX
MEM
WB
Branch target + 1
IF
ID
EX
MEM
WB
Branch target + 2
IF
ID
EX
MEM
WB
While Method 1 always causes a stall for each branch, this only causes a stall if the branch is taken.
Method 3: treat every branch as taken
Not useful in the 5-stage pipeline since we do not know the branch target until after ID, which is too late.
For longer pipelines, this method will make the penalty the smallest if the branch is taken.
Method 4: The delayed branch
Must be a feature of the ISA an therefore the programmer (or compliler writer) must take this into account.
The instruction after a branch is always executed, whether the branch is taken or not.
This allows us to know the branch target and whether the branch is taken before we fetch the next instruction.
There are no branch stalls with this method as long as a useful instruction can be put in the delay slot.
It is the job of the complier to schedule a useful instruction into the delay slot.
If none can be found, the delay slot can be filled with a no-op.
Figure C.14
shows several methods of scheduling the delay slot.
The best method of scheduling the delay slot might depend on whether the branch is taken or not.
Reducing the branch cost through prediction
There are 2 classes of branch prediction:
static prediction: low cost - can be used by compliers
dynamic prediction: based on program behavior
Static Branch Prediction
Use profiling to predict which branches are usually taken and which ones are usually not taken.
Figure C.17
shows the success of this strategy for some SPEC benchmarks.
In SPEC, branches make up between 3% and 24% of all instructions executed.
Dynamic Branch Prediction
The simplest technique uses a branch prediction buffer or branch history table.
branch prediction buffer: small memory indexed by the low bits of the address of the branch instruction.
simple: each entry has a bit indicating whether the last branch at that address was taken or untaken.
better: each entry has 2 bits so that a prediction must miss twice before it is changed.