CS 3853 Architecture Notes on Appendix C Section 3

CS 3853 Computer Architecture Notes on Appendix C Section 3

Read Appendix C.3

C3: Pipeline Implementation

We start with a simple unpipelined implementation of a subset of the MIPS instructions.

Unpipelined Implementation

We consider the following 5 types on instructions:

register-register ALU (result in another register)
register-immediate ALU (result in another register)
load (register with displacement addressing)
store (register with displacement addressing)
conditional branch instruction that branches if a register is 0

The following information is from Figure A-22.
All instructions are 32 bits and these instructions have one of 2 formats:
I-type:
Figure A.22-I

Used for:

load: Regs[rt] ← Mem[Regs[rs] + Imm]
store: Mem[Regs[rs] + Imm] ← Regs[rt]
branch: if (Regs[rs] == 0) PC ← PC + (Imm << 2)
Immediate ALU: Regs[rt] ← Regs[rs] op Imm

R-type:
Figure A.22-R

Used for:

RR ALU: Regs[rd] ← Regs[rs] funct Regs[rt]

Today's News: February 12, 2013

No news yet.

Examples: Figure C.21 shows the hardware needed to implement these instructions in 5 or fewer cycles.
Here is what happens at each cycle:

IF
- IR ← Mem[PC]
- NPC ← PC + 4
ID
- A ← Regs[rs]
- B ← Regs[rt]
- Imm ← sign-extended field of IR
EX
- Load or Store:
  ALUOutput ← A + Imm
- RR ALU:
  ALUOutput ← A funct B
- R-Imm ALU
  ALUOutput ← A op Imm
- Branch:
  ALUOutput ← NPC + (Imm << 2)
  Cond ← (A == 0)
MEM
- if Branch and cond PC ← ALUOutput
  otherwise PC ← PC + 4
- Load:
  LMD ← Mem[ALUOutput]
- Store:
  Mem[ALUOutput] ← B
WB
- Load:
  Regs[rt] ← LMD
- RR ALU:
  Regs[rd] ← ALUOutput
- R-Imm ALU:
  Regs[rt] ← ALUOutput

Question:

The RR instruction is described as:

RR ALU: Regs[rd] ← Regs[rs] funct Regs[rt]

What would have to change if instead it were:

RR ALU: Regs[rs] ← Regs[rt] funct Regs[rd]

Answer:

Pipelined Implementation

Figure C.22 shows a corresponding pipeline implementation.
The registers NPC, IR, A, B, Imm, Cond, ALUOutput and LMD are now contained in the pipeline registers.
Examples:

NPC is contained in which pipeline register?
Answer:
NPC is created in IF so it it stored in the IF/ID register.
It is needed in EX and MEM, so it must be in all pipeline registers up to MEM, so it is also stored in ID/EX and EX/MEM.
IR is stored in which pipieline registers?
Answer:
Parts of the IR register are needed in each cycle, so for simplicity, the entire IR is propagated to each pipeline register. This is somewhat inefficient.

Today's News: February 14, 2013

No news yet.

Examples: Figure C.23 shows the details of the pipelined execution for each type of instruction.

Below is a comparison for the RR ALU instruction. See Figures C.21 and C.22
Operations that are performed, but not needed for this instruction are shown this way: operation.

Stage	Unpipelined	Pipielined
IF	IR ← Mem[PC] NPC ← PC + 4	IF/ID.IR ← Mem[PC] PC ← PC + 4 IF/ID.NPC ← PC + 4
ID	A ← Regs[IR.rs] B ← Regs[IR.rt] Imm ← sign-extended(IR.Immediate)	ID/EX.A ← Regs[ID/IF.IR.rs] ID/EX.B ← Regs[ID/IF.IR.rt] ID/EX.NPC ← IF/ID.NPC ID/EX.IR ← ID/ID.IR ID/EX.Imm ← sign-extended(IF/ID.IR.Immediate)
EX	ALUOutput ← A funct B	EX/MEM.IR ← ID/EX.IR EX/MEM.ALUOutput ← ID/EX.A funct ID/EX.B
MEM	PC ← PC + 4	MEM/WB.IR ← EX/MEM.IR MEM/WB.ALUOutput ← EX/MEM.ALUOutput
WB	Regs[IR.rd] ← ALUOutput	Regs[MEM/WB.IR.rd] ← MEM/WB.ALUOutput

Note: My notation is slightly different from that of the book.

For the unpipelined case I use IR.rs instead of just rs, etc.
For the pipelined case I use XX/XX.IR.rs instead of XX/XX.IR[rs]

How Branches Work

Branches are hard.

We already know that branches can cause stalls.
The problem is that we might not know the branch address or if the branch is taken until one or more additional instructions have been fetched, and possibly executed.
We are saved by the fact that the external state (what programs see) is not changed until MEM or WB.

In the unpipelined architecture shown in Figure C.21:

the NPC stores the potential new PC during IF
the branch address and whether the branch is taken is computed in EX
the PC is updated in MEM
for a branch, the instruction is complete after the MEM cycle.

The the pipelined architecture shown in Figure C.22 has a 3-cycle stall when a branch is taken:
Suppose the instruction stream looks like:

instruction (not branch)
instruction (not branch)
instruction (not branch)
instruction A: taken branch
instruction B
instruction C
instruction D
...
instruction X: branch target

The PC is set at the end of IF to either PC+4 (normally) or if the Zero? field of EX/MEM is not 0 it is set to the ALU result
The Zero? field of EX/MEM stays 0 until the branch instruction is executed.
If the branch instruction is fetched in cycle i:

cycle i:
- IF: taken branch is fetched
- IF: branch instruction stored in IF/ID
- IF: PC + 4 stored in PC (address of instruction i+1)
cycle i+1:
- IF: instruction B at i+1 is fetched
- IF: PC + 4 is stored in PC (address of instruction i+2)
- ID: branch base register stored in ID/EX
- ID: branch destination offset is stored in ID/EX
- ID: branch instruction is stored in ID/EX (from IF.ID)
cycle i+2:
- IF: instruction C at i+2 is fetched
- IF: PC + 4 is stored in PC (address of instruction i+3)
- ID: instruction B at i+1 is decoded
- EX: branch instruction Zero? stored in EX/MEM (this is 1 since the branch is taken)
- EX: branch destination stored in EX/MEM
cycle i+3:
- IF: instruction D at i+3 is fetched
- IF: branch destination is stored in PC (since Zero? field of EX/MEM is now set)
- ID: instruction C at i+2 is decoded
- EX: instruction B at i+1 is executed
  Note that even if this is a branch, we do not want to set Zero?
- MEM: nothing (for branch instruction)
cicle i+4:
- IF: branch destination is fetched

The timing diagram looks like this:

instruction	cycle i	cycle i+1	cycle i+2	cycle i+3	cycle i+4	cycle i+5	cycle i+6	cycle i+7	cycle i+8
instruction A (taken branch)	IF	ID	EX	MEM	WB
instruction B		IF	ID	EX	MEM	WB
instruction C			IF	ID	EX	MEM	WB
instruction D				IF	ID	EX	MEM	WB
instruction X (branch destination)					IF	ID	EX	MEM	WB

The PC is changed at the end each cycle and is either PC+4 or the ALU output depending on what is in the MEM/EX register which was set on the previous cycle in MEM.
The branch instruction sets this in cycle i+3 so it affects the fetch in cycle i+4
We inhibit the MEM and WB actions in the next 3 instructions so the effect is that these are not executed.
This produces 3 stalls for each taken branch.

Reducing the branch penalty

Figure C.28 shows how to reduce the branch taken penalty from 3 to 1. Figures C.22 and C.28 compared

Must know if branch is taken in ID, rather than EX
- Zero? is done in ID rather than EX
- This is easy if with only have branch on zero or nonzero
- Requires more hardware if branch on compare 2 registers
Must compute branch address in ID
- requires an adder in ID after the register file read
- might increase the clock cycle time, but result is not fed into ID/EX
Must feed results of Add and Zero? directly into PC mux rather than into ID/EX to save one cycle

The timing diagram now looks like this:

instruction	cycle i	cycle i+1	cycle i+2	cycle i+3	cycle i+4	cycle i+5	cycle i+6
instruction A (taken branch)	IF	ID	EX	MEM	WB
instruction B		IF	ID	EX	MEM	WB
instruction X (branch destination)			IF	ID	EX	MEM	WB

Questions:

Why do we not strike out the ID and EX of instruction B?
Answer:
We do not have to since they do not change the external state.
Why don't with strike out the MEM and WB for instruction A?
Answer:
A branch instruction does not do anything in these stages.

Today's News: February 19, 2013

Exam on Thursday.

Examples:

Dealing with data hazards

Recall that there are 3 types of hazards: structural, data, and control.
Structural hazards will not occur because we included enough hardware.
The above discussion showed how to handle control hazards.
When a data hazard occurs, we need to either stall the pipeline, or elimintate the hazard by using forwarding.

Examples:

The following requires a stall of the DADD instruction:

     LD    R1, 45(R2)
     DADD  R5, R1, R7

This can be detected in the ID stage of the DADD instruction by comparing rt of the LD instruction to rs and rt of the DADD instruction.
During the ID stage of DADD, rs is in IF/ID.IR.rs and rt is in IF/ID.IR.rt
During the ID stage of DADD, rt of LD is in ID/EX.IR.rt

The following data hazard in the DSUB instruction can be removed by forwarding:
```
     LD    R1, 45(R2)
     DADD  R5, R6, R7
     DSUB  R8, R1 R7
```
- Figure C.27 shows the new data paths needed and the new muxes for the ALU.

Next Notes

Back to CS 3853 Notes Table of Contents