CS 3853 Architecture Notes on Appendix C

C1: Introduction

Consider a traditional processor in which instructions are executed as follows:

Instruction fetch (IF): read the instruction from memory and update the PC
Instruction decode (ID): decode the instruction and read the source registers
Execution (EX): execute, e.g. perform ALU operation (may be effective address calculation)
Memory access (MEM): If this was a load, read from memory, if a store, write to memory
Write-back (WB): Write the result to the destination register

If we execute one instruction per cycle, the cycle time needs to be long enough to perform all of these steps on the longest instruction.
Alternatively, we can execute each part in a cycle, which makes the cycle time shorter, but some instructions will require as many as 5 cycles.

The cycle time needs to be long enough for the slowest of these steps.
A 5-cycle instruction will execute slower, but some instructions will take fewer cycles.

Today's News: August 31, 2015

Recitation and quiz today.

Pipeline Timing 1

We will consider this second approach for now.
What fraction of the time is the ALU being used?
How can we improve the performance?
The idea of pipelining: Fetch the next instruction while decoding the previous instruction.
Instead of a throughput of 1 instruction every 5 cycles, we could get one per cycle after an initial delay.

Important:

You must know the above 5 steps
You must be able to give them in order using the 2 or 3 letter description:
IF, ID, EX, MEM, WB
You must know the names of each step:
Instruction fetch, Instruction decode, Execution, Memory access, Write-back
You must be able to describe in general what each step does
For each of the following major types of 4-byte RISC instructions,
you must be able to describe in detail what happens at each step:
- register-register ALU (result in another register)
- register-immediate ALU (result in another register)
- load (register with displacement addressing)
- store (register with displacement addressing)
- conditional branch instruction that compares 2 registers
See pages C5 and C6 of the text.

The MIPS instruction set

We will base our pipeline discussion on the MIPS 64-bit instruction set.
This is a RISC instruction set.
All operations on data apply to data in registers and typically change the entire register.
Only load and store instructions access memory.
Memory instructions can operate on 8, 16, 32, or 64 bits.
Almost all instructions are 32 bits in length.
32 general purpose registers with R0 always 0.
ALU instructions have 3 operands, either all registers, or 2 registers and an immediate value:
- DADD R1, R2, R3: R1 = R2 + R3
- DADDIU R1, R2, #3: R1 = R2 + 3 (unsigned)
Load and store instructions use base register with displacement addressing.
- LD R1, 30(R2): 64-bits of memory loaded into R1
- SD R1, 30(R2): 64-bits of memory stored from R1
- Note that source or destination register is first and memory address second.
Branches and Jumps: branches are conditional and the instruction stores the offset from the current PC
MIPS can use either condition codes or direct register comparison. We only consider the latter.
- BEQZ R1, name: branch if R1 is 0
- BNE R1, R2, name: branch if R1 is not equal to R2
For now we do not need to know the details which are contained in Appendix A.
All of these instructions can be executed in 5 cycles or fewer using IF, ID, EX, MEM, WB
Branch instructions require only 2 cycles.
Store instructions require only 4 cycles.
All other instructions take 5 cycles.

Example

Describe the execution of

DADDIU R1, R2, #3

at each of the 5 execution stages.
Solution:

IF:
- Send the PC to the instruction memory and fetch the next instruction.
- Add 4 to the PC (length of the instruction)
ID:
- decode the instruction
- get R2 from the register file
- sign-extend the immediate value from the instruction
EX:
- send the value of R2 and the sign-extended immediate value to the ALU to perform the add
MEM:
- nothing to do here
WB:
- write the result to R1 in the register file

Problems

5-stage pipeline 1

Describe the execution of
LD R1, 30(R2)
in the EX stage.
In the ID step, we decode the instruction and read the source registers.
Describe in words what is meant by decode the instruction.

The classic 5-stage pipeline

The simple 5-stage pipeline looks like this:

	clock number
Instruction number	1	2	3	4	5	6	7	8	9
Instruction i	IF	ID	EX	MEM	WB
Instruction i+1		IF	ID	EX	MEM	WB
Instruction i+2			IF	ID	EX	MEM	WB
Instruction i+3				IF	ID	EX	MEM	WB
Instruction i+4					IF	ID	EX	MEM	WB

Today's News: September 2, 2015

No news

Starting with clock number 5, one instruction completes per clock cycle.
Figure C.2 shows the hardware needed to support each stage of the pipeline.
We need to make sure that a piece of hardware does not need to do 2 things at once.
For example, an ADD will need to use the ALU in stage 3 (the EX stage) and a LD will need to compute an effective address (by adding a displacement to a register) in the same stage.
However, a branch will need to computer the branch address in stage 2 (ID) which requires an adder.
We also need an adder in stage 1 (IM) to increment the PC.

IF accesses the Instruction Memory (IM), but also needs to increment the program counter (needs an adder)
ID needs to read from the register file (but not write to it) and possibly compute a branch address.
EX does an ALU operation or calculates a memory address (not both since this is RISC)
MEM accesses memory if this is a load or store
WB: if ALU or load, write to the register file.

Note that the register file is accessed in both ID and WB.
We assume that we can read and write in the same cycle.
In fact, we write at the beginning of the cycle and read at the end of the cycle.

Implementation requirement: pipeline registers

At each stage, certain values need to be saved so they do not change during the next stage.
Example: ALU is combinational logic

This means that the outputs can change soon after the inputs change
In Figure C.1, the same ALU is used every clock cycle
A value computed in stage 3 (EX) might be used in stage 4 (MEM) or stage 5 (WB).
A register (sequential logic) can hold the results after each clock cycle, until the next one.

Figure C.3 shows the pipeline registers required.

Questions:

Figure C.3 ALU and Pipeline

In Figure C.3, how many different ALU's are shown?
Answer:
In Figure C.3, how many different pipeline registers are shown?
Answer:

Today's News: September 4, 2015

No recitation next week since Monday is Labor Day

Example

Describe what needs to be stored in each of the pipeline registers during the execution of

DADDIU R1, R2, #3

Solution:
It is easier to do this backwards, starting with the MEM/WB to make sure everything that is needed propagates.
Only values from the previous pipeline register and those computed at the current stage are available to be saved.
Look at the previous example describing what is needed at each stage.

IF/ID: The fetched instruction
ID/EX:
- the value of R2 (from the register file)
- the sign-extended immediate value (from the IF/ID register, with the help of some hardware)
- the address of the R1 register (from the IF/ID register)
- the following decoded control lines: ALU, MEM, WB (generated by hardware from the IF/ID register)
EX/MEM:
- the result of the ALU operation
- and the address of the R1 register (from ID/EX)
- the following control lines (from ID/EX): MEM, WB
MEM/WB:
- the result of the ALU operation (from EX/MEM)
- the address of the R1 register (from EX/MEM)
- the control lines for WB (from EX/MEM)

A Note on terminology

When discussing registers, you should be careful to distinguish between

the contents of the register
the address of the register

If the register R1 contains the value 234 then

The value of R1 is 234
The address of R1 is 1

If you just say R1, it may be ambiguous what you mean.
Always either say "the value of" or "the address of" when referring to a register.
In the example above, the ID/EX pipeline register stores

the value of R2
the address of R1 (the number 1)

Pipeline performance

Pipelining increases throughput
Pipelining does not reduce latency
Usually pipelining increases latency (slightly)
Clock runs at a rate determined by the slowest stage in the pipeline.

Examples

A unpipelined machine has a 1 ns clock.
All instructions take 5 cycles, except for branches which take 2 cycles. Branches are 30% of all instructions.
What is the speedup obtained by using a pipelined design if the pipelining increases the clock cycle time to 1.5 ns?
Solution:
Unpipelined CPI = .7 × 5 + .3 × 2 = 4.1
Unpipelined average instruction execution time: 1 ns × 4.1 = 4.1 ns.
Pipelined average instruction execution time: 1.5 ns.
Speedup =
Average instruction execution time unpipelined
Average instruction execution time pipelined
=
4.1
1.5
= 2.733.
Why did we ignore the latency of the pipelined machine in the above solution?

Next Notes

Back to CS 3853 Notes Table of Contents