CS 3853 Architecture Notes on Appendix B Section 2

next

CS 3853 Computer Architecture Notes on Appendix B Section 2

Read Appendix B.2

B.2: Cache Performance

Hit ratio or miss ratio alone are not a good measure of cache performance.
We will use Average memory access time = Hit time + Miss rate × Miss penalty.
Note that this is only a measure of the memory system and not the performance of an entire computer system.
This chapter has little new information, but gives a number of examples of calculating cache performance.
Here are some of the issues presented:
1. Advantages of separate instruction and data caches
  - With a unified single-port cache, there is a structural hazard on each load or store
  - This requires a stall cycle on each load or store.
2. How much associativity is best?
  - the higher the associativity, the lower the miss rate (up to a point).
  - the higher the associativity, the more hardware is needed (increased cost).
  - with higher associativity, may have to reduce the clock rate.
3. The higher the clock rate and lower the CPI, the more the effect of cache misses.
  (With a higher clock rate, the miss penalty is a larger number of cycles.)
4. Does average memory access time predict processor performance?
  - Cache performance has no affect on I/O
5. Miss penalty has smaller effect for processors that support out-of-order execution.

ClassQue: Cache Performance 1

Example 1

Compare the miss ratios and access times of:

16KB instruction cache and 64KB data cache
256KB unified cache

Make reasonable assumptions to solve the problem.
Solution:

Assumptions:

Miss rates per 1000 instructions are given in Figure B.6 (on page B-15) as follows:
16KB instruction: 3.82
64KB data: 36.9
256KB unified: 32.9
These assume 36% instructions are loads and stores, as with some SPEC benchmarks.
Assume a 2-way set associative cache with 64-byte blocks.
A hit takes 1 cycle
Miss penalty is 50 cycles
A load or store takes an extra cycle because the the structural hazard in the case of the unified cache.
Ignore stalls due to write-through.

miss ratio_split = (3.82+36.9)/(1.36 × 1000) = .02994
miss ratio_unified = 32.9/(1.36 × 1000) = .02419
The unified miss ratio is better!

Today's News: October 17

This does not take into account the extra stall due to the structural hazard in the unified cache.
To calculate the average memory access time:

instruction access time = hit time + miss ratio × miss penalty

access time_split = 1 + .02994 × 50 = 2.497 cycles.
access time_unified = 1 + .36 + .02419 × 50 = 2.57 cycles.
The split access time is better!

The next example explores the performance of direct mapped and set associative caches.
For a given size cache, the more associativity, the higher the hit ratio.
More associativity requires additional hardware (and time) to check a tag (even on a hit)
This might require increasing the clock cycle time.
Example 2

Which is faster, a direct mapped cache with a cycle time of .4 ns, or
a 2-way set associative cache with a cycle time of .45 ns?
We need some additional assumptions to do this problem:

1.3 memory accesses per instruction
CPI of 1 with no cache misses
miss penalty of 21 ns
miss rate of direct mapped cache: 2.3%
miss rate of 2-way set associative cache: 2.1%
these are unified caches, but with no structural hazard

First, we need to know the miss penalty in cycles for each:

miss penalty_direct = 21ns/.4ns = 52.5 cycles
miss penalty_2-way = 21ns/.45ns = 46.67 cycles

We round up the number of cycles for the miss penalty.
Second we calculate CPI for each:

CPI_direct = 1 + 1.3 × .023 × 53 = 2.5847
CPI_2-way = 1 + 1.3 × .021 × 47 = 2.2831

What we really want it time:

Time per instruction_direct = 2.5847 × .4 ns = 1.0339 ns.
Time per instruction_2-way = 2.2831 × .45 ns = 1.0274 ns.

In this case the 2-way cache is better by .6%.

With out-of-order execution, part of the miss penalty can be overlapped with the execution of other instructions.
Example 3

Redo the above problem if the 30% of the miss penalty can be overlapped.
Solution:

We just have to reduce the miss penalty by 30% in each case.

CPI_direct = 1 + 1.3 × .023 × 53 × .7 = 2.1093
CPI_2-way = 1 + 1.3 × .021 × 47 × .7 = 1.8982

What we really want it time:

Time per instruction_direct = 2.1093 × .4 ns = .8437 ns.
Time per instruction_2-way = 1.8982 × .45 ns = .8542 ns.

In this case the direct mapped cache is faster by 1.25%.

Next Notes

Back to CS 3853 Notes Table of Contents