CS 3853 Architecture Notes on Appendix B Section 2

next

CS 3853 Computer Architecture Notes on Appendix B Section 2

Read Appendix B.2

B.2: Cache Performance

Hit ratio or miss ratio alone are not a good measure of cache performance.
We will use Average memory access time = Hit time + Miss rate × Miss penalty.
Note that this is only a measure of the memory system and not the performance of an entire computer system.
This chapter has little new information, but gives a number of examples of calculating cache performance.
Here are some of the issues presented:
1. Advantages of separate instruction and data caches
  - With a unified single-port cache, there is a structural hazard on each load or store
  - This requires a stall cycle on each load or store.
2. How much associativity is best?
  - the higher the associativity, the lower the miss rate (up to a point).
  - the higher the associativity, the more hardware is needed (increased cost).
  - with higher associativity, may have to reduce the clock rate.
3. The higher the clock rate and lower the CPI, the more the effect of cache misses.
  (With a higher clock rate, the miss penalty is a larger number of cycles.)
4. Does average memory access time predict processor performance?
  - Cache performance has no affect on I/O
5. Miss penalty has smaller effect for processors that support out-of-order execution.

Today's News: March 7, 2013

Compare the miss rates of:

16KB instruction cache and 64KB data cache
256KB unified cache

Make reasonable assumptions to solve the problem.
Solution:

Miss rates per 1000 instructions are given in Figure B.6 as follows:
16KB instruction: 3.82
64KB data: 36.9
256KB unified: 32.9
These assume 36% instructions are loads and stores, as with some SPEC benchmarks.
Assume a 2-way set associative cache with 64-byte blocks.
A hit takes 1 cycle
Miss penalty is 50 cycles
A load or store takes an extra cycle because the the structural hazard in the case of the unified cache.
Ignore stalls due to write-through.

What percentage of references are fetches and what % are data?

fraction instruction references = 1/1.36 = .7353.
fraction data references = .36/1.36 = .2647 (= 1 - .7353)

miss rate_instruction = 3.82/1000 = .00382
miss rate_data = 36.9/(.36 × 1000) = .1025
miss rate_split = .7353 × .00382 + .2647 × .1025 = .0299
miss rate_unified = 32.9/(1.36 × 1000) = .0242
The unified miss rate is better!
This does not take into account the extra stall due to the structural hazard in the unified cache.
access time = hit time + miss rate × miss penalty
access time_split = 1 + .0299 × 50 = 2.495 cycles.
access time_unified = 1 + .36 + .0242 × 50 = 2.57 cycles.
The split access time is better!

The next example explores the performance of direct mapped and set associative caches.
For a given size cache, the more associativity, the higher the hit ratio.
More associativity requires additional hardware (and time) to check a tag (even on a hit)
This might require increasing the clock cycle time.
Example 2

Which is faster, a direct mapped cache with a cycle time of .4 ns, or
a 2-way set associative cache with a cycle time of .45 ns?
We need some additional assumptions to do this problem:

1.3 memory accesses per instruction
CPI of 1 with no cache misses
miss penalty of 21 ns
miss rate of direct mapped cache: 2.3%
miss rate of 2-way set associative cache: 2.1%

First, we need to know the miss penalty in cycles for each:

miss penalty_direct = 21ns/.4ns = 52.5 cycles
miss penalty_2-way = 21ns/.45ns = 46.67 cycles

We round up the number of cycles for the miss penalty.
Second we calculate CPI for each:

CPI_direct = 1 + 1.3 × .023 × 53 = 2.5847
CPI_2-way = 1 + 1.3 × .021 × 47 = 2.2831

What we really want it time:

Time per instruction_direct = 2.5847 × .4 ns = 1.0339 ns.
Time per instruction_2-way = 2.2831 × .45 ns = 1.0274 ns.

In this case the 2-way cache is better by .6%.

With out-of-order execution, part of the miss penalty can be overlapped with the execution of other instructions.
Example 3

Redo the above problem if the 30% of the miss penalty can be overlapped.
Solution:

We just have to reduce the miss penalty by 30% in each case.

CPI_direct = 1 + 1.3 × .023 × 53 × .7 = 2.1093
CPI_2-way = 1 + 1.3 × .021 × 47 × .7 = 1.8982

What we really want it time:

Time per instruction_direct = 2.1093 × .4 ns = .8437 ns.
Time per instruction_2-way = 1.8982 × .45 ns = .8542 ns.

In this case the direct mapped cache is faster by 1.25%.

Next Notes

Back to CS 3853 Notes Table of Contents