Body

Main Idea of a Cache - keep a copy of frequently used information as "close" (w.r.t access time) to the processor as possible.

Steps when the CPU generates a memory request:

1) check the (faster) cache first

2) If the addressed memory value is in the cache (called a hit), then no need to access memory

3) If the addressed memory value is NOT in the cache (called a miss), then transfer the block of memory containing the reference to cache. (The CPU is stalled waiting while this occurs)

4) The cache supplies the memory value from the cache.

Effective Memory Access Time

Suppose that the hit time is 10 ns, the cache miss penalty is 1600 ns, and the hit rate is 99%.

Effective Access Time (hit time * hit probability) + (miss penalty * miss probability)

Effective Access Time = 10 * 0.99 + 1600 * (1 - 0.99) = 9.9 + 16.0 = 25.9 ns

(One way to reduce the miss penalty is to not have the cache wait for the whole block to be read from memory before supplying the accessed memory word.)

Fortunately, programs exhibit locality of reference that helps achieve high hit-rates:

1) spatial locality - if a (logical) memory address is referenced, nearby memory addresses will tend to be referenced soon.

2) temporal locality - if a memory address is referenced, it will tend to be referenced again soon.

Cache - Small fast memory between the CPU and RAM/Main memory.

Example:

32-bit address
512 KB (2¹⁹)
8 byte per block/line
byte-addressable memory

Number of Cache Line

Three Types of Cache:

1) Direct-mapped - a memory block maps to a single cache line

Cache - Small fast memory between the CPU and RAM/Main memory.

Example:

32-bit address, byte-addressable memory
512 KB (2¹⁹)
8 byte per block/line

Number of Cache Line

2) Fully-Associative Cache - a memory block can map to any cache line

Advantage: Flexibility on what's in the cache

Disadvantage: Complex circuit to compare all tags of the cache with the tag in the target address

Therefore, they are expensive and slower so use only for small caches (say 8-64 lines)

Replacement algorithms - on a miss of a full cache, we must select a block in the cache to replace

LRU - replace the cache block that has not been used for the longest time (need additional bits)
Random - select a block randomly (only slightly worse that LRU)
FIFO - select the block that has been in the cache for the longest time (slightly worse that LRU)

3) Set-Associative Cache - a memory block can map to a small (2, 4, or 8) set of cache lines

Common Possibilities:

2-way set associative - each memory block can map to either of two lines in the cache

4-way set associative - each memory block can map to either of four lines in the cache

Number of Sets

Block/Line Size

The larger the line size:

fewer cache line for the same size cache
improves hit rate since larger blocks are read when a miss occurs
larger miss penalty since more words are read from memory when a miss occurs

Number of Caches:

Issues:

Number of cache levels

unified vs. split caches

split caches - separate smaller caches for data and instructions

unified cache - data and instructions in the same cache

Advantages of each:

split caches - reduces contention for "memory" between instruction and data accesses

unified caches - balances the load between instructions and data automatically

(e.g., a tight loop might need more data blocks than instruction blocks) Write Policy - do we keep the cache and memory copy of a block identical???

Just reading a shared variable causes no problems - all caches have the same value

Writing can cause a "cache-coherency problem"

Write Policies

write back - CPU only changes local cache copy until that block is replaced, then it is written back to memory (an UPDATE/DIRTY bit is associated with each cache line to indicate if it has been modified). If we assume that CPU 0 writes the block back to memory before CPU 1, then X's resulting value will be 6. Thereby, discarding the effect of "X:=X+2".

Disadvantage(s) of writeback?

Advantage(s) of writeback?

write through - on a write to a cache block, write to the main memory copy to keep it up to date

(If the write occurs to a block not in the cache, then typically the write only occurs to memory and a block is not allocated in the cache.)

Unfortunately, write-through does not completely solve the cache-coherency problem, but it helps.

To prevent the CPU from stalling on a write, a write buffer can be used to buffer the write request so the CPU can go back to work without waiting on the slow memory. If the write buffer fills up, the CPU must stall.

Cache Coherency Solutions

a) bus watching with write through / Snoopy caches - caches eavesdrop on the bus for other caches write requests. If the cache contains a block written by another cache, it take some action such as invalidating it's cache copy.

The MESA protocol is a common cache-coherency protocol.

b) noncachable memory - part of memory is designated as noncachable, so the memory copy is the only copy. Access to this part of memory always generate a "miss".

c) Hardware transparency - additional hardware is added to update all caches and memory at once on a write.