### Fundamentals of Computer Systems Caches

# **Timothy K Paine**

Columbia University

# Summer 2024

Illustrations Copyright (c) 2007 Elsevier

### **Computer Systems**

Performance depends on which is slowest: the processor or the memory system







Our single-cycle memory assumption has been wrong since 1980.

Hennessy and Patterson. *Computer Architecture: A Quantitative Approach.* 3rd ed., Morgan Kaufmann, 2003.

# Your Choice of Memories

|                | Fast | Cheap | Large |
|----------------|------|-------|-------|
| On-Chip SRAM   | •    | •     |       |
| Commodity DRAM |      | •     | •     |
| Supercomputer  | •    |       | ~     |

## **Memory Hierarchy**

An essential trick that makes big memory appear fast

| Technology | Cost<br>(\$/Gb) | Access Time<br>(ns) | Density<br>(Gb/cm2) |
|------------|-----------------|---------------------|---------------------|
| SRAM       | 30000           | 0.5                 | 0.00025             |
| DRAM       | 10              | 100                 | 1 – 16              |
| Flash      | 2               | 300*                | 8 – 32              |
| Hard Disk  | 0.2             | L 10000000          | 500 – 2000          |

\*Read speed; writing much, much slower

# A Modern Memory Hierarchy



AMD Phenom 9600 Quad-core 2.3 GHz 1.1–1.25 V 95 W 65 nm A desktop machine:

| Level           | Size   | Tech.    |
|-----------------|--------|----------|
| L1 Instruction* | 64 K   | SRAM     |
| L1 Data*        | 64 K   | SRAM     |
| L2 *            | 512 K  | SRAM     |
| L3              | 2 MB   | SRAM     |
| Memory          | 4 GB   | DRAM     |
| Disk            | 500 GB | Magnetic |

\* per core

# A Simple Memory Hierarchy



First level: small, fast storage (typically SRAM)

| Н | i | р |   | h | i  | р  | h | 0 |
|---|---|---|---|---|----|----|---|---|
| 0 | r | а | У | ! | \0 |    |   |   |
|   |   |   | h | i | р  | \0 |   |   |

Last level: large, slow storage (typically DRAM)

*Can fit a subset of lower level in upper level, but which subset?* 

# Locality Example: Substring Matching





temporal locality: if needed X recently, likely to need X again soon

spatial locality: if need X, likely also need something near X

### Cache

Highest levels of memory hierarchy
Fast: level 1 typically 1 cycle access time
With luck, supplies most data
Cache design questions:
What data does it hold? Recently accessed
How is data found? Simple address hash



What data is replaced? Often the oldest

# What Data is Held in the Cache?

Ideal cache: always correctly guesses what you want before you want it.

Real cache: never that smart

#### **Caches Exploit**

#### **Temporal Locality**

Copy newly accessed data into cache, replacing oldest if necessary

#### **Spatial Locality**

Copy nearby data into the cache at the same time

Specifically, always read and write a **block**, also called a **line**, at a time (e.g., 64 bytes), never a single byte.

## **Memory Performance**

Hit: Data is found in the level of memory hierarchy

Miss: Data not found; will look in next level

Hit Rate =  $\frac{\text{Number of hits}}{\text{Number of accesses}}$ 

 $Miss Rate = \frac{Number of misses}{Number of accesses}$ 



Hit Rate + Miss Rate = 1

The expected access time  $E_L$  for a memory level L with latency  $t_L$  and miss rate  $M_L$ :

 $E_L = t_L + M_L \cdot E_{L+1}$ 

### Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache What's the cache hit and miss rate?

## Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache What's the cache hit and miss rate?

> Hit Rate  $=\frac{750}{1000} = 75\%$ Miss Rate = 1 - 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, *What's the expected access time?* 

### Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache What's the cache hit and miss rate?

> Hit Rate  $=\frac{750}{1000} = 75\%$ Miss Rate = 1 - 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, What's the expected access time? Expected access time of main memory:  $E_1 = 100$  cycles Access time for the cache:  $t_0 = 1$  cycle Cache miss rate:  $M_0 = 0.25$ 

 $E_0 = t_0 + M_0 \cdot E_1 = 1 + 0.25 \cdot 100 = 26$  cycles

12/1





### **Direct-Mapped Cache Behavior**



# **Direct-Mapped Cache Behavior**

| Memory<br>Address                            | 0000 001               | By<br>Off<br>00    | set  |                     |                            |
|----------------------------------------------|------------------------|--------------------|------|---------------------|----------------------------|
| A dumb loop:                                 | 3                      | V                  | Tag  | Data                |                            |
| repeat 5 times                               |                        | 0                  |      |                     | Set 7 (111)                |
| load from 0x4;                               |                        | 0                  |      |                     | Set 6 (110)                |
|                                              |                        | 0                  |      |                     | Set 5 (101)                |
| load from 0xC;                               |                        | 0                  |      | mem[0x000C]         | Set 4 (100)<br>Set 3 (011) |
| load from 0x8.                               |                        |                    | 0000 | mem[0x000C]         | Set 2 (010)                |
|                                              | l,                     | ▶ <mark> </mark> 1 | 0000 | mem[0x0004]         | Set 1 (001)                |
| li \$t0,5                                    |                        | 0                  |      |                     | Set 0 (000)                |
| l1: beq \$t0, \$0, done<br>lw \$t1, 0x4(\$0) | Cach                   | ne v               | when | reading 0x4         | l last time                |
| lw \$t2, 0xC(\$0)<br>lw \$t3, 0x8(\$0)       | Assuming<br>what's the |                    |      | ie starts en<br>te? | npty,                      |
| addiu \$t0, \$t0, -1<br>i l1                 | 4 C 8 4                | C                  | 84C  | 84C84               | C 8                        |
| done:                                        | МММН                   | ΗI                 | ннн  | ннннн               | нн                         |
|                                              | 3/15 = 0.2             | 2 =                | 20%  |                     |                            |

## **Direct-Mapped Cache: Conflict**



These are *conflict misses* 

## **Direct-Mapped Cache: Conflict**



These are conflict misses

#### No Way! Yes Way! 2-Way Set Associative Cache



#### 2-Way Set Associative Behavior

|      | li    | \$t0, | 5         |
|------|-------|-------|-----------|
| l1:  | beq   | \$t0, | \$0, done |
|      | lw    | \$t1, | 0x4(\$0)  |
|      | lw    | \$t2, | 0x24(\$0) |
|      | addiu | \$t0, | \$t0, -1  |
|      | j     | l1    |           |
| done | e:    |       |           |

Associativity reduces conflict misses

|   | V    | Vay 1       |   | V    | Vay 0       |       |
|---|------|-------------|---|------|-------------|-------|
| V | Tag  | Data        | V | Tag  | Data        |       |
| 0 |      |             | 0 |      |             | Set 3 |
| 0 |      |             | 0 |      |             | Set 2 |
| 1 | 0000 | mem[0x0024] | 1 | 0010 | mem[0x0004] | Set 1 |
| 0 |      |             | 0 |      |             | Set 0 |

# An Eight-way Fully Associative Cache

|   | Way | 7    |   | Way | 6    |   | Way | 5    |   | Way | 4    |   | Way | 3    |   | Way | 2    |   | Way | / 1  |   | Way | 0    |
|---|-----|------|---|-----|------|---|-----|------|---|-----|------|---|-----|------|---|-----|------|---|-----|------|---|-----|------|
| V | Tag | Data |
|   |     |      |   |     |      |   |     |      |   |     |      |   |     |      |   |     |      |   |     |      |   |     |      |

No conflict misses: only compulsory or capacity misses

Either very expensive or slow because of all the associativity



- 2 sets
- 1 block per set (Direct Mapped)
- 4 words per block

20/1

#### Direct-Mapped Cache Behavior w/ 4-word block



#### Direct-Mapped Cache Behavior w/ 4-word block



Larger blocks reduce compulsory misses by exploting spatial locality

# Stephen's Desktop Machine Revisited



**AMD** Phenom

Quad-core 2.3 GHz

1.1–1.25 V

9600

95 W

65 nm

On-chip caches:

| Cach | e Size | Sets | Ways   | Block   |
|------|--------|------|--------|---------|
| L1I* | 64 K   | 512  | 2-way  | 64-byte |
| L1D* | 64 K   | 512  | 2-way  | 64-byte |
| L2*  | 512 K  | 512  | 16-way | 64-byte |
| L3   | 2 MB   | 1024 | 32-way | 64-byte |

\* per core

# Int<u>el On-Chip Caches</u>

| Chip        | Year | Freq.<br>(MHz) | L1<br>Data                 | Instr           | L2                       |
|-------------|------|----------------|----------------------------|-----------------|--------------------------|
| 80386       | 1985 | 16–25          | off-cł                     | nip             | none                     |
| 80486       | 1989 | 25–100         | 8K uni                     | fied            | off-chip                 |
| Pentium     | 1993 | 60–300         | 8K                         | 8K              | off-chip                 |
| Pentium Pro | 1995 | 150–200        | 8K                         | 8K              | 256K–1M<br>(MCM)         |
| Pentium II  | 1997 | 233–450        | 16K                        | 16K             | 256K–512K<br>(Cartridge) |
| Pentium III | 1999 | 450–1400       | 16K                        | 16K             | 256K–512K                |
| Pentium 4   | 2001 | 1400–3730      | 12k op<br>8–16K trace cach |                 | 256K–2M                  |
| Pentium M   | 2003 | 900–2130       | 32K 32K                    |                 | 1M-2M                    |
| Core 2 Duo  | 2005 | 1500–3000      | 32K<br>ner core            | 32K<br>ner core | 2M-6M <sub>23/1</sub>    |