# Fundamentals of Computer Systems Caches

Stephen A. Edwards

**Columbia University** 

Summer 2016

Illustrations Copyright © 2007 Elsevier

### **Computer Systems**

Performance depends on which is slowest: the processor or the memory system



### Memory Speeds Haven't Kept Up



Our single-cycle memory assumption has been wrong since 1980.

Hennessy and Patterson. Computer Architecture: A Quantitative Approach. 3rd ed., Morgan Kaufmann, 2003.

### **Your Choice of Memories**

|                | Fast | Cheap | Large |
|----------------|------|-------|-------|
| On-Chip SRAM   | V    | V     |       |
| Commodity DRAM |      | •     | •     |
| Supercomputer  | ~    |       | ~     |

# Memory Hierarchy

Fundmental trick to making a big memory appear fast

| Technology | Cost<br>(\$/Gb) | Access Time (ns) | Density<br>(Gb/cm2) |
|------------|-----------------|------------------|---------------------|
| SRAM       | 30 000          | 0.5              | 0.00025             |
| DRAM       | 10              | 100              | 1 – 16              |
| Flash      | 2               | 300*             | 8 – 32              |
| Hard Disk  | 0.1             | 10 000 000       | 500 – 2000          |

<sup>\*</sup>Read speed; writing much, much slower

# A Modern Memory Hierarchy



AMD Phenom 9600 Quad-core 2.3 GHz 1.1–1.25 V 95 W 65 nm

#### My desktop machine:

| Level           | Size   | Tech.    |
|-----------------|--------|----------|
| L1 Instruction* | 64 K   | SRAM     |
| L1 Data*        | 64 K   | SRAM     |
| L2*             | 512 K  | SRAM     |
| L3              | 2 MB   | SRAM     |
| Memory          | 4 GB   | DRAM     |
| Disk            | 500 GB | Magnetic |
|                 |        |          |

<sup>\*</sup>per core

### **Temporal Locality**

#### FIRST BOOK

#### DEFINITIONS.

- 1 A point is that which is without parts.
- A line is length without breadth.
   I he extremities of a line are points.
- 4. A right line, is that which lies evenly between its extremities.
- 5. A superficies, is that which has only length and breadth.
- 6. The boundings of a superficies are lines.
- 7. A plane superficies, is that which lies evenly between its extreme right lines.
- 8. A rectilineal angle, is the inclination of two right lines to each other, which touch, but do not form one straight line.
- An angle is designated either by one letter at the vertex; or three, of which the middle one is at the vertex, the remaining two any place on the legs.
- 9. The legs of an angle, are the lines which make the angle.
- 10. The vertex of an angle is the point in which the legs mutually touch each other.

What path do your eyes take when you read this?

Did you look at the drawings more than once?

Euclid's Elements

### **Spatial Locality**



If you need something, you may also need something nearby

# **Memory Performance**

Hit: Data is found in the level of memory hierarchy

Miss: Data not found; will look in next level

$$Hit Rate = \frac{Number of hits}{Number of accesses}$$

Miss Rate = 
$$\frac{\text{Number of misses}}{\text{Number of accesses}}$$



$$E_L = t_L + M_L \cdot E_{L+1}$$



# **Memory Performance Example**

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache What's the cache hit and miss rate?

### **Memory Performance Example**

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache What's the cache hit and miss rate?

Hit Rate = 
$$\frac{750}{1000}$$
 = 75%  
Miss Rate = 1 – 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, What's the expected access time?

### **Memory Performance Example**

Two-level hierarchy: Cache and main memory
Program executes 1000 loads & stores
750 of these are found in the cache
What's the cache hit and miss rate?

Hit Rate = 
$$\frac{750}{1000}$$
 = 75%  
Miss Rate = 1 – 0.75 = 25%

 $E_0 = t_0 + M_0 \cdot E_1 = 1 + 0.25 \cdot 100 = 26$  cycles

If the cache takes 1 cycle and the main memory 100, What's the expected access time? Expected access time of main memory:  $E_1 = 100$  cycles Access time for the cache:  $t_0 = 1$  cycle Cache miss rate:  $M_0 = 0.25$ 

### Cache

Highest levels of memory hierarchy

Fast: level 1 typically 1 cycle access time

With luck, supplies most data

Cache design questions:

What data does it hold? Recently accessed

How is data found? Simple address hash

What data is replaced? Often the oldest

### What Data is Held in the Cache?

Ideal cache: always correctly guesses what you want before you want it.

Real cache: never that smart

#### Caches Exploit

#### Temporal Locality

Copy newly accessed data into cache, replacing oldest if necessary

#### **Spatial Locality**

Copy nearby data into the cache at the same time

Specifically, always read and write a block at a time (e.g., 64 bytes), never a single byte.

#### A Direct-Mapped Cache This simple cache has Address Data 8 sets 11...11111100 mem[0xFFFFFFC] 11...11111000 mem[0xFFFFFF8] 1 block per set 11 11110100 mem[0xFFFFFFF4] 4 bytes per block 11 11110000 mem[0xFFFFFF0] 11...11101100 mem[0xFFFFFEC] To simplify answering 11...11101000 mem[0xFFFFFE8] "is this memory in the 11 11100100 mem[0xFFFFFE4] cache?," each byte is 11 11100000 mem[0xFFFFFE0] mapped to exactly one set. 00 00100100 mem[0x00000024] 00 00100000 mem[0x00000020] 00...00011100 mem[0x0000001C] Set 7 (111) mem[0x00000018] 00...00011000 Set 6 (110) 00 00010100 mem[0x00000014] Set 5 (101) 00 00010000 mem[0x00000010] Set 4 (100) mem[0x0000000C] Set 3 (011) 00...00001100 00...00001000 mem[0x00000008] Set 2 (010) 00 00000100 mem[0x00000004] Set 1 (001) 00 0000000 mem[0x00000000] Set 0 (000) 23-Word Cache 2<sup>30</sup>-Word Main Memory



### **Direct-Mapped Cache Behavior**



11: beq \$t0, \$0, done
 lw \$t1, 0x4(\$0)
 lw \$t2, 0xC(\$0)
 lw \$t3, 0x8(\$0)
 addiu \$t0, \$t0, -1
 j 11
done:

\$t0, 5

A dumb loop:

repeat 5 times

load from 0x4;

load from 0xC:

load from 0x8.

li

Assuming the cache starts empty, what's the miss rate?

Cache when reading 0x4 last time

### **Direct-Mapped Cache Behavior**



A dumb loop:

repeat 5 times

load from 0x4;

load from 0xC:

load from 0x8.

done:

Assuming the cache starts empty, what's the miss rate?

Cache when reading 0x4 last time

4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8

3/15 = 0.2 = 20%

### **Direct-Mapped Cache: Conflict**



li \$t0, 5
l1: beq \$t0, \$0, done
lw \$t1, 0x4(\$0)
lw \$t2, 0x24(\$0)
addiu \$t0, \$t0, -1
j l1
done:

A dumber loop:

repeat 5 times

load from 0x4:

load from 0x24

Assuming the cache starts empty, what's the miss rate?

These are conflict misses

### Direct-Mapped Cache: Conflict

A dumber loop:

repeat 5 times

load from 0x4:

load from 0x24

\$t0, 5

addiu \$t0, \$t0, -1

11

lί

٦w

i

done:

lw

l1: bea



These are conflict misses

### No Way! Yes Way! 2-Way Set Associative Cache



### 2-Way Set Associative Behavior

```
li $t0, 5

l1: beq $t0, $0, done
 lw $t1, 0x4($0)
 lw $t2, 0x24($0)
 addiu $t0, $t0, -1
 j l1

done:
```

```
Assuming the cache starts empty, what's the miss rate?

4 24 4 24 4 24 4 24 4 24

M M H H H H H H H H

2/10 = 0.2 = 20%
```

#### Associativity reduces conflict misses

| Way 1 |      |             |   | V    |             |       |
|-------|------|-------------|---|------|-------------|-------|
| ٧     | Tag  | Data        | ٧ | Tag  | Data        | _     |
| 0     |      |             | 0 |      |             | Set 3 |
| 0     |      |             | 0 |      |             | Set 2 |
| 1     | 0000 | mem[0x0024] | 1 | 0010 | mem[0x0004] | Set 1 |
| 0     |      |             | 0 |      |             | Set 0 |

### An Eight-way Fully Associative Cache

| Way 7      | Way 6     | Way 5        | Way 4      | Way 3      | Way 2      | Way 1      | Way 0      |
|------------|-----------|--------------|------------|------------|------------|------------|------------|
| V Tag Data | V Tag Dat | a V Tag Data | V Tag Data |
|            |           |              |            |            |            |            |            |

No conflict misses: only compulsory or capacity misses

Either very expensive or slow because of all the associativity

### **Exploiting Spatial Locality: Larger Blocks**

0x8000 0009C: Memory Address

|          |     | Dioon  | 2,10   |
|----------|-----|--------|--------|
| Tag      | Set | Offset | Offset |
| 100100   | 1   | 11     | 00     |
|          |     |        |        |
| 800000 9 | (   |        |        |

Block Byte



2 sets1 block per set (Direct Mapped)4 words per block

### Direct-Mapped Cache Behavior w/ 4-word block



load from 0x4; load from 0xC; load from 0x8.

Assuming the cache starts empty, what's the miss rate?

### Direct-Mapped Cache Behavior w/ 4-word block



Larger blocks reduce compulsory misses by exploting spatial locality

## Stephen's Desktop Machine Revisited



AMD Phenom 9600 Quad-core 2.3 GHz 1.1–1.25 V 95 W 65 nm

#### On-chip caches:

| Cach | e Size | Sets | Ways   | Block   |
|------|--------|------|--------|---------|
| L1I* | 64 K   | 512  | 2-way  | 64-byte |
| L1D* | 64 K   | 512  | 2-way  | 64-byte |
| L2*  | 512 K  | 512  | 16-way | 64-byte |
| L3   | 2 MB   | 1024 | 32-way | 64-byte |

<sup>\*</sup>per core

Intel On-Chip Caches

| <br>Chip    | hip Year Freq. L1 |           | L2              |                     |                          |
|-------------|-------------------|-----------|-----------------|---------------------|--------------------------|
| <br>-       |                   | (MHz)     | Data            | Instr               |                          |
| 80386       | 1985              | 16–25     | off-cl          | nip                 | none                     |
| 80486       | 1989              | 25–100    | 8K uni          | fied                | off-chip                 |
| Pentium     | 1993              | 60–300    | 8K              | 8K                  | off-chip                 |
| Pentium Pro | 1995              | 150–200   | 8K              | 8K                  | 256K–1M<br>(MCM)         |
| Pentium II  | 1997              | 233–450   | 16K             | 16K                 | 256K–512K<br>(Cartridge) |
| Pentium III | 1999              | 450–1400  | 16K             | 16K                 | 256K-512K                |
| Pentium 4   | 2001              | 1400–3730 |                 | 12k op<br>ace cache | 256K–2M                  |
| Pentium M   | 2003              | 900–2130  | 32K             | 32K                 | 1M-2M                    |
| Core 2 Duo  | 2005              | 1500–3000 | 32K<br>per core | 32K<br>per core     | 2M-6M                    |