# Fundamentals of Computer Systems Caches

Martha A. Kim

Columbia University

Fall 2015

Illustrations Copyright © 2007 Elsevier

#### **Computer Systems**

Performance depends on which is slowest: the processor or the memory system



## Memory Speeds Haven't Kept Up



Our single-cycle memory assumption has been wrong since 1980.

Hennessy and Patterson. *Computer Architecture: A Quantitative Approach.* 3rd ed., Morgan Kaufmann, 2003.

#### Your Choice of Memories

|                | Fast | Cheap | Large |
|----------------|------|-------|-------|
| On-Chip SRAM   | V    | V     |       |
| Commodity DRAM |      | V     | V     |
| Supercomputer  | ~    |       | V     |

#### **Memory Hierarchy**

An essential trick that makes big memory appear fast

| Technology | Cost<br>(\$/Gb) | Access Time (ns) | Density<br>(Gb/cm2) |
|------------|-----------------|------------------|---------------------|
| SRAM       | 30 000          | 0.5              | 0.00025             |
| DRAM       | 10              | 100              | 1 – 16              |
| Flash      | 2               | 300*             | 8 – 32              |
| Hard Disk  | 0.3             | 1 10000000       | 500 – 2000          |

<sup>\*</sup>Read speed; writing much, much slower

## A Modern Memory Hierarchy



AMD Phenom 9600 Quad-core 2.3 GHz 1.1–1.25 V 95 W 65 nm

#### A desktop machine:

| Level           | Size   | Tech.    |
|-----------------|--------|----------|
| L1 Instruction* | 64 K   | SRAM     |
| L1 Data*        | 64 K   | SRAM     |
| L2*             | 512 K  | SRAM     |
| L3              | 2 MB   | SRAM     |
| Memory          | 4 GB   | DRAM     |
| Disk            | 500 GB | Magnetic |

<sup>\*</sup>per core

## A Simple Memory Hierarchy





Last level: large, slow storage (typically DRAM)

Can fit a subset of lower level in upper level, but which subset?

#### Locality Example: Substring Matching



spatial locality: if need X, likely also need something near X

#### Cache

Highest levels of memory hierarchy

Fast: level 1 typically 1 cycle access time

With luck, supplies most data

Cache design questions:

What data does it hold? Recently accessed

How is data found? Simple address hash

What data is replaced? Often the oldest

#### What Data is Held in the Cache?

Ideal cache: always correctly guesses what you want before you want it.

Real cache: never that smart

#### **Caches Exploit**

#### **Temporal Locality**

Copy newly accessed data into cache, replacing oldest if necessary

#### **Spatial Locality**

Copy nearby data into the cache at the same time
Specifically, always read and write a **block**, also called a **line**, at a time (e.g., 64 bytes), never a single byte.

## **Memory Performance**

Hit: Data is found in the level of memory hierarchy

Miss: Data not found; will look in next level

 $Hit Rate = \frac{Number of hits}{Number of accesses}$ 

$$Miss Rate = \frac{Number of misses}{Number of accesses}$$

Hit Rate + Miss Rate = 1



$$E_L = t_L + M_L \cdot E_{L+1}$$



#### Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache What's the cache hit and miss rate?

#### Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache What's the cache hit and miss rate?

Hit Rate = 
$$\frac{750}{1000}$$
 = 75%  
Miss Rate = 1 – 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, What's the expected access time?

#### Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores 750 of these are found in the cache What's the cache hit and miss rate?

Hit Rate = 
$$\frac{750}{1000}$$
 = 75%  
Miss Rate = 1 – 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, What's the expected access time?

Expected access time of main memory:  $E_1 = 100$  cycles

Access time for the cache:  $t_0 = 1$  cycle

Cache miss rate:  $M_0 = 0.25$ 

$$E_0 = t_0 + M_0 \cdot E_1 = 1 + 0.25 \cdot 100 = 26$$
 cycles



## Direct-Mapped Cache Hardware



Address bits:

Set 7 Set 6 Set 5

Set 4

Set 3

Set 2 Set 1 Set 0

0–1: byte within block

8-entry x

(1+27+32)-bit

**SRAM** 

2–4: set number

5-31: block "tag"

Cache hit if in the set of the address.

- block is valid (V=1)
- tag (address bits 5-31) matches

#### **Direct-Mapped Cache Behavior**

\$t2, 0xC(\$0)

\$t3, 0x8(\$0)

\$t0, \$t0, -1

lw

lw

done:

addiu

11



Assuming the cache starts empty, what's the miss rate?

#### **Direct-Mapped Cache Behavior**

Byte Tag Set Offset Memory 00...00 001 00 Address A dumb loop: Tag Data repeat 5 times 0 Set 7 (111) Set 6 (110) 0 load from 0x4: Set 5 (101) 0 load from 0xC; Set 4 (100) Set 3 (011) load from 0x8. mem[0x00...0C] 00...00 Set 2 (010) 00...00 mem[0x00...08] Set 1 (001) mem[0x00...04] 00...00 Set 0 (000) li \$t0, 5 l1: bea \$t0, \$0, done Cache when reading 0x4 last time lw \$t1, 0x4(\$0)

Assuming the cache starts empty, what's the miss rate?

4 C 8 4 C 8 4 C 8 4 C 8 4 C 8

M M M H H H H H H H H H H H H H

3/15 = 0.2 = 20%

#### Direct-Mapped Cache: Conflict



These are conflict misses

#### Direct-Mapped Cache: Conflict



These are conflict misses

#### No Way! Yes Way! 2-Way Set Associative Cache



#### 2-Way Set Associative Behavior

```
li $t0, 5
l1: beq $t0, $0, done
lw $t1, 0x4($0)
lw $t2, 0x24($0)
addiu $t0, $t0, -1
j l1
done:
```

```
Assuming the cache starts empty, what's the miss rate?

4 24 4 24 4 24 4 24 4 24

M M H H H H H H H H

2/10 = 0.2 = 20%
```

Associativity reduces conflict misses

|   | V    | Vay 1       | Way 0 |      |             |     |  |  |
|---|------|-------------|-------|------|-------------|-----|--|--|
| ٧ | Tag  | Data        | ٧     | Tag  | Data        |     |  |  |
| 0 |      |             | 0     |      |             | ] S |  |  |
| 0 |      |             | 0     |      |             | S   |  |  |
| 1 | 0000 | mem[0x0024] | 1     | 0010 | mem[0x0004] | S   |  |  |
| 0 |      |             | 0     |      |             | S   |  |  |

#### An Eight-way Fully Associative Cache

| Way 7      | Way 6   | 6      | Way | 5      | Way   | 4    | ٧   | Nay | 3    | ٧    | /ay | 2    |   | Way | 1    |   | Way | 0    |
|------------|---------|--------|-----|--------|-------|------|-----|-----|------|------|-----|------|---|-----|------|---|-----|------|
| V Tag Data | V Tag [ | Data V | Tag | Data ' | V Tag | Data | V T | ag  | Data | V Ta | ag  | Data | ٧ | Tag | Data | ٧ | Tag | Data |
|            |         |        |     |        |       |      |     |     |      |      |     |      |   |     |      |   |     |      |

No conflict misses: only compulsory or capacity misses

Either very expensive or slow because of all the associativity

## Exploiting Spatial Locality: Larger Blocks

Block Byte

0x80000009C:

Memory Address

| Tag    | Set | Offset | Offse |
|--------|-----|--------|-------|
| 100100 | 1   | 11     | 00    |
|        |     |        |       |
| 800000 | )   | (      | 2     |



2 sets

1 block per set (Direct Mapped)

4 words per block

## Direct-Mapped Cache Behavior w/ 4-word block



load from 0x4; load from 0xC; load from 0x8.

## Direct-Mapped Cache Behavior w/ 4-word block



```
li
          $t0, 5
                            Assuming the cache starts empty,
l1: bea
          $t0, $0, done
                            what's the miss rate?
    lw
          $t1, 0x4($0)
    lw
          $t2, 0xC($0)
                             4 C 8 4 C 8 4 C 8 4 C 8 4 C 8
    ۱w
          $t3. 0x8($0)
                             MHHHHHHHHHHHH
    addiu
          $t0, $t0, -1
                            1/15 = 0.0666 = 6.7\%
          11
done:
```

Larger blocks reduce compulsory misses by exploting spatial locality

## Stephen's Desktop Machine Revisited



AMD Phenom 9600 Quad-core 2.3 GHz 1.1–1.25 V 95 W 65 nm

#### On-chip caches:

| Cache | e Size | Sets | Ways   | Block   |
|-------|--------|------|--------|---------|
| L1I*  | 64 K   | 512  | 2-way  | 64-byte |
| L1D*  | 64 K   | 512  | 2-way  | 64-byte |
| L2*   | 512 K  | 512  | 16-way | 64-byte |
| L3    | 2 MB   | 1024 | 32-way | 64-byte |

<sup>\*</sup>per core

Intel On-Chip Caches

|     | Chip        | Year Freq. L1 |           | L2                  |                    |                          |
|-----|-------------|---------------|-----------|---------------------|--------------------|--------------------------|
|     |             |               | (MHz)     | Data                | Instr              |                          |
|     | 80386       | 1985          | 16–25     | off-c               | thip               | none                     |
|     | 80486       | 1989          | 25–100    | 8K ur               | nified             | off-chip                 |
|     | Pentium     | 1993          | 60–300    | 8K                  | 8K                 | off-chip                 |
|     | Pentium Pro | 1995          | 150–200   | 8K                  | 8K                 | 256K–1M<br>(MCM)         |
|     | Pentium II  | 1997          | 233–450   | 16K                 | 16K                | 256K–512K<br>(Cartridge) |
|     | Pentium III | 1999          | 450–1400  | 16K                 | 16K                | 256K-512K                |
|     | Pentium 4   | 2001          | 1400–3730 | 8–16K <sub>tr</sub> | 12k op<br>ace cach | e <sup>256K–2M</sup>     |
|     | Pentium M   | 2003          | 900–2130  | 32K                 | 32K                | 1M-2M                    |
| TIE | Core 2 Duo  | 2005          | 1500-3000 | 32K                 | 32K                | 2M-6M3/23                |