Blade Computing with the AMD Opteron™ Processor ("Magny-Cours")

Pat Conway (Presenter)
Nathan Kalyanasundharam
Gregg Donley
Kevin Lepak
Bill Hughes
Agenda

Processor Architecture

- AMD driving the x86 64-bit processor evolution
- Driving forces behind the Twelve-Core AMD Opteron™ processor codenamed “Magny-Cours”
- CPU silicon
- MCM 2.0 package, speeds and feeds

Performance and scalability

- 2P/4P blade and rack topologies
- HyperTransport™ technology HT Assist design
  - Cache coherence protocol
  - Transaction scenarios and frequencies
  - Coverage ratio
  - Memory latency and bandwidth

A look ahead
## x86 64-bit Architecture Evolution

<table>
<thead>
<tr>
<th>Year</th>
<th>Model</th>
<th>Mfg. Process</th>
<th>CPU Core</th>
<th>L2/L3</th>
<th>Hyper Transport™ Technology</th>
<th>Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>2003</td>
<td>AMD Opteron™</td>
<td>90nm SOI</td>
<td>K8</td>
<td>1MB/0</td>
<td>3x 1.6GT/.s</td>
<td>2x DDR1 300</td>
</tr>
<tr>
<td>2005</td>
<td>AMD Opteron™</td>
<td>90nm SOI</td>
<td>K8</td>
<td>1MB/0</td>
<td>3x 1.6GT/.s</td>
<td>2x DDR1 400</td>
</tr>
<tr>
<td>2007</td>
<td>“Barcelona”</td>
<td>65nm SOI</td>
<td>Greyhound</td>
<td>512kB/2MB</td>
<td>3x 2GT/s</td>
<td>2x DDR2 667</td>
</tr>
<tr>
<td>2008</td>
<td>“Shanghai”</td>
<td>45nm SOI</td>
<td>Greyhound+</td>
<td>512kB/6MB</td>
<td>3x 4.0GT/s</td>
<td>2x DDR2 800</td>
</tr>
<tr>
<td>2009</td>
<td>“Istanbul”</td>
<td>45nm SOI</td>
<td>Greyhound+</td>
<td>512kB/6MB</td>
<td>3x 4.8GT/s</td>
<td>2x DDR2 1066</td>
</tr>
<tr>
<td>2010</td>
<td>“Magny-Cours”</td>
<td>45nm SOI</td>
<td>Greyhound+</td>
<td>512kB/12MB</td>
<td>4x 6.4GT/s</td>
<td>4x DDR3 1333</td>
</tr>
</tbody>
</table>

### Max Power Budget Remains Consistent
Dramatic Back-to-back Gains

"Shanghai" to "Istanbul" delivers 34% more performance in the same power envelope

*"Magny-Cours" and Future silicon data is based on AMD projections
## Driving Forces Behind “Magny-Cours”

<table>
<thead>
<tr>
<th>Server Throughput</th>
<th>Exploit thread level parallelism</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Leverage Directly Connected MCM 2.0</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Virtualization</th>
<th>Maximize compute density in 2P/4P blades and racks</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Run more VMs per server</td>
</tr>
<tr>
<td></td>
<td>Provide hardware context (thread) based QOS</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Energy Proportional Computing</th>
<th>More performance, same power envelope</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Power conservation when idle</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Economics</th>
<th>Design efficiency – “Magny-Cours” silicon same as “Istanbul”</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Can help speed qualification times and customers’ time to market</td>
</tr>
<tr>
<td></td>
<td>Reasonable die size permits 2 die per reticle (Yield (\uparrow) Manufacturing Cost (\downarrow))</td>
</tr>
<tr>
<td></td>
<td>Yield improvements can help ensure supply chain stability</td>
</tr>
<tr>
<td></td>
<td>Manufacturing cost savings ultimately benefit customers</td>
</tr>
</tbody>
</table>
“Magny-Cours” Silicon
**MCM 2.0 Logical View**

**G34 Socket**

“Magny-Cours” utilizes a **Directly Connected** MCM

**DDR3 Memory Channel**

**Package** has 12 cores, 4 HT ports, & 4 memory channels

**Die (Node)** has 6 cores, 4 HT ports & 2 memory channels

- **P0**
  - x16 cHT
  - x16 (NC)
  - x8 cHT

- **P1**
  - x16 cHT
Topologies

2P

- Diameter: 1
- Avg Diam: 0.75
- DRAM BW: 85.6 GB/s
- XFIRE BW: 71.7 GB/s (*)

4P

- Diameter: 2
- Avg Diam: 1.25
- DRAM BW: 170.4 GB/s
- XFIRE BW: 143.4 GB/s

(*) XFIRE BW is the maximum available coherent memory bandwidth if the HT links were the only limiting factor. Each node accesses its own memory and that of every other node in an interleaved fashion.
HyperTransport™ Technology HT Assist (Probe Filter)

Key enabling technology on “Istanbul” and “Magny-Cours”

HT Assist is a sparse directory cache
- Associated with the memory controller on the home node
- Tracks all lines cached in the system from the home node

Eliminates most probe broadcasts (see diagram)
- Lowers latency
  - local accesses get local DRAM latency, no need to wait for probe responses
  - less queuing delay due to lower HT traffic overhead
- Increases system bandwidth by reducing probe traffic
**Where Do We Put the HT Assist Probe Filter?**

**Q:** Where do we store probe filter entries without adding a large on-chip probe filter RAM which is not used in a 1P desktop system?

**A:** Steal 1MB of 6MB L3 cache per die in “Magny-Cours” systems.

<table>
<thead>
<tr>
<th>L3 cache</th>
<th>Dir</th>
<th>set</th>
</tr>
</thead>
<tbody>
<tr>
<td>way 0</td>
<td>way 1</td>
<td>way 15</td>
</tr>
</tbody>
</table>

Implementation in fast SRAM (L3) minimizes
- Access latency
- Port occupancy of read-modify-write operations
- Indirection latency for cache-to-cache transfers
Format of a Probe Filter Entry

- 16 probe filter entries per L3 cache line (64B), 4B per entry, 4-way set associative
- 1MB of a 6MB L3 cache per die holds 256k probe filter entries and covers 16MB of cache

![Probe Filter Entry Diagram]

- **L3 Cache Line (64B)**
  - 4 sets
  - 4 ways
  - 16 probe filter entries

- **Probe Filter Entry (4B)**
  - Tag
  - State
  - Owner

  - EM, O, S, S1 states
Cache Coherence Protocol

- Track lines in M, E, O or S state in probe filter
- PF is fully inclusive of all cached data in system - if a line is cached, then a PF entry must exist.
- Presence of probe filter entry says line in M, E, O or S state
- Absence of probe filter entry says line is uncached
- New messages
  - Directed probe on probe filter hit
  - Replacement notification E ->I (clean VicBlk)
## Probe Filter Transaction Scenarios

<table>
<thead>
<tr>
<th></th>
<th>PF Hit</th>
<th></th>
<th>PF Miss (*)</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>I</td>
<td>O</td>
<td>S</td>
<td>S1</td>
</tr>
<tr>
<td>FETCH</td>
<td></td>
<td>D</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>LOAD</td>
<td>-</td>
<td>D</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>STORE</td>
<td>-</td>
<td>B</td>
<td>B</td>
<td>B</td>
</tr>
</tbody>
</table>

**Legend**
- **Filtered**
- **Directed**
- **Directed Invalidate**
- **Broadcast Invalidate**

(*) PF miss implies line is Uncached (no broadcast necessary). State refers to the state of the line to be replaced upon allocation of new PF entry.

"**Effective**"

"**Ineffective**"

Traditional “Cache Hit Ratio” does not measure effectiveness of probe filter.
Probe Filter Coverage Ratio

Typical
Uniformly distributed data
Coverage ratio = 256k :: 128k = 2.0x

With sharing, a PF entry may track multiple cached copies and the coverage ratio increases

Worst case (Hotspotting)
Home node of each cached line is P0
Coverage ratio = 256k :: 128k * 4 = 0.5x

2 Socket “Magny-Cours”
HT Assist and Memory Latency

With “old” broadcast coherence protocol, the latency of a memory access is the longer of 2 paths:

- time it takes to return data from DRAM and
- the time it takes to probe all caches

With HT Assist, local memory latency is significantly reduced as it is not necessary to probe caches on other nodes.

Several server workloads naturally have ~100% local accesses

- SPECint®, SPECfp®
- VMMARK™ typically run with 1 VM per core
- SPECpower_ssj® with 1 JVM per core
- STREAM

Probe Filter amplifies benefit of any NUMA optimizations in OS/application which make memory accesses local
A Look Ahead

Socket compatible upgrade to “Magny-Cours” is planned with
- More cores for additional thread-level paralleism
- More cache to maintain cache-per-core balance
- Same power envelope
- Finer grain power management

New processor core ("Bulldozer")
- Planned brand new x86 64-bit microarchitecture
- 32nm design
- Instruction set extensions
- Higher memory level parallelism
Thank you!
Disclaimer & Attribution

DISCLAIMER
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2009 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Opteron, and combinations thereof are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Technology Consortium. Other names are for informational purposes only and may be trademarks of their respective owners.

SPEC, SPECint, SPECfp, and SPECpower_ssj are trademarks or registered trademarks of the Standard Performance Evaluation Corporation.