PhD Dissertation Defense
March 14, 2019
but might be slowing down
Data: CPUDB, Intel ARK, Wikipedia: https://en.wikipedia.org/wiki/Transistor_count
Plot from 2018 Turing Lecture by Hennessy & Patterson
Data based on models in Esmaeilzadeh et al., "Dark Silicon and the End of Multicore Scaling", ISCA'11
End of Dennard scaling
~50% improv./year
<=10%/year
DVFS
..but multicores can only take us so far due to Amdahl's law:
and even if p == 1, multicore scaling will stop due to growing power density:
growing portions of chips will have to remain powered off (a.k.a. "dark silicon")
"The Multicore Scaling Era"
Give up generality for greater efficiency
embrace dark silicon: add many accelerators; not all will be on at the same time
Accelerators are expensive to develop and deploy, particularly ASICs
Investment can only be amortized for high-demand application domains
Taylor, "The Evolution of Bitcoin Hardware", IEEE Computer 2017
Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA'17
With further transistor scaling, increasing portions of the chip will be devoted to accelerators
Example: Apple A12 7nm SoC ("Iphone X/XS")
Sources: techinsights.com "Apple iPhone XS Max Teardown", anandtech.com
High-level synthesis tools enable productive design space exploration
Simulators for heterogeneous systems are limited by a lack of fast, scalable emulators
Unused accelerators incur a large opportunity cost
From RTL and/or high-level synthesis descriptions
Accelerators might require ISA innovations
Accelerators might affect the hardware-software interface, e.g. virtual memory or I/O
Leverage multi-core hosts while maintaining correctness
Architecture
OS
Compiler
"Fast, scalable machine emulation is feasible and useful for evaluating the design and integration of heterogeneous systems"
Quantitative comparison of accelerator couplings
Technique to lower the opportunity cost of accelerator integration by reusing acc. memories to extend the LLC
Pico
[CGO'17]
Qelt
[VEE'19]
[DAC'15]
ROCA
[CAL'14, ICS'16]
Cota, Bonzini, Bennée, Carloni. "Cross-ISA Machine Emulation for Multicores", CGO, 2017
goal: efficient, correct, multicore-on-multicore cross-ISA emulation
Main task:
Fetch -> Decode -> Execute
How? Two options:
uint8_t *ip;
uint8_t opcode;
while (true)
// Read the next token from the instruction stream
opcode = *ip;
// Advance to the next byte in the stream
ip++;
// Decide what to do
switch (opcode) {
case PZT_ADD_32:
...
break;
case PZT_SUB_32:
...
break;
...
}
}
1. Interpretation
2. Dynamic Binary Translation (DBT)
Faster than interpretation
More complex
e.g., external "helpers" are needed to deal with complex emulation
"DBT dispatch loop"
Open source: https://www.qemu.org
Widely used in both industry and academia
Supports many ISAs through DBT via TCG, its Intermediate Representation (IR):
Our contributions are not QEMU-specific
They are applicable to dynamic binary translators at large
[*] Bellard. "QEMU, a fast and portable dynamic translator", ATC, 2005
(1) Scalability of the DBT engine
(2) ISA disparities between guest & host:
[A] J. H. Ding et al. PQEMU: A parallel system emulator based on QEMU. ICPADS, 2011
[B] Z. Wang et al. COREMU: A scalable and portable parallel full-system emulator. PPoPP, 2011
[C] D. Lustig et al. ArMOR: defending against memory consistency model mismatches in heterogeneous architectures. ISCA, 2015
(2.1) Memory consistency mismatches
(2.2) Atomic instruction semantics
i.e. compare-and-swap vs. load locked-store conditional
Related Work:
PQEMU [A] and COREMU [B] do not address (2)
ArMOR [C] solves (2.1)
[*] Bruening, Kiriansky, Garnett, Banerji. "Thread-shared software code caches", CGO, 2006
Long hash chains: slow lookups
Fixed number of buckets
hash=h(phys_addr) leads to uneven chain lengths
No support for concurrent lookups
Problems in QEMU's TB hash table
hash=h(phys_addr, phys_PC, cpu_flags): uniform chain distribution
e.g. longest chain down from 550 to 40 TBs when booting ARM Linux
QHT: A resizable, scalable hash table
scales for both reads & writes
Keeps QEMU's global lock for code translation
Translation is rare, but more on this later!
[*] Southern, Renau. "Deconstructing PARSEC scalability", WDDD, 2015
Two families:
/* runs as a single atomic instruction */
bool CAS(type *ptr, type old, type new) {
if (*ptr != old) {
return false;
}
ptr = new;
return true;
}
Compare-and-Swap (CAS)
Load Locked-Store Conditional (LL/SC)
/*
* store_exclusive() returns 1 if addr has
* been written to since load_exclusive()
*/
do {
val = load_exclusive(addr);
val += 1; /* do something */
} while (store_exclusive(addr, val);
ldl_l/stl_c lwarx/stwcx ldrex/strex ldaxr/strlxr ll/sc lr/sc
Challenge: How to correctly emulate atomics in a parallel environment, without hurting scalability?
Alpha:
POWER:
ARM:
aarch64:
MIPS:
RISC-V:
x86/IA-64:
cmpxchg
Only a few simple instructions are allowed between LL and SC
Solving this solves LL/SC on LL/SC, because LL/SC is stronger than CAS
However, there's the ABA problem
Challenge: How to correctly emulate atomics in a parallel environment, without hurting scalability?
cpu0 | cpu1 |
---|---|
do { val = atomic_read(addr); /* reads A */ ... ... } while (CAS(addr, val, newval); |
atomic_set(addr, B); atomic_set(addr, A); |
time
cpu0 | cpu1 |
---|---|
do { val = load_exclusive(addr); /* reads A */ ... ... } while (store_exclusive(addr, newval); |
atomic_set(addr, B); atomic_set(addr, A); |
Init: *addr = A;
SC fails, regardless of the contents of *addr
CAS succeeds where SC failed!
time
3 proposed options that scale:
No need to instrument regular stores
But requires HTM support on the host
Pico-user atomic_add, multi-threaded, aarch64-on-POWER
Cota, Carloni. "Cross-ISA Machine Instrumentation using Fast and Scalable Dynamic Binary Translation", VEE, 2019
goal: fast, scalable instrumentation of a machine emulator
Correct cross-ISA FP emulation using the host FPU
Integration of two state-of-the-art optimizations:
indirect branch handling
dynamic sizing of the software TLB
Make the DBT engine scale under heavy code translation
Not just during execution, like Pico
4. Fast, ISA-agnostic instrumentation layer for QEMU
baseline (incorrect): always uses the host FPU and never reads excp. flags
How common?
of FP instructions in SPECfp06
float64 float64_mul(float64 a, float64 b, fp_status *st)
{
float64_input_flush2(&a, &b, st);
if (likely(float64_is_zero_or_normal(a) &&
float64_is_zero_or_normal(b) &&
st->exception_flags & FP_INEXACT &&
st->round_mode == FP_ROUND_NEAREST_EVEN)) {
if (float64_is_zero(a) || float64_is_zero(b)) {
bool neg = float64_is_neg(a) ^ float64_is_neg(b);
return float64_set_sign(float64_zero, neg);
} else {
double ha = float64_to_double(a);
double hb = float64_to_double(b);
double hr = ha * hb;
if (unlikely(isinf(hr))) {
st->float_exception_flags |= float_flag_overflow;
} else if (unlikely(fabs(hr) <= DBL_MIN)) {
goto soft_fp;
}
return double_to_float64(hr);
}
}
soft_fp:
return soft_float64_mul(a, b, st);
}
.. and similarly for 32/64b + , - , \(\times\) , \(\div\), \( \surd\), ==
derived from state-of-the-art DBT engines
[A] Hong, Hsu, Chou, Hsu, Liu, Wu. "Optimizing Control Transfer and Memory Virtualization in Full System Emulators", ACM TACO, 2015
[B] Tong, Koju, Kawahito, Moshovos. "Optimizing memory translation emulation in full system emulators", ACM TACO, 2015
user-mode x86_64-on-x86_64. Baseline: Pico (i.e. QEMU v3.1.0)
full-system x86_64-on-x86_64. Baseline: Pico (i.e. QEMU v3.1.0)
with a shared translation block (TB) cache
Parallel TB execution (green blocks)
Serialized TB generation (red blocks) with a global lock
Parallel TB execution
Parallel TB generation (one region per vCPU)
Guest VM performing parallel compilation of Linux kernel modules, x86_64-on-x86_64
x86_64-on-x86_64 (lower is better). Baseline: KVM
Qelt faster than the state-of-the-art, even for heavy instrumentation (cachesim)
x86_64-on-x86_64 (lower is better). Baseline: native
"Fast, scalable machine emulation is feasible and useful for evaluating the design and integration of heterogeneous systems"
Quantitative comparison of accelerator couplings
Technique to lower the opportunity cost of accelerator integration by reusing acc. memories to extend the LLC
Pico
[CGO'17]
Qelt
[VEE'19]
[DAC'15]
ROCA
[CAL'14, ICS'16]
Cota, Mantovani, Di Guglielmo, Carloni. "An Analysis of Accelerator Coupling in Heterogeneous Architectures", DAC, 2015
goal: to draw observations about
performance, efficiency and programmability
of accelerators with different couplings
✔Nil invocation overhead (via ISA extensions)
✔ No internal storage: direct access to L1 cache
✗ Limited portability: design heavily tied to CPU
✔Good design reuse: no CPU-specific knowledge
✗ High set-up costs: driver invocation and DMA
✔ Freedom to tailor private memories (PLMs), e.g.
providing different banks, ports, and bit widths
✗ PLMs require large area expenses
TCA
LCA, two flavors: DRAM-DMA, LLC-DMA
[*] http://hpc.pnl.gov/PERFECT/
Why LCAs > TCAs:
Tailored, many-ported PLMs are key to performance
Cota, Mantovani, Petracca, Casu, Carloni. "Accelerator Memory Reuse in the Dark Silicon Era", Computer Architecture Letters (CAL), 2014
Cota, Mantovani, Carloni. "Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration", Intl. Conf. on Supercomputing (ICS), 2016
goal: to expose on-chip accelerator PLMs to the LLC, thereby extracting utility from accelerators when otherwise unused
An accelerator is only of utility if it applies to the system's workload
If it doesn't, more generally-applicable alternatives are more productive
vs.
vs.
vs.
An average of 69% of accelerator area is consumed by memory
Lyons, Hempstead, Wei, Brooks. "The Accelerator Store", TACO, 2012
Not all accelerators on a chip are likely to run at the same time
Accelerator examples: AES, JPEG encoder, FFT, USB, CAN, TFT controller, UMTS decoder..
No changes to coherence protocol
Minimal modifications to accelerators
Configurations:
cores | 16 cores, i386 ISA, in-order IPC=1 except on memory accesses, 1GHz |
L1 caches | Split I/D 32KB, 4-way set-associative, 1-cycle latency, LRU |
L2 caches | 8-cycle latency, LRU S-NUCA: 16ways, 8 banks ROCA: 12 ways |
Coherence | MESI protocol, 64-byte blocks, standalone directory cache |
DRAM | 1 controller, 200-cycle latency, 3.5GB physical |
NoC | 5x5 or 7x7 mesh, 128b flits, 2-cycle router traversal, 1-cycle links, XY router |
OS | Linux v2.6.34 |
Workloads:
Assuming no accelerator activity,
[DAC'15] Quantitative comparison of accelerator couplings
[CAL'14, ICS'16] ROCA: Lower the opportunity cost of accelerator integration by reusing acc. memories to extend the LLC
[CAL'14] Cota, Mantovani, Petracca, Casu, Carloni. "Accelerator Memory Reuse in the Dark Silicon Era", Computer Architecture Letters (CAL), 2014 [DAC'15] Cota, Mantovani, Di Guglielmo, Carloni. "An Analysis of Accelerator Coupling in Heterogeneous Architectures", DAC, 2015 [ICS'16] Cota, Mantovani, Carloni. "Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration", Intl. Conf. on Supercomputing (ICS), 2016 [CGO'17] Cota, Bonzini, Bennée, Carloni. "Cross-ISA Machine Emulation for Multicores", CGO, 2017 [VEE'19] Cota, Carloni. "Cross-ISA Machine Instrumentation using Fast and Scalable Dynamic Binary Translation", VEE, 2019
We hope other researchers and educators will adopt QEMU to drive simulators of heterogeneous systems
Orders of magnitude faster than existing tools such as gem5-aladdin (~200 KIPS vs. ~10s of MIPS)
Has been used for both research and teaching at Columbia
Instrumentation layer: under review by the QEMU community
Everything else: merged upstream QEMU v2.7 (Sept'16)\( \leftrightarrow \)QEMU v4.0 (April'19)
Code contributions well-received (and improved!) by the QEMU community
302 commits to date
[1] http://concurrencykit.org
[12] D. Bruening, V. Kiriansky, T. Garnett, and S. Banerji. Thread-shared software code caches. CGO, pages 28–38, 2006
[13] T. David, R. Guerraoui, and V. Trigonakis. Asynchronized concurrency: The secret to scaling concurrent search data structures. ASPLOS, p. 631–644, 2015
Fast, concurrent lookups
Low update rate: max 6% booting Linux
Candidate #1: ck_hs [1] (similar to [12])
Candidate #2: CLHT [13]
#3: Our proposal: QHT
[*] Blundell, Lewis, Martin. "Subtleties of transactional memory atomicity semantics", Computer Architecture Letters, 2006
Pico-user, single thread, aarch64-on-x86
Pico-user atomic_add, multi-threaded, aarch64-on-POWER
struct count {
u64 val;
} __aligned(64); /* avoid false sharing */
struct count *counts;
while (!test_stop) {
int index = rand() % n_elements;
atomic_increment(&counts[index].val);
}
single thread
x86-on-POWER
We applied ArMOR's [24] FSMs:
, and also leveraged
[24] D. Lustig et al. ArMOR: defending against memory consistency model mismatches in heterogeneous architectures. ISCA, pages 388–400, 2015
RCU is a way of waiting for things to finish,
without tracking every one of them
Credit: Paul McKenney
void *qht_lookup__slowpath(struct qht_bucket *b, qht_lookup_func_t func,
const void *userp, uint32_t hash)
{
unsigned int version;
void *ret;
do {
version = seqlock_read_begin(&b->sequence);
ret = qht_do_lookup(b, func, userp, hash);
} while (seqlock_read_retry(&b->sequence, version));
return ret;
}
Reader
Writer
seq=0
seq=3
seq=1
seq=2
seq=3
Retry
Reader: Sequence number must be even, and must remain unaltered. Otherwise, retry
seq=3
seq=4
Retry
Retry
seq=4
seq=4
val_t val = atomic_read(&bucket->val[i]);
smp_rmb();
if (atomic_read(&bucket->key [i]) == key && atomic_read(&bucket->val[i]) == val) {
/* found */
}
the memory allocator of the values must guarantee that the same address cannot appear twice during the lifespan of an operation.
[13] T. David, R. Guerraoui, and V. Trigonakis. Asynchronized concurrency: The secret to scaling concurrent search data structures. ASPLOS, pages 631–644, 2015
cpu0 | cpu1 | cpu2 | cpu3 |
---|---|---|---|
x=1 | y=1 | r1=x r2=y |
r3=y r4=x |
[10] H.-J. Boehm and S. V. Adve. Foundations of the C++ concurrency memory model. ACM SIGPLAN Notices, volume 43, pages 68–78, 2008.
user-mode x86_64-on-x86_64. Baseline: Pico (i.e. QEMU v3.1.0)
user-mode x86-on-x86
VOID Instruction(INS ins)
{
if (INS_IsMemoryRead(ins))
INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)MemCB, ...);
}
VOID Trace(TRACE trace, VOID *v)
{
for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl))
for (INS ins = BBL_InsHead(bbl); INS_Valid(ins); ins = INS_Next(ins))
Instruction(ins);
}
static void vcpu_tb_trans(qemu_plugin_id_t id, unsigned int cpu_index, struct qemu_plugin_tb *tb)
{
size_t n = qemu_plugin_tb_n_insns(tb);
size_t i;
for (i = 0; i < n; i++) {
struct qemu_plugin_insn *insn = qemu_plugin_tb_get_insn(tb, i);
qemu_plugin_register_vcpu_mem_cb(insn, vcpu_mem, QEMU_PLUGIN_CB_NO_REGS, QEMU_PLUGIN_MEM_R);
}
user-mode, x86_64-on-x86_64
CactusADM an anomaly: TLB resizing doesn't kick in often enough (we only do it on TLB flushes)
lower is better
CactusADM an anomaly: TLB resizing doesn't kick in often enough (we only do it on TLB flushes)
Faravelon, Gruber, Pétrot. "Optimizing memory access performance using hardware assisted virtualization in retargetable dynamic binary translation. Euromicro Conference on Digital System Design (DSD), 2017.
[*] Belay, Bittau, Mashtizadeh, Terei, Mazieres, Kozyrakis. "Dune: Safe user-level access to privileged cpu features." OSDI, 2012
Before:
softMMU requires
many insns
after:
only 2 insns thanks to
shadow page tables
Advantages:
Disadvantages:
x86-on-ppc64, make -j N inside a VM
aarch64-on-aarch64, Nbench FP
aarch64-on-x86, SPEC06fp
ind. branches, aarch64-on-x86
ind. branches, x86-on-aarch64
bench before after1 after2 after3 final speedup
---------------------------------------------------------
aes 1.12s 1.12s 1.10s 1.00s 1.12
bigint 0.78s 0.78s 0.78s 0.78s 1
dhrystone 0.96s 0.97s 0.49s 0.49s 1.9591837
miniz 1.94s 1.94s 1.88s 1.86s 1.0430108
norx 0.51s 0.51s 0.49s 0.48s 1.0625
primes 0.85s 0.85s 0.84s 0.84s 1.0119048
qsort 4.87s 4.88s 1.86s 1.86s 2.6182796
sha512 0.76s 0.77s 0.64s 0.64s 1.1875
bench before after1 after2 after3 final speedup
---------------------------------------------------------
aes 2.68s 2.54s 2.60s 2.34s 1.1452991
bigint 1.61s 1.56s 1.55s 1.64s 0.98170732
dhrystone 1.78s 1.67s 1.25s 1.24s 1.4354839
miniz 3.53s 3.35s 3.28s 3.35s 1.0537313
norx 1.13s 1.09s 1.07s 1.06s 1.0660377
primes 15.37s 15.41s 15.20s 15.37s 1
qsort 7.20s 6.71s 3.85s 3.96s 1.8181818
sha512 1.07s 1.04s 0.90s 0.90s 1.1888889
Ind. branches, RISC-V on x86, user-mode
Ind. branches, RISC-V on x86, full-system
[*] http://hpc.pnl.gov/PERFECT/
Additional latency for hits to blocks stored in accelerators
Return via the host bank guarantees the host bank is the only coherence synchronization point
[*] David H. Albonesi, "Selective Cache Ways: On-Demand Cache Resource Allocation", ISCA'99
4-way example: 2 local, 2 remote ways
requires full-length tags: modulo is not bit selection anymore
[*] André Seznec, "Bank-interleaved cache or memory indexing does not require Euclidean division", IWDDD'15