Generality
Energy Efficiency
Specialization
CPUs
Multi-cores/
Asymmetric
Many-cores
GPUs/DSPs
ASICs
The end of Dennard scaling explains the surge of interest in specialization
1
10x
100x
1000x
This work
FPGAs
An accelerator is only of utility if it applies to the system's workload
If it doesn't, more generally-applicable alternatives are more productive
vs.
vs.
vs.
Stage 1
Parallel Bubble Sort
64-port PLM
Stage 2
Merge Sort
64-port PLM
Tailored, many-ported Private Local Memories (PLMs) are key to exploit all parallelism in the algorithm
Input
2-port PLM
Output
2-port PLM
Example: Sort Accelerator to sort FP vectors
An average of 69% of accelerator area is consumed by memory
Lyons et al., "The Accelerator Store", TACO'12
Not all accelerators on a chip are likely to run at the same time
Accelerator examples: AES, JPEG encoder, FFT, USB, CAN, TFT controller, UMTS decoder..
High-bandwidth PLMs cannot tolerate additional latency
[1] Lyons et al., "The Accelerator Store: A Shared Memory Framework for Accelerator-Based Systems", TACO'12
[2] Cong et al., "Bin: a Buffer-in-NUCA Scheme for Accelerator-rich CMPs", ISLPED'12
[3] Fajardo et al., "Buffer-Integrated-Cache: a Cost-Effective SRAM Architecture for Handheld and Embedded Platforms", DAC'11
Shared memory pool that accelerators allocate from
Substrate to host cache blocks or accelerator buffers
Complicate accelerator designs
Applies to all accelerator PLMs, not only low-bandwidth ones
Minimal modifications to accelerators
source: cpudb + Intel ARK
Additional latency for hits to blocks stored in accelerators
Return via the host bank guarantees the host bank is the only coherence synchronization point
[*] David H. Albonesi, "Selective Cache Ways: On-Demand Cache Resource Allocation", ISCA'99
4-way example: 2 local, 2 remote ways
requires full-length tags: modulo is not bit selection anymore
[*] André Seznec, "Bank-interleaved cache or memory indexing does not require Euclidean division", IWDDD'15
Assuming no accelerator activity,
Configurations:
a.k.a. "the end of the multi-core era"
Esmaeilzadeh et al., "Dark Silicon and the End of Multicore Scaling", ISCA'11
source: cpudb + Intel ARK
cores | 16 cores, i386 ISA, in-order IPC=1 except on memory accesses, 1GHz |
L1 caches | Split I/D 32KB, 4-way set-associative, 1-cycle latency, LRU |
L2 caches | 8-cycle latency, LRU S-NUCA: 16ways, 8 banks ROCA: 12 ways |
Coherence | MESI protocol, 64-byte blocks, standalone directory cache |
DRAM | 1 controller, 200-cycle latency, 3.5GB physical |
NoC | 5x5 or 7x7 mesh, 128b flits, 2-cycle router traversal, 1-cycle links, XY router |
OS | Linux v2.6.34 |
~10560 cycles, i.e. 10.5us at 1GHz