# Compiling Parallel Algorithms to Memory Systems 

Stephen A. Edwards

Columbia University

ENS DI Group, June 26th, 2012

$$
(\lambda x . ?) f=\text { FPGA }
$$

## Functional Programs to FPGAs

$\boldsymbol{\lambda} f .(\boldsymbol{\lambda} x .(f(x x)) \boldsymbol{\lambda} x .(f(x x)))$

## Functional Programs to FPGAs



## Functional Programs to FPGAs



## Functional Programs to FPGAs



## Functional Programs to FPGAs



## Functional Programs to FPGAs



## Functional Programs to FPGAs



## Moore's Law: Lots of Cheap Transistors...


"The complexity for minimum component costs has increased at a rate of roughly a factor of two per year."

Closer to every 24 months

Gordon Moore, Cramming More Components onto Integrated Circuits, Electronics, 38(8) April 19, 1965.

## Pollack's Rule: ...Give Diminishing Returns for Processors



Single-core processor performance follows the square root of area.
It takes $4 \times$ the transistors to give $2 \times$ the performance.

Fred J. Pollack, MICRO 1999 keynote. Graph from Borkar, DAC 2007

## Dally: Calculation is Cheap; Communication is Costly


"Chips are power limited and most power is spent moving data

Performance $=$ Parallelism
Efficiency = Locality

Bill Dally's 2009 DAC Keynote, The End of Denial Architecture

## Parallelism for Performance and Locality for Efficiency



Dally: "Single-thread processors are in denial about these two facts"

We need
different programming paradigms and
different architectures
on which to run them.

## Bacon et al.'s Liquid Metal



Fig. 2. Block level diagram of DES and Lime code snippet
JITting Lime (Java-like, side-effect-free, streaming) to FPGAs Huang, Hormati, Bacon, and Rabbah, Liquid Metal, ECOOP 2008.

## Goldstein et al.'s Phoenix



Figure 3: C program and its representation comprising three hyperblocks; each hyperblock is shown as a numbered rectangle. The dotted lines represent predicate values. (This figure omits the token edges used for memory synchronization.)


Figure 8: Memory access network and implementation of the value and token forwarding network. The LOAD produces a data value consumed by the oval node. The store node may depend on the load (i.e., we have a token edge between the LOAD and the STORE, shown as a dashed line). The token travels to the root of the tree, which is a load-store queue (LSQ).

## C to asynchronous logic, monolithic memory

Budiu, Venkataramani, Chelcea and Goldstein, Spatial Computation, ASPLOS 2004.

## Ghica et al.'s Geometry of Synthesis



Figure 1. In-place map schematic and implementation
Algol-like imperative language to handshake circuits
Ghica, Smith, and Singh. Geometry of Synthesis IV, ICFP 2011

## Greaves and Singh’s Kiwi

```
public static void SendDeviceID()
{ int devicelD = 0x76;
    for (int i = 7; i > 0; i--)
    { scl = false;
        sda_out = (deviceID & 64) != 0;
        Kiwi.Pause(); // Set it i-th bit of the device ID
        scl = true; Kiwi.Pause(); // Pulse SCL
        scl = false; deviceID = deviceID << 1;
        Kiwi.Pause();
    }
}
```

C\# with a concurrency library to FPGAs

Greaves and Singh. Kiwi, FCCM 2008

## Arvind, Hoe et al.'s Bluespec



Figure 1.3 Circuit for computing $\operatorname{Gcd}(a, b)$ from Example 1.
Guarded commands and functions to synchronous logic

Hoe and Arvind, Term Rewriting, VLSI 1999

## Sheeran et al.'s Lava

```
bfly :: CmplxArithmetic m
    => [CmplxSig] -> m [CmplxSig]
bfly [i1, i2] =
    do o1 <- csubtract (i1, i2)
        o2 <- cplus (i1, i2)
        return [o1, o2]
```



Figure 9: A butterfly
bflys :: CmplxArithmetic m
=> Int -> [CmplxSig] -> m [CmplxSig]
bflys $\mathrm{n}=$

riffle >-> raised n two bfly >-> unriffle

Figure 10: A butterfly stage of size 8 expressed with riffling
Functional specifications of regular structures

Bjesse, Claessen, Sheeran, and Singh. Lava, ICFP 1998

## Kuper et al.'s C $\lambda$ aSH



Fig. 6. 4-taps FIR Filter
More operational Haskell specifications of regular structures

Baaij, Kooijman, Kuper, Boeijink, and Gerards. C ash, DSD 2010

## AutoESL (Xilinx, was Cong's xPilot)

-SSDM (System-level Synthesis Data Model)

- Hierarchical netlist of concurrent processes and communication channels

- Each leaf process contains a sequential program which is represented by an extended LLVM IR with hardware-specific semantics
- Port / IO interfaces, bit-vector manipulations, cycle-level notations

SystemC input; classical high-level synthesis for processes
Jason Cong, presentation at ISARS 2005

## Optimization of Parallel "Programs" Enables Chip Design



Sun's UltraSPARC T2
The "Niagara 2"
8 cores; 64 threads
Built 2007, 1.6 GHz, 65 nm
Released open-source as the OpenSPARC T2
www.opensparc.net

454000 lines of synthesizable Verilog $\rightarrow 503000000$ transistors A mix of Boolean logic and structure

## The Lesson of Logic Synthesis: the Enabling Technology

How do you compile and optimize a digital logic circuit?

$$
\begin{aligned}
& f_{1}=a b c d+a b c e+a \bar{b} c \bar{d}+a \bar{b} \bar{b} \bar{d}+\bar{a} c+c d f+a b \bar{c} \bar{d} \bar{e}+a \bar{b} \bar{c} d \bar{f} \\
& f_{2}=b d g+\bar{b} d f g+\overline{b d} g+b \bar{d} e g
\end{aligned}
$$

$$
\begin{aligned}
f_{1} & =c(x+\bar{a})+a \overline{c x} \\
f_{2} & =g x \\
x & =d(b+f)+\bar{d}(\bar{b}+e)
\end{aligned}
$$

After Brayton et al.'s class on Multi-Level Logic Synthesis

## The Lesson of Logic Synthesis: the Enabling Technology

How do you compile and optimize a digital logic circuit?
Use a simple, formal model and automate it.

$$
\begin{aligned}
& f_{1}=a b c d+a b c e+a \bar{b} c \bar{d}+a \bar{b} \bar{c} \bar{d}+\bar{a} c+c d f+a b \bar{c} \bar{d} \bar{e}+a \bar{b} \bar{c} d \bar{f} \\
& f_{2}=b d g+\bar{b} d f g+\overline{b d} g+b \bar{d} e g \\
& \text { Minimize } \\
& f_{1}=b c d+b c e+\overline{b d}+\bar{a} c+c d f+a b \bar{c} \bar{d} \bar{e}+a \bar{b} \bar{c} d \bar{f} \\
& f_{2}=b d g+d f g+\overline{b d} g+\bar{d} e g
\end{aligned}
$$

Factor

$$
\begin{aligned}
& f_{1}=c(b(d+e)+\bar{b}(\bar{d}+f)+\bar{a})+a \bar{c}(b \bar{d} \bar{e}+\bar{b} d \bar{f}) \\
& f_{2}=g(d(b+f)+\bar{d}(\bar{b}+e))
\end{aligned}
$$

Decompose

$$
\begin{aligned}
f_{1} & =c(x+\bar{a})+a \overline{c x} \\
f_{2} & =g x \\
x & =d(b+f)+\bar{d}(\bar{b}+e)
\end{aligned}
$$

After Brayton et al.'s class on Multi-Level Logic Synthesis

## High-Level Synthesis: Adding Time Meant Scheduling

(a)

(b) CFG


Figure 3: (a) FSM for scheduled CFG in Figure 2(b), (b) Hardware implementation of FSM using one-hot encoding



Figure 2: (a) VHDL description; (b) Separate control and data-flow graphs

Bergamaschi, Behavioral Network Graph, DAC 1999.

## The High-Level Synthesis Lessons

## Don't Start From C

"The so-called high-level specifications in reality grew out of the need for simulation and were often little more than an input language to make a discrete event simulator reproduce a specific behavior."

Gupta and Brewer, High-Level Synthesis: A Retrospective, 2008.
Don't Forget Memory
Goldstein et al.'s Phoenix synthesized asychronous hardware from ANSI C. Required heroic work [CGO 2003] to recover any parallelism.

## Our Approach



## Our Approach



## Our Approach



## Why Functional Specifications?

- Referential transparency/side-effect freedom make formal reasoning about programs vastly easier
- Inherently concurrent and race-free (Thank Church and Rosser). If you want races and deadlocks, you need to add constructs.
- Immutable data structures makes it vastly easier to reason about memory in the presence of concurrency



## Why FPGAs?

- We do not know the structure of future memory systems Homogeneous/Heterogeneous? Levels of Hierarchy?
 Communication Mechanisms?

- We do not know the architecture of future multi-cores Programmable in Assembly/C? Single- or multi-threaded?


Use FPGAs as a surrogate. Ultimately too flexible, but representative of the long-term solution.

## A Modern High-End FPGA: Altera's Stratix V

2500 dual-ported 2.5 KB 600 MHz memory blocks; 6 Mb total 350 36-bit 500 MHz DSP blocks (MAC-oriented datapaths) 300000 6-input LUTs; 28 nm feature size


## Let's Talk Details



## Let's Talk Details



## Let's Talk Details



## Our Starting Point: A Functional IR

Inspired by the Glasgow Haskell Compiler's "Core" representation

$$
\text { expr }::=\text { name } v a r^{*} \quad \text { Function call }
$$

Includes primitive arithmetic operators and type constructors
Non-tail-recursive calls generally inlined to improve parallelism; Mycroft and Sharp's [IWLS 2000] propose sharing policies

True recursion transformed to tail recursion with a stack

## Our Starting Point: A Functional IR

Inspired by the Glasgow Haskell Compiler's "Core" representation

$$
\begin{aligned}
\text { expr }::= & \text { name var* } \\
& \mid \text { let }(\text { var }=\text { expr })^{+} \text {in expr }
\end{aligned}
$$

Function call
Parallel evaluation

Parallelism and sequencing:

$$
\text { let } \begin{aligned}
v_{1} & =e_{1} \\
v_{2} & =e_{2} \\
v_{3} & =e_{3} \text { in } e
\end{aligned}
$$



## Our Starting Point: A Functional IR

Inspired by the Glasgow Haskell Compiler's "Core" representation


Evaluate and return one of the expressions based on the pattern

## Our Starting Point: A Functional IR

Inspired by the Glasgow Haskell Compiler's "Core" representation

```
expr ::= name var*
    | let (var = expr)+ in expr
    | case var of (pat -> expr)+
    | var
    | literal
pat ::= literal
    | _
    | Constr. (var | literal | _)*
```

    Function call
    Parallel evaluation
    Multiway conditional
    Variable reference
    Literal value
    Exact match
Default
Match a tagged union

## The Type System: Tagged Unions

Types are primitive (Boolean, Integer, etc.) or tagged unions:

```
type ::= Type
    | Constr Type* | . | | Constr Type*
```

Named type/primitive
Tagged union

Subsume C structs, unions, and enums
Comparable power to C++ objects with virtual methods
Sometimes called "algebraic data types": sums of products

## The Type System: Tagged Unions

Types are primitive (Boolean, Integer, etc.) or tagged unions:

$$
\begin{array}{rll}
\text { type }::=\text { Type } & \text { Named type/primitive } \\
& \mid \text { Constr Type }|\cdots| \text { Constr Type* } & \text { Tagged union }
\end{array}
$$

Examples:

$$
\begin{aligned}
\text { data } \text { Intlist } & =\text { Nil } \quad-- \text { Linked list of integers } \\
& \mid \text { Cons } \text { Int } \text { Intlist }
\end{aligned}
$$

$$
\begin{gathered}
\text { data } \text { Bintree }=\text { Leaf } \text { Int }- \text { Binary tree w/ integer leaves } \\
\mid \text { Branch BinTree Bintree }
\end{gathered}
$$

$$
\text { data } \text { Expr }=\text { Literal } \text { Int } \quad-- \text { Arithmetic expression }
$$

| Var String
Binop Expr Op Expr

$$
\text { data } O p=A d d|S u b| M u l t \mid \text { Div }
$$

## Syntax-Directed Translation of Expressions to Hardware



Combinational functions:


Sequential functions:


## Translating Let and Case



Let makes all new variables available to its body.


Case invokes one of its sub-expressions, then synchronizes.

## Representing Recursive Algebraic Data Types

Consider a list of integers:
data Intlist $=$ Nil
| Cons Int Intlist
An obvious representation:

| 1 | Integer | Pointer |
| :--- | :--- | :--- |

## Nil

Cons Int Intlist

- Usual byte-alignment unnecessary \& wasteful in hardware
- Naturally stored \& managed in a custom integer-list memory
- Width of pointer can depend on integer-list memory size


## Removing Recursion: Recursive Fibonacci Example

Starting point: a dumb way to compute Fibonacci numbers

$$
\begin{aligned}
& \text { fib } 1=1 \\
& \text { fib } 2=1 \\
& \text { fib } n=\text { fib }(n-1)+f i b(n-2)
\end{aligned}
$$

## Removing Recursion: Recursive Fibonacci

Reformatting

$$
\begin{array}{lllc}
f i b & 1 & = & 1 \\
\text { fib } 2 & & & 1 \\
\text { fib } & n & & \\
& & f i b & (n-1)+ \\
& & & \text { fib }
\end{array}
$$

## Removing Recursion: Continuation-Passing Style

In continuation-passing style (the "and then?" transformation):

$$
\begin{array}{lllll}
\text { fib1 1 } & c & = & c 1 & \\
\text { fib1 } 2 & c & = & c 1 & \\
\text { fib1 } n & c & = & \text { fib1 }(n-1) & \text {-- Calls made sequent } \\
& & (\backslash n 1-> & \text { fib1 }(n-2) & \text {-- Intermediates nam } \\
& & (\backslash n 2-> & c(n 1+n 2))) & \text { - - Add scheduled last } \\
\text { fib } n & & = & \text { fib1 } n(\backslash x->x) & \text {-- Wrapper }
\end{array}
$$

## Removing Recursion: Naming Functions

Naming functions; converting unbound variables to arguments:


## Removing Recursion: True Recursion to Tail Recursion

Introducing a stack; merging functions

| $f$ | (Fibl 1 c) | $=f($ Cont c 1) | -- Single function |
| :---: | :---: | :---: | :---: |
| $f$ | (Fib1 2 c) | $=f($ Cont c 1) | -- Continuation the stack |
| $f$ | (Fibl $n$ c) | $=f($ Fibl ( $n-1$ |  |
| $f($ Cont (Fib2 $n$ c) $n 1)=f($ Fib1 ( $n-2)($ Fib3 n1 c) $)$ |  |  |  |
| $f(\operatorname{Cont}($ Fib3 n1 c) n2 $)=f(\operatorname{Contc}(n 1+n 2))$ |  |  |  |
| $f$ | (Fib n) | $=f($ Fibl $n$ Fi |  |
|  | t Fibo n) | $=n$ |  |

> - - Continuations (references to the lambda expressions) data Stack $=$ Fib2 Int Stack -- fib2 n c

## Fibonacci Datapath



data Stack = Fib2 Int Stack
Fib3 Int Stack
Fib0
data Action = Fib Int
| Fibl Int Stack
Cont Stack Int

## Implementing the Stack in Hardware

This uses a list-like stack data type:
data Stack = Fib2 Int Stack
| Fib3 Int Stack
Fibo
A naïve, but correct, way to implement it in hardware:


## Specializing Data Types: Recovering a Classical Stack

Fib3 42 (Fib2 17 (Fib3 8 (Fib3 2 Fib0)))


## Specializing Data Types: Recovering a Classical Stack

Fib3 42 (Fib2 17 (Fib3 8 (Fib3 2 Fib0)))


The only "pop" operation discards the previous top-of-stack

$$
f(\text { Cont }(\text { Fib3 n1 c) n2) }=f(\text { Cont } c(n 1+n 2))
$$

so this code will never generate a tree.
Sequential memory allocation is safe.

## Specializing Data Types: Recovering a Classical Stack

Fib3 42 (Fib2 17 (Fib3 8 (Fib3 2 Fib0)))


Sequential memory allocation makes "next" pointers predictable...

## Specializing Data Types: Recovering a Classical Stack

Fib3 42 (Fib2 17 (Fib3 8 (Fib3 2 Fib0)))

...so there is no need to store them.
Constructor (Fib0) always returns 0.
Constructors (Fib2/3 $n s$ ) writes (Fib2/3 $n$ ) at $s+1$ and returns $s+1$. Reading 0 returns Fib0; reading $s$ returns (Fib2/3ns-1).

## Specializing Data Types



Stacks are the tip of the iceberg
Synthesizing custom memory systems for specific types is a key goal of this project

Shape Analysis relevant here
This is a simple case; a simple, mathematical IR enables such clever optimizations.

Imagine trying to do this in C.

## Unrolling Code for Better Parallelism

$$
\begin{aligned}
& \text { fib } 0=0 \\
& \text { fib } 1=1 \\
& \text { fib } n=\text { fib } \quad(n-1)+f i b \quad(n-2)
\end{aligned}
$$

$$
\begin{aligned}
& f i b(n-1) \text { and } f i b(n-2) \text { are } \\
& \text { functionally independent. }
\end{aligned}
$$

Yet because they share fib, they are performed sequentially.

## Unrolling Code for Better Parallelism

$$
\begin{aligned}
& \text { fib } 0=0 \\
& \text { fib } 1=1 \\
& \text { fib } n=\text { fib' }(n-1)+\text { fib'" }^{\prime}(n-2) \\
& \text { fib' } 0=0 \\
& \text { fib' } 1=1 \\
& \text { fib' } n=\text { fib' }(n-1)+\text { fib' }^{\prime}(n-2) \\
& \text { fib" } 0=0 \\
& \text { fib'" } 1=1 \\
& \text { fib' } n=\text { fib" }(n-1)+\text { fib' }^{\prime \prime}(n-2)
\end{aligned}
$$

By unrolling the recursion once, $f i b$ ' and $f i b$ " run in parallel.

A further improvement: balance the work done by fib' and fib"

## Unrolling Types for Better Locality

data | Stack | $=$ Fib2 Int Stack |
| ---: | :--- |
|  | $\left.\left\lvert\, \begin{array}{l}\text { Fib3 Int Stack } \\ \\ \end{array}\right.\right)$ Fib0 |

Each Stack object naturally represents a single activation record

## Unrolling Types for Better Locality



A similar unrolling amounts to packing records that can be processed in parallel

Abstract data types enables this
Imagine trying to do this safely in a C compiler

## Example: Huffman Decoder in Haskell

data HTree $=$ Branch HTree HTree

## | Leaf Char

decode :: HTree $->$ [Bool] $\rightarrow$ [Char $]$-- Huffman tree \& bitstream to symbols
decode table str $=$ decoder table str
where
decoder (Leaf s) $i=s:($ decoder table $i$ ) -- Identified symbol; start again
decoder_ [] = []
decoder (Branchf_) (False: $x s$ ) $=\operatorname{decoder} f x s--0$ : follow left branch
decoder (Branch_t) (True: $x s$ ) $=$ decoder $t x s--1$ : follow right branch
Three data types: Input bitstream, output character stream, and Huffman tree

## Optimizations



## Target Applications

- "Data-parallel irregular applications [that] manipulate large pointer-based data structures like graphs"
[Pingali et al.'s Galois project]
- Datatype accelerators

Hash tables, Balanced trees, Heaps

- Application-domain accelerators

Relational databases, Crypography, Data compression

- Non-scientific computing: the stuff that's hard for vector units and GPGPUs


## Acknowledgements

Project started while at MSR Cambridge


Satnam Singh (now at Google)

Simon Peyton Jones (MSR)

Martha Kim (Columbia)


