# Functioning Hardware from <br> Functional Specifications 

Stephen A. Edwards

Columbia University

ACSD Keynote, Tunis, June 26, 2014

Where's my 10 GHz processor?

## Moore's Law: Transistors Shrink and Get Cheaper


"The complexity for minimum component costs has increased at a rate of roughly a factor of two per year."

Closer to every 24 months

Gordon Moore, Cramming More Components onto Integrated Circuits, Electronics, 38(8) April 19, 1965.

## Intel CPU Trends



Sutter, The Free Lunch is Over, DDJ 2005.
Data: Intel, Wikipedia, K. Olukotun

## Intel CPU Trends



Sutter, The Free Lunch is Over, DDJ 2005.
Data: Intel, Wikipedia, K. Olukotun

## Pollack's Rule: Diminishing Returns for Processors



Single-threaded processor performance grows with the square root of area.

It takes
$4 \times$ the transistors to give $2 \times$ the performance.

## Intel CPU Trends



Sutter, The Free Lunch is Over, DDJ 2005.
Data: Intel, Wikipedia, K. Olukotun

## Intel Processors to Scale



## What Happened in 2005?



Pentium 4 2000 1 core
Transistors: 42 M


Core 2 Duo 2006

2 cores
291 M


Xeon E5
2012
8 cores
2.3 G

## Intel CPU Trends



Sutter, The Free Lunch is Over, DDJ 2005.
Data: Intel, Wikipedia, K. Olukotun

The Cray-2:Immersed in Fluorinert


## Heat Flux in IBM Mainframes: A Familiar Trend



Schmidt. Liquid Cooling is Back. Electronics Cooling. August 2005.

## Liquid Cooled Apple Power Mac G5



2004 CMOS 1.2 kW

## Where's all that power going? What can we do about it?

## Dally: Calculation Cheap; Communication Costly


"Chips are power limited and most power is spent moving data

Performance = Parallelism

Efficiency = Locality

Bill Dally's 2009 DAC Keynote, The End of Denial Architecture

## Parallelism for Performance; Locality for Efficiency



Dally: "Single-thread processors are in denial about these two facts"

We need
different programming paradigms and
different architectures on which to run them.

## Massive On-Chip Parallelism: The NVIDIA GTX Titan



## The NVIDIA GTX Titan/GK110 GPU



## The Future is Wires and Memory



## How best to use all those transistors?



## Xilinx's Vivado (Was xPilot, AutoESL)

-SSDM (System-level Synthesis Data Model)

- Hierarchical netlist of concurrent processes and communication channels

- Each leaf process contains a sequential program which is represented by an extended LLVM IR with hardware-specific semantics
- Port / IO interfaces, bit-vector manipulations, cycle-level notations

SystemC input; classical high-level synthesis for processes
Jason Cong et al. ISARS 2005

## Taylor and Swanson's Conservation Cores



Custom datapaths, controllers for loop kernels; uses existing memory hierarchy
Swanson, Taylor, et al. Conservation Cores. ASPLOS 2010.

## Bacon et al.'s Liquid Metal



Fig. 2. Block level diagram of DES and Lime code snippet
JITting Lime (Java-like, side-effect-free, streaming) to FPGAs
Huang, Hormati, Bacon, and Rabbah, Liquid Metal, ECOOP 2008.

## Goldstein et al.'s Phoenix



Figure 3: C program and its representation comprising three hyperblocks; each hyperblock is shown as a numbered rectangle. The dotted lines represent predicate values. (This figure omits the token edges used for memory synchronization.)


Figure 8: Memory access network and implementation of the value and token forwarding network. The LOAD produces a data value consumed by the oval node. The store node may depend on the load (i.e., we have a token edge between the LOAD and the STORE, shown as a dashed line). The token travels to the root of the tree, which is a load-store queue (LSQ).

## C to asynchronous logic, monolithic memory

Budiu, Venkataramani, Chelcea and Goldstein, Spatial Computation, ASPLOS 2004.

## Ghica et al.'s Geometry of Synthesis



Figure 1. In-place map schematic and implementation
Algol-like imperative language to handshake circuits
Ghica, Smith, and Singh. Geometry of Synthesis IV, ICFP 2011

## Greaves and Singh's Kiwi

```
public static void SendDeviceID()
{ int deviceID = 0x76;
    for (int i=7; i > 0; i--)
        { scl = false;
        sda_out = (deviceID & 64) != 0;
        Kiwi.Pause(); // Set it i-th bit of the device ID
        scl = true; Kiwi.Pause(); // Pulse SCL
        scl = false; deviceID = deviceID << 1;
        Kiwi.Pause();
        }
}
```

C\# with a concurrency library to FPGAs
Greaves and Singh. Kiwi, FCCM 2008

## Arvind, Hoe, et al.'s Bluespec

$$
\begin{aligned}
& \text { GCD Mod Rule } \\
& \quad \operatorname{Gcd}(a, b) \text { if }(a \geq b) \wedge(b \neq 0) \rightarrow \operatorname{Gcd}(a-b, b) \\
& \text { GCD Flip Rule } \\
& \quad \operatorname{Gcd}(a, b) \text { if } a<b \rightarrow \operatorname{Gcd}(b, a)
\end{aligned}
$$



Figure 1.3 Circuit for computing $\operatorname{Gcd}(a, b)$ from Example 1.
Guarded commands and functions to synchronous logic Hoe and Arvind, Term Rewriting, VLSI 1999

## Sheeran et al.'s Lava

```
bfly :: CmplxArithmetic m
    #> [CmplxSig] -> m [CmplxSig]
bfly [i1, i2] =
    do o1 <- csubtract (i1, i2)
        o2 <- cplus (i1, i2)
        return [o1, o2]
```

bflys :: CmplxArithmetic m
=> Int -> [CmplxSig] -> m [CmplxSig]
bflys $\mathrm{n}=$
riffle >-> raised $n$ two bfly >-> unriffle
Figure 10: A butterfly stage of size 8 expressed with riffling

Functional specifications of regular structures
Bjesse, Claessen, Sheeran, and Singh. Lava, ICFP 1998

## Kuper et al.'s $\mathrm{C} \lambda \mathrm{aSH}$



Fig. 6. 4-taps FIR Filter

More operational Haskell specifications of regular structures
Baaij, Kooijman, Kuper, Boeijink, and Gerards. C $\lambda$ ash, DSD 2010

## What am I doing about it?

## Functional Programs to FPGAs

$\boldsymbol{\lambda} f .(\boldsymbol{\lambda} x .(f(x x)) \boldsymbol{\lambda} x .(f(x x)))$

## Functional Programs to FPGAs



## Functional Programs to FPGAs



## Functional Programs to FPGAs



## Functional Programs to FPGAs



## Functional Programs to FPGAs



## Functional Programs to FPGAs



## More Motivation



## More Motivation



## More Motivation



## Why Functional Specifications?

- Referential transparency/side-effect freedom make formal reasoning about programs vastly easier
- Inherently concurrent and race-free (Thank Church and Rosser). If you want races and deadlocks, you need to add constructs.
- Immutable data structures makes it vastly easier to reason about memory in the presence of concurrency



## Why FPGAs?

- We do not know the structure of future memory systems Homogeneous/Heterogeneous? Levels of Hierarchy? Communication Mechanisms?

- We do not know the architecture of future multi-cores Programmable in Assembly/C? Single- or multi-threaded?

Use FPGAs as a surrogate. Ultimately too flexible, but representative of the long-term solution.

## A High-End FPGA: Altera's Stratix V

2500 dual-ported 2.5KB 600 MHz memory blocks; 6 Mb total 350 36-bit 500 MHz DSP blocks (MAC-oriented datapaths) 300000 6-input LUTs; 28 nm feature size


## The Practical Question

> How do we synthesize hardware from pure functional languages for FPGAs?

Control and datapath are easy; the memory system is interesting.

## To Implement Real Algorithms, We Need

Structured, recursive data types


Recursion to handle recursive data types

Memories


Memory Hierarchy


## The Type System: Algebraic Data Types

Types are primitive (Boolean, Integer, etc.) or other ADTs:

$$
\begin{array}{cl}
\text { type }::=\text { Type } & \text { Primitive type } \\
& \mid \text { Constr Type* }|\cdots| \text { Constr Type* } \\
\text { Tagged union }
\end{array}
$$

Subsume C structs, unions, and enums
Comparable power to C++ objects with virtual methods
"Algebraic" because they are sum-of-product types.

## The Type System: Algebraic Data Types

Types are primitive (Boolean, Integer, etc.) or other ADTs:

$$
\begin{array}{rll}
\text { type }::=\text { Type } & \text { Primitive type } \\
& \mid \text { Constr Type* }|\cdots| \text { Constr Type* } & \text { Tagged union }
\end{array}
$$

Examples:

$$
\begin{gathered}
\text { data Intlist }=\text { Nil -- Linked list of integers } \\
\text { | Cons Int Intlist }
\end{gathered}
$$

data Bintree = Leaf Int $\quad--$ Binary tree of integers Branch Bintree Bintree

$$
\begin{aligned}
\text { data Expr } & =\text { Literal Int } \quad-- \text { Arithmetic expression } \\
& \text { | Var String } \\
& \text { | Binop Expr Op Expr }
\end{aligned}
$$

data $\mathrm{Op}=$ Add $\mid$ Sub $\mid$ Mult $\mid$ Div

## Algebraic Datatypes in Hardware: Lists

| data IntList | $=$ Cons Int IntList |
| ---: | :--- |
|  | $\mid$ Nil |

48
3332
10

| pointer | int | 10 |
| :---: | :---: | :---: |
|  | Cons |  |
|  | Nil |  |

## Datatypes in Hardware: Binary Trees

data IntTree $=$ Branch IntTree IntTree
Leaf Int

| 32 | 1716 | 10 |
| :--- | :--- | :--- |
| pointer | pointer | 0 |
| Branch |  |  |
| int | 1 | Leaf |
| int |  |  |

## Example: Huffman Decoder in Haskell

| $\begin{gathered} \text { data HTree }=\text { Branch HTree HTree } \\ \text { \| Leaf Char } \end{gathered}$ |  |  |
| :---: | :---: | :---: |
| decode :: HTree $\rightarrow$ [Bool] $\rightarrow$ [Char] |  |  |
| decode table str $=$ bit str table where |  |  |
| bit (False:xs) | (Branch I _) = bit xs | -- 0: left |
| bit (True:xs) | (Branch _ r) = bit xs | -- 1: right |
| bit x | (Leaf c) = c : bit | - leaf |
| bit [] | = [] | --done |

Three data types:

Input bitstream
Output character stream Huffman tree
[Bool] (list of Booleans)
[Char] (list of Characters)
HTree

## Encoding the Types

Huffman tree nodes: (19 bits)

|  | 8 -bit char | 1 | Leaf |
| :---: | :---: | :---: | :--- |
| 9-bit pointer | 9-bit pointer | 0 | Branch |

Boolean input stream: (14 bits)

| 12-bit pointer | B 1 | Cons |
| ---: | :--- | :--- |
|  | 0 | Nil |

Character output stream: (19 bits)


Optimizations


Optimizations


## Use Streams




Use Streams

## Optimizations



Unroll for locality

Speculate


## Hardware Synthesis:

Semantics-preserving steps to
a low-level dialect

## High-Level Synthesis in a Functional Setting

 diffeq a dxxuy=if $x<a$ then
diffeq a $d x(x+d x)(u-5 * x * u * d x-3 * y * d x)(y+u * d x)$ else y

Direct translation: too many multipliers


## Scheduling: Time-multiplex Two Multiplers

 diffeq a dx x u y =if $x<a$ then
diffeq a $d x(x+d x)(u-5 * x * u * d x-3 * y * d x)(y+u * d x)$ else y


## Scheduling: Mapping to a Datapath

diffeq a dx x u y $=$
if $x<a$ then
diffeq a $d x(x+d x)(u-5 * x * u * d x-3 * y * d x)(y+u * d x)$ else y

Introduce a function representing the datapath dpath $m 1 m 2 m 3 m 4 a 1 a 2 s 1 s 2 c 1 c 2 k=$
$\quad k(m 1 * m 2)(m 3 * m 4)(a 1+a 2)(s 1-s 2)(c 1<c 2)$
and re-express the function with the datapath diffeq adx $x$ uy $=$
dpath $u d x 5 \times x d x 00 \times a\left(\lambda p a p b x \_c \rightarrow\right.$ if not $c$ then $y$ else
dpath papb 3 y $000000(\lambda$ pa pb _ _ $\rightarrow$
dpath udx dxpb00u pa 00 ( $\lambda$ papb_d $\rightarrow$
dpath 0000 y pad pb00( $\lambda_{-}$_ $s d_{-} \rightarrow$ diffeq a $\left.\left.d x \times d s\right)\right)$ ))

## Name Lambda Expressions (Continuations)

diffeq a dx x $u y=$
if $x<a$ then
diffeq a $d x(x+d x)(u-5 * x * u * d x-3 * y * d x)(y+u * d x)$ else y

k1 adx uy papbs_c=
if not $c$ then $y$ else
dpath pa pb 3 y 0000000 (k2 a dx s u y)
k2 adxxuy pa pb___=
dpath $u d x d x p b 00$ u pa 00 (k3adx $x$ y)
k3adxx y papb_d_= dpath 00000 y pad pb 00 (k0adx x )
diffeq a dx x u y $=\mathrm{k} 0 \mathrm{adx} \mathrm{x} 00 \mathrm{y} u$ False

## Encode Continuations as a Type

data Cont = KO Int Int Int K1 Int Int Int Int K2 Int Int Int Int Int K3 Int Int Int Int
dpath m1 m2 m3 m4 a1 a2 s1 s2 c1 c2 k= kk k (m1 * m2) (m3 * m4) (a1 + a2) ( $\mathrm{s} 1-\mathrm{s} 2$ ) $(\mathrm{c} 1<\mathrm{c} 2)$
kk k m1 m2 a s c = case ( $k, m 1, m 2, a, s, c$ ) of
(K0 a dxx ,_,_, s,d,_) $\rightarrow$
dpath d dx $5 \times x d x 00 \times a(K 1 a d x \quad d s)$
(K1 a dx $u y, p a, p b, s, \ldots, c$ ) $\rightarrow$ if not $c$ then $y$ else dpath pa pb 3 y 000000 (K2 a dx s u y)
$(\mathrm{K} 2 \mathrm{a} \mathrm{dx} x$ u y, pa, pb,_,_,_) $\rightarrow$ dpath u dx dx pb 00 u pa 00 (K3 a dx x y)
$\left(K 3\right.$ a dx x $\left.\quad \mathrm{y}, \mathrm{pa,pb}, \ldots, \mathrm{~d}, \_\right) \rightarrow$ dpath 0000 y padpb 00 (K0adx x )
diffeq a dx $x$ uy $=k k(K 0 a d x x) 00 y u$ False

## Syntax-Directed Translation to Hardware



## Removing Recursion: The Fib Example

fib n

$$
\begin{array}{ll}
= & \text { case } \mathrm{n} \text { of } \\
& 1 \\
& \rightarrow 1 \\
2 & \rightarrow 1 \\
\mathrm{n} & \rightarrow \text { fib }(\mathrm{n}-1)+\mathrm{fib}(\mathrm{n}-2)
\end{array}
$$

## Transform to Continuation-Passing Style

fibk $n \mathrm{k}=$ case n of

| 1 | $\rightarrow \mathrm{k} 1$ |
| :--- | :--- | :--- |
| 2 | $\rightarrow \mathrm{k} 1$ |

$n \quad \rightarrow$ fibk $(n-1)(\lambda n 1 \rightarrow$
fibk $(\mathrm{n}-2)(\lambda \mathrm{n} 2 \rightarrow$
$k(n 1+n 2))$ )
fib $n=$ fibk $n(\lambda x \rightarrow x)$

## Name Lambda Expressions (Lambda Lifting)

fibk $n$ k case $n$ of

$$
\begin{array}{ll}
1 & \rightarrow k 1 \\
2 & \rightarrow k \text { 1 } \\
\mathrm{n} & \rightarrow \text { fibk }(\mathrm{n}-1)(\mathrm{k} 1 \mathrm{nk})
\end{array}
$$

| k1 | $\mathrm{n} \mathrm{k} \mathrm{n} 1=$ | fibk (n-2) (k2 n1 k) |
| :---: | :---: | :---: |
| k2 | $\mathrm{n} 1 \mathrm{kn2}=$ | $k(n 1+n 2)$ |
| k0 | $\mathrm{x}=$ | x |
| fib | n | fibk n k0 |

## Represent Continuations with a Type

## data Cont $=$ K0 $\mid$ K1 Int Cont | K2 Int Cont

```
fibk n k = case ( n,k) of
    (1, k) }->\mathrm{ kk k 1
    (2, k) }->\mathrm{ kk k 1
    (n, k) -> fibk (n-1) (K1 n k)
kk k a = case (k, a) of
    ((K1 n k), n1) }->\mathrm{ fibk (n-2) (K2 n1 k)
    ((K2 n1 k), n2) -> kk k (n1 + n2)
    (K0, x ) }->\textrm{x
fib n
            = fibk nKO
```


## Merge Functions

```
data Cont = K0 | K1 Int Cont | K2 Int Cont
data Call = Fibk Int Cont | KK Cont Int
fibk z = case z of
    (Fibk 1 k) }->\mathrm{ fibk (KK k 1)
    (Fibk 2 k) }->\mathrm{ fibk (KK k 1)
    (Fibk n k) }->\mathrm{ fibk (Fibk (n-1) (K1 n k))
    (KK (K1 n k) n1) -> fibk (Fibk (n-2) (K2 n1 k))
    (KK (K2 n1 k) n2) -> fibk (KK k (n1 + n2))
    (KK K0 x ) }->\textrm{x
fib n
        = fibk(Fibk n K0)
```


## Add Explicit Memory Operations

```
load :: CRef }->\mathrm{ Cont
store :: Cont }->\mathrm{ CRef
data Cont = K0| K1 Int CRef| K2 Int CRef
data Call = Fibk Int CRef| KK Cont Int
fibk z = case z of
    (Fibk 1 k) }->\mathrm{ fibk (KK (load k) 1)
    (Fibk 2 k) }->\mathrm{ fibk (KK (load k) 1)
    (Fibk n k) }->\mathrm{ fibk (Fibk (n-1) (store (K1 n k)))
    (KK (K1 n k) n1) }->\mathrm{ fibk (Fibk (n-2) (store (K2 n1 k)))
    (KK (K2 n1 k) n2) -> fibk (KK (load k) (n1 + n2))
    (KK K0 x ) }->\textrm{x
fib n
        = fibk (Fibk n (store K0))
```


## Syntax-Directed Translation to Hardware



## Duplication Can Increase Parallelism

fib $0=0$
fib $1=1$
fib $n=f i b(n-1)+f i b(n-2)$

## Duplication Can Increase Parallelism

## After duplicating functions:

$$
\begin{aligned}
& \text { fib } 0=0 \\
& \text { fib } 1=1 \\
& \text { fib } n=\mathrm{fib}^{\prime} \quad(n-1)+\mathrm{fib}^{\prime \prime}(n-2) \\
& \mathrm{fib}^{\prime} 0=0 \\
& \mathrm{fib}^{\prime} 1=1 \\
& \mathrm{fib}^{\prime} n=\mathrm{fib}^{\prime} \quad(n-1)+\mathrm{fib}^{\prime}(n-2) \\
& \mathrm{fib}^{\prime \prime} 0=0 \\
& \mathrm{fib}^{\prime \prime} 1=1 \\
& \mathrm{fib}^{\prime \prime} \quad n=\mathrm{fib}^{\prime \prime} \quad(n-1)+\mathrm{fib}^{\prime \prime}(n-2)
\end{aligned}
$$

Here, fib' and fib" may run in parallel.

## Unrolling Recursive Data Structures

Original Huffman tree type:
data Htree $=$ Branch Htree HTree | Leaf Char

Unrolled Huffman tree type:
data Htree $=$ Branch Htree' HTree' | Leaf Char data Htree ${ }^{\prime}=$ Branch $^{\prime}$ Htree ${ }^{\prime \prime}$ HTree $^{\prime \prime}$ | Leaf' Char data Htree ${ }^{\prime \prime}=$ Branch $^{\prime \prime}$ Htree HTree | Leaf" Char

Increases locality: larger data blocks.
A type-aware cache line


- Dark Silicon is the future: faster transistors; most must remain off
- Custom accelerators are the future; many approaches
- My project: A Pure Functional Language to FPGAs



Boolean input stream: (14 bits)

| 12-bit pointer | B | 1 |
| :--- | :--- | :--- |
|  | Cons |  |
|  | 0 | Nil |

Character output stream: (19 bits)


Syntax-Directed Translation to Hardware


