\documentclass[11pt]{article}
\usepackage{latexsym}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{epsfig}
\usepackage[right=0.8in, top=1in, bottom=1.2in, left=0.8in]{geometry}
\usepackage{setspace}
\spacing{1.06}
% Custom additions
\usepackage[shortlabels]{enumitem}
\usepackage{commath}
\usepackage{tikz}
\usetikzlibrary{circuits.logic.US}
\newcommand{\handout}[5]{
\noindent
\begin{center}
\framebox{
\vbox{\vspace{0.25cm}
\hbox to 5.78in { {COMS 4995-2:\hspace{0.12cm}Advanced
Algorithms (Spring'20)} \hfill #2 }
\vspace{0.48cm}
\hbox to 5.78in { {\Large \hfill #5 \hfill} }
\vspace{0.42cm}
\hbox to 5.78in { {#3 \hfill #4} }\vspace{0.25cm}
}
}
\end{center}
\vspace*{4mm}
}
\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribes:\hspace{0.08cm}#4}{Lecture #1}}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{observation}[theorem]{Observation}
\newtheorem{example}[theorem]{Example}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{fact}[theorem]{Fact}
\newtheorem{assumption}[theorem]{Assumption}
\newcommand{\E}{{\mathbb E}}
\DeclareMathOperator{\var}{Var}
\begin{document}
\lecture{Lecture 22: Cache Oblivious Models, Parallel Algorithms}{Apr 16, 2020}
{Instructor:\hspace{0.08cm}\emph{Alex Andoni}}{\emph{Kundan Guha}}
\section{Introduction}
Topics Covered Today
\begin{itemize}
\item Cache Oblivious Model
\item Parallel Algorithms
\end{itemize}
\section{Cache Oblivious Model Continued}
%TODO: Make this into a theorem or lemma statement rather than subsection
\subsection{Problem: Search (in a search tree)}
We have a set of numbers to search over, target \(O(\log_B N + 1)\) cache line
misses for searching for some number \(B\).
\subsection{Solution}
BST (balanced), van Emde Boas layout
%(Diagram: tree layout triangle with height \(\log N\), height from \(T_0\) to \(\frac{1}{2} \log N\) is marked which is \(\sqrt n\) with \(T_{\sqrt N}\) subtrees from that point)
Layout: \(\mathrm{vEB}(T) = \mathrm{vEB}(T_0), \mathrm{vEB}(T_1), \dots,
\mathrm{vEB}(T_{\sqrt{n}})\)
Search: As usual
\subsection{Analysis of Query}
\begin{enumerate}[(1)]
\item
Suppose subtree at recursion level \(e\), is size \(N^{{(1/2)}^e} \equiv k\).
UB On traversing this subtree: \(O\del[2]{1 + \left\lceil \frac{k}{B} \right\rceil}\)
is \(k \in B \Rightarrow O(1)\) time (CLM).
Height: \(\log k\)
\item
%(Diagram: Tree, k pointing at top, many subtree triangles descending from root, \(\log k\) height of each subtree, size of \(k\), total width of \(\frac{N}{2}\) )
\# size-\(k\) subtrees: \(O\del{\frac{\log N}{\log k}}\)
\end{enumerate}
\section{Cache Line Transfers (in Cache-Oblivious Setting)}
% (Diagram: Cache rectangle diagram of width \(B\) with two disjoint areas shaded and having arrows pointing to disjoint shaded sections in a larger (height wise) memory rectangle diagram)
When cache is full, what to throw out? Rely on automatic \textit{paging
algorithm}:
\begin{itemize}[-]
\item FIFO (First In First Out): Throw out the CL that was brought in earliest \(\rightarrow\)
Best when older memory is known to be constantly reused, so keep it in.
Not great in other situations
\item LRU (Least Recently Used): Throw out the pages that have been least
recently used, still not optimal for all tasks.
\end{itemize}
Could be auto paging algorithms lead to performance \textit{worse} than what the
algorithm designer intends (thinking of the best CL to keep in cache)
\begin{theorem}
For any algorithm \(A\)
that has \(T\) cache line misses (time) on cache of size \(M\), when
optimally using the cache, if we run \(A\) with FIFO/LRU, on cache of size
\(2M\), the runtime is at most \(2T\).
\end{theorem}
Alternatively: \(A\) of auto paging on cache size \(M\) is ``as good as''
optimal paging on cache of size \(\frac{M}{2}\) up to factor of \(2\).
This is an example of online algorithms,
\begin{itemize}[-]
\item
FIFO/LRU have to make decisions locally, without knowledge of future
\item
Compare them to algorithms making decisions, knowing the future
\end{itemize}
Measure of performance: \textit{competitive ratio} \(\rightarrow
\frac{\text{Cost of Online Algorithm}}{\text{Cost of Future-Knowing Algorithms}}\)
\section{Parallel Algorithms}
Solve a problem of size \(n\) by a number of computing units. Goal is to have
faster (wall clock) runtime than just with one CPU\@. Any analysis of such
algorithms inherently relies on the underlying computational model (as any
analysis would), i.e.\ cores on a laptop vs cores on a super computer vs cores on
a GPU, etc. In this class will mainly go over three such models.
\subsection{First model: Circuits}
\begin{minipage}{0.65\textwidth}
Computes function \(f\) using gates.
Q1: Set of basis gates
\begin{itemize}[-]
\item \{ OR, AND, NOT \} which is enough to compute any \(f\)
\item \{ NAND \} which is equally enough to compute any \(f\)
\end{itemize}
\end{minipage}
\begin{minipage}{0.35\textwidth}
\begin{tikzpicture}[x = 1cm, y = 1cm, text height = 1.5ex, text depth = 0.25ex]
\tikzstyle{twoOr} = [draw, or gate US, logic gate inputs=nn]
\tikzstyle{threeAnd} = [draw, and gate US, logic gate inputs = nnn, anchor = input 2]
\node (x1) at (0, 2) {\(x_1\)};
\node (x2) at (0, 1) {\(x_2\)};
\node (x3) at (0, 0) {\(x_3\)};
\node[twoOr] at ($(x1) + (1.5, -0.5)$) (Or1) {};
\node[threeAnd] at ($(x2) + (1.2, -0.5)$) (And) {};
\node[twoOr] at ($(And) + (1.5, 0.5)$) (Or2) {};
\draw (x1) -- ++(0.5, 0) |- (Or1.input 1);
\draw (x2) -- ++(0.5, 0) |- (Or1.input 2);
\draw (Or1.output) -- ++(0.2, 0) |- ++(0, -0.5) -- ++(-1.2, 0) |- (And.input 1);
\draw (x2) -- ++(0.5, 0) |- (And.input 2);
\draw (x3) -- ++(0.5, 0) |- (And.input 3);
\draw (And.output) -- ++(0.2, 0) |- (Or2.input 2);
\draw (x1) -- ++(2.25, 0) |- (Or2.input 1);
\draw (Or2.output) -- ++(2.5, 0) node[above left] {\(f(x_1, x_2, x_3)\)};
\end{tikzpicture}
\end{minipage}
\vspace*{0.8em}
\noindent Performance measures of circuits:
\begin{itemize}[-]
\item Number of gates \(\approx\) total time (in standard non-parallel model)
\item Depth of circuit: Largest computational path from input (\(x\)) to
output \(\approx\) \underline{parallel time} (wall-clock time)
\end{itemize}
\noindent Goal:
Reduce \(p\) time (depth) to ideally be much less than time in standard model,
while keeping the number of gates bounded (not blow up in size)
\begin{theorem}
For any function \(f\colon \cbr{0,1}^n \mapsto \cbr{0,1}\), exist
a circuit to compute \(f\) with depth of 4 and \(2^{n+1}\) gates.
\end{theorem}
\noindent Proof:
Can create a set of \(y\) inputs such that \(f(y) = 1\) and use a line of AND
gates to compare \(x\) inputs against all possible such \(y\). Final result is
depth of 4 using the total \(2^{n + 1}\) gate space to exhaustively output
\(f\).
There are two versions of circuit models, one allowing for unbounded fan-in
(i.e.\ multi input gates are allowed) and the other with bounded fan-in. Bounded
fan-in models may be mapped to unbounded fan-in versions through splitting
/ combining multi-input AND/OR gates to a set of connected AND/OR gates as
necessary.
\subsection{Complexity Classes}
\begin{itemize}[-]
\item
AC: Problems solvable with circuits (unbounded fan-in) with
\(\del{\log N}^{O(1)}\) depth and \(N^{O(1)}\) number of gates.
\item
NC: As above, but bounded fan-in
\item
\(\mathrm{AC}_k\): As AC but depth is \(O\del{\del{\log N}^k}\)
\item
\(\mathrm{NC}_k\): As above but with bounded fan-in
\end{itemize}
Examples:
\begin{enumerate}
\item AND of \(n\) bits: \(f(x_1, \dots, x_n) = \operatorname{AND}(x_1, \dots, x_n)\)
\begin{itemize}[-]
\item
AC: Depth of 2
\item
NC: Depth of \(\log n\)
Proof: AND with bounded fan-in is a binary operation, chaining \(n\) bits
requires \(\log n\) total such chains.
\end{itemize}
\item XOR on \(n\) bits
\begin{itemize}[-]
\item
AC: Depth possible is \(O\del{\frac{\log N}{\log\log N}}\) for
\(\operatorname{poly}(N)\) \# gates (if basis is \{AND, OR, NOT\})
\item
NC: \(O(\log N)\)
\end{itemize}
\end{enumerate}
\section{Next Class}
Go over the remaining two parallel models:
\begin{enumerate}
\item[(2)] PRAM: \(p\) processors, shared memory
\item[(3)] MPC (Map Reduce): \(p\) machines, local memory
\end{enumerate}
\end{document}