\documentclass[11pt]{article}
\usepackage{fullpage}
\usepackage{amsmath,amsthm,amsfonts,amssymb,mathtools}
\usepackage{hyperref}
\usepackage{cleveref}
% Allow use of "blah" instead of ``blah''
\usepackage{csquotes}
\MakeOuterQuote{"}
\newcommand\E{\mathbb{E}}
\newcommand\R{\mathbb{R}}
\newcommand\D{{\mathcal{D}}}
\newcommand\T{{\scriptscriptstyle{\mathsf{T}}}}
\newcommand\U{{\mathcal{U}}}
\newcommand\X{\mathcal{X}}
\newcommand\Y{\mathcal{Y}}
\newcommand\bW{{\mathbf{W}}}
\newcommand\bd{{\mathbf{d}}}
\newcommand\bg{{\mathbf{g}}}
\newcommand\bu{{\mathbf{u}}}
\newcommand\bx{{\mathbf{x}}}
\newcommand\by{{\mathbf{y}}}
\DeclareMathOperator{\poly}{poly}
\mathtoolsset{centercolon}
\DeclarePairedDelimiter\ip{\langle}{\rangle}
\DeclarePairedDelimiter\norm{\|}{\|}
\DeclarePairedDelimiter\del{\lparen}{\rparen}
\DeclarePairedDelimiter\sbr{\lbrack}{\rbrack}
\DeclarePairedDelimiter\cbr{\{}{\}}
\usepackage[affil-sl]{authblk}
\title{COMS 4774 Spring 2021 Project Report}
\author[1]{Daniel Hsu (djh2164)}
\affil[1]{Department of Computer Science, Columbia University}
\date{}
\newtheorem{theorem}{Theorem}
\begin{document}
\maketitle
\begin{abstract}
This is our project report for COMS 4774 Spring 2021.
\begin{enumerate}
\item \textbf{Title:} Learning Parities with Neural Networks
\item \textbf{Authors:} Amit Daniely, Eran Malach
\item \textbf{Bibliographic information:} NeurIPS 2020
\item \textbf{Link:} \url{https://arxiv.org/abs/2002.07400}
\end{enumerate}
\end{abstract}
\section{Paper review}
\subsection{Brief summary}
The goal of this paper is to establish a separation between what can be learned with linear models and what can be learned with depth-two neural networks.
By linear models, the authors mean linear classifiers on top of fixed feature representations of the inputs that do not depend on the data.
While depth-two neural networks can also be viewed as linear models---viewing the bottom layer of the network as defining a feature map---the feature representation is usually not independent of the training data.
The authors prove that for any linear model, there is a learning problem that cannot be efficiently solved by any classifier in the model, but that can be efficiently solved by using neural networks.
The authors also provide some empirical results that illustrate a similar separation between linear models and neural networks.
\subsection{Problem statement}
The authors prove two main results to establish this separation, both of which involve the problem of learning sparse parity functions over the $n$-dimensional discrete cube $\cbr{\pm1}^n$, i.e., functions of the form $f_A(x) := \prod_{i \in A} x_i$ for some $A \subseteq [n]$ with $|A|=k$.
Here, $k \in [n]$ is the "hardness" parameter used to establish a separation: linear models will require resources exponential in $k$, while depth-two neural networks will only require resources polynomial in $k$.
The objective of learning is to find a predictor $h \colon \cbr{\pm1}^n \to \R$ that achieves small expected hinge loss with respect to a data distribution $\D$ on $\cbr{\pm1}^n$ where labels are realized by a sparse parity function:
\begin{equation*}
L_A(h) := \E_{\bx \sim \D}[ \max\cbr{0, 1 - f_A(\bx) h(\bx)} ] .
\end{equation*}
\subsection{Main results}
The first result, a lower bound for linear models, is as follows.
Consider any feature map $\Psi \colon \cbr{\pm1}^n \to [-1,1]^N$, and bounded norm linear classifiers over this feature map: $H_\Psi^B := \cbr{ x \mapsto \ip{\Psi(x),w} : w \in \R^N, \norm{w}_2 \leq B}$.
Also consider any $k \leq n/16$.
Then there exist some $A \subseteq [n]$ with $|A|=k$ and a data distribution over $\cbr{\pm1}^n$ such that every $h \in H_\Psi^B$ has expected hinge loss
\begin{equation*}
L_A(h) \geq \frac12 - \frac{\sqrt{N}B}{2^k\sqrt{2}} .
\end{equation*}
In other words, unless the product of $\sqrt{N}B$ is greater than a certain exponential in $k$, the expected hinge loss is always above some positive absolute constant.
This means either the dimension of the feature map must be exponentially large, or the weight vector must be exponentially large norm.
The distribution in the lower bound is always taken to be a certain uniform mixture of the uniform distribution on $\cbr{\pm1}^n$ and another distribution that is specific to $A$.
The reason for this mixture is really for the upper bound, discussed next.
The second result, an upper bound for depth-two neural networks, shows that a certain way of training depth-two neural networks with gradient descent will yield a predictor $g \colon \cbr{\pm1}^n \to \R$ with small expected hinge loss $L_A(g)$, as long as (i) the width $q$ is larger than a certain polynomial in $k$, the dimension $n$ is larger than a certain polynomial in $q$ and $k$, and the number of iterations $T$ of gradient descent is larger than a certain polynomial in $k$ and $q$.
This is a somewhat intricate set of conditions on $n$, $k$, $q$, and $T$, but it seems there is an asymptotic regime for these parameters where the resulting hinge loss is indeed always smaller than the positive absolute constant from the above lower bound.
The upper bound is established for a data distribution that is specific to the target sparse parity function, a uniform mixture of the uniform distribution on $\cbr{\pm1}^n$ and another distribution that "reveals" the target sparse parity.
The paper also conducts an empirical comparison of linear models and depth-two neural networks for a synthetically-constructed problem that resembles learning sparse parity functions.
The results of the empirical comparison confirm the high-level message of the paper, that linear models are "weaker" than depth-two neural networks for learning these parity-type functions.
\subsection{Key ideas and insights}
The proof of the lower bound is a combination basic convex analysis and a standard averaging argument.
They crucially use the fact that parity functions are orthonormal with respect to the $L^2(U)$ space, where $U$ is the uniform distribution on $\cbr{\pm1}^n$.
Once they establish hardness of learning some $k$-sparse parity function with respect to $U$, they argue that learning with respect to any uniform mixture of $U$ and another distribution must also be at least somewhat hard.
(This is where they lose a factor of $1/2$ in the lower bound on the expected hinge loss.)
The key idea behind the upper bound is that the first step of gradient descent is able to pick up on the signal hidden inside the data distribution that is reveals which bits are in the parity function.
By carefully choosing the step sizes (and weight decay parameters), they are albe to show that the signal is not lost in subsequent steps.
\subsection{Strengths/weakness}
The notion of separation established in this paper is quite strong: exponential versus polynomial.
Moreover, the form of the upper bound is based on training algorithms and neural network architectures that resemble what are used in practice.
However, there is a disconnect between the empirical results and the type of distributions considered in the upper bound.
In the upper bound, the data distribution is one very strongly reveals the identity of the correct parity function.
Indeed, just be checking second-order statistics of the unlabeled data distribution, one can easily discover which bits are involved in the target parity.
It seems rather unlikely that real data would have such strong "hints" encoded in the unlabeled data distribution.
Moreover, the upper bound is established by a peculiar way of executing gradient descent, where the first step has a very large step size, and the remaining steps have relatively small step sizes.
This seems rather unnatural and highly tailored to the specific types of data distribution considered in the paper.
\section{New contribution}
\ldots
\end{document}