## 7 Papers Accepted to COLT 2021

Research from the department was accepted to the 34th Annual Conference on Learning Theory (COLT2021). The conference highlights research on the theoretical aspects of machine learning.

Below are the abstracts and links to the accepted papers.

Size and Depth Separation in Approximating Natural Functions with Neural Networks

Gal Vardi Weizmann *Institute of Science*, Daniel Reichman *Worcester Polytechnic Institute*, Toniann Pitassi *Columbia University*, Ohad Shamir *Weizmann Institute of Science*

When studying the expressive power of neural networks, a main challenge is to understand how the size and depth of the network affect its ability to approximate real functions. However, not all functions are interesting from a practical viewpoint: functions of interest usually have a polynomially bounded Lipschitz constant, and can be computed efficiently. We call functions that satisfy these conditions “benign” and explore the benefits of size and depth for approximation of benign functions with ReLU networks. As we show, this problem is more challenging than the corresponding problem for non-benign functions. We give complexity-theoretic barriers to showing depth-lower bounds: Proving existence of a benign function that cannot be approximated by polynomial-sized networks of depth 4 would settle longstanding open problems in computational complexity. It implies that beyond depth 4 there is a barrier to showing depth-separation for benign functions, even between networks of constant depth and networks of nonconstant depth. We also study size separation, namely, whether there are benign functions that can be approximated with networks of size O(s(d)), but not with networks of size O(s 0 (d)). We show a complexity-theoretic barrier to proving such results beyond size O(d log2 (d)), but also show an explicit benign function, that can be approximated with networks of size O(d) and not with networks of size o(d/ log d). For approximation in the L∞ sense we achieve such separation already between size O(d) and size o(d). Moreover, we show superpolynomial size lower bounds and barriers to such lower bounds, depending on the assumptions on the function. Our size-separation results rely on an analysis of size lower bounds for Boolean functions, which is of independent interest: We show linear size lower bounds for computing explicit Boolean functions (such as set disjointness) with neural networks and threshold circuits.

Learning sparse mixtures of permutations from noisy information

Rocco Servedio *Columbia University*, Anindya De *University of Pennsylvania*, Ryan O’Donnell *Carnegie Mellon University*

We study the problem of learning an unknown mixture of k permutations over n elements, given access to noisy samples drawn from the unknown mixture. We consider a range of different noise models, including natural variants of the “heat kernel” noise framework and the Mallows model. We give an algorithm which, for each of these noise models, learns the unknown mixture to high accuracy under mild assumptions and runs in n O(log k) time. Our approach is based on a new procedure that recovers an unknown mixture of permutations from noisy higher-order marginals.

Learning and testing junta distributions with subcube conditioning

Xi Chen *Columbia University*, Rajesh Jayaram *Carnegie Mellon University*, Amit Levi *University of Waterloo*, Erik Waingarten *Stanford University*

We study the problems of learning and testing junta distributions on {−1, 1} n with respect to the uniform distribution, where a distribution p is a k-junta if its probability mass function p(x) depends on a subset of at most k variables. The main contribution is an algorithm for finding relevant coordinates in a k-junta distribution with subcube conditioning Bhattacharyya and Chakraborty (2018); Canonne et al. (2019). We give two applications: • An algorithm for learning k-junta distributions with O˜(k/2 ) log n + O(2k/2 ) subcube conditioning queries, and • An algorithm for testing k-junta distributions with O˜((k + √ n)/2 ) subcube conditioning queries. All our algorithms are optimal up to poly-logarithmic factors. Our results show that subcube conditioning, as a natural model for accessing high-dimensional distributions, enables significant savings in learning and testing junta distributions compared to the standard sampling model. This addresses an open question posed by Aliakbarpour et al. (2016).

Survival of the strictest: Stable and unstable equilibria under regularized learning with partial information

Emmanouil Vasileios Vlatakis-Gkaragkounis *Columbia University*, Angeliki Giannou N*ational Technical University of Athens*, Panayotis Mertikopoulos *Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France & Criteo AI Lab*

In this paper, we examine the Nash equilibrium convergence properties of no-regret learning in general N-player games. For concreteness, we focus on the archetypal “follow the regularized leader” (FTRL) family of algorithms, and we consider the full spectrum of uncertainty that the players may encounter – from noisy, oracle-based feedback, to bandit, payoff-based information. In this general context, we establish a comprehensive equivalence between the stability of a Nash equilibrium and its support: a Nash equilibrium is stable and attracting with arbitrarily high probability if and only if it is strict (i.e., each equilibrium strategy has a unique best response). This equivalence extends existing continuous-time versions of the “folk theorem” of evolutionary game theory to a bona fide algorithmic learning setting, and it provides a clear refinement criterion for the prediction of the day-to-day behavior of no-regret learning in games.

Reconstructing weighted voting schemes from partial information about their power indices

Emmanouil Vasileios Vlatakis-Gkaragkounis *Columbia University*, Huck Bennett *Columbia University*, Anindya De *Columbia University*, Rocco Servedio *Columbia University*

A number of recent works (Goldberg, 2006; O’Donnell and Servedio, 2011; De et al., 2017, 2014) have considered the problem of approximately reconstructing an unknown weighted voting scheme given information about various sorts of “power indices” that characterize the level of control that individual voters have over the final outcome. In the language of theoretical computer science, this is the problem of approximating an unknown linear threshold function (LTF) over {−1, 1} n given some numerical measure (such as the function’s n “Chow parameters,” a.k.a. its degree-1 Fourier coefficients, or the vector of its n Shapley indices) of how much each of the n individual input variables affects the outcome of the function. In this paper we consider the problem of reconstructing an LTF given only partial information about its Chow parameters or Shapley indices; i.e. we are given only the Chow parameters or the Shapley indices corresponding to a subset S ⊆ [n] of the n input variables. A natural goal in this partial information setting is to find an LTF whose Chow parameters or Shapley indices corresponding to indices in S accurately match the given Chow parameters or Shapley indices of the unknown LTF. We refer to this as the Partial Inverse Power Index Problem. Our main results are a polynomial time algorithm for the (ε-approximate) Chow Parameters Partial Inverse Power Index Problem and a quasi-polynomial time algorithm for the (ε-approximate) Shapley Indices Partial Inverse Power Index Problem.

On the Approximation Power of Two-Layer Networks of Random ReLUs

Daniel Hsu *Columbia University*, Clayton H Sanford *Columbia University*, Rocco Servedio *Columbia University*, Emmanouil Vasileios Vlatakis-Gkaragkounis *Columbia University*

This paper considers the following question: how well can depth-two ReLU networks with randomly initialized bottom-level weights represent smooth functions? We give near-matching upper and lower-bounds for L2-approximation in terms of the Lipschitz constant, the desired accuracy, and the dimension of the problem, as well as similar results in terms of Sobolev norms. Our positive results employ tools from harmonic analysis and ridgelet representation theory, while our lower-bounds are based on (robust versions of) dimensionality arguments.

Weak learning convex sets under normal distributions

Anindya De *Columbia University*, Rocco Servedio *Columbia University*

This paper addresses the following natural question: can efficient algorithms weakly learn convex sets under normal distributions? Strong learnability of convex sets under normal distributions is well understood, with near-matching upper and lower bounds given in Klivans et al. (2008), but prior to the current work nothing seems to have been known about weak learning. We essentially answer this question, giving near-matching algorithms and lower bounds. For our positive result, we give a poly(n)-time algorithm that can weakly learn the class of convex sets to advantage Ω(1/ √ n) using only random examples drawn from the background Gaussian distribution. Our algorithm and analysis are based on a new “density increment” result for convex sets, which we prove using tools from isoperimetry. We also give an information-theoretic lower bound showing that O(log(n)/ √ n) advantage is best possible even for algorithms that are allowed to make poly(n) many membership queries.