\sup_{S\subseteq [n]} (p(S)-q(S)) > \varepsilon?
\mathcal{P}\subseteq \Delta([n])
# Frontiers in Distribution Testing

## FOCS 2017 Workshop: Saturday, October 14 (Berkeley)

The workshop will include an Open Problems session *(see the schedule below)*, to which everyone is encouraged to participate. Send in advance your open questions or favorite problem in distribution testing or surrounding areas to either organizer, with *"FOCS distribution testing workshop: open question"* as subject line.

**9:00-9:35**Clément Canonne**9:40-10:15**Ilias Diakonikolas**10:20-10:55**Jiantao Jiao*Coffee break***11:25-12:00**Alon Orlitsky**12:00-12:25**Gautam Kamath*Lunch break***14:55-15:30**Costis Daskalakis**15:35-16:10**Ryan O'Donnell**16:15-16:50**Ronitt Rubinfeld*Coffee break***17:20-17:55**Tom Gur**18:00-18:20**Open Problems

Given data from an experiment, study or population, inferring *information* from the underlying probability
distribution it defines is a fundamental problem in Statistics and data analysis, and has applications and ramifications in countless other fields — and in Theory as well. Tackling this problem from a computational viewpoint is the objective of *distribution testing*: to start off the day, we survey recent developments in this area, (a subset of) the new directions taken and connections made, and some of the exciting “applications and ramifications” these spawned.

The prototypical question in distribution property testing is the following: Given sample access to one or more discrete distributions, determine whether they have some global property or are far from having the property in \ell_1 distance. We will describe a simple unified framework to obtain sample-efficient testers in this setting, by reducing \ell_1-testing to \ell_2-testing. Using our framework, we obtain optimal estimators for a wide variety of \ell_1 distribution testing problems, including the following: identity testing to a fixed distribution, closeness testing between two unknown distributions (with equal/unequal sample sizes), independence testing (in any number of dimensions), closeness testing for collections of distributions, and testing k-flatness. For most of these problems, our approach gives the first optimal tester in the literature. Moreover, our testers are significantly simpler to analyze compared to previous approaches. As an important application of our reduction-based technique, we obtain the first adaptive algorithm for testing equivalence between two unknown distributions. The sample complexity of our algorithm depends on the structure of the unknown distributions — as opposed to merely their domain size — and is significantly better compared to the worst-case optimal tester in many natural instances. Moreover, our technique naturally generalizes to other metrics beyond the \ell_1-distance. As an illustration of its flexibility, we use it to obtain the first near-optimal equivalence tester under the Hellinger distance.

Joint work with Daniel Kane.

Up to now, there exist three distinct methodologies that are provably achieving optimal estimation and testing performances for a wide range of statistical properties, including the Shannon entropy, mutual information, the Kullback—Leibler divergence, the total variation distance, among others. These three approaches have intimate connections, reflect key milestone ideas in statistics and machine learning, and have far reaching applications beyond distribution estimation and testing. We discuss the fundamental ideas behind these three approaches, their relative strengths and weaknesses, as well as advice for their usage in practice.

Based on joint work with Yanjun Han, Dmitri Pavlichin, Kartik Venkat, and Tsachy Weissman.

Symmetric distribution properties such as support size, support coverage, entropy, and proximity to uniformity, arise in many applications. Specialized estimators and analysis tools were recently used to derive asymptotically sample-optimal approximations for each of these properties. We show that a single, simple, plug-in estimator—profile maximum likelihood (PML)—is sample competitive for all symmetric properties, and in particular is asymptotically sample-optimal for all the properties above.

Joint work with Jayadev Acharya, Hirakendu Das, and Ananda Theertha Suresh.

Traditionally, distribution testing has focused on testing with respect to the total variation distance. In this talk, I will discuss some results on distribution testing with other distances, including \chi^2, Kullback-Leibler, Hellinger, and \ell_2. I'll also go into motivation for testing with other distances, including applications to testing problems both new and old (i.e., testing independence and monotonicity), and allowing for tolerance to model misspecification.

Based on joints works with Jayadev Acharya, Constantinos Daskalakis, and John Wright.

How many samples from a multi-dimensional distribution are necessary to distinguish whether it is a product measure or whether it is 10%-far in total variation distance from being product? As it turns out, answering this question rigorously requires exponentially many samples in the dimension. Similar lower bounds apply to a host of statistical testing problems in high dimensions. So what do we really know about high-dimensional distributions and the important phenomena that they model? I will propose a way out of the conundrum with an overview of recent work on testing structured high-dimensional distributions: Bayesian networks and Markov Random Fields. A combination of information-theoretic and statistical physics techniques will yield efficient testing from a number of samples that is a low polynomial in the dimension, and in some cases even just a single sample from the underlying distributions.

Based on joint work with Nishanth Dikkala, Gautam Kamath, Qinxuan Pan.

Let \mathrm{p} be an unknown source of randomness with n basic outcomes. We're interested in the usual questions — e.g., the number of samples required to fully learn \mathrm{p}, to test whether \mathrm{p} is close to some fixed hypothesis \mathrm{q}, to estimate the entropy of \mathrm{p}, etc. Recently, sharp upper bounds have been found for these problems (e.g., O(n^2/\varepsilon^2), O(n/\varepsilon^2), O(n^2/\varepsilon + \log^2 n / \varepsilon^2) for the aforementioned problems). Did I mention that \mathrm{p} is a quantum state, the noncommutative cousin of an n-outcome probability distribution? We're learning and testing quantum states.

In this talk we'll survey techniques used in the area. Sometimes it's the “usual thing”: probabilistic analysis of random histograms like ; or, collision-tester/unbiased-estimator/variance-analysis for testing if p is the maximum-entropy distribution. Other times we'll need to delve into diverse older topics — going back in time to *The Art of Computer Programming Vol. 3 (Sorting and Searching)*, or even further back in time to the representation theory of the symmetric group.

In many situations, sample data is obtained from a noisy or imperfect source. In order to address such corruptions, we propose the methodology of sampling correctors. Such algorithms use structure that the distribution is purported to have, in order to allow one to make “on-the-fly” corrections to samples drawn from probability distributions. These algorithms may then be used as filters between the noisy data and the end user. We show connections between sampling correctors, distribution learning algorithms, and distribution property testing algorithms. We show that these connections can be utilized to expand the applicability of known distribution learning and property testing algorithms as well as to achieve improved algorithms for those tasks.

Warning: This talk contains more questions than answers...

Joint work with Clément Canonne and Themis Gouleakis.

We initiate a study of proofs of proximity for properties of distributions, which are proof systems within the framework of distribution testing. We investigate the power and limitations of several types of these proof systems, including the distribution testing analogues of NP, MA, and IP. In particular, we show that proof systems can significantly reduce the complexity of testing natural properties of distributions.

Joint work with Alessandro Chiesa.

- Clément Canonne will/may be graduating from Columbia University in September 2017, where his advisor is Rocco Servedio. His research focuses on the fields of property testing and sublinear algorithms; specifically, on understanding the strengths and limitations of the standard models in property and distribution testing, as well as in related areas. He also really likes elephants.
- Constantinos Daskalakis is an associate professor of computer science and electrical engineering at MIT. He holds a diploma in electrical and computer engineering from the National Technical University of Athens, and a Ph.D. in electrical engineering and computer sciences from UC-Berkeley. His research interests lie in theoretical computer science and its interface with economics, probability, learning and statistics. He has been honored with the 2007 Microsoft Graduate Research Fellowship, the 2008 ACM Doctoral Dissertation Award, the Game Theory and Computer Science Prize from the Game Theory Society, the 2010 Sloan Fellowship in Computer Science, the 2011 SIAM Outstanding Paper Prize, the 2011 Ruth and Joel Spira Award for Distinguished Teaching, the 2012 Microsoft Research Faculty Fellowship, and the 2015 Research and Development Award by the Vatican Giuseppe Sciacca Foundation. He is also a recipient of Best Paper awards at the ACM Conference on Economics and Computation in 2006 and in 2013.
- Ilias Diakonikolas is an Assistant Professor and Andrew and Erna Viterbi Early Career Chair in the Department of Computer Science at USC. He obtained a Diploma in electrical and computer engineering from the National Technical University of Athens and a Ph.D. in computer science from Columbia University where he was advised by Mihalis Yannakakis. Before moving to USC, he was a faculty member at the University of Edinburgh, and prior to that he was the Simons postdoctoral fellow in theoretical computer science at the University of California, Berkeley. His research is on the algorithmic foundations of massive data sets, in particular on designing efficient algorithms for fundamental problems in machine learning. He is a recipient of a Sloan Fellowship, an NSF Career Award, a Google Faculty Research Award, a Marie Curie Fellowship, the IBM Research Pat Goldberg Best Paper Award, and an honorable mention in the George Nicholson competition from the INFORMS society.