COMS 6998: Sublinear Time Algorithms in Learning and Property Testing
Spring 2014

SYLLABUS

## General Information

Instructor: Rocco Servedio. Office hours: Fri 9:30-11:30, CSB 517.
Teaching assistant: Clement Canonne (OH: Tues 2-4, CSB 516)
Classroom: 644 Mudd
Time: Wed 4:10-6pm.
Course email (use for all course-related issues): coms6998learntests14 at gmail dot com.

Administrative Note: This course can be used as a theory elective for the Ph.D. program in computer science, as a track elective course for MS students in the "Foundations of Computer Science" track or the "Machine Learning" track, or as a track elective for undergraduates in the "Foundations" track. Auditing the class requires the approval of the instructor.

## Introduction

Recently there has been a lot of glorious hullabaloo about Big Data and how it is going to revolutionize the way we work, play, eat and sleep. As part of the general excitement, it has become clear that for truly massive datasets and digital objects of various sorts, even algorithms which run in linear time (linear in the size of the relevant dataset or object) may be much too slow.

In this course we will study sub-linear time algorithms which are aimed at helping us understand massive objects. The algorithms we study inspect only a tiny portion of an unknown object and have the goal of coming up with some useful information about the object. Algorithms of this sort provide a foundation for principled analysis of truly massive data sets and data objects.

In particular we will consider

• learning algorithms: Here the goal is to come up with an representation (or hypothesis) that is a high-accuracy approximation of the unknown object.
• Algorithms for property testing. Here the (more modest) goal is to determine whether the unknown object has some particular property of interest, or is ``far'' (in a suitable sense) from every object having the property.

We will study these algorithms for three different types of objects: Boolean functions, graphs, and probability distributions.

## Prerequisites

There are no specific prerequisities for the course, but you should be mathematically mature and comfortable with the usual style of theoretical computer science arguments and analyses. Prior graduate level coursework in theoretical computer science will be helpful, but the course should be accessible to mathematically strong undergraduates as well. It may be helpful to have taken COMS 4252 (the introductory learning theory course), but this is not a requirement -- this course has been designed to be accessible without COMS 4252.

In a bit more detail, we will freely use standard tail bounds for sums of independent random variables (i.e. Chernoff bounds and Hoeffding bounds) so it may be helpful to come into the course familiar with these basic tools. Here is a quick summary of these results that has been helpfully prepared by Clement.

## Rough Syllabus

Here is a rough anticipated syllabus for the course. New topics may be added in and some of the topics listed below may not be covered depending on how the semester goes. The majority of the course will be lectures given by the instructor on the topics below. Pointers to relevant readings are given below. The last few lectures will be presentations given by the students as part of their final projects.

Introduction:

Boolean Functions:
• Warmup: Learning parities and juntas. Basics of Fourier analysis over the Boolean cube.
• Fourier analysis and learning. Finding large Fourier coefficients using membership queries.
• Learning decision trees, sparse polynomials, monotone functions, and constant-depth circuits.
• Relevant readings: Section 3.1 of the Kobler-Lindner survey for the low-degree algorithm and application to constant-depth circuits; Section 3.2 for the KM-algorithm for finding large Fourier coefficients and applications to learning decision trees and sparse polynomials; Section 4 for monotone functions. The KL survey also has pointers to the original papers on these topics.
• Learning intersections of halfspaces (and more) using ROBPs.
• Relevant readings: See here for the original Gopalan/Klivans/Meka paper.
• Testing Boolean functions: the basics. Testing linearity.
• Relevant readings: These two surveys of Dana Ron provide an extensive introduction to property testing of Boolean functions (and more). The Fourier-based linearity test we present is from this paper of Bellare, Coppersmith, Hastad, Kiwi and Sudan.
• Testing monotone functions: upper and lower bounds.
• Relevant readings: The monotonicity testing algorithm we presented is from this paper of Goldreich, Goldwasser, Lehman, Ron, and Samorodnitsky. The lower bound for one-sided testing algorithms is from this paper of Fischer, Lehman, Newman, Raskhodnikova, Rubinfeld, and Samorodnitsky. The lower bound for two-sided testing algorithms is from a forthcoming paper of Chen, Servedio and Tan.
• Testing juntas: upper and lower bounds. Other testing results.
• Relevant readings: The junta testing algorithm we presented is from this paper of Blais. The lower bound for testing juntas is from this paper of Chockler and Gutfreund.
Graphs:
• Testing dense graphs: the basics. Testing bipartiteness.
• Testing triangle-freeness. Regularity and Szemeredi's regularity lemma.
• Testing sparse graphs: the basics. Testing connectivity.
Distributions:
• Learning distributions: the basics. Learning general distributions.
• Hypothesis testing for distribution learning. Learning monotone distributions.
• Testing distributions: the basics. Testing uniformity (upper and lower bounds).
• Testing equivalence to known and unknown distributions. Testing monotone distributions.

The requirements of the course are as follows:

• Class Participation: Students are expected to come to lecture and are encouraged to participate.
• Scribing lectures: For each lecture there will be one or more assigned scribes. Your duties as a scribe are to take careful notes which will be made available on the class web page. Your notes should be a clear, detailed and readable exposition of the material covered in that lecture. You are encouraged to use outside sources (in particular, relevant papers) to make your notes as good as possible. A template will be provided.

• Homework problem: As the semester progresses I will designate (sometimes in class) certain questions/lemmas as "official problems." These will either be problems relating to our course material, or lemmas that I will state but not prove in class. You are to choose one of these "official problems" at some point in the semester and turn in a clear, detailed, self-contained solution to the problem by the last day of class, April 30. Click here for a list of the "official problems".

• Final project report: Each student will complete a research project on a topic related to the course material. You may choose your project topic, and you can do a solo project or work with one partner.

The project is your chance to become an expert on some particular learning / testing / sublinear-time algorithms problem that interests you (and is aligned with the flavor of the course), and hopefully to contribute to what is known about this topic. You will turn in a project proposal fairly early in the semester to identify the general area of your project, and ultimately a final report as described below.

There are two aspects to the final project. The first is a literature review: this will involve identifying the relevant literature on your topic (typically one or two research papers), closely reading and understanding this literature, and writing up a presentation of the results you have learned in your own words. You are encouraged to use the scribe notes template for this portion of your project.

The second aspect of the project is to identify a specific research question or direction on your chosen topic, think about it, and write up the results of your research. Ideally, your project will result in some research progress that could lead to a new publishable research result in the topic you pursue, but this is not a course requirement. All that you need to do for this portion of the project to be successful is come up with a research question/direction and give evidence of having made a real attempt to attack it.

So to summarize the project requirements, your final project submission should (1) give a clear, self-contained exposition of what you have learned about your chosen topic -- as mentioned above, the class scribe notes format is fine for this portion; and (2) it should also explain the specific research problem you investigated, what you did to try to solve this problem, and what progress you made.

Final projects will be due on Wed May 7; more details are available on the projects page, which lists some possible project topics.

• Final project presentation: You will give an in-class presentation on your final project paper (more details will be available later in the semester on the projects page).

There is no required textbook for the course. Pointers to required and recommended readings (papers and occasional book chapters) will be posted on the ``Syllabus'' portion of this webpage above as the semester progresses.

## Problems

Here is a list of the "official problems".