COMS 6998-12: Dealing with Massive Data

Administrivia

Instructor: Sergei Vassilvitskii, Yahoo! Research
Time: Mondays 6:10pm-8pm
Location: 329 Pupin
TAs: Mohan (mk3245), Kripa (kk2735)

Course Description

The size of modern datasets is staggering. With Yahoo! Mail moving over 3 billion messages per day, Twitter recording more than 100 million tweets per day, Facebook users spending over 20 billions minutes every day, and Google executing over a billion searches every day, how does one make sense of all of the data that is generated?
This course will provide an introduction to algorithm design for such large datasets. We will cover streaming algorithms, which never store the whole input in memory and parallel algorithms, which partition the computation across multiple machines. In particular we will look how to utilize the MapReduce framework for large scale data analysis.

The main goal of this course is to introduce algorithmic design techniques for dealing with large data sets. This will be primarily a theoretical analysis course, with a focus on practical algorithms and applications.

Prerequisites

Algorithms, Discrete Math. No prior knowledge of streaming or parallel algorithms is necessary.

Homework

Homework 1. Posted February 8. Due February 28 at end of class (8pm).
Homework 1.5. Posted March 2. No Due Date.
Homework 2 Posted March 24. Due April 14 at 11:59pm NY time.
Project Posted April 18. Due May 2 at end of class (8pm).

Approximate Schedule

January 24: Introduction. Notes

For more see:

Broder, Glassman, Manasse, Zweif. Syntactic clustering of the web
Henzinger.Finding near duplicate web pages: a large scale evaluation of algorithms

January 31: Distinct Value Estimation. Notes

For more see:

Charikar, Chaudhuri, Motwani, Narasayya. Towards Estimation Error Guarantees for Distinct Values
Gibbons. Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Flajolet, Martin. Probabilistic counting algorithms for data base applications

February 7: Finding Frequent Elements. Notes

Correction. For the second algorithm described in class, to estimate count of element j, use: Count[h(x_j)]*g(x_j). In class, we used Z*g(x_j). While both of them give unbiased results (Why?) the latter has a much lower variance.

For more see:

Cormode, Muthukrishnan. An improved data stream summary: the count-min sketch and its applications
Charikar, Chen, Farach-Colton. Finding Frequent Items in Data Streams

February 14: Clustering on streams

For more see:

Charikar, O'Callaghan, Panigrahy. Better Streaming Algorithms for Clustering Problems

February 21: Graph Algorithms on streams

For more see:

Feigenbaum, Kannan, McGregor, Suri, Zhang. On graph problems in a semi-streaming model

February 28: PRAMs and Matchings

For more see:

Israeli, Itai. A fast and Simple Randomized Parallel Algorithm for Maximal Matchings

March 7: Intro to MapReduce

For more see:

Cloudera Tutorial: Tutorial Slides
Intro slides from UMD: UMD Slides
Dean, Ghemawat. MapReduce: Simplified Data Processing on Large Clusters

March 14: No Class -- Spring Break

March 21: Social Network Analysis

For more see:

Suri, Vassilvitskii. Counting Triangles and the Curse of the Last Reducer

March 28: PageRank and Data Privacy

Slides from G. Cormode's talk: pdf

April 4: Recommendation Systems

Slides from J. Hofman's talk: pdf

April 11: Max Matchings in MapReduce

For more see:

Lattanzi, Moseley, Suri, Vassilvitskii. Filtering: A Method for Solving Graph Problems in MapReduce

April 18: Graph Spanners

For more see:

Thorup and Zwick. Approximate distance oracles

March 25: Large Scale Machine Learning

May 2:

Grading Policy

There will be two problem sets (30% each), one final project (30%), participation (10%).

Assignment Policy

The problem sets will require you to do proofs. You are encouraged to discuss the course material and the homework problems with each other in small groups (2-3 people), as long as you list all discussion partners on your problem set. Discussion of homework problems may include brainstorming and verbally walking through possible solutions, but should not include one person telling the others how to solve the problem. In addition, each person must write up their solutions entirely on their own; you may not look at another student's written solutions. List your collaborators on your solutions. Moreover, all materials you consult must be appropriately acknowledged.

Please consult me if you have any questions about this policy. When in doubt play it safe. If I suspect that you have turned in a homework assignment which you don't understand, you may be asked to orally defend your solutions. If you turn in a homework assignment in violation of the above policies, the highest grade you will receive on that assignment is 0, and you may receive a negative grade.

Students are expected to adhere to the Academic Honesty policy of the Computer Science Department; this policy can be found in full here.