COMS 6998-12: Dealing with Massive Data

## Course Description

The size of modern datasets is staggering. With Yahoo! Mail moving over 3 billion messages per day, Twitter recording more than 100 million tweets per day, Facebook users spending over 20 billions minutes every day, and Google executing over a billion searches every day, how does one make sense of all of the data that is generated?
This course will provide an introduction to algorithm design for such large datasets. We will cover streaming algorithms, which never store the whole input in memory and parallel algorithms, which partition the computation across multiple machines. In particular we will look how to utilize the MapReduce framework for large scale data analysis.

The main goal of this course is to introduce algorithmic design techniques for dealing with large data sets. This will be primarily a theoretical analysis course, with a focus on practical algorithms and applications.

## Prerequisites

Algorithms, Discrete Math. No prior knowledge of streaming or parallel algorithms is necessary.

## Homework

Homework 1. Posted February 8. Due February 28 at end of class (8pm).
Homework 1.5. Posted March 2. No Due Date.
Homework 2 Posted March 24. Due April 14 at 11:59pm NY time.
Project Posted April 18. Due May 2 at end of class (8pm).

## Approximate Schedule

January 24: Introduction. Notes

For more see:

January 31: Distinct Value Estimation. Notes

For more see:

February 7: Finding Frequent Elements. Notes

• Correction. For the second algorithm described in class, to estimate count of element j, use: Count[h(x_j)]*g(x_j). In class, we used Z*g(x_j). While both of them give unbiased results (Why?) the latter has a much lower variance.
For more see:
February 14: Clustering on streams

For more see:

February 21: Graph Algorithms on streams

For more see:

February 28: PRAMs and Matchings

For more see:

March 7: Intro to MapReduce

For more see:

March 14: No Class -- Spring Break

March 21: Social Network Analysis

For more see:

March 28: PageRank and Data Privacy

Slides from G. Cormode's talk: pdf

April 4: Recommendation Systems

Slides from J. Hofman's talk: pdf

April 11: Max Matchings in MapReduce

For more see:

April 18: Graph Spanners

For more see:

March 25: Large Scale Machine Learning

May 2: