COMS 6998-12: Dealing with Massive Data


Administrivia

Course Description

    The size of modern datasets is staggering. With Yahoo! Mail moving over 3 billion messages per day, Twitter recording more than 100 million tweets per day, Facebook users spending over 20 billions minutes every day, and Google executing over a billion searches every day, how does one make sense of all of the data that is generated?
    This course will provide an introduction to algorithm design for such large datasets. We will cover streaming algorithms, which never store the whole input in memory and parallel algorithms, which partition the computation across multiple machines. In particular we will look how to utilize the MapReduce framework for large scale data analysis.

The main goal of this course is to introduce algorithmic design techniques for dealing with large data sets. This will be primarily a theoretical analysis course, with a focus on practical algorithms and applications.

Prerequisites

Algorithms, Discrete Math. No prior knowledge of streaming or parallel algorithms is necessary.

Homework

Homework 1. Posted February 8. Due February 28 at end of class (8pm).
Homework 1.5. Posted March 2. No Due Date.
Homework 2 Posted March 24. Due April 14 at 11:59pm NY time.
Project Posted April 18. Due May 2 at end of class (8pm).

Approximate Schedule

January 24: Introduction. Notes

  For more see:

January 31: Distinct Value Estimation. Notes

   For more see:

February 7: Finding Frequent Elements. Notes

  • Correction. For the second algorithm described in class, to estimate count of element j, use: Count[h(x_j)]*g(x_j). In class, we used Z*g(x_j). While both of them give unbiased results (Why?) the latter has a much lower variance.
    For more see: February 14: Clustering on streams

    For more see:

February 21: Graph Algorithms on streams

    For more see:

February 28: PRAMs and Matchings

    For more see:

March 7: Intro to MapReduce

    For more see:

March 14: No Class -- Spring Break

March 21: Social Network Analysis

    For more see:

March 28: PageRank and Data Privacy

    Slides from G. Cormode's talk: pdf

April 4: Recommendation Systems

    Slides from J. Hofman's talk: pdf

April 11: Max Matchings in MapReduce

    For more see:

April 18: Graph Spanners

    For more see:

March 25: Large Scale Machine Learning

May 2:

Grading Policy

There will be two problem sets (30% each), one final project (30%), participation (10%).

Assignment Policy

The problem sets will require you to do proofs. You are encouraged to discuss the course material and the homework problems with each other in small groups (2-3 people), as long as you list all discussion partners on your problem set. Discussion of homework problems may include brainstorming and verbally walking through possible solutions, but should not include one person telling the others how to solve the problem. In addition, each person must write up their solutions entirely on their own; you may not look at another student's written solutions. List your collaborators on your solutions. Moreover, all materials you consult must be appropriately acknowledged.

Please consult me if you have any questions about this policy. When in doubt play it safe. If I suspect that you have turned in a homework assignment which you don't understand, you may be asked to orally defend your solutions. If you turn in a homework assignment in violation of the above policies, the highest grade you will receive on that assignment is 0, and you may receive a negative grade.

Students are expected to adhere to the Academic Honesty policy of the Computer Science Department; this policy can be found in full here.