COMS 699812: Dealing with Massive Data
Administrivia
Course DescriptionThe size of modern datasets is staggering. With Yahoo! Mail moving over 3 billion messages per day, Twitter recording more than 100 million tweets per day, Facebook users spending over 20 billions minutes every day, and Google executing over a billion searches every day, how does one make sense of all of the data that is generated?This course will provide an introduction to algorithm design for such large datasets. We will cover streaming algorithms, which never store the whole input in memory and parallel algorithms, which partition the computation across multiple machines. In particular we will look how to utilize the MapReduce framework for large scale data analysis. The main goal of this course is to introduce algorithmic design techniques for dealing with large data sets. This will be primarily a theoretical analysis course, with a focus on practical algorithms and applications. PrerequisitesAlgorithms, Discrete Math. No prior knowledge of streaming or parallel algorithms is necessary.HomeworkHomework 1. Posted February 8. Due February 28 at end of class (8pm).Homework 1.5. Posted March 2. No Due Date. Homework 2 Posted March 24. Due April 14 at 11:59pm NY time. Project Posted April 18. Due May 2 at end of class (8pm). Approximate ScheduleJanuary 24: Introduction. NotesFor more see:
For more see:
For more see:
For more see:
For more see: March 7: Intro to MapReduceFor more see:
March 21: Social Network Analysis For more see:
Slides from G. Cormode's talk: pdf April 4: Recommendation Systems Slides from J. Hofman's talk: pdf April 11: Max Matchings in MapReduce For more see:
For more see:
May 2:
Grading PolicyThere will be two problem sets (30% each), one final project (30%), participation (10%).Assignment PolicyThe problem sets will require you to do proofs. You are encouraged to discuss the course material and the homework problems with each other in small groups (23 people), as long as you list all discussion partners on your problem set. Discussion of homework problems may include brainstorming and verbally walking through possible solutions, but should not include one person telling the others how to solve the problem. In addition, each person must write up their solutions entirely on their own; you may not look at another student's written solutions. List your collaborators on your solutions. Moreover, all materials you consult must be appropriately acknowledged.Please consult me if you have any questions about this policy. When in doubt play it safe. If I suspect that you have turned in a homework assignment which you don't understand, you may be asked to orally defend your solutions. If you turn in a homework assignment in violation of the above policies, the highest grade you will receive on that assignment is 0, and you may receive a negative grade. Students are expected to adhere to the Academic Honesty policy of the Computer Science Department; this policy can be found in full here.
