This class offers a hands-on approach to machine learning and data science. The class discusses the application of machine learning methods like SVMs, Random Forests, Gradient Boosting and neural networks on real world dataset, including data preparation, model selection and evaluation. This class complements COMS W4721 in that it relies entirely on available open source implementations in scikit-learn and tensor flow for all implementations. Apart from applying models, we will also discuss software development tools and practices relevant to productionizing machine learning models.
Familiarity with Python programming and basic use of NumPy, pandas and matplotlib. A good reference is the Python Data Science Handbook by Jake VanderPlas. It’s online for free and available as a notebook at the link above. I highly recommend going through it before starting the class.
Grading / course grade
6 homework assignments (60%), midterm exam (20%), final in-class exam (20%). All homework assignments are programming assignments and need to be submitted via Github (as will be explained in the class). The midterm will test material from the first half of the class, while the second exam will test material from the second half.
All homework assignments are due at 1pm. No later submissions (or commits) will be accepted. There are no deadline extensions. The last commit before the deadline will be counted as your submission. All code is expected to run on Python 3.4 and adhere to the pep8 standard.
The exams will be written, no computer or course material allowed. Everything that is on the slides or on the notes to the slides is up for testing. There might be some minor coding, but mostly conceptual questions and multiple choice.
The syntax of git and the python libraries that were covered in class (as far as they were covered) will be content of the exam.
Academic rules of conduct
You are expected to adhere to the Academic Honesty policy of the Computer Science Department, as well as the following course-specific policies.
You are welcome and encouraged to discuss course materials and reading assignments with other students. Please limit discussion of homework to general approaches. You are not allowed to share code between submissions or submission groups. For homeworks submitted individually, each individual is required to write their own solution. For homeworks submitted in groups (if allowed), a single write-up should be submitted. Collaboration is not permitted for any of the exams.
Use of outside references
Students are welcome to use any outside materials sources on general machine learning and data science topics. However, you are not permitted to use solutions to specific homework tasks or problems that you find online. Code that is provided during the lectures or as part of the github repository can be reused for the homework, but should be marked as such.
Violation of any portion of these policies will result in a penalty to be assessed at the instructor’s discretion. This may include receiving a zero grade for the assignment in question AND a failing grade for the whole course, even for the first infraction. Such students are also reported to the relevant Deans’ offices that handle cases of academic dishonesty.
Lecture slides, notes, illustrations and notebooks are licensed under CC-0 and can be used without requiring acknowledgement for any purpose (though acknowledgement is appreciated). Homeworks, homework solutions, exams and exam solutions are copyrighted and may not be re-distributed without explicit permission from the instructor.