COMS E6998 (Spring 2019)

Lecture Details

Instructor: Suman Jana
Office: Mudd 412
Office hours: by appointment
TA: Kyle Matoba (km3227@columbia.edu) TA Office hours: Mondays 2:30-4:00 pm over Skype
Classroom: 1127 Seeley W. Mudd Building
Class hours: Thursday (1:10-3:40 pm)

Description

This class is going to focus on improving program analysis using Machine Learning (ML). Traditionally, program analysis has used formal logic due to its mathematical precision and expressiveness. However, such approaches often struggle to scale to large programs. In this class, I plan to explore the challenges and possibilities of using ML together with formal logic to make such analysis scalable without making major sacrifices in precision.

Note:There will be no assigned textbook for the class and you are expected to read the assigned articles/papers/slides carefully.

The final project report is due by May 18th 11:59pm . The format should be two-column ACM format >= 5-6 pages.

Prerequisite

There is no formal prerequisite for this class but you should be generally comfortable with ML. Feel free to send me an email if you have any specific questions.

Grading

Project - 60%
Presentations - 30%
Class participation - 10%
Extra credit (scribing a lecture) - 5%

Schedule

Date Topics Lecture notes & Reading

Jan 24 Basics of program analysis Class notes

Jan 31 Class cancelled

Feb 7 Static Analysis & abstract interpretation Control Flow (Slides: Control Flow Analysis.pptx, Control Flow Analysis.pdf, Notes:Class notes) Control flow reading Data Flow (Data Flow Analysis.pptx, Data Flow Analysis.pdf) Data flow reading Abstract interpretation Reading

Feb 14 Symbolic analysis Symbolic Execution.pptx, Symbolic Execution.pdf, Notes:Class notes
additional reading: Symbolic Execution for Software Testing: Three Decades Later (Cadar and Sen)
KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs (Cadar et al.)
CUTE: A Concolic Unit Testing Engine for C (Sen et al.)
DART: Directed Automated Random Testing (Godfroid et al.)
Symbolic execution and program testing (King et al.)

Feb 21 Dynamic analysis & fuzzing fuzzing.pptx, fuzzing.pdf Notes:Class notes

Feb 28 Improving fuzzing with ML (Neuzz, Learn&Fuzz)

Mar 7 Student Presenter: Gabriel Ryan (Slides) Neural Code Comprehension: A Learnable Representation of Code Semantics by Ben-Nun et al. NIPS 2018

Mar 14 Student Presenter: Harry Smith (Slides) DeepCoder: Learning to Write Programs by Balog et al. ICLR 2017
1 page preliminary project proposals due

Mar 21 No Class (Spring Break)

Mar 28 Student Presenters: Noah Gallant (Slides)
Justin Wong (Slides) AppFlow: Using Machine Learning to Synthesize Robust, Reusable UI Tests by Hu et al. FSE'18
code2vec: Learning Distributed Representations of Code by Alon et al. POPL 2019

Apr 4 Student Presenters: Joshua Learn (Slides)
Saikat Chakraborty/Yufan Zhuang (Slides) Neural-Augmented Static Analysis of Android Communication by Zhao et al. FSE'18
Improving Neural Program Synthesis with Inferred Execution Traces by Shin et al. NIPS'18
Leveraging Grammar And Reinforcement Learning For Neural Program Synthesis by Bunel et al. ICLR'18

Apr 11 Christian Doan (postponed)
Jonas Duan (Slides)
Avik Laha (Slides) Recognizing Functions in Binaries with Neural Networks by Shin et al. USENIX Sec'15
An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists

Apr 18 Dennis Roellke (Slides)
Yoongbok Lee (Slides)
Jeevan Farias/Dmitiri Leggas (Slides) Quantifying Program Bias by Albarghouthi et al. OOPSLA 2017
Learn&Fuzz: Machine Learning for Input Fuzzing by Godefroid et al. ASE 2017
Neural Sketch Learning For Conditional Program Generation by Murali et al. ICLR 2018
Learning to Infer Program Sketches by Nye et al.

Apr 25 Shiqi Wang+Justin Whitehouse (Slides)
Kyra Busser (Slides) Safety Verification and Robustness Analysis of Neural Networks via Quadratic Constraints and Semidefinite Programming by Fazlyab et al.
Predicting Program Properties from “Big Code” by Raychev et al. POPL 2015

May 2 Abhishek Shah/Dongdong She/Kexin Pei (Slides) Neuro-symbolic Execution: Augmenting Symbolic Execution with Neural Constraints by Shiqi et al. NDSS'19
Debin: Recovering Stripped Info from Binaries by He et al. CCS'18
Learning To Represent Programs with Graphs by Allamanis et al. ICLR'18

Online presentations:

Andrew Calvano (Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection) Video, Slides
Nathan Reitinger (Towards Seamless Tracking-Free Web: Improved Detection of Trackers via One-Class Learning) Video, Slides
Ben Meerovitch (Deep Learning to Find Bugs) Video, Slides

Date	Topics	Lecture notes & Reading
Jan 24	Basics of program analysis	Class notes
Jan 31	Class cancelled
Feb 7	Static Analysis & abstract interpretation	Control Flow (Slides: Control Flow Analysis.pptx, Control Flow Analysis.pdf, Notes:Class notes) Control flow reading Data Flow (Data Flow Analysis.pptx, Data Flow Analysis.pdf) Data flow reading Abstract interpretation Reading
Feb 14	Symbolic analysis	Symbolic Execution.pptx, Symbolic Execution.pdf, Notes:Class notes additional reading: Symbolic Execution for Software Testing: Three Decades Later (Cadar and Sen) KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs (Cadar et al.) CUTE: A Concolic Unit Testing Engine for C (Sen et al.) DART: Directed Automated Random Testing (Godfroid et al.) Symbolic execution and program testing (King et al.)
Feb 21	Dynamic analysis & fuzzing	fuzzing.pptx, fuzzing.pdf Notes:Class notes
Feb 28	Improving fuzzing with ML (Neuzz, Learn&Fuzz)
Mar 7	Student Presenter: Gabriel Ryan (Slides)	Neural Code Comprehension: A Learnable Representation of Code Semantics by Ben-Nun et al. NIPS 2018
Mar 14	Student Presenter: Harry Smith (Slides)	DeepCoder: Learning to Write Programs by Balog et al. ICLR 2017 1 page preliminary project proposals due
Mar 21	No Class (Spring Break)
Mar 28	Student Presenters: Noah Gallant (Slides) Justin Wong (Slides)	AppFlow: Using Machine Learning to Synthesize Robust, Reusable UI Tests by Hu et al. FSE'18 code2vec: Learning Distributed Representations of Code by Alon et al. POPL 2019
Apr 4	Student Presenters: Joshua Learn (Slides) Saikat Chakraborty/Yufan Zhuang (Slides)	Neural-Augmented Static Analysis of Android Communication by Zhao et al. FSE'18 Improving Neural Program Synthesis with Inferred Execution Traces by Shin et al. NIPS'18 Leveraging Grammar And Reinforcement Learning For Neural Program Synthesis by Bunel et al. ICLR'18
Apr 11	Christian Doan (postponed) Jonas Duan (Slides) Avik Laha (Slides)	Recognizing Functions in Binaries with Neural Networks by Shin et al. USENIX Sec'15 An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists
Apr 18	Dennis Roellke (Slides) Yoongbok Lee (Slides) Jeevan Farias/Dmitiri Leggas (Slides)	Quantifying Program Bias by Albarghouthi et al. OOPSLA 2017 Learn&Fuzz: Machine Learning for Input Fuzzing by Godefroid et al. ASE 2017 Neural Sketch Learning For Conditional Program Generation by Murali et al. ICLR 2018 Learning to Infer Program Sketches by Nye et al.
Apr 25	Shiqi Wang+Justin Whitehouse (Slides) Kyra Busser (Slides)	Safety Verification and Robustness Analysis of Neural Networks via Quadratic Constraints and Semidefinite Programming by Fazlyab et al. Predicting Program Properties from “Big Code” by Raychev et al. POPL 2015
May 2	Abhishek Shah/Dongdong She/Kexin Pei (Slides)	Neuro-symbolic Execution: Augmenting Symbolic Execution with Neural Constraints by Shiqi et al. NDSS'19 Debin: Recovering Stripped Info from Binaries by He et al. CCS'18 Learning To Represent Programs with Graphs by Allamanis et al. ICLR'18