COMS 6975: Large Language Model Interpretability and Alignment

Instructor: John Hewitt — Columbia University — Department of Computer Science

Course Description

The use of large language models pervades artificial intelligence research and, increasingly, our world. Understanding these systems should be expected as it is of any other technology. Curiously, as of now, the best way to align systems uses precious little of our work towards understanding their internal workings. This PhD seminar in natural language processing is intended to bring together students in the shared goal of synthesizing, critiquing, and extending recent (and not so recent) research in these distinct subfields of interpretability and alignment.

The first three weeks will consist of an intensive review of the mathematics and technical aspects of large language models—their architecture, pretraining, and alignment—as well as attempts to understand them. Then, students will present research papers to the rest of the class, which will jointly assess, critique, and extend those papers.

Prerequisites

There are no formal prerequisites, though having taken COMS 4705 (Natural Language Processing) would be useful. I am admitting PhD students primarily, though I will send out an interest form to the rest of the wait list to gauge research interest for the rest of the slots in the class.

Schedule

Lectures: Fridays, 1:10–2:00 PM
Location: TBD

We'll use Ed for discussion forums and Gradescope for assignment submission. You should have been added automatically to both. If you just enrolled, ping us to sync the Canvas roster.

Date	Lecture	Notes
Jan 23	Foundations I: Transformer LMs
Jan 30	Foundations II: Interpretability	Last day to add courses
Feb 6	Foundations III: Alignment
Feb 13	Student Presentations
Feb 20	Student Presentations
Feb 27	Student Presentations
Mar 6	Student Presentations
Mar 13	Student Presentations
Mar 20	No class	Spring Recess
Mar 27	Student Presentations
Apr 3	Student Presentations
Apr 10	Student Presentations
Apr 17	Student Presentations
Apr 24	Student Presentations
May 1	Wrap-up

Presentation Format

Each session features two paper presentations. For each paper, three students take on distinct roles:

Role	Responsibility
Background Investigator	Provides context on the paper's motivation, related work, and historical significance
Interpretability Researcher	Analyzes the paper through the lens of furthering our understanding of model internals
Alignment Researcher	Evaluates the paper's implications for AI safety and alignment

With 20 papers across 10 sessions and 3 roles per paper, each student will present exactly twice during the semester.

Grading

This grading breakdown is provisional and subject to change.

Letter grades will be determined by the teaching staff as a function of the following breakdown; cutoffs for each letter grade will be decided at the end of the class, not by pre-set cutoffs. All written elements must be written in LaTeX and submitted as PDF.

Component	Weight
Research Paper Presentation	50%
Research Review	50%

Students are allowed to use AI tools in whatever capacity they desire. The content students submit is their responsibility alone.

Materials and Expectations

This course has no required textbook; I will provide lecture notes for the fundamentals, and then we will be reading and presenting papers.

Attendance is required. In general, I expect you to be at effectively every lecture. However, I dislike grading on attendance, so there's no penalty for not attending, and I understand that everyone will need to miss a lecture or two.

Please see the grading section for our policies on AI tools in this class. Otherwise, please refer to the Faculty Statement on Academic Integrity and the Columbia University Undergraduate Guide to Academic Integrity.

The teaching team is committed to accommodating students with disabilities in line with the Faculty Statement on Disability Accommodations.