COMS 6998:
Advanced Topics in Spoken Language Processing
Instructors: Julia Hirschberg
Time:  Tu 4: 10-6:00 (Spring 2025)
Location: Schapiro CEPSR 750
 
 
Prerequisite: COMS
4705 or another speech or NLP class and experience in Machine Learning
Description:  This class will introduce students to spoken language
processing:  basic concepts, analysis approaches, and
applications.  Applications include Text-to-Speech Synthesis,
dialogue systems, and analysis of entrainment, empathy, personality, emotion,
humor and sarcasm, deception and trust, radicalization and charisma, all using
text and speech information and some visual features as well.
 
Required readings:
Jurafsky & Martin 2023
(3rd edition draft) chapters
These and other readings are linked from this syllabus for
each class.
Suggested:
Keith
Johnson. Acoustic & Auditory Phonetics (3rd edition). Wiley.  2011.
 
Resources:
A list of resources can be found here.
 
Office Hours
Julia Hirschberg: TBD
Yu-Wen Chen: F 24-pm
Kimiya Shahamat: M 4-6pm
Riya Raj: W 4:30-6:30pm
Priyanka Varghese: TBD
 
Grade Breakdown
5% attendance
20% weekly
posts
20% HW1
25% HW2
30% HW3
 
Also please note our late policies:
For weekly posts: 
Monday deadline 11:59pm; 1 late day allowed but 1 point lost
For homework: 3 late days allowed but 5 points lost for each
late day
 
Academic Integrity
The SEAS academic integrity policy is found here.
The CS academic integrity policy is found here.
Syllabus
Note: Schedule and readings are
subject to change.  Readings labeled with
* are optional.
 
 
  | 
   Date 
   | 
  
   Topic 
   | 
  
   Readings 
   | 
  
   Assignments 
   | 
 
 
  | 
   Week 1: 1/21 
   | 
  
   Introduction
  to Speech Processing 
   | 
  
     
   | 
   | 
 
 
  | 
   Week 2: 1/28 
   | 
  
   From
  Sounds to Language 
   | 
  
   Jurafsky & Martin Chapter 28 (Chapters 1-3) 
   | 
   | 
 
 
  | 
   Week 3: 2/4 
   | 
  
   Acoustics
  of Speech  
   | 
  
   Jurafsky & Martin Chapter 28 (sections 4-6) 
   | 
   | 
 
 
  | 
   Week 4: 2/11 
   | 
  
   Tools
  for Speech Analysis 
   | 
  
   *Praat Tutorial (just use for reference) 
  Watch
  all these Praat video tutorials here
  (1-7) 
  *Also some video tutorials on acoustics of speech here 
  Download the latest version of Praat 
  Record your own voice saying
  these sentences 
  Bring your laptop and headphones to class 
   | 
  
     
  HW1: Praat Recording and
  Analysis (assigned) 
   | 
 
 
  | 
   Week 5: 2/18 
   | 
  
   Analyzing
  Speech Prosody 
   | 
  
   ToBI Conventions  
  AuToBI 
  Prosody
  and Meaning 
  *Guidelines
  for ToBI Labeling 
   | 
   | 
 
 
  | 
   Week 6: 2/25 
   | 
  
   Text-to-Speech
  Synthesis (Andrew Rosenberg);  
   | 
  
   Jurafsky & Martin Chapter 16 (Introduction, sections
  6, 8) 
  *Prosody Prediction from Syntactic, Lexical, and Word
  Embedding Features, *Comparing
  acoustic and textual representations of previous linguistic context for
  improving Text-to-Speech, 
  *Where do the
  improvements come from in sequence-to-sequence neural tts? 
   | 
  
   HW1 due 
   | 
 
 
  | 
   Week 7: 3/4 
   | 
  
   Speech
  Recognition (Bhuvana Ramabhadran, Google) 
   | 
  
   Jurafsky & Martin Chapter 16 (Introduction, sections
  1-5, 7-8) 
  Twenty-Five Years of Evolution in Speech and
  Language Processing 
  *Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages 
  *Robust Speech Recognition via Large-Scale Weak
  Supervision 
  *SLM: BRIDGE THE THIN GAP BETWEEN SPEECH AND TEXT
  FOUNDATION MODELS  
   | 
  
     
   | 
 
 
  | 
   Week 8: 3/11 
   | 
  
   Spoken Dialogue Systems (JH,
  Yu-Wen,
  Siyan
  Li, Zack
  Rackauckas)  
   | 
  
   Jurafsky & Martin Chapters 14,
  15,
  27 
  RASwDA: Re-Aligned Switchboard Dialog Act Corpus for
  Dialog Act Prediction in Conversations  
  Nora
  the Empathetic Psychologist 
  *EDEN: Empathetic
  Dialogues for English Learning  
   | 
  
   HW2 assigned 
   | 
 
 
  | 
   Week 9: 3/17-21 
   | 
  
   Spring Break: No classes 
   | 
  
     
   | 
  
     
   | 
 
 
  | 
   Week 10: 3/25 
   | 
  
   Speech Analysis: Emotion
  and Sentiment Detection (Zixoafan Yang, Apple);
  Tony Chen EmoKnob
  (both
  remote on Zoom Class – will be saved to Video Library after) 
   | 
  
   Predicting
  Arousal and Valence from Waveforms and Spectrograms using Deep Neural
  Networks 
  Emotions and
  Types of Emotional Responses 
  EmoKnob: Enhance
  Voice Cloning with Fine-Grained Emotion Control 
   | 
  
     
   | 
 
 
  | 
   Week 11: 4/1 
   | 
  
   Speech Analysis: Entrainment;
  Code-Switching
  (Debasmita Bhattacharya)   
   | 
  
   Identifying
  Entrainment in Task-oriented Conversations 
  What
  Code-Switching Strategies are Effective in Dialog Systems? 
   | 
  
   HW2 due 
   | 
 
 
  | 
   Week 12: 4/8 
   | 
  
   Speech Analysis: Personality
  (Michelle Levine) and  Mental
  State 
   | 
  
   Predicting
  the Big 5 personality traits from digital footprints on social 
  media: A
  meta-analysis 
  Multimodal
  Deep Learning for Mental Disorders Prediction from Audio Speech Samples 
  Speech
  Processing Approach for Diagnosing Dementia in an Early Stage 
   | 
   | 
 
 
  | 
   Week 13: 4/15 
   | 
  
   Speech
  Analysis: WordsEye (Bob Coyne); Empathy
  (Run Chen) 
   | 
  
   The
  background to the study of the language of space 
  Semantics
  and Pragmatics of Locative Expressions 
  Detecting
  Empathy in Speech 
   | 
  
   HW3
  assigned 
   | 
 
 
  | 
   Week 14: 4/22 
   | 
  
   Speech
  Analysis: Charisma;
  Humor;
  Sarcasm 
   | 
  
   What
  Makes a Speaker Charismatic?  Producing
  and Perceiving Charismatic Speech 
  *Extracting
  Social Meaning: Identifying Interactional Style in Spoken Conversation 
  Multimodal
  Indicators of Humor in Video 
  CHoRaL: Collecting
  Humor Reaction Labels from Millions of Social Media Users 
  “Laughing at you or
  with you”: The Role of Sarcasm in Shaping the Disagreement Space 
  *"Sure, I did the right thing": A system for sarcasm
  detection in speech 
   | 
  
     
   | 
 
 
  | 
   Week 15: 4/29 
   | 
  
   Speech Analysis: Producing Trustworthy Voices;
  Intent Detection, Radicalization and De-Radicalization (Lin Ai) 
   | 
  
   Acoustic-Prosodic
  and Lexical Cues to Deception and Trust: 
  Deciphering How People Detect Lies 
  Multimodal
  Deception Detection using Automatically Extracted Acoustic, Visual and
  Lexical Features 
  The
  sound of trustworthiness: Acoustic-based modulation of perceived voice
  personality 
  Identifying
  the Popularity and Persuasiveness of Right- and Left-leaning Group Videos on Social Media 
  Unveiling
  the Influencers of Radical Content: A Multimodal Analysis of QAnon Videos 
   | 
  
   HW3 Due 
   |