COMS 6998: Advanced Topics in Spoken Language Processing

Instructors: Julia Hirschberg

Time:  Tu 4: 10-6:00 (Spring 2025)

Location: Schapiro CEPSR 750

 

 

Prerequisite: COMS 4705 or another speech or NLP class and experience in Machine Learning

Description:  This class will introduce students to spoken language processing:  basic concepts, analysis approaches, and applications.  Applications include Text-to-Speech Synthesis, dialogue systems, and analysis of entrainment, empathy, personality, emotion, humor and sarcasm, deception and trust, radicalization and charisma, all using text and speech information and some visual features as well.

 

Required readings:

Jurafsky & Martin 2023 (3rd edition draft) chapters

These and other readings are linked from this syllabus for each class.

Suggested:

Keith Johnson. Acoustic & Auditory Phonetics (3rd edition). Wiley.  2011.

 

Resources:

A list of resources can be found here.

 

Office Hours

Julia Hirschberg: TBD

Yu-Wen Chen: F 24-pm

Kimiya Shahamat: M 4-6pm

Riya Raj: W 4:30-6:30pm

Priyanka Varghese: TBD

 

Grade Breakdown

5% attendance

20% weekly posts

20% HW1

25% HW2

30% HW3

 

Also please note our late policies:

For weekly posts:  Monday deadline 11:59pm; 1 late day allowed but 1 point lost

For homework: 3 late days allowed but 5 points lost for each late day

 

Academic Integrity

The SEAS academic integrity policy is found here.

The CS academic integrity policy is found here.

Syllabus

Note: Schedule and readings are subject to change.  Readings labeled with * are optional.

 

Date

Topic

Readings

Assignments

Week 1: 1/21

Introduction to Speech Processing

 

Week 2: 1/28

From Sounds to Language

Jurafsky & Martin Chapter 28 (Chapters 1-3)

Week 3: 2/4

Acoustics of Speech

Jurafsky & Martin Chapter 28 (sections 4-6)

Week 4: 2/11

Tools for Speech Analysis

*Praat Tutorial (just use for reference)

Watch all these Praat video tutorials here (1-7)

*Also some video tutorials on acoustics of speech here

Download the latest version of Praat

Record your own voice saying these sentences

Bring your laptop and headphones to class

 

HW1: Praat Recording and Analysis (assigned)

Week 5: 2/18

Analyzing Speech Prosody

ToBI Conventions

AuToBI

Prosody and Meaning

*Guidelines for ToBI Labeling

Week 6: 2/25

Text-to-Speech Synthesis (Andrew Rosenberg);

Jurafsky & Martin Chapter 16 (Introduction, sections 6, 8)

*Prosody Prediction from Syntactic, Lexical, and Word Embedding Features, *Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech,

*Where do the improvements come from in sequence-to-sequence neural tts?

HW1 due

Week 7: 3/4

Speech Recognition (Bhuvana Ramabhadran, Google)

Jurafsky & Martin Chapter 16 (Introduction, sections 1-5, 7-8)
Twenty-Five Years of Evolution in Speech and Language Processing

*Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

*Robust Speech Recognition via Large-Scale Weak Supervision

*SLM: BRIDGE THE THIN GAP BETWEEN SPEECH AND TEXT FOUNDATION MODELS

 

Week 8: 3/11

Spoken Dialogue Systems (JH, Yu-Wen, Siyan Li, Zack Rackauckas)

Jurafsky & Martin Chapters 14, 15, 27

RASwDA: Re-Aligned Switchboard Dialog Act Corpus for Dialog Act Prediction in Conversations

Nora the Empathetic Psychologist

*EDEN: Empathetic Dialogues for English Learning

HW2 assigned

Week 9: 3/17-21

Spring Break: No classes

 

 

Week 10: 3/25

Speech Analysis: Emotion and Sentiment Detection (Zixoafan Yang, Apple); Tony Chen EmoKnob (both remote on Zoom Class – will be saved to Video Library after)

Predicting Arousal and Valence from Waveforms and Spectrograms using Deep Neural Networks

Emotions and Types of Emotional Responses

EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control

 

Week 11: 4/1

Speech Analysis: Entrainment; Code-Switching (Debasmita Bhattacharya) 

Identifying Entrainment in Task-oriented Conversations

What Code-Switching Strategies are Effective in Dialog Systems?

HW2 due

Week 12: 4/8

Speech Analysis: Personality (Michelle Levine) and  Mental State

Predicting the Big 5 personality traits from digital footprints on social

media: A meta-analysis

Multimodal Deep Learning for Mental Disorders Prediction from Audio Speech Samples

Speech Processing Approach for Diagnosing Dementia in an Early Stage

Week 13: 4/15

Speech Analysis: WordsEye (Bob Coyne); Empathy (Run Chen)

The background to the study of the language of space

Semantics and Pragmatics of Locative Expressions

Detecting Empathy in Speech

HW3 assigned

Week 14: 4/22

Speech Analysis: Charisma; Humor; Sarcasm

What Makes a Speaker Charismatic?  Producing and Perceiving Charismatic Speech

*Extracting Social Meaning: Identifying Interactional Style in Spoken Conversation

Multimodal Indicators of Humor in Video

CHoRaL: Collecting Humor Reaction Labels from Millions of Social Media Users

“Laughing at you or with you”: The Role of Sarcasm in Shaping the Disagreement Space

*"Sure, I did the right thing": A system for sarcasm detection in speech

 

Week 15: 4/29

Speech Analysis: Producing Trustworthy Voices; Intent Detection, Radicalization and De-Radicalization (Lin Ai)

Acoustic-Prosodic and Lexical Cues to Deception and Trust:  Deciphering How People Detect Lies

Multimodal Deception Detection using Automatically Extracted Acoustic, Visual and Lexical Features

The sound of trustworthiness: Acoustic-based modulation of perceived voice personality

Identifying the Popularity and Persuasiveness of Right- and Left-leaning Group Videos on Social Media

Unveiling the Influencers of Radical Content: A Multimodal Analysis of QAnon Videos

HW3 Due