• Papers &
• Ph.D. Theses
• Speech Lab
• Events & Links
• Pet Gallery
• NLP Group
Undergraduate and master students interested in research projects,
please check out our Ads for Student
Research Projects in NLP.
Text-to-Speech Synthesis for Low-Resource Languages
The rapid improvement of speech technology over the past few years has resulted in its widespread adoption by consumers, especially in mobile spoken dialogue systems such as Apple Siri and Google Voice Search. This progress has led to very natural and intelligible text-to-speech (TTS) synthesis for a small number of languages, including English, French, and Mandarin. These high-resource languages (HRLs) have been studied extensively by speech researchers who have built various language tools and collected and annotated massive amounts of speech data in these languages. However, there are thousands of languages in the world (~6500), many of which are spoken by millions of people, which have not been so fortunate to receive this attention from the speech and natural language processing community. Low-resource languages (LRLs), such as Telugu, Tok Pisin, and Vietnamese, for example, do not enjoy rich computational resources and vast amounts of annotated data. Thus, speakers of these languages are deprived of the benefits of modern speech technology which enable us to communicate across language barriers.
We are working towards developing methods of building intelligible, natural-sounding TTS voices out of limited data. While most commercial TTS voices are built from audio recorded by a professional speaker in a controlled acoustic environment, this data can be very time-consuming and expensive to collect. We are exploring the use of radio broadcast news, speech recorded with mobile phones, and other found data for building TTS voices, investigating data selection and model adaptation techniques for making the most out of noisy data.
Participants: Julia Hirschberg, Erica Cooper, Alison Chang, Yocheved Levitan (Brooklyn College)
Code switching (CS) is the practice of switching back and forth between the shared languages
of bilingual or multilingual speakers. CS is particularly prevalent in geographic regions with
linguistic boundaries or where there are large immigrant groups sharing a common first
language different from the mainstream language, as in the USA. Different levels of language
(phonological, morphological, syntactic, semantic and discourse-pragmatic) may be involved in
CS in different language pairs and/or genres. Computational tools trained for a single language
such as automatic speech recognition, information extraction or retrieval, or machine translation
systems quickly break down when the input includes CS. A major barrier to research on CS in
computational linguistics has been the lack of large, accurately annotated corpora of CS data.
We are part of a larger team which aims to collect a large repository of CS data, consistently
annotated across different language pairs at different levels of granularity, from phonology/
morphology to pragmatics and discourse, in Modern Standard Arabic with dialectal Arabic,
Arabic-English, Hindi-English, Spanish-English, and Mandarin-English. At Columbia we are
currently focusing on collecting Mandarin-English CS data in social media and in telephone
Participants: Julia Hirschberg, Alison Chang, Mona Diab, Thamar Solorio
DEFT: Anomaly Detection in Speech
This project investigates anomaly in speech by looking at behaviors that break the Gricean maxims of cooperative communication. Specifically, we are looking at hedging behaviors wherein the speaker uses cue words (eg. 'maybe', 'could', 'think', etc) to show a reduced commitment to their utterance. Initial research included constructing an annotation manual to accurately identify and label such behavior in speech. Ongoing work is looking at automatic labeling of hedges with the help of lexical and acoustic features. The end goal is to use the presence of hedging and disfluencies as a metric through which we can identify anomalous regions in dialogue.
Participants: Anna Prokofieva, Julia Hirschberg, Owen Rambow, Vinod Prabhakaran, Smaranda Muresan, Apoorv Agarwal, Anup Kotalwar, Kathy McKeown, Sara Rosenthal, Weiwei Guo
BOLT investigates interactive error handling for speech translation systems. BOLT is DARPA funded joint project with SRI international, University of Marseille, and University of Washington. In this project, we introduce an error-recovery dialogue manager component into a spoken translation system. A spoken translation system allows speakers of two different languages to communicate verbally through a translation application. An error-recovery dialogue manager detects errors in the recognition of utterances and asks the speaker a clarification question before translating the potentially erroneous utterance. Most modern dialogue systems employ generic clarification strategies for recovering from recognition errors by asking a user to repeat or rephrase their previous utterance or asking a yes/no confirmation question. Such generic requests are not natural and tend to frustrate the user. In BOLT, we evaluate the feasibility of using targeted clarification questions that focus specifically on the part of an utterance that contains a predicted recognition error. For example, if a speaker says "Pass me some XXX", where XXX is a misunderstood concept, a system may ask the targeted clarification question "What shall I pass?" instead of a generic request for a repetition. Our approach is based on human strategies for such clarifications. We have collected and analysed a corpus of human responses to misunderstandings in dialogue (Stoyanchev et al., Interdisciplinary Workshop on Feedback Behaviors in Dialog 2012). In order to create targeted clarifications, it is important to detect the error location in the utterance. We used a combination of ASR confidence, lexical, and prosodic features to help identify which words in a spoken sentence are misrecognized (Stoyanchev et al., SLT 2012). Although BOLT evaluates a targeted clarification approach with a speech-to-speech translation application, this approach will also benefit spoken dialogue systems, especially AI systems that accept spoken input with a wide range of concepts and topics.
Participants: Svetlana Stoyanchev, Rose Sloan (Yale University), Mei-Vern Then, Alex Liu, Sunil Khanal, Eli Pincus, Ananta Padney (Barnard College), Jingbo Yaung, Philipp Salletmayer (Graz University)
Text-to-Scene for Field Linguistics
This research aims at creating a novel tool for fieldwork, which we call the WordsEye Linguistics Tool, or WELT. WELT is based on WordsEye, an existing text-to-scene tool which has been developed by in the lab. WordsEye allows for the automatic generation of 3D scenes from written input. The WELT tool will have two modes of operation. In the first mode, English input will automatically generate a picture which can be used to elicit a targeted description in the language being studied. In the second mode, linguists will use an intuitive interface to develop a formal grammar of spatial expressions for the language they are researching. The tool will automatically incorporate this grammar into the existing WordsEye infrastructure to create a text-to-scene system for the new language. Linguists can use this system to verify their grammar with native speakers, easily making changes to it in realtime.
While we intend that the tool will be generally useful, we are initially developing WELT based on scenarios involving Arrernte, an Australian aboriginal language.
Participants: Morgan Ulinski, Bob Coyne, Julia Hirschberg, Owen Rambow, Alexandra Orth, Inna Fetissova (Northeastern University), Myfany Turpin (University of Queensland), Daniel Kaufman (Endangered Language Alliance), Mark Dras (Macquarie University)
Identifying Deceptive Speech Across Cultures
The aim of this research is to increase our scientific understanding of deceptive behavior as it is practiced and perceived within and across cultures. A secondary goal is to develop state-of-the-art techniques to detect deceptive behaviors in spoken language.
We are building a new corpus of deceptive and non-deceptive speech, using subjects from American, Mandarin, and Arabic adult native speakers. We will then examine cues to deception, including acoustic, prosodic and lexical features, subject-dependent features, and entrainment. We also plan to investigate personality influences on deceptive behavior.
Participants: Julia Hirschberg, Michelle Levine (Barnard), Andrew Rosenberg (CUNY Queens), Sarah Ita Levitan, Laura Willson, Nishmar Cestero, Elizabeth Petitti, Molly Scott
The BABEL program aims to develop spoken keyword search systems for diverse low-resource languages. Our group focuses on the use of prosodic features for improving recognition accuracy and keyword search performance, as well as experiments in cross-lingual adaptation of models for identifying prosodic events.
Participants: Victor Soto, Erica Cooper, Andrew Rosenberg, Julia Hirschberg
Extracting Paraphrase Rules from FrameNet and WordNet
FrameNet organizes lexical units into semantic frames with associated
frame elements which represent the core roles of that frame. Each
frame also contains annotated sentences mapping grammatical function
to frame element role for the sample sentences. In our research we've
extracted patterns from these annotated sentences to form paraphrase
rules that cover conversives (e.g. "buy" <-> "sell") as well as other
meaning-preserving verb transformations and alternations such as
"The rats swarmed around the room" <-> "The room was teeming with rats.".
Participants: Bob Coyne, Owen Rambow
WordsEye: Automatic Text-to-Scene Conversion
We live in a vast sea of ever-changing text with few tools available
to help us visualize its meaning. The goal of this research is to
bridge the gap between graphics and language by developing new
theoretical models and supporting technology to create a system that
automatically converts descriptive text into rendered 3D scenes
representing the meaning of that text. This builds upon previous work
done with Richard Sproat in the WordsEye text-to-scene system
(available online at www.wordseye.com). New research directions
include the lexical semantics and knowledge acquisition needed to
semi-automatically construct a new scenario-based lexical
resource. This resource will be used in decoding and making explicit
the oblique contextual elements common in descriptive language for the
purposes of graphical depiction.
Participants: Bob Coyne, Owen Rambow, Julia Hirschberg, Gino Micelli, Cecilia Schudel, Daniel Bauer, Morgan Ulinski, Richard Sproat (OHSU), Masoud Rouhizadeh (OHSU), Yilei Yang, Sam Wiseman, Jack Crawford, Kenny Harvey, Mi Zhou, Yen-Han Lin, Margit Bowler (Reed College), Victor Soto.
Speaker Entrainment in Dialogue Systems
In conversation, people entrain to their partner by adopting that partner's
word choice, or by adapting aspects of their speaking style, such as speaking rate or pitch
range or intensity. Such synchronization is critical to the success of human-human
While lexical entrainment has been investigated experimentally in a number of studies, other
types of entrainment have received less attention. In this project, we are investigating
entrainment along dimensions such as intonational contour, pitch accent, phrasing rate, pitch
range, intensity, laughter, turn-taking and backchanneling behaviors.
An investigation of these behaviors will support the design of better Spoken Dialogue Systems.
While entrainment has been proposed as an important method for inducing users to adopt the
system's lexical items, to improve recognition accuracy, few studies have examined the
importance of systems entraining to their users, to promote more successful and human-like
Participants: Julia Hirschberg; Ani Nenkova (University of Pennsylvania); Agustín Gravano (University of Buenos Aires), Enrique Henestroza, Rivka Levitan, Adele Chase, Laura Willson, Stefan Benus (Constantine the Philosopher University), Jens Edlund (KTH), Mattias Heldner (KTH)
AuToBI is a tool for the automatic analysis of Standard American English
prosody. Open source and written in Java, AuToBI hypothesizes pitch accents
and phrase boundaries consistent with the ToBI prosodic annotation standard.
The toolkit incluides an acoustic feature extraction frontend, and a
classification backend supported by the
Weka machine learning toolkit.
Participants: Julia Hirschberg, Andrew Rosenberg
Deception in Speech
This project consisted in examining the feasibility of automatic
detection of deception in speech, using linguistic, prosodic, and
other acoustic cues. We were particularly interested in how
individual differences affect the behavior of deceivers, and how
such differences affect the ability of individuals to detect
Our study produced the first cleanly recorded, labeled corpus of
deceptive speech, the Columbia-SRI-Colorado (CSC) Corpus. Our elicitation
paradigm created a context in which the subject was positively
motivated to deceive an interviewer (in contrast to studies in which
subjects are placed in situations where they are led to lie about
potentially guilt inducing behavior). We investigated deception on two
levels: we considered the speaker's overall intention to deceive (or
not) with respect to particular topics, and we examined individual
utterances in terms of their factual content.
Our published work produced a classification system that performs
substantially better than human judges at classifying deceptive and
non-deceptive utterances; a study of the use of filled pauses in
deceptive speech; a method of combining classifiers using different
feature sets; and a perception study showing that the personality
of a listener affects his or her ability to distinguish deceptive
from non-deceptive speech.
Participants: Julia Hirschberg, Frank Enos, Stefan Benus, Jennifer Venditti-Ramprashad, Sarah Friedman, Sarah Gilman, Jared Kennedy, Max Shevyakov, Wayne Thorsen, Alan Yeung, and collaborators from SRI/ICSI and from the University of Colorado at Boulder.
Emotion in Speech
The crux of this research involved characterizing
acoustic and prosodic cues to human emotion, evaluating subjective
judgments of human emotion, as well as exploring when and why certain
emotions become confusable. We conducted on-line surveys designed to
collect subjective judgments of both emotional speech as well as emotional
faces. We observed that machine learning techniques applied to the
prediction of human emotion given acoustic and prosodic information of the
sound tokens yields a prediction rate of 75%-80%.
We also found that our subjects systematically differed on how
they perceived emotion in terms of valency (positive or negative
affect). Furthermore, automatic emotion classification increases if
we model these two groups independently of one another.
Participants: Julia Hirschberg, Jennifer Venditti-Ramprashad, Jackson Liscombe, Sarah Gilman, Daniel Vassilev, Agustín Gravano.
Detecting and Responding to Emotion in Intelligent Tutorial Systems
A tutor uses cues from the student to determine whether information
has been successfully learned or not. These cues may be explicit or
implicit. The first goal of this study is to examine cues to student
emotions − such as frustration and uncertainty − in the context of
speech-enabled intelligent tutorial systems. Such cues include
lexical, prosodic, quality of voice, and contextual information. The
second goal of this study is to evaluate the most appropriate
strategies for responding to (negative) emotional states once they are
detected. The ultimate goal is to increase the enjoyment and learning
of Intelligent Tutoring Systems users.
Participants: Julia Hirschberg, Jennifer Venditti-Ramprashad, Jackson Liscombe, Jeansun Lee (Columbia University); Diane Litman, Katherine Forbes, Scott Silliman (University of Pittsburgh).
Identifying Acoustic, Prosodic, and Phonetic Cues to Individual Variation in Spoken Language
A fundamental challenge for current research on speech science and technology is
understanding individual variation in spoken language. Individuals have their own
speaking styles, depending on many factors, including the dialect and socioeconomic
background of the speaker, as well as contextual variables such as the degree of
familiarity between the speaker and hearer and the register of the speaking
situation, from very casual to very formal (Eskenazi 1992). Even within the same
dialect or register individual variation may occur; for example, in spontaneous
speech, some speakers tend to exhibit more articulation reduction (e.g., reducing or
deletion of function words) than others. In this project, we are working on identifying the acoustic-prosodic and phonetic cues that might contribute to clustering speakers based on their speaking style.
Participants: Fadi Biadsy, Julia Hirschberg, William (Yang) Wang
People at an instinctual level are drawn to certain public speakers.
What about it makes their speech charismatic? Our research is looking at
acoustic and lexical features from public addresses to locate the source
of the charisma. Though the work so far has been in American English,
parallel work in Arabic may shed light on potential cultural biases in
the perception of charisma.
Participants: Julia Hirschberg, Wisam Dakka, Andrew Rosenberg, Fadi Biadsy, Aron Wahl, Judd Sheinholtz, Sveltlana Stenchikova.
Speech Summarization consists of summarizing spoken data - broadcast news,
telephone conversation, meetings, lectures. We are mainly focusing on
summarization of broadcast news. Our speech summarization research mainly
consists of three different aspects, which are i) Summarization ii)
Information Extraction iii) User Interface. Summarization aspect consists
of extracting significant segments of speech and concatenating them to
provide a coherent summary of the given story in broadcast news.
Information Extraction aspect consists of extracting named entities,
headlines, interviews, different types of speakers. The last part consists
of developing an user-interface that allows us to combine summary of
broadcast news and other extracted information in a coherent and
user-friendly speech browser.
Participants: Julia Hirschberg, Sameer Maskey, Michel Galley, Martin Jansche, Jeansun Lee, Irina Likhtina, Aaron Roth, Lauren Wilcox.
Prosody of Turn-Taking in Dialogue
In conversation there are implicit
rules specifying whose turn it is to talk, and
conventions for switching the turn from one speaker to the other.
For example, interrupting the interlocutor is a (not necessarily
rude) way of grabbing the turn, while formulating a question is
a way of yielding it.
These rules allow dialogues to develop in a coordinated manner.
The goal of this project is to study and characterize those rules and
conventions, in the Columbia Games Corpus
and other corpora.
Participants: Julia Hirschberg; Stefan Benus (Constantine The Philosopher University); Agustín Gravano (University of Buenos Aires), Héctor Chávez, Michael Mulley, Enrique Henestroza, Lauren Wilcox.
Affirmative Cue Words in Dialogue
In speech, single affirmative cue words such as okay, right and yes
are often used with different functions, including acknowledgment
(meaning "I believe/agree with what you said"), backchannel (indicating
"I'm still here" or "I hear you and please continue"), and beginning of a new
discourse segment (as in "okay, now I will talk about...").
In this project, we analyze how such functions are conveyed
and perceived, and explore how they can be automatically predicted
with Machine Learning algorithms.
Participants: Julia Hirschberg (Columbia University); Stefan Benus (Constantine The Philosopher University); Agustín Gravano (University of Buenos Aires), Lauren Wilcox, Héctor Chávez, Ilia Vovsha (Columbia University); Shira Mitchell (Harvard University).
Intonational Overload: Uses of the Downstepped Contours
Intonational contours are overloaded, conveying different meanings in different
contexts. We are studying potential uses of the downstepped contours (especially H* !H* L- L%)
in Standard American English, both in read and spontaneous speech. We are investigating
speakers' use of these contours in conveying discourse topic structure and in
signaling given vs. new information, and the relationship between these two
functions. We designed and collected the
Columbia Games Corpus especifically for this project.
Participants: Julia Hirschberg (Columbia University); Agustín Gravano (University of Buenos Aires); Gregory Ward, Elisa Sneed (Northwestern University); Stefan Benus (Constantine The Philosopher University), Ani Nenkova, Michael Mulley.
Characterizing Laughter in Dialogue
Laughter can serve many different purposes in human communication and
occurs in many different forms. This project involved studying the acoustic
characteristics of laughter and the functions different types of
laughter may serve in human dialogue and in spoken dialogue systems.
Participants: Brianne Calandra, Rolf Carlson (KTH, Sweden).