trans Spoken Language Processing Group
Department of Computer Science - Columbia University

• Home

• People

• Projects

• Software

• Papers &   Presentations

• Ph.D. Theses

• Speech Lab

• Events & Links

• Resources

• Photos

• Pet Gallery

• Internal

• NLP Group


Undergraduate and master students interested in research projects, please come to the CS Research Fair the first week of the semester and bring your C.V.

Current Projects

Past Projects

Current Projects


Text-to-Speech Synthesis for Low-Resource Languages

The rapid improvement of speech technology over the past few years has resulted in its widespread adoption by consumers, especially in mobile spoken dialogue systems such as Apple Siri and Google Voice Search. This progress has led to very natural and intelligible text-to-speech (TTS) synthesis for a small number of languages, including English, French, and Mandarin. These high-resource languages (HRLs) have been studied extensively by speech researchers who have built various language tools and collected and annotated massive amounts of speech data in these languages. However, there are thousands of languages in the world (~6500), many of which are spoken by millions of people, which have not been so fortunate to receive this attention from the speech and natural language processing community. Low-resource languages (LRLs), such as Telugu, Tok Pisin, and Vietnamese, for example, do not enjoy rich computational resources and vast amounts of annotated data. Thus, speakers of these languages are deprived of the benefits of modern speech technology which enable us to communicate across language barriers.

We are working towards developing methods of building intelligible, natural-sounding TTS voices out of limited data. While most commercial TTS voices are built from audio recorded by a professional speaker in a controlled acoustic environment, this data can be very time-consuming and expensive to collect. We are exploring the use of radio broadcast news, speech recorded with mobile phones, and other found data for building TTS voices, investigating data selection and model adaptation techniques for making the most out of noisy data.

Participants: Julia Hirschberg, Erica Cooper, Alison Chang, Yocheved Levitan (Brooklyn College)


Code Switching

Code switching (CS) is the practice of switching back and forth between the shared languages of bilingual or multilingual speakers. CS is particularly prevalent in geographic regions with linguistic boundaries or where there are large immigrant groups sharing a common first language different from the mainstream language, as in the USA. Different levels of language (phonological, morphological, syntactic, semantic and discourse-pragmatic) may be involved in CS in different language pairs and/or genres. Computational tools trained for a single language such as automatic speech recognition, information extraction or retrieval, or machine translation systems quickly break down when the input includes CS. A major barrier to research on CS in computational linguistics has been the lack of large, accurately annotated corpora of CS data. We are part of a larger team which aims to collect a large repository of CS data, consistently annotated across different language pairs at different levels of granularity, from phonology/ morphology to pragmatics and discourse, in Modern Standard Arabic with dialectal Arabic, Arabic-English, Hindi-English, Spanish-English, and Mandarin-English. At Columbia we are currently focusing on collecting Mandarin-English CS data in social media and in telephone conversations.

Participants: Julia Hirschberg, Victor Soto, Alison Chang, Mona Diab, Thamar Solorio


DEFT: Anomaly Detection in Speech

This project investigates anomaly in speech by looking at behaviors that break the Gricean maxims of cooperative communication. Specifically, we are looking at hedging behaviors wherein the speaker uses cue words (eg. 'maybe', 'could', 'think', etc) to show a reduced commitment to their utterance. Initial research included constructing an annotation manual to accurately identify and label such behavior in speech. Ongoing work is looking at automatic labeling of hedges with the help of lexical and acoustic features. The end goal is to use the presence of hedging and disfluencies as a metric through which we can identify anomalous regions in dialogue.

Participants: Morgan Ulinski, Anna Prokofieva, Julia Hirschberg, Owen Rambow, Vinod Prabhakaran, Smaranda Muresan, Apoorv Agarwal, Anup Kotalwar, Kathy McKeown, Sara Rosenthal, Weiwei Guo


Text-to-Scene for Field Linguistics

This research aims at creating a novel tool for fieldwork, which we call the WordsEye Linguistics Tool, or WELT. WELT is based on WordsEye, an existing text-to-scene tool which has been developed by in the lab. WordsEye allows for the automatic generation of 3D scenes from written input. The WELT tool will have two modes of operation. In the first mode, English input will automatically generate a picture which can be used to elicit a targeted description in the language being studied. In the second mode, linguists will use an intuitive interface to develop a formal grammar of spatial expressions for the language they are researching. The tool will automatically incorporate this grammar into the existing WordsEye infrastructure to create a text-to-scene system for the new language. Linguists can use this system to verify their grammar with native speakers, easily making changes to it in realtime.

While we intend that the tool will be generally useful, we are initially developing WELT based on scenarios involving Arrernte, an Australian aboriginal language.

Participants: Morgan Ulinski, Bob Coyne, Julia Hirschberg, Owen Rambow, Alexandra Orth, Inna Fetissova (Northeastern University), Myfany Turpin (University of Queensland), Daniel Kaufman (Endangered Language Alliance), Mark Dras (Macquarie University)


Identifying Deceptive Speech Across Cultures

Project Website

The aim of this research is to increase our scientific understanding of deceptive behavior as it is practiced and perceived within and across cultures. A secondary goal is to develop state-of-the-art techniques to detect deceptive behaviors in spoken language.

We are building a new corpus of deceptive and non-deceptive speech, using subjects from American, Mandarin, and Arabic adult native speakers. We will then examine cues to deception, including acoustic, prosodic and lexical features, subject-dependent features, and entrainment. We also plan to investigate personality influences on deceptive behavior.

Participants: Julia Hirschberg, Michelle Levine (Barnard), Andrew Rosenberg (CUNY Queens), Sarah Ita Levitan, Laura Willson, Nishmar Cestero, Elizabeth Petitti, Molly Scott


Speaker Entrainment in Dialogue Systems

In conversation, people entrain to their partner by adopting that partner's word choice, or by adapting aspects of their speaking style, such as speaking rate or pitch range or intensity. Such synchronization is critical to the success of human-human interactions.

While lexical entrainment has been investigated experimentally in a number of studies, other types of entrainment have received less attention. In this project, we are investigating entrainment along dimensions such as intonational contour, pitch accent, phrasing rate, pitch range, intensity, laughter, turn-taking and backchanneling behaviors.

An investigation of these behaviors will support the design of better Spoken Dialogue Systems. While entrainment has been proposed as an important method for inducing users to adopt the system's lexical items, to improve recognition accuracy, few studies have examined the importance of systems entraining to their users, to promote more successful and human-like exchanges.

Participants: Julia Hirschberg; Ani Nenkova (University of Pennsylvania); Agustín Gravano (University of Buenos Aires), Enrique Henestroza, Rivka Levitan, Adele Chase, Laura Willson, Stefan Benus (Constantine the Philosopher University), Jens Edlund (KTH), Mattias Heldner (KTH)

Past Projects



BOLT investigates interactive error handling for speech translation systems. BOLT is DARPA funded joint project with SRI international, University of Marseille, and University of Washington. In this project, we introduce an error-recovery dialogue manager component into a spoken translation system. A spoken translation system allows speakers of two different languages to communicate verbally through a translation application. An error-recovery dialogue manager detects errors in the recognition of utterances and asks the speaker a clarification question before translating the potentially erroneous utterance. Most modern dialogue systems employ generic clarification strategies for recovering from recognition errors by asking a user to repeat or rephrase their previous utterance or asking a yes/no confirmation question. Such generic requests are not natural and tend to frustrate the user. In BOLT, we evaluate the feasibility of using targeted clarification questions that focus specifically on the part of an utterance that contains a predicted recognition error. For example, if a speaker says "Pass me some XXX", where XXX is a misunderstood concept, a system may ask the targeted clarification question "What shall I pass?" instead of a generic request for a repetition. Our approach is based on human strategies for such clarifications. We have collected and analysed a corpus of human responses to misunderstandings in dialogue (Stoyanchev et al., Interdisciplinary Workshop on Feedback Behaviors in Dialog 2012). In order to create targeted clarifications, it is important to detect the error location in the utterance. We used a combination of ASR confidence, lexical, and prosodic features to help identify which words in a spoken sentence are misrecognized (Stoyanchev et al., SLT 2012). Although BOLT evaluates a targeted clarification approach with a speech-to-speech translation application, this approach will also benefit spoken dialogue systems, especially AI systems that accept spoken input with a wide range of concepts and topics.

Participants: Svetlana Stoyanchev, Rose Sloan (Yale University), Mei-Vern Then, Alex Liu, Sunil Khanal, Eli Pincus, Ananta Padney (Barnard College), Jingbo Yaung, Philipp Salletmayer (Graz University)



The BABEL program aims to develop spoken keyword search systems for diverse low-resource languages. Our group focuses on the use of prosodic features for improving recognition accuracy and keyword search performance, as well as experiments in cross-lingual adaptation of models for identifying prosodic events.

Participants: Victor Soto, Erica Cooper, Andrew Rosenberg, Julia Hirschberg



AuToBI is a tool for the automatic analysis of Standard American English prosody. Open source and written in Java, AuToBI hypothesizes pitch accents and phrase boundaries consistent with the ToBI prosodic annotation standard. The toolkit incluides an acoustic feature extraction frontend, and a classification backend supported by the Weka machine learning toolkit.

Participants: Julia Hirschberg, Andrew Rosenberg


Deception in Speech

This project consisted in examining the feasibility of automatic detection of deception in speech, using linguistic, prosodic, and other acoustic cues. We were particularly interested in how individual differences affect the behavior of deceivers, and how such differences affect the ability of individuals to detect deception.

Our study produced the first cleanly recorded, labeled corpus of deceptive speech, the Columbia-SRI-Colorado (CSC) Corpus. Our elicitation paradigm created a context in which the subject was positively motivated to deceive an interviewer (in contrast to studies in which subjects are placed in situations where they are led to lie about potentially guilt inducing behavior). We investigated deception on two levels: we considered the speaker's overall intention to deceive (or not) with respect to particular topics, and we examined individual utterances in terms of their factual content.

Our published work produced a classification system that performs substantially better than human judges at classifying deceptive and non-deceptive utterances; a study of the use of filled pauses in deceptive speech; a method of combining classifiers using different feature sets; and a perception study showing that the personality of a listener affects his or her ability to distinguish deceptive from non-deceptive speech.

Participants: Julia Hirschberg, Frank Enos, Stefan Benus, Jennifer Venditti-Ramprashad, Sarah Friedman, Sarah Gilman, Jared Kennedy, Max Shevyakov, Wayne Thorsen, Alan Yeung, and collaborators from SRI/ICSI and from the University of Colorado at Boulder.


Emotion in Speech

The crux of this research involved characterizing acoustic and prosodic cues to human emotion, evaluating subjective judgments of human emotion, as well as exploring when and why certain emotions become confusable. We conducted on-line surveys designed to collect subjective judgments of both emotional speech as well as emotional faces. We observed that machine learning techniques applied to the prediction of human emotion given acoustic and prosodic information of the sound tokens yields a prediction rate of 75%-80%. We also found that our subjects systematically differed on how they perceived emotion in terms of valency (positive or negative affect). Furthermore, automatic emotion classification increases if we model these two groups independently of one another.

Participants: Julia Hirschberg, Jennifer Venditti-Ramprashad, Jackson Liscombe, Sarah Gilman, Daniel Vassilev, Agustín Gravano.


Detecting and Responding to Emotion in Intelligent Tutorial Systems

A tutor uses cues from the student to determine whether information has been successfully learned or not. These cues may be explicit or implicit. The first goal of this study is to examine cues to student emotions − such as frustration and uncertainty − in the context of speech-enabled intelligent tutorial systems. Such cues include lexical, prosodic, quality of voice, and contextual information. The second goal of this study is to evaluate the most appropriate strategies for responding to (negative) emotional states once they are detected. The ultimate goal is to increase the enjoyment and learning of Intelligent Tutoring Systems users.

Participants: Julia Hirschberg, Jennifer Venditti-Ramprashad, Jackson Liscombe, Jeansun Lee (Columbia University); Diane Litman, Katherine Forbes, Scott Silliman (University of Pittsburgh).


Identifying Acoustic, Prosodic, and Phonetic Cues to Individual Variation in Spoken Language

A fundamental challenge for current research on speech science and technology is understanding individual variation in spoken language. Individuals have their own speaking styles, depending on many factors, including the dialect and socioeconomic background of the speaker, as well as contextual variables such as the degree of familiarity between the speaker and hearer and the register of the speaking situation, from very casual to very formal (Eskenazi 1992). Even within the same dialect or register individual variation may occur; for example, in spontaneous speech, some speakers tend to exhibit more articulation reduction (e.g., reducing or deletion of function words) than others. In this project, we are working on identifying the acoustic-prosodic and phonetic cues that might contribute to clustering speakers based on their speaking style.

Participants: Fadi Biadsy, Julia Hirschberg, William (Yang) Wang


Extracting Paraphrase Rules from FrameNet and WordNet

FrameNet organizes lexical units into semantic frames with associated frame elements which represent the core roles of that frame. Each frame also contains annotated sentences mapping grammatical function to frame element role for the sample sentences. In our research we've extracted patterns from these annotated sentences to form paraphrase rules that cover conversives (e.g. "buy" <-> "sell") as well as other meaning-preserving verb transformations and alternations such as "The rats swarmed around the room" <-> "The room was teeming with rats.".

Participants: Bob Coyne, Owen Rambow


WordsEye: Automatic Text-to-Scene Conversion

We live in a vast sea of ever-changing text with few tools available to help us visualize its meaning. The goal of this research is to bridge the gap between graphics and language by developing new theoretical models and supporting technology to create a system that automatically converts descriptive text into rendered 3D scenes representing the meaning of that text. This builds upon previous work done with Richard Sproat in the WordsEye text-to-scene system (available online at New research directions include the lexical semantics and knowledge acquisition needed to semi-automatically construct a new scenario-based lexical resource. This resource will be used in decoding and making explicit the oblique contextual elements common in descriptive language for the purposes of graphical depiction.

Participants: Bob Coyne, Owen Rambow, Julia Hirschberg, Gino Micelli, Cecilia Schudel, Daniel Bauer, Morgan Ulinski, Richard Sproat (OHSU), Masoud Rouhizadeh (OHSU), Yilei Yang, Sam Wiseman, Jack Crawford, Kenny Harvey, Mi Zhou, Yen-Han Lin, Margit Bowler (Reed College), Victor Soto.


Charismatic Speech

People at an instinctual level are drawn to certain public speakers. What about it makes their speech charismatic? Our research is looking at acoustic and lexical features from public addresses to locate the source of the charisma. Though the work so far has been in American English, parallel work in Arabic may shed light on potential cultural biases in the perception of charisma.

Participants: Julia Hirschberg, Wisam Dakka, Andrew Rosenberg, Fadi Biadsy, Aron Wahl, Judd Sheinholtz, Sveltlana Stenchikova.


Speech Summarization

Speech Summarization consists of summarizing spoken data - broadcast news, telephone conversation, meetings, lectures. We are mainly focusing on summarization of broadcast news. Our speech summarization research mainly consists of three different aspects, which are i) Summarization ii) Information Extraction iii) User Interface. Summarization aspect consists of extracting significant segments of speech and concatenating them to provide a coherent summary of the given story in broadcast news. Information Extraction aspect consists of extracting named entities, headlines, interviews, different types of speakers. The last part consists of developing an user-interface that allows us to combine summary of broadcast news and other extracted information in a coherent and user-friendly speech browser.

Participants: Julia Hirschberg, Sameer Maskey, Michel Galley, Martin Jansche, Jeansun Lee, Irina Likhtina, Aaron Roth, Lauren Wilcox.


Prosody of Turn-Taking in Dialogue

In conversation there are implicit rules specifying whose turn it is to talk, and conventions for switching the turn from one speaker to the other. For example, interrupting the interlocutor is a (not necessarily rude) way of grabbing the turn, while formulating a question is a way of yielding it. These rules allow dialogues to develop in a coordinated manner. The goal of this project is to study and characterize those rules and conventions, in the Columbia Games Corpus and other corpora.

Participants: Julia Hirschberg; Stefan Benus (Constantine The Philosopher University); Agustín Gravano (University of Buenos Aires), Héctor Chávez, Michael Mulley, Enrique Henestroza, Lauren Wilcox.


Affirmative Cue Words in Dialogue

In speech, single affirmative cue words such as okay, right and yes are often used with different functions, including acknowledgment (meaning "I believe/agree with what you said"), backchannel (indicating "I'm still here" or "I hear you and please continue"), and beginning of a new discourse segment (as in "okay, now I will talk about..."). In this project, we analyze how such functions are conveyed and perceived, and explore how they can be automatically predicted with Machine Learning algorithms.

Participants: Julia Hirschberg (Columbia University); Stefan Benus (Constantine The Philosopher University); Agustín Gravano (University of Buenos Aires), Lauren Wilcox, Héctor Chávez, Ilia Vovsha (Columbia University); Shira Mitchell (Harvard University).


Intonational Overload: Uses of the Downstepped Contours

Intonational contours are overloaded, conveying different meanings in different contexts. We are studying potential uses of the downstepped contours (especially H* !H* L- L%) in Standard American English, both in read and spontaneous speech. We are investigating speakers' use of these contours in conveying discourse topic structure and in signaling given vs. new information, and the relationship between these two functions. We designed and collected the Columbia Games Corpus especifically for this project.

Participants: Julia Hirschberg (Columbia University); Agustín Gravano (University of Buenos Aires); Gregory Ward, Elisa Sneed (Northwestern University); Stefan Benus (Constantine The Philosopher University), Ani Nenkova, Michael Mulley.


Characterizing Laughter in Dialogue

Laughter can serve many different purposes in human communication and occurs in many different forms. This project involved studying the acoustic characteristics of laughter and the functions different types of laughter may serve in human dialogue and in spoken dialogue systems.

Participants: Brianne Calandra, Rolf Carlson (KTH, Sweden).

webmaster - vsotox[at] last updated - 05.08.2019 HTML 4.01