trans Spoken Language Processing Group
Department of Computer Science - Columbia University

• Home

• People

• Projects

• Software

• Papers &   Presentations

• Ph.D. Theses

• Speech Lab

• Events & Links

• Resources

• Photos

• Pet Gallery

• Internal

• NLP Group


Undergraduate and master students interested in research projects, please come to the CS Research Fair the first week of the semester and bring your C.V.

Current Projects

Past Projects

Current Projects


Empathetic Chatbots for Students Learning English

Language learners tend to experience frustration when learning a new language because they are not immediately proficient in the language. An approach to combat such frustration is to develop language learning technologies capable of responding empathetically to negative emotions. Prior work has also shown that when students perceive their teachers to be more emotionally supportive, they are more likely to have high resilience in learning. Therefore, we focus on building chatbots for English conversation practice that are empathetic, while focusing on catering to individual learners. Specifically, we utilize emotion-labeled data to develop mechanisms for detecting student frustration and other negative emotions. We then use large language models to synthesize empathetic feedback that is adaptive and personalizable. In addition to developing empathetic feedback, we are interested in improving the quality of the chatbot in general.

Participants: Siyan Li, Teresa Shao, Brittney Lilly, Zhou Yu, Julia Hirschberg


The Paralinguistics of Code-switching in Speech: Identification and Generation

More than half of the world's population is estimated to speak more than one language. Among multilingual speakers, code-switching is a common linguistic phenomenon in which speakers alternate between languages or dialects within or between utterances. This project studies how and why people code-switch in speech by focusing on its paralinguistic aspects - we investigate how we can identify and measure various aspects of code-switched communication that cannot be inferred directly or uniquely from a speech signal, and apply that information toward the generation of more natural and human-like multilingual speech. Work in this project explores a number of directions, including examining the relationship between code-switching and: entrainment, the expression of empathy, the influence of named entities, types of dialogue acts, and the production of intonational contours. The overarching goal of this project is for the consolidated results from each of these research directions to contribute to greater innovation in inclusive language technologies that account for and reflect the ways in which real people speak to one another.

Participants: Debasmita Bhattacharya, Siying Ding, Juan Esteban Junco, Anxin Yi, Divya Tadimeti, Aanya Tolat, Shreyas Chatterjee, Julia Hirschberg

Past participants: Alayna Nguyen, Eleanor Lin, Margot Story


Conveying Empathy in Multiple Modalities

Empathy is the ability to understand another's feelings as if we were having those feelings ourselves. Empathetic behavior can encourage users to like a speaker more, to believe the speaker is more intelligent, to actually take the speaker's advice, and to want to speak with the speaker longer and more often. Much research has been done in the past 15 years on creating empathetic responses in text, facial expression and gesture in conversational systems. However very little has been done to identify the speech features that can create an empathetic sounding voice. We have been collecting YouTube videos which convey empathy in English and in Mandarin and using empathetic vs. neutral segments from speakers in these videos and using these to build machine learning models identifying empathetic vs. neutral speech using speech and text features.

Participants: Run Chen, Anushka Kulkarni, Tony Chen, Linda Pang, Jun Shin, Julia Hirschberg

Past participants: Andrea Lopez, Divya Tadimeti, Aruj Jain


Detecting Emotion Across Cultures

Cross-cultural studies of emotion by psychologists, ethnographers, and anthropologists have provided much evidence that emotions differ in different cultures in multiple ways, including how they are defined and how (or whether) they are expressed. For this project, identifying emotion using multimodal information will be very important since little prior work has combined algorithms using text, speech and visual features for emotion identification successfully. Also, identifying changes in communication which may affect communicative issues such as misunderstanding or conflict such as negative valence, dominance and degree of arousal will be novel and important, as will multiple items from more specific emotional categories such as anger, contempt, confusion, disgust, surprise, or even sadness, while expression of happiness or satisfaction should indicate that communication is proceeding well. We are developing multimodal, cross-lingual, and cross-cultural approaches to continuous emotion detection from acoustic-prosodic speech features, lexical information, facial expression, and body gesture from conversational participants.

Participants: Ziwei (Sara) Gong, Zehui Wu, Julia Hirschberg

Past participants: Jaywon Koo


Re-aligned Text and Speech in the Switchboard Dialogue Act Corpus

The goal of this project is to create the largest, publicly available corpus of 2-party conversations about general topics which have transcripts and speech correctly aligned by correcting the alignment of transcripts and speech of the 1155 conversations in the Switchboard Dialog Act (SwDA) corpus. Dialogue act prediction and production is of seminal importance today in research, government and industry, as more and more dialogue systems are being built to interact with people for training, education, decreasing the human workload in call centers, and providing problem-solving advice. However, there are few large labeled corpora for researchers to use for model-building and analysis of general conversational speech. The transcripts and speech of this corpus, created from the larger Switchboard Corpus in the late 1990s, were originally aligned with a GMM-HMM Switchboard recognizer and results of the alignment are very poor, making it extremely difficult to make use of both speech and text data to predict or learn to generate dialogue acts correctly: most users have found that using the aligned audio information does not improve, and sometimes worsens, their dialogue act prediction or generation scores. The goal of this project is to re-align each side of the SwDA transcripts with the speaker’s audio to manually to correct the errors from the early automatic alignment to make the corpus of much greater value to the dialogue research community.

Participants: Run Chen, Eleanor Lin, Julia Hirschberg

Past participants: Eric Chen, Ilan Falcon, Shivani Ghatge, Shayan Hooshmand, Sophia Horng, Patrick Jiao, Andrea Lopez, Catherine Lyu, Linda Pang, Rose Sloan, Isaac Suh, Alicia Yang,


Semafor: Identifying false information in social media

Our goal in this project is to detect false information in social media and to identify the intent or purpose behind the falsified information. We are beginning by collecting Twitter data on Covid19 and the U. S. 2020 elections and identifying information from the tweets about tweeters and their networks and the information sources they point to. We are also beginning to study different types of intent behind tweets we identify as false or pointing to false information using emotion detection and sarcasm detection. We divide falsification intent into two major categories: malicious and non-malicious, where non-malicious may be humorous or sarcastic or simply misinformation on the part of the tweeter. Other reasons for spreading misinformation might be to produce an emotional reaction toward an individual or group, persuading people to support an individual or group or set of ideas or to “cover up” an embarrassing or criminal act from being believed as true.

Participants: Lin Ai, Zixiaofan (Brenda) Yang, Run Chen, Gitika Bose, Anika Kathuria, Rishabh Narang, Julia Hirschberg


Detecting Deception in Multiple Modalities Across Cultures

Project Website

The aim of this research is to increase our scientific understanding of deceptive behavior as it is practiced and perceived within and across cultures. A secondary goal is to develop state-of-the-art techniques to detect deceptive behaviors in spoken language. We have built a new corpus of deceptive and non-deceptive speech, using subjects from American, Mandarin, and Arabic adult native speakers. We have examined many possible cues to deception, including acoustic, prosodic and lexical features, subject-dependent features, and entrainment, and personality differences. We have also compared deception detection techniques in multiple corpora and in multiple modalities including facial features as well as text and speech. We have also developed a LieCatcher game which we are using to compare the performance of our classification models to human performance more broadly using crowd-sourcing (see below on Trusted and Mistrusted Speech).

Participants: Sarah Ita Levitan, Laura Willson, Nishmar Cestero, Guozhen An (CUNY), Angel Maredia, Elizabeth Petitti, Molly Scott, Yogesh Singh, Jessica Xiang, Jixuan (Gilbert) Zhang, Rivka Levitan (CUNY), Michelle Levine (Barnard), Andrew Rosenberg (CUNY), Julia Hirschberg


Multimodal Research on Radicalization

In this project we are studying the influence of radicalizing data in social media in persuading viewers to adopt extremist right-wing or left-wing political beliefs. Previous research has developed many theories of how and why radicalization occurs, but less research has been done to empirically test these theories on a large scale and to answer questions about specific features of group methods that are statistically correlated with success in attracting followers. Our goal is to identify, collect and perform careful statistical analysis of online radicalization videos and develop Machine Learning classifiers to identify and thus help us collect a very large corpus for others to use as well in testing radicalization theories. In addition to identifying aspects of videos which appear to lead to radicalization, we want to explore additional questions, including: What are the characteristics of radical video material? Can we measure the effectiveness of different materials? What are the characteristics of individuals who are engaged with extremist content on YouTube? Do users tend to entrain to (unconsciously imitate) the inciters in terms of the language they use (perhaps about certain groups)? Is this evident in the language they adopt? The symbols they use? Can we track changes over time in the material that appears to lead to specific violent incidents? Can we use these findings to predict future incidents -- as well as those unlikely to result in violence?

Participants: Lin Ai, Yogesh Singh, Sarah Ita Levitan, Julia Hirscberg


Identifying Trusted and Mistrusted Speech

Humans rarely perform better than chance at lie detection. To better understand human perception of deception, we created a game framework, LieCatcher, to collect ratings of perceived deception using a large corpus of deceptive and truthful interviews. We analyzed the acoustic-prosodic and linguistic characteristics of language trusted and mistrusted by raters and compared these to characteristics of actual truthful and deceptive language to understand how perception aligns with reality. With this data we built classifiers to automatically distinguish trusted from mistrusted speech which perform significantly better than humans. We next evaluated whether the strategies raters said they used to discriminate between truthful and deceptive responses were in fact useful. Our results show that, while several prosodic and lexical features were consistently perceived as trustworthy, they were not reliable cues. Also, the strategies that judges reported using in deception detection were not helpful for the task. Currently we are experimenting with using LieCatcher to help train humans in lie detection providing helpful cues when they miss a lie.

Participants: Sarah Ita Levitan, Xi (Leslie) Chen, Rebecca Calinsky, Marko Mandic, Xinyue Tan, Michelle Levine, Julia Hirschberg

Past Projects


Sarcasm Detection in Social Media

Sarcasm is a common non-malicious type of false information in social media content. While sarcasm detection has been a discussed topic in the field, this project focuses on the time-relevant, domain-specific COVID-19 related tweets. We combine the state-of-art BERT models with previously identified sarcastic markers to classify the sarcastic tweets in the collected COVID-19 Twitter dataset. We aim to determine features that make these tweets sarcastic including linguistic and nonlinguistic cues. We are also interested in extending this work into sarcastic speech detection where little systematic work is done. The sarcasm project serves as part of the Semafor project.

Participants: Run Chen, Ziwei (Sara) Gong, Julia Hirschberg, Tuhin Chakrabarty, Smaranda Muresan


Gender Differences in Debates

We are examining acoustic and prosodic as well as lexical cues in a large collection of intercollegiate debate tournaments collected from 2008-2018 with information on the debaters and their scores. Our initial goal is to identify differences in debate behavior and success based on gender differences, using transcripts and acoustic-prosodic information. The corpus was collected by our collaborator, Huyen Nguyen.

Participants: Sarah Ita Levitan, Huyen Nguyen, David Lupea, Julia Hirschberg


Prosodic Assignment for TTS

Accurate prosody prediction from text leads to more natural-sounding TTS. In this work, we employ a new set of features to predict ToBI pitch accent and phrase boundaries from text. We investigate a wide variety of text-based features, including many new syntactic features, several types of word em- beddings, co-reference features, LIWC features, and specificity information. We focus our work on the Boston Radio News Corpus, a ToBI-labeled corpus of relatively clean news broad- casts, but also test our classifiers on Audix, a smaller corpus of read news, and on the Columbia Games Corpus, a corpus of conversational speech, in order to test the applicability of our model in cross-corpus settings. Our results show strong performance on both tasks, as well as some promising results for cross-corpus applications of our models. Currently we are preparing the Switchboard Corpus for additional analysis.

Participants: Rose Sloan, Shivani Ghatge, Adaeze Adigwe, Isabella Mandis, Sahana Mohandoss, Syed Sarfaraz Ahtar, Bryan Li, Ritvik Shrivastava, Liliann Ulysse, Mariel Werner, Agustin Gravano, Julia Hirschberg


Gender-balanced Charismatic Speech and Charismatic Politician Speech

Following the earlier work in our lab, we started two new projects on charismatic speech: (1) studying the role of demographic information, especially gender, in producing and perceiving charismatic speech using a gender-balanced dataset, and (2) studying charismatic politician speech on a newly collected large-scaled politician speech dataset. We are interested in how the genre of speech, the speaking style, and the perceiver’s demographic information affects perceived traits of politician speech.

Participants: Zixiaofan (Brenda) Yang, Nishmar Cestero, Tomer Aharoni, Brandon Liang, Riku Tabata, Jessica Yin Huynh, Alina Ying, Julia Hirschberg


Modeling Mental Illness from Reddit Posts

We are currently investigating the problem of automatic detection of psychiatric disorders from the linguistic content of social media posts in Reddit. We have collected a large scale dataset of Reddit posts from users with eight types of mental disorders and a control user group. We have extracted and analyzed the linguistic characteristics of posts and have identified differences between these diagnostic groups. We have built strong classification models based on deep contextualized word representations and have found that they outperform previously applied statistical models with simple linguistic features by large margins.

Participants: Sarah Ita Levitan, Zhengping Jiang, Julia Hirschberg


Detecting Hate Speech Directed at Female Journalists

Most efforts at identifying abusive speech online have relied on public corpora that have been scraped from websites using keyword-based queries or released by site or platform owners for research purposes. These are typically labeled by crowd-sourced annotators -- not the targets of the abuse. While this data supports fast development of machine learning classifiers, the models built on them often fail in the context of real-world harassment and abuse, which contain nuances less easily identified by non-targets. We are developing a mixed-methods approach to creating classifiers for abuse and harassment which will leverage direct engagement with the target group in order to achieve high quality and ecological validity of data sets and labels, and to generate deeper insights into the key tactics of bad actors. We are collecting women journalists' experience on Twitter as an initial community of focus but we have identified several structural mechanisms of abuse that we believe will generalize to other target communities.

Participants: Sarah Ita Levitan, Ishaan Arora, Julia Guo, Susan McGregor, Julia Hirschberg


Cross-lingual Emotion and Sentiment Detection in Speech

We have built models for emotion detection in speech, predicting both valence and arousal continuously and utilizing both waveforms and spectrograms as inputs for corpora such as the SEMAINE and RECOLA databases using models trained on these to predict emotion in Low Resource Languages. We are currently studying cross-lingual sentiment detection in speech.

Participants: Zixiaofan (Brenda) Yang, Sheryl Mathew, Julia Hirschberg


Speaker Entrainment in Dialogue Systems

In conversation, people entrain to their partner by adopting that partner's word choice, or by adapting aspects of their speaking style, such as speaking rate or pitch range or intensity. Such synchronization is critical to the success of human-human interactions.

While lexical entrainment has been investigated experimentally in a number of studies, other types of entrainment have received less attention. In this project, we are investigating entrainment along dimensions such as intonational contour, pitch accent, phrasing rate, pitch range, intensity, laughter, turn-taking and backchanneling behaviors.

An investigation of these behaviors will support the design of better Spoken Dialogue Systems. While entrainment has been proposed as an important method for inducing users to adopt the system's lexical items, to improve recognition accuracy, few studies have examined the importance of systems entraining to their users, to promote more successful and human-like exchanges.

Participants: Julia Hirschberg, Marian Trnka, Eduard Kuric, Lukas Martak, Andreas Weise, Sarah Ita Levitan, Ramiro H. Galvez, Florencia Savoretti, Sakhia Darjaa (Slovak Academia of Sciences), Laura Willson, Shirley Xia (Shanghai Jiao), Ani Nenkova (University of Pennsylvania), Agustín Gravano (University of Buenos Aires), Enrique Henestroza, Rivka Levitan, Adele Chase, Laura Willson, Stefan Benus (Constantine the Philosopher University), Jens Edlund (KTH), Mattias Heldner (KTH)


Identifying Hyraxes by Their Song

We have obtained a corpus of hyrax songs from a collaborator at Bar-Ilan University which were identified by the name tag of the hyrax from which the songs were collected. Our goal was to be able to identify different hyraxes by differences in their songs. While we did manage to obtain reasonable results we had some issues with amount of data available and background noise in the recordings which were done in the wild.

Participants: Lin Ai, Shivani Ghatge, Lee Koren (Bar-Ilan U), Julia Hirschberg


Differences in Demographics and Trust in News Sources

We investigated differences in the degree of trust readers expressed toward news articles in a corpus collected by the Knight Foundation which asked raters to provide demographic information on gender, age, education, financial level, and political leaning (liberal vs. conservative. We clustered the raters into multiple groups and identified their allocation of trust in media based on their demographics as well as the news source for the articles rated.

Participants: Sarah Ita Levitan, Eric Bolton, Marko Mandic, Julia Hirschberg


LORELEI: Incident Detection in Low Resource Languages

The goal of this project is to build low-resource language speech processing systems in order to support a rapid and effective response to emerging incidents. We approach this goal from two aspects: (1) Detecting incidents in speech using prosodic characteristics only. We assume when an incident happened, the speech might contain emotions such as fear or stress, which is captured in prosodic features. We built cross-lingual incident detection models for all the 27 languages provided in the LORELEI project. (2) Detecting incidents by spotting keywords related to the incidents. We proposed a linguistically-informed training scheme to obtain acoustic word embeddings that can be easily applied to query-by-example keyword search in languages with minimal resources.

Participants: Zixiaofan (Brenda) Yang, Lin Ai, Julia Hirschberg


Multimodal Humor Prediction

In this project, we proposed a novel approach for generating unsupervised humor labels using time-aligned user comments, and predicting humor using speech, text, and visual information. We collected 341 videos of comedy movies, gameplay videos, and satirical talk show videos from one of the largest Chinese video-sharing website. We generated unsupervised humor labels from laughing comments, and found high agreement between these labels and human annotations. From these unsupervised labels, we built deep learning models using features from multiple modalities, which obtained an F1-score of 0.73 in predicting humor on a manually annotated test set.

Participants: Zixiaofan (Brenda) Yang, Lin Ai, Bingyan Hu, Julia Hirschberg


Detecting Hate Speech Targeting Different Religious Groups

This project detects hate speech in online text, where hate speech is defined as abusive speech targeting specific group characteristics, such as ethnic origin, re- ligion, gender, or sexual orientation. While hate speech against any group may exhibit some common characteristics, we observed that hatred against each different group is typically characterized by the use of a small set of high frequency stereotypical words; however, such words may be used in either a positive or a negative sense, making our task similar to that of words sense disambiguation. To build a classifier to detect hate speech we collected and annotated an anti-semitic hate speech corpus. We also developed a mechanism for detecting some commonly used methods of evading common “dirty word” filters. Our pilot classification experiments attained an accuracy of 94%, precision of 68% and recall at 60%, for an F1 measure of .6375.

Participants: William Warner, Matthew Holtzman, Julia Hirschberg with help from Belle Tseng, Kim Capps-Tanaka, Evgeniy Gabrilovish and Martin Zinkevish at Yahoo!


Clarification in Dialogue

In this project we modeled human responses to speech recognition errors from a corpus of human clarification strategies. We employed machine learning techniques to study 1) the decision to either stop and ask a clarification question or to continue the dialogue without clarification, and 2) the decision to ask a targeted clarification question or a more generic question. Targeted clarification questions focus specifically on the part of an utterance that is misrecognized, in contrast to generic requests to ‘please repeat’ or ‘please rephrase’. Our goal is to generate targeted clarification strategies for handling errors in spoken dialogue systems, when needed. Our experiments showed that linguistic features, in particular the inferred part-of-speech of a misrecognized word are predictive of human clarification decisions. A combina- tion of linguistic features predicts a user’s decision to continue or stop a dialogue with an accuracy of 72.8% over a majority baseline accuracy of 59.1%. The same set of features predict the decision to ask a targeted question with an accuracy of 74.6% compared with the majority baseline of 71.8%.

Participants: Svetlana Stoyanchev, Alex Liu, Eli Pincus, Julia Hirschberg


Text-to-Speech Synthesis for Low-Resource Languages

The rapid improvement of speech technology over the past few years has resulted in its widespread adoption by consumers, especially in mobile spoken dialogue systems such as Apple Siri and Google Voice Search. This progress has led to very natural and intelligible text-to-speech (TTS) synthesis for a small number of languages, including English, French, and Mandarin. These high-resource languages (HRLs) have been studied extensively by speech researchers who have built various language tools and collected and annotated massive amounts of speech data in these languages. However, there are thousands of languages in the world (~6500), many of which are spoken by millions of people, which have not been so fortunate to receive this attention from the speech and natural language processing community. Low-resource languages (LRLs), such as Telugu, Tok Pisin, and Vietnamese, for example, do not enjoy rich computational resources and vast amounts of annotated data. Thus, speakers of these languages are deprived of the benefits of modern speech technology which enable us to communicate across language barriers.

We are working towards developing methods of building intelligible, natural-sounding TTS voices out of limited data. While most commercial TTS voices are built from audio recorded by a professional speaker in a controlled acoustic environment, this data can be very time-consuming and expensive to collect. We are exploring the use of radio broadcast news, speech recorded with mobile phones, and other found data for building TTS voices, investigating data selection and model adaptation techniques for making the most out of noisy data.

Participants: Julia Hirschberg, Erica Cooper, Khai-Zhan Lee, Elshadai Testaye Biru, Yishak Tofik Mohammed, David Tofu, Emily Li, Alison Chang, Yocheved Levitan, Luise Valentin Rygaard, Olivia Lundelius, Xinyue Wang, Mert Ussakli


Code Switching

Code switching (CS) is the practice of switching back and forth between the shared languages of bilingual or multilingual speakers. CS is particularly prevalent in geographic regions with linguistic boundaries or where there are large immigrant groups sharing a common first language different from the mainstream language, as in the USA. Different levels of language (phonological, morphological, syntactic, semantic and discourse-pragmatic) may be involved in CS in different language pairs and/or genres. Computational tools trained for a single language such as automatic speech recognition, information extraction or retrieval, or machine translation systems quickly break down when the input includes CS. A major barrier to research on CS in computational linguistics has been the lack of large, accurately annotated corpora of CS data. We are part of a larger team which aims to collect a large repository of CS data, consistently annotated across different language pairs at different levels of granularity, from phonology/ morphology to pragmatics and discourse, in Modern Standard Arabic with dialectal Arabic, Arabic-English, Hindi-English, Spanish-English, and Mandarin-English. At Columbia we are currently focusing on collecting Mandarin-English CS data in social media and in telephone conversations.

Participants: Julia Hirschberg, Victor Soto, Nishmar Cestero, Alison Chang, Mona Diab, Thamar Solorio


DEFT: Anomaly Detection in Speech

This project investigates anomaly in speech by looking at behaviors that break the Gricean maxims of cooperative communication. Specifically, we are looking at hedging behaviors wherein the speaker uses cue words (eg. 'maybe', 'could', 'think', etc) to show a reduced commitment to their utterance. Initial research included constructing an annotation manual to accurately identify and label such behavior in speech. Ongoing work is looking at automatic labeling of hedges with the help of lexical and acoustic features. The end goal is to use the presence of hedging and disfluencies as a metric through which we can identify anomalous regions in dialogue.

Participants: Morgan Ulinski, Anna Prokofieva, Julia Hirschberg, Owen Rambow, Vinod Prabhakaran, Smaranda Muresan, Apoorv Agarwal, Anup Kotalwar, Kathy McKeown, Sara Rosenthal, Weiwei Guo



BOLT investigates interactive error handling for speech translation systems. BOLT is DARPA funded joint project with SRI international, University of Marseille, and University of Washington. In this project, we introduce an error-recovery dialogue manager component into a spoken translation system. A spoken translation system allows speakers of two different languages to communicate verbally through a translation application. An error-recovery dialogue manager detects errors in the recognition of utterances and asks the speaker a clarification question before translating the potentially erroneous utterance. Most modern dialogue systems employ generic clarification strategies for recovering from recognition errors by asking a user to repeat or rephrase their previous utterance or asking a yes/no confirmation question. Such generic requests are not natural and tend to frustrate the user. In BOLT, we evaluate the feasibility of using targeted clarification questions that focus specifically on the part of an utterance that contains a predicted recognition error. For example, if a speaker says "Pass me some XXX", where XXX is a misunderstood concept, a system may ask the targeted clarification question "What shall I pass?" instead of a generic request for a repetition. Our approach is based on human strategies for such clarifications. We have collected and analysed a corpus of human responses to misunderstandings in dialogue (Stoyanchev et al., Interdisciplinary Workshop on Feedback Behaviors in Dialog 2012). In order to create targeted clarifications, it is important to detect the error location in the utterance. We used a combination of ASR confidence, lexical, and prosodic features to help identify which words in a spoken sentence are misrecognized (Stoyanchev et al., SLT 2012). Although BOLT evaluates a targeted clarification approach with a speech-to-speech translation application, this approach will also benefit spoken dialogue systems, especially AI systems that accept spoken input with a wide range of concepts and topics.

Participants: Svetlana Stoyanchev, Rose Sloan (Yale University), Mei-Vern Then, Alex Liu, Sunil Khanal, Eli Pincus, Ananta Padney (Barnard College), Jingbo Yaung, Philipp Salletmayer (Graz University)


Text-to-Scene for Field Linguistics

While we intend that the tool will be generally useful, we are initially developing WELT based on scenarios involving Arrernte, an Australian aboriginal language.

Participants: Morgan Ulinski, Bob Coyne, Julia Hirschberg, Owen Rambow, Alexandra Orth, Inna Fetissova (Northeastern University), Myfany Turpin (University of Queensland), Daniel Kaufman (Endangered Language Alliance), Mark Dras (Macquarie University)



The BABEL program aims to develop spoken keyword search systems for diverse low-resource languages. Our group focuses on the use of prosodic features for improving recognition accuracy and keyword search performance, as well as experiments in cross-lingual adaptation of models for identifying prosodic events.

Participants: Victor Soto, Erica Cooper, Andrew Rosenberg, Gideon Mendels, Julia Hirschberg



AuToBI is a tool for the automatic analysis of Standard American English prosody. Open source and written in Java, AuToBI hypothesizes pitch accents and phrase boundaries consistent with the ToBI prosodic annotation standard. The toolkit incluides an acoustic feature extraction frontend, and a classification backend supported by the Weka machine learning toolkit.

Participants: Julia Hirschberg, Andrew Rosenberg


Deception in Speech

This project consisted in examining the feasibility of automatic detection of deception in speech, using linguistic, prosodic, and other acoustic cues. We were particularly interested in how individual differences affect the behavior of deceivers, and how such differences affect the ability of individuals to detect deception.

Our study produced the first cleanly recorded, labeled corpus of deceptive speech, the Columbia-SRI-Colorado (CSC) Corpus. Our elicitation paradigm created a context in which the subject was positively motivated to deceive an interviewer (in contrast to studies in which subjects are placed in situations where they are led to lie about potentially guilt inducing behavior). We investigated deception on two levels: we considered the speaker's overall intention to deceive (or not) with respect to particular topics, and we examined individual utterances in terms of their factual content.

Our published work produced a classification system that performs substantially better than human judges at classifying deceptive and non-deceptive utterances; a study of the use of filled pauses in deceptive speech; a method of combining classifiers using different feature sets; and a perception study showing that the personality of a listener affects his or her ability to distinguish deceptive from non-deceptive speech.

Participants: Julia Hirschberg, Frank Enos, Stefan Benus, Jennifer Venditti-Ramprashad, Sarah Friedman, Sarah Gilman, Jared Kennedy, Max Shevyakov, Wayne Thorsen, Alan Yeung, and collaborators from SRI/ICSI and from the University of Colorado at Boulder.


Emotion in Speech

The crux of this research involved characterizing acoustic and prosodic cues to human emotion, evaluating subjective judgments of human emotion, as well as exploring when and why certain emotions become confusable. We conducted on-line surveys designed to collect subjective judgments of both emotional speech as well as emotional faces. We observed that machine learning techniques applied to the prediction of human emotion given acoustic and prosodic information of the sound tokens yields a prediction rate of 75%-80%. We also found that our subjects systematically differed on how they perceived emotion in terms of valency (positive or negative affect). Furthermore, automatic emotion classification increases if we model these two groups independently of one another.

Participants: Julia Hirschberg, Jennifer Venditti-Ramprashad, Jackson Liscombe, Sarah Gilman, Daniel Vassilev, Agustín Gravano.


Detecting and Responding to Emotion in Intelligent Tutorial Systems

A tutor uses cues from the student to determine whether information has been successfully learned or not. These cues may be explicit or implicit. The first goal of this study is to examine cues to student emotions − such as frustration and uncertainty − in the context of speech-enabled intelligent tutorial systems. Such cues include lexical, prosodic, quality of voice, and contextual information. The second goal of this study is to evaluate the most appropriate strategies for responding to (negative) emotional states once they are detected. The ultimate goal is to increase the enjoyment and learning of Intelligent Tutoring Systems users.

Participants: Julia Hirschberg, Jennifer Venditti-Ramprashad, Jackson Liscombe, Jeansun Lee (Columbia University); Diane Litman, Katherine Forbes, Scott Silliman (University of Pittsburgh).


Identifying Acoustic, Prosodic, and Phonetic Cues to Individual Variation in Spoken Language

A fundamental challenge for current research on speech science and technology is understanding individual variation in spoken language. Individuals have their own speaking styles, depending on many factors, including the dialect and socioeconomic background of the speaker, as well as contextual variables such as the degree of familiarity between the speaker and hearer and the register of the speaking situation, from very casual to very formal (Eskenazi 1992). Even within the same dialect or register individual variation may occur; for example, in spontaneous speech, some speakers tend to exhibit more articulation reduction (e.g., reducing or deletion of function words) than others. In this project, we are working on identifying the acoustic-prosodic and phonetic cues that might contribute to clustering speakers based on their speaking style.

Participants: Fadi Biadsy, Julia Hirschberg, William (Yang) Wang


Extracting Paraphrase Rules from FrameNet and WordNet

FrameNet organizes lexical units into semantic frames with associated frame elements which represent the core roles of that frame. Each frame also contains annotated sentences mapping grammatical function to frame element role for the sample sentences. In our research we've extracted patterns from these annotated sentences to form paraphrase rules that cover conversives (e.g. "buy" <-> "sell") as well as other meaning-preserving verb transformations and alternations such as "The rats swarmed around the room" <-> "The room was teeming with rats.".

Participants: Bob Coyne, Owen Rambow


WordsEye: Automatic Text-to-Scene Conversion

We live in a vast sea of ever-changing text with few tools available to help us visualize its meaning. The goal of this research is to bridge the gap between graphics and language by developing new theoretical models and supporting technology to create a system that automatically converts descriptive text into rendered 3D scenes representing the meaning of that text. This builds upon previous work done with Richard Sproat in the WordsEye text-to-scene system (available online at New research directions include the lexical semantics and knowledge acquisition needed to semi-automatically construct a new scenario-based lexical resource. This resource will be used in decoding and making explicit the oblique contextual elements common in descriptive language for the purposes of graphical depiction.

Participants: Bob Coyne, Owen Rambow, Julia Hirschberg, Gino Micelli, Cecilia Schudel, Daniel Bauer, Morgan Ulinski, Richard Sproat (OHSU), Masoud Rouhizadeh (OHSU), Yilei Yang, Sam Wiseman, Jack Crawford, Kenny Harvey, Mi Zhou, Yen-Han Lin, Margit Bowler (Reed College), Victor Soto.


Charismatic Speech

People at an instinctual level are drawn to certain public speakers. What about it makes their speech charismatic? Our research is looking at acoustic and lexical features from public addresses to locate the source of the charisma. Though the work so far has been in American English, parallel work in Arabic may shed light on potential cultural biases in the perception of charisma.

Participants: Julia Hirschberg, Wisam Dakka, Andrew Rosenberg, Fadi Biadsy, Aron Wahl, Judd Sheinholtz, Sveltlana Stenchikova.


Speech Summarization

Speech Summarization consists of summarizing spoken data - broadcast news, telephone conversation, meetings, lectures. We are mainly focusing on summarization of broadcast news. Our speech summarization research mainly consists of three different aspects, which are i) Summarization ii) Information Extraction iii) User Interface. Summarization aspect consists of extracting significant segments of speech and concatenating them to provide a coherent summary of the given story in broadcast news. Information Extraction aspect consists of extracting named entities, headlines, interviews, different types of speakers. The last part consists of developing an user-interface that allows us to combine summary of broadcast news and other extracted information in a coherent and user-friendly speech browser.

Participants: Julia Hirschberg, Sameer Maskey, Michel Galley, Martin Jansche, Jeansun Lee, Irina Likhtina, Aaron Roth, Lauren Wilcox.


Prosody of Turn-Taking in Dialogue

In conversation there are implicit rules specifying whose turn it is to talk, and conventions for switching the turn from one speaker to the other. For example, interrupting the interlocutor is a (not necessarily rude) way of grabbing the turn, while formulating a question is a way of yielding it. These rules allow dialogues to develop in a coordinated manner. The goal of this project is to study and characterize those rules and conventions, in the Columbia Games Corpus and other corpora.

Participants: Julia Hirschberg; Stefan Benus (Constantine The Philosopher University); Agustín Gravano (University of Buenos Aires), Héctor Chávez, Michael Mulley, Enrique Henestroza, Lauren Wilcox.


Affirmative Cue Words in Dialogue

In speech, single affirmative cue words such as okay, right and yes are often used with different functions, including acknowledgment (meaning "I believe/agree with what you said"), backchannel (indicating "I'm still here" or "I hear you and please continue"), and beginning of a new discourse segment (as in "okay, now I will talk about..."). In this project, we analyze how such functions are conveyed and perceived, and explore how they can be automatically predicted with Machine Learning algorithms.

Participants: Julia Hirschberg (Columbia University); Stefan Benus (Constantine The Philosopher University); Agustín Gravano (University of Buenos Aires), Lauren Wilcox, Héctor Chávez, Ilia Vovsha (Columbia University); Shira Mitchell (Harvard University).


Intonational Overload: Uses of the Downstepped Contours

Intonational contours are overloaded, conveying different meanings in different contexts. We are studying potential uses of the downstepped contours (especially H* !H* L- L%) in Standard American English, both in read and spontaneous speech. We are investigating speakers' use of these contours in conveying discourse topic structure and in signaling given vs. new information, and the relationship between these two functions. We designed and collected the Columbia Games Corpus especifically for this project.

Participants: Julia Hirschberg (Columbia University); Agustín Gravano (University of Buenos Aires); Gregory Ward, Elisa Sneed (Northwestern University); Stefan Benus (Constantine The Philosopher University), Ani Nenkova, Michael Mulley.


Characterizing Laughter in Dialogue

Laughter can serve many different purposes in human communication and occurs in many different forms. This project involved studying the acoustic characteristics of laughter and the functions different types of laughter may serve in human dialogue and in spoken dialogue systems.

Participants: Brianne Calandra, Rolf Carlson (KTH, Sweden).

webmaster - runchenx[at] last updated - 07.12.2024 HTML 4.01