In Memoriam: Dragomir R. Radev

The CS Department mourns the loss of Dragomir R. Radev, a 1999 computer science PhD graduate who unexpectedly passed away on March 29th in his home in New Haven, Connecticut. He was 54 years old and leaves behind his wife, Axinia, and children, Laura and Victoria. 

Dragomir R. Radev
Dragomir R. Radev

Radev worked with Professor Kathleen McKeown on seminal multi-document text summarization research, the topic of his PhD dissertation. His first job after Columbia was at IBM TJ Watson Research in Hawthorne, New York, where he worked for a year as a Research Staff Member. Then he spent 16 years on the computer science faculty at the University of Michigan before joining Yale University in 2017 as the A. Bartlett Giamatti Professor of Computer Science and led the Language, Information, and Learning (LILY) Lab at Yale University.

His research and work were influential, from his widely cited paper on LexRank to his most recent papers providing datasets, benchmarks, and evaluation of metrics for text summarization. His wide-ranging research touched many areas beyond summarization. He worked on graph-based methods for natural language processing (NLP), question answering, interfaces to databases, and language generation. 

Over his career, Radev received many honors, including Fellow of the Association for Computational Linguistics (2018), the American Association for the Advancement of Science (2020), the Association for Computing Machinery (2015), and the Association for the Advancement of Artificial Intelligence (2020). He served as the Secretary of ACL from 2006-2015 and was awarded the ACL Distinguished Service Award in 2022. 

Radev co-founded the North American Computational Linguistics Open Competition, an annual competition in which high school students solve brain teasers about language. He organized the contest and traveled with top-ranked students to the International Linguistics Olympiad every year. 

Drago was a very special, incredible person who touched all of us with his energy, his love for NLP, and his kindness,” said Kathleen McKeown. “He touched so many people and has had a huge impact on the field and on the ACL, the primary organization for our field.” 


Dragomir R. Radev and family. Left to right: Laura, Dragomir, Axinia, and Victoria.
Dragomir R. Radev and family. Left to right: Laura, Dragomir, Axinia, and Victoria.

Fundraising note: A small group of faculty members from Columbia University, Yale University, and the University of Michigan have joined forces to raise money and set up a GoFundMe to help the Radev family support Victoria, who has a disability. The fund will help Axinia and the family continue to provide Victoria with the care she needs. If you are interested in and capable of donating in any way, please consider giving to the fundraiser

Voices of CS: Tuhin Chakrabarty 

PhD student Tuhin Chakrabarty talks about how his research is tapping into the creative side of computer science.


The field of natural language processing (NLP) has ramped up by leaps and bounds. This branch of artificial intelligence focuses on the ability of computers to understand and process language as humans do. It has been in the news these past few months because of a chatbot, ChatGPT, that can provide answers and data conversationally. The technology gives us a taste of just how powerful and useful NLP can be.

Tuhin Chakrabarty wants to see how much further he can push NLP in the field of computational creativity to see how computers can generate creative output. This is what ChatGPT had to say about computational creativity:

Computational creativity is a field that uses computational methods to simulate and enhance human-like creativity, producing valuable outputs such as art, music, stories, and scientific discoveries. It aims to understand and replicate the cognitive processes involved in human creativity, combining techniques from AI, cognitive psychology, and philosophy. Examples of computational creativity include generative art and music, game design, natural language processing, and scientific discovery. Ultimately, computational creativity seeks to leverage computers and algorithms to augment and extend human creativity, creating new possibilities for creative expression and innovation.
Tuhin Chakrabarty
Tuhin Chakrabarty

“Generating text beyond a few sentences was almost very difficult two years ago, but things look much better now. It is not perfect, but I am optimistic,” said Tuhin Chakrabarty, who first became interested in computational creativity in 2019. “One of the things that I am excited about is how better we can align models like ChatGPT to human expectations and different cultures.”

Instead of creating text conversationally, Chakrabarty’s research focuses on how AI can be used to create metaphors and detect sarcasm with little to no training data. The fifth-year PhD student advised by Smaranda Muresan has expanded his work to generating long narratives of 2,000-word documents and visual metaphors. We recently sat down with him to learn more about his research and the creative possibilities of NLP.

Q: You mentioned that you became interested in doing research during your MS. What happened that made you interested in doing research?

I did not have much research experience as an undergrad. I got accepted to the CS masters program and I was fortunate enough to take a class offered by my advisor Smaranda Muresan, which still happens to be one of my all-time favorite courses at Columbia. Computational models of Social Meaning was a graduate seminar course about impactful papers in NLP. Reading all the papers in that class made me think about what I want to do with NLP and how so many interesting research questions can be answered computationally by studying language. Alongside this, I was also working with my advisor and my friend Chris Hidey on extracting arguments from social media. That experience was really precious. The enthusiasm everyone shared in trying to solve the problem at hand made me sure of my decision to pursue research.

Q: How did you become interested in computational creativity? And what is it? 

Around 2019, Nanyun Peng and He He, two very important researchers in the field of computational creativity, wrote a paper on generating puns. I happened to attend NAACL 2019 in Minneapolis, where the paper was presented. I thought the paper was beautiful in every possible way and it quantified the surprisal theory in humor algorithmically. This made me really fascinated about how we can use inductive biases to help machines generate creative output. For selfish reasons, I reached out to Nanyun Peng and told her that I wanted to work with her. She was very kind and agreed to mentor me. My PhD advisor Smaranda Muresan is one of the experts in the field of Figurative Language, which deals with creativity. So, of course, that influenced my decision to work in computational creativity too.

Computational creativity is a multidisciplinary endeavor located at the intersection of artificial intelligence, cognitive psychology, philosophy, and the arts. The goal of computational creativity is to model, simulate or replicate creativity using a computer to achieve one of several ends:

  • To construct a program or computer capable of human-level creativity.
  • To better understand human creativity and formulate an algorithmic perspective on human creative behavior.
  • To design programs that can enhance human creativity without necessarily being creative themselves.


Q: How can you train a model or algorithm to interpret creativity or language?

State-of-the-art models are often found to be inadequate for creative tasks. The principal reason for this is that in addition to composing grammatical and fluent sentences to articulate given content, these tasks usually require extensive world and common sense knowledge.

It should also be noted that current approaches to text generation require lots of training data for supervision. However, most existing corpus for creative forms of text is limited in size. Even if such a corpus existed, learning the distribution of existing data and sampling from it is unlikely to lead to truly novel, creative output.

So we have to rely on unsupervised or weakly supervised techniques to train an end-to-end model to interpret or generate creative text. Of course, with the advent of Large Language Models and few-shot learning, we can now prompt a model with a few examples of creative text and it can somewhat generalize (but not as well as humans). My dissertation deals with a lot of this.


Q: Let’s talk about your work with the New York Times. What type of research questions did you have to answer while there? How was it different from what you have been doing?

Over the past several years, a key focus for NYTimes Research and Development has understood how advances in machine learning can extend the capabilities of journalists and unlock reader experiences that aren’t possible today. Questions and answers are central to how humans learn. Times journalism frequently uses FAQ and Q&A-style articles to help readers understand complex topics like the Covid-19 vaccines. To enhance this style of journalism, we experimented with large language models to match questions to answers, even if the reader asks their question in a novel way.

Last year we launched a new research effort to explore generating open-ended questions for news articles. Our hypothesis is that understanding the questions our news articles are implicitly answering may be helpful in the reporting process and may ultimately enable us to create FAQ and Q&A-style articles more efficiently.

You can find more information here:

This was fundamentally different from what I have been doing because I had to work towards upholding journalism values such as accuracy and verifiability. In creativity, your model can generate something that does not require attribution. But, when working on a project that deals with news and journalism, the focus is on factuality.

Q: One of your five research papers at EMNLP was from your time at the NY Times, right? 

Recent work on question generation has primarily focused on factoid questions such as who, what, where, and when about basic facts. Generating open-ended why, how, what, etc., questions that require long-form answers has proven more difficult. To facilitate the generation of open-ended questions, we propose CONSISTENT, a new end-to-end system for generating open-ended questions that are answerable from and faithful to the input text. Using news articles as a trustworthy foundation for experimentation, we demonstrate our model’s strength over several baselines using both automatic and human-based evaluations. We contribute an evaluation dataset of expert-generated open-ended questions and discuss potential downstream applications for news media organizations.


Q: What are you working on now? What are the kinds of research questions that you hope to answer?

Much of my recent and upcoming work is on human-AI collaboration for creativity. I recently worked on developing methods and evaluation frameworks for two creative tasks–poetry generation and visual metaphor generation–by leveraging collaboration between expert humans and state-of-the-art generative models. I further highlighted how collaboration improves the final output over either standalone models or only humans.

I have long focused on developing and evaluating machine learning models aimed at creativity in an isolated setting. This somehow limits their capacity to behave in an interactive setting with real humans. In a creative setting, it is crucial for models to understand human needs and provide assistance to augment human capabilities and improve performance based on human edits or feedback over time. So that is my focus now.

Q: About doing a PhD, what are the things you wished you knew before starting it?

This is a difficult question. Pursuing a PhD can be a really fun experience, but at the same time, it can be daunting. There is a lot of uncertainty around research questions and whether something will work or not. I wish I had been a little easier on myself and not taken everything personally. Like, if an idea didn’t work, instead of spending months trying to make it work, it is okay to give up and move in a different direction.

Q: What are your tips for people who want to pursue a PhD?

One of the things I learned during my PhD is to focus on what you care about. There are hundreds of researchers who might work on slightly dense areas, while your work can feel niche. This is not a problem. When I started working on NLP and creativity, the field still felt very young, but over the past three to four years, it has grown tremendously.

Your advisor will be one of the most important people in your PhD. It is essential to have good communication and working chemistry with them. One of the reasons my PhD felt like so much fun is because my advisor and I cared about the same problems.

Form a community and foster friendships with your lab mates, talk about research, or email a colleague whose work moved you and get a coffee with them at a conference. Also, try for opportunities to work with people in your lab or your community. It helps us learn so much.

12 Research Papers Accepted to EMNLP 2022

Papers from CS researchers were accepted to the Empirical Methods in Natural Language Processing (EMNLP) 2022. EMNLP is a leading conference in artificial intelligence and natural language processing. Aside from presenting their research papers, several researchers also organized workshops to gather conference attendees for discussions about current issues confronting NLP and computer science.   


Massively Multilingual Natural Language Understanding
Jack FitzGerald Amazon Alexa, Kay Rottmann Amazon Alexa, Julia Hirschberg Columbia University, Mohit Bansal University of North Carolina, Anna Rumshisky University of Massachusetts Lowell, and Charith Peris Amazon Alexa

3rd Workshop on Figurative Language Processing
Debanjan Ghosh Educational Testing Service, Beata Beigman Klebanov Educational Testing Service, Smaranda Muresan Columbia University, Anna Feldman Montclair State University, Soujanya Poria Singapore University of Technology and Design, and Tuhin Chakrabarty Columbia University

Sharing Stories and Lessons Learned
Diyi Yang Stanford University, Pradeep Dasigi Allen Institute for AI, Sherry Tongshuang Wu Carnegie Mellon University, Tuhin Chakrabarty Columbia University, Yuval Pinter Ben-Gurion University of the Negev, and Mike Zheng Shou National University of Singapore

Accepted Papers

Help me write a Poem – Instruction Tuning as a Vehicle for Collaborative Poetry Writing
Tuhin Chakrabarty Columbia University, Vishakh Padmakumar New York University, He He New York University

Recent work in training large language models (LLMs) to follow natural language instructions has opened up exciting opportunities for natural language interface design. Building on the prior success of large language models in the realm of computer assisted creativity, in this work, we present CoPoet, a collaborative poetry writing system, with the goal of to study if LLM’s actually improve the quality of the generated content. In contrast to auto-completing a user’s text, CoPoet is controlled by user instructions that specify the attributes of the desired text, such as Write a sentence about ‘love’ or Write a sentence ending in ‘fly’. The core component of our system is a language model fine-tuned on a diverse collection of instructions for poetry writing. Our model is not only competitive to publicly available LLMs trained on instructions (InstructGPT), but also capable of satisfying unseen compositional instructions. A study with 15 qualified crowdworkers shows that users successfully write poems with CoPoet on diverse topics ranging from Monarchy to Climate change, which are preferred by third-party evaluators over poems written without the system.

FLUTE: Figurative Language Understanding through Textual Explanations
Tuhin Chakrabarty Columbia University, Arkadiy Saakyan Columbia University, Debanjan Ghosh Educational Testing Service, and Smaranda Muresan Columbia University

Figurative language understanding has been recently framed as a recognizing textual entailment (RTE) task (a.k.a. natural language inference (NLI)). However, similar to classical RTE/NLI datasets they suffer from spurious correlations and annotation artifacts. To tackle this problem, work on NLI has built explanation-based datasets such as eSNLI, allowing us to probe whether language models are right for the right reasons. Yet no such data exists for figurative language, making it harder to assess genuine understanding of such expressions. To address this issue, we release FLUTE, a dataset of 9,000 figurative NLI instances with explanations, spanning four categories: Sarcasm, Simile, Metaphor, and Idioms. We collect the data through a Human-AI collaboration framework based on GPT-3, crowd workers, and expert annotators. We show how utilizing GPT-3 in conjunction with human annotators (novices and experts) can aid in scaling up the creation of datasets even for such complex linguistic phenomena as figurative language. The baseline performance of the T5 model fine-tuned on FLUTE shows that our dataset can bring us a step closer to developing models that understand figurative language through textual explanations.

Fine-tuned Language Models are Continual Learners
Thomas Scialom Columbia University, Tuhin Chakrabarty Columbia University, and Smaranda Muresan Columbia University

Recent work on large language models relies on the intuition that most natural language processing tasks can be described via natural language instructions and that models trained on these instructions show strong zero-shot performance on several standard datasets. However, these models even though impressive still perform poorly on a wide range of tasks outside of their respective training and evaluation sets.To address this limitation, we argue that a model should be able to keep extending its knowledge and abilities, without forgetting previous skills. In spite of the limited success of Continual Learning, we show that Fine-tuned Language Models can be continual learners.We empirically investigate the reason for this success and conclude that Continual Learning emerges from self-supervision pre-training. Our resulting model Continual-T0 (CT0) is able to learn 8 new diverse language generation tasks, while still maintaining good performance on previous tasks, spanning in total of 70 datasets. Finally, we show that CT0 is able to combine instructions in ways it was never trained for, demonstrating some level of instruction compositionality.

Multitask Instruction-based Prompting for Fallacy Recognition
Tariq Alhindi Columbia University, Tuhin Chakrabarty Columbia University, Elena Musi University of Liverpool, and Smaranda Muresan Columbia University

Fallacies are used as seemingly valid arguments to support a position and persuade the audience about its validity. Recognizing fallacies is an intrinsically difficult task both for humans and machines. Moreover, a big challenge for computational models lies in the fact that fallacies are formulated differently across the datasets with differences in the input format (e.g., question-answer pair, sentence with fallacy fragment), genre (e.g., social media, dialogue, news), as well as types and number of fallacies (from 5 to 18 types per dataset). To move towards solving the fallacy recognition task, we approach these differences across datasets as multiple tasks and show how instruction-based prompting in a multitask setup based on the T5 model improves the results against approaches built for a specific dataset such as T5, BERT or GPT-3. We show the ability of this multitask prompting approach to recognize 28 unique fallacies across domains and genres and study the effect of model size and prompt choice by analyzing the per-class (i.e., fallacy type) results. Finally, we analyze the effect of annotation quality on model performance, and the feasibility of complementing this approach with external knowledge.

CONSISTENT: Open-Ended Question Generation From News Articles
Tuhin Chakrabarty Columbia University, Justin Lewis The New York Times R&D, and Smaranda Muresan Columbia University

Recent work on question generation has largely focused on factoid questions such as who, what, where, when about basic facts. Generating open-ended why, how, what, etc. questions that require long-form answers have proven more difficult. To facilitate the generation of open-ended questions, we propose CONSISTENT, a new end-to-end system for generating open-ended questions that are answerable from and faithful to the input text. Using news articles as a trustworthy foundation for experimentation, we demonstrate our model’s strength over several baselines using both automatic and human=based evaluations. We contribute an evaluation dataset of expert-generated open-ended questions.We discuss potential downstream applications for news media organizations.

SafeText: A Benchmark for Exploring Physical Safety in Language Models
Sharon Levy University of California, Santa Barbara, Emily Allaway Columbia University, Melanie Subbiah Columbia University, Lydia Chilton Columbia University, Desmond Patton Columbia University, Kathleen McKeown Columbia University, and William Yang Wang University of California, Santa Barbara

Understanding what constitutes safe text is an important issue in natural language processing and can often prevent the deployment of models deemed harmful and unsafe. One such type of safety that has been scarcely studied is commonsense physical safety, i.e. text that is not explicitly violent and requires additional commonsense knowledge to comprehend that it leads to physical harm. We create the first benchmark dataset, SafeText, comprising real-life scenarios with paired safe and physically unsafe pieces of advice. We utilize SafeText to empirically study commonsense physical safety across various models designed for text generation and commonsense reasoning tasks. We find that state-of-the-art large language models are susceptible to the generation of unsafe text and have difficulty rejecting unsafe advice. As a result, we argue for further studies of safety and the assessment of commonsense physical safety in models before release.

Learning to Revise References for Faithful Summarization
Griffin Adams Columbia University, Han-Chin Shing Amazon AWS AI, Qing Sun Amazon AWS AI, Christopher Winestock Amazon AWS AI, Kathleen McKeown Columbia University, and Noémie Elhadad Columbia University

In real-world scenarios with naturally occurring datasets, reference summaries are noisy and may contain information that cannot be inferred from the source text. On large news corpora, removing low quality samples has been shown to reduce model hallucinations. Yet, for smaller, and/or noisier corpora, filtering is detrimental to performance. To improve reference quality while retaining all data, we propose a new approach: to selectively re-write unsupported reference sentences to better reflect source data. We automatically generate a synthetic dataset of positive and negative revisions by corrupting supported sentences and learn to revise reference sentences with contrastive learning. The intensity of revisions is treated as a controllable attribute so that, at inference, diverse candidates can be over-generated-then-rescored to balance faithfulness and abstraction. To test our methods, we extract noisy references from publicly available MIMIC-III discharge summaries for the task of hospital-course summarization, and vary the data on which models are trained. According to metrics and human evaluation, models trained on revised clinical references are much more faithful, informative, and fluent than models trained on original or filtered data.

Mitigating Covertly Unsafe Text within Natural Language Systems
Alex Mei University of California, Santa Barbara, Anisha Kabir University of California, Santa Barbara, Sharon Levy University of California, Santa Barbara, Melanie Subbiah Columbia University, Emily Allaway Columbia University, John N. Judge University of California, Santa Barbara, Desmond Patton University of Pennsylvania, Bruce Bimber University of California, Santa Barbara, Kathleen McKeown Columbia University, and William Yang Wang University of California, Santa Barbara

An increasingly prevalent problem for intelligent technologies is text safety, as uncontrolled systems may generate recommendations to their users that lead to injury or life-threatening consequences. However, the degree of explicitness of a generated statement that can cause physical harm varies. In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text. Then, we further break down this category with respect to the system’s information and discuss solutions to mitigate the generation of text in each of these subcategories. Ultimately, our work defines the problem of covertly unsafe language that causes physical harm and argues that this subtle yet dangerous issue needs to be prioritized by stakeholders and regulators. We highlight mitigation strategies to inspire future researchers to tackle this challenging problem and help improve safety within smart systems.

Affective Idiosyncratic Responses to Music
Sky CH-Wang Columbia University, Evan Li Columbia University, Oliver Li Columbia University, Smaranda Muresan Columbia University, and Zhou Yu Columbia University

Affective responses to music are highly personal. Despite consensus that idiosyncratic factors play a key role in regulating how listeners emotionally respond to music, precisely measuring the marginal effects of these variables has proved challenging. To address this gap, we develop computational methods to measure affective responses to music from over 403M listener comments on a Chinese social music platform. Building on studies from music psychology in systematic and quasi-causal analyses, we test for musical, lyrical, contextual, demographic, and mental health effects that drive listener affective responses. Finally, motivated by the social phenomenon known as 网抑云 (wǎng-yì-yún), we identify influencing factors of platform user self-disclosures, the social support they receive, and notable differences in discloser user activity.

Robots-Dont-Cry: Understanding Falsely Anthropomorphic Utterances in Dialog Systems
David Gros University of California, Davis, Yu Li Columbia University, and Zhou Yu Columbia University

Dialog systems are often designed or trained to output human-like responses. However, some responses may be impossible for a machine to truthfully say (e.g. “that movie made me cry”). Highly anthropomorphic responses might make users uncomfortable or implicitly deceive them into thinking they are interacting with a human. We collect human ratings on the feasibility of approximately 900 two-turn dialogs sampled from 9 diverse data sources. Ratings are for two hypothetical machine embodiments: a futuristic humanoid robot and a digital assistant. We find that for some data-sources commonly used to train dialog systems, 20-30% of utterances are not viewed as possible for a machine. Rating is marginally affected by machine embodiment. We explore qualitative and quantitative reasons for these ratings. Finally, we build classifiers and explore how modeling configuration might affect output permissibly, and discuss implications for building less falsely anthropomorphic dialog systems.

Just Fine-tune Twice: Selective Differential Privacy for Large Language Models
Weiyan Shi Columbia University, Ryan Patrick Shea Columbia University, Si Chen Columbia University, Chiyuan Zhang Google Research, Ruoxi Jia Virginia Tech, and Zhou Yu Columbia University

Protecting large language models from privacy leakage is becoming increasingly crucial with their wide adoption in real-world products. Yet applying *differential privacy* (DP), a canonical notion with provable privacy guarantees for machine learning models, to those models remains challenging due to the trade-off between model utility and privacy loss. Utilizing the fact that sensitive information in language data tends to be sparse, Shi et al. (2021) formalized a DP notion extension called *Selective Differential Privacy* (SDP) to protect only the sensitive tokens defined by a policy function. However, their algorithm only works for RNN-based models. In this paper, we develop a novel framework, *Just Fine-tune Twice* (JFT), that achieves SDP for state-of-the-art large transformer-based models. Our method is easy to implement: it first fine-tunes the model with *redacted* in-domain data, and then fine-tunes it again with the *original* in-domain data using a private training mechanism. Furthermore, we study the scenario of imperfect implementation of policy functions that misses sensitive tokens and develop systematic methods to handle it. Experiments show that our method achieves strong utility compared to previous baselines. We also analyze the SDP privacy guarantee empirically with the canary insertion attack.

Focus! Relevant and Sufficient Context Selection for News Image Captioning
Mingyang Zhou University of California, Davis, Grace Luo University of California, Berkeley, Anna Rohrbach University of California, Berkeley, and Zhou Yu Columbia University

News Image Captioning requires describing an image by leveraging additional context from a news article. Previous works only coarsely leverage the article to extract the necessary context, which makes it challenging for models to identify relevant events and named entities. In our paper, we first demonstrate that by combining more fine-grained context that captures the key named entities (obtained via an oracle) and the global context that summarizes the news, we can dramatically improve the model’s ability to generate accurate news captions. This begs the question, how to automatically extract such key entities from an image? We propose to use the pre-trained vision and language retrieval model CLIP to localize the visually grounded entities in the news article and then capture the non-visual entities via an open relation extraction model. Our experiments demonstrate that by simply selecting a better context from the article, we can significantly improve the performance of existing models and achieve new state-of-the-art performance on multiple benchmarks.

6 Papers from CS Researchers Accepted to NAACL 2022

Researchers from the department presented natural language processing (NLP) papers at the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2022).

Selective Differential Privacy for Language Models
Weiyan Shi, Aiqi Cui, Evan Li, Ruoxi Jia, Zhou Yu

With the increasing applications of language models, it has become crucial to protect these models from leaking private information. Previous work has attempted to tackle this challenge by training RNN-based language models with differential privacy guarantees. However, applying classical differential privacy to language models leads to poor model performance as the underlying privacy notion is over-pessimistic and provides undifferentiated protection for all tokens in the data. Given that the private information in natural language is sparse (for example, the bulk of an email might not carry personally identifiable information), we propose a new privacy notion, selective differential privacy, to provide rigorous privacy guarantees on the sensitive portion of the data to improve model utility. To realize such a new notion, we develop a corresponding privacy mechanism, Selective-DPSGD, for RNN-based language models. Besides language modeling, we also apply the method to a more concrete application–dialog systems. Experiments on both language modeling and dialog system building show that the proposed privacy-preserving mechanism achieves better utilities while remaining safe under various privacy attacks compared to the baselines. The data and code are released at this HTTPS URL to facilitate future research.

Knowledge-Grounded Dialogue Generation with a Unified Knowledge Representation
Yu Li, Baolin Peng, Yelong Shen, Yi Mao, Lars Liden, Zhou Yu, Jianfeng Gao

Knowledge-grounded dialogue systems are challenging to build due to the lack of training data and heterogeneous knowledge sources. Existing systems perform poorly on unseen topics due to limited topics covered in the training data. In addition, heterogeneous knowledge sources make it challenging for systems to generalize to other tasks because knowledge sources in different knowledge representations require different knowledge encoders. To address these challenges, we present PLUG, a language model that homogenizes different knowledge sources to a unified knowledge representation for knowledge-grounded dialogue generation tasks. PLUG is pre-trained on a dialogue generation task conditioned on a unified essential knowledge representation. It can generalize to different downstream knowledge-grounded dialogue generation tasks with a few training examples. The empirical evaluation on two benchmarks shows that our model generalizes well across different knowledge-grounded tasks. It can achieve comparable performance with state-of-the-art methods under a fully-supervised setting and significantly outperforms other methods in zero-shot and few-shot settings.

Database Search Results Disambiguation for Task-Oriented Dialog Systems
Kun Qian, Ahmad Beirami, Satwik Kottur, Shahin Shayandeh, Paul Crook, Alborz Geramifard, Zhou Yu, Chinnadhurai Sankar

As task-oriented dialog systems are becoming increasingly popular in our lives, more realistic tasks have been proposed and explored. However, new practical challenges arise. For instance, current dialog systems cannot effectively handle multiple search results when querying a database, due to the lack of such scenarios in existing public datasets. In this paper, we propose Database Search Result (DSR) Disambiguation, a novel task that focuses on disambiguating database search results, which enhances user experience by allowing them to choose from multiple options instead of just one. To study this task, we augment the popular task-oriented dialog datasets (MultiWOZ and SGD) with turns that resolve ambiguities by (a) synthetically generating turns through a pre-defined grammar, and (b) collecting human paraphrases for a subset. We find that training on our augmented dialog data improves the model’s ability to deal with ambiguous scenarios, without sacrificing performance on unmodified turns. Furthermore, pre-fine tuning and multi-task learning help our model to improve performance on DSRdisambiguation even in the absence of indomain data, suggesting that it can be learned as a universal dialog skill. Our data and code will be made publicly available.

ErAConD: Error Annotated Conversational Dialog Dataset for Grammatical Error Correction
Xun Yuan, Sam Pham, Sam Davidson, Zhou Yu

Currently available grammatical error correction (GEC) datasets are compiled using well-formed written text, limiting the applicability of these datasets to other domains such as informal writing and dialog. In this paper, we present a novel parallel GEC dataset drawn from open-domain chatbot conversations; this dataset is, to our knowledge, the first GEC dataset targeted to a conversational setting. To demonstrate the utility of the dataset, we use our annotated data to fine-tune a state-of-the-art GEC model, resulting in a 16-point increase in model precision. This is of particular importance in a GEC model, as model precision is considered more important than recall in GEC tasks since false positives could lead to serious confusion in language learners. We also present a detailed annotation scheme which ranks errors by perceived impact on comprehensibility, making our dataset both reproducible and extensible. Experimental results show the effectiveness of our data in improving GEC model performance in conversational scenarios.

Improving Conversational Recommendation Systems’ Quality with Context-Aware Item Meta-Information
Bowen Yang, Cong Han, Yu Li, Lei Zuo, Zhou Yu

Conversational recommendation systems (CRS) engage with users by inferring user preferences from dialog history, providing accurate recommendations, and generating appropriate responses. Previous CRSs use knowledge graph (KG) based recommendation modules and integrate KG with language models for response generation. Although KG-based approaches prove effective, two issues remain to be solved. First, KG-based approaches ignore the information in the conversational context but only rely on entity relations and bag of words to recommend items. Second, it requires substantial engineering efforts to maintain KGs that model domain-specific relations, thus leading to less flexibility. In this paper, we propose a simple yet effective architecture comprising a pre-trained language model (PLM) and an item metadata encoder. The encoder learns to map item metadata to embeddings that can reflect the semantic information in the dialog context. The PLM then consumes the semantic-aligned item embeddings together with dialog context to generate high-quality recommendations and responses. Instead of modeling entity relations with KGs, our model reduces engineering complexity by directly converting each item to an embedding. Experimental results on the benchmark dataset ReDial show that our model obtains state-of-the-art results on both recommendation and response generation tasks.

Differentially private decoding in large language models
By Jimit Majmudar, Christophe Dupuy, Charith Peris, Sami Smaili, Rahul Gupta, Richard Zemel

Recent large-scale natural language processing (NLP) systems use a pre-trained Large Language Model (LLM) on massive and diverse corpora as a headstart. In practice, the pre-trained model is adapted to a wide array of tasks via fine-tuning on task-specific datasets. LLMs, while effective, have been shown to memorize instances of training data thereby potentially revealing private information processed during pre-training. The potential leakage might further propagate to the downstream tasks for which LLMs are fine-tuned. On the other hand, privacy-preserving algorithms usually involve retraining from scratch, which is prohibitively expensive for LLMs. In this work, we propose a simple, easy to interpret, and computationally lightweight perturbation mechanism to be applied to an already trained model at the decoding stage. Our perturbation mechanism is model-agnostic and can be used in conjunction with any LLM. We provide a theoretical analysis showing that the proposed mechanism is differentially private, and experimental results show a privacy-utility trade-off.

Research from the NLP & Speech Group Accepted to ACL 2022

CS researchers presented their work at the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022).

Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization
Faisal Ladhak Columbia University, Esin Durmus Stanford University, He He New York University, Claire Cardie Cornell University, Kathleen McKeown Columbia University

Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed models that improve faithfulness, it is unclear whether the improvement comes from an increased level of extractiveness of the model outputs as one naive way to improve faithfulness is to make summarization models more extractive. In this work, we present a framework for evaluating the effective faithfulness of summarization systems, by generating a faithfulness-abstractiveness trade-off curve that serves as a control at different operating points on the abstractiveness spectrum. We then show that the baseline system as well as recently proposed methods for improving faithfulness, fail to consistently improve over the control at the same level of abstractiveness.

Finally, we learn a selector to identify the most faithful and abstractive summary for a given document, and show that this system can attain higher faithfulness scores in human evaluations while being more abstractive than the baseline system on two datasets. Moreover, we show that our system is able to achieve a better faithfulness-abstractiveness trade-off than the control at the same level of abstractiveness.

Towards Learning (Dis)-Similarity of Source Code from Program Contrasts
Yangruibo Ding Columbia University, Luca Buratti IBM Research, Saurabh Pujar IBM Research, Alessandro Morari IBM Research, Baishakhi Ray Columbia University, and Saikat Chakraborty Columbia University

Understanding the functional (dis)-similarity of source code is significant for code modeling tasks such as software vulnerability and code clone detection. We present DISCO (DISsimilarity of COde), a novel self-supervised model focusing on identifying (dis)similar functionalities of source code. Different from existing works, our approach does not require a huge amount of randomly collected datasets. Rather, we design structure-guided code transformation algorithms to generate synthetic code clones and inject real-world security bugs, augmenting the collected datasets in a targeted way.

We propose to pre-train the Transformer model with such automatically generated program contrasts to better identify similar code in the wild and differentiate vulnerable programs from benign ones. To better capture the structural features of source code, we propose a new cloze objective to encode the local tree-based context (e.g., parents or sibling nodes). We pre-train our model with a much smaller dataset, the size of which is only 5% of the state-of-the-art models’ training datasets, to illustrate the effectiveness of our data augmentation and the pre-training approach. The evaluation shows that, even with much less data, DISCO can still outperform the state-of-the-art models in vulnerability and code clone detection tasks.

Fantastic Questions and Where to Find Them: FairytaleQA — An Authentic Dataset for Narrative Comprehension
Ying Xu University of California Irvine, Dakuo Wang IBM Research, Mo Yu WeChat AI/Tencent, Daniel Ritchie University of California Irvine, Bingsheng Yao Rensselaer Polytechnic Institute, Tongshuang Wu University of Washington, Zheng Zhang University of Notre Dame, Toby Jia-Jun Li University of Notre Dame, Nora Bradford University of California Irvine, Branda Sun University of California Irvine, Tran Bao Hoang University of California Irvine, Yisi Sang Syracuse University, Yufang Hou IBM Research Ireland, Xiaojuan Ma Hong Kong University of Science and Technology, Diyi Yang Georgia Institute of Technology, Nanyun Peng University of California Los Angeles, Zhou Yu Columbia University, Mark Warschauer University of California Irvine

Question answering (QA) is a fundamental means to facilitate the assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA1, a dataset focusing on the narrative comprehension of kindergarten to eighth-grade students.

Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models’ fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.

Effective Unsupervised Constrained Text Generation based on Perturbed Masking
Yingwen Fu Guangdong University of Foreign Studies/NetEase Games AI Lab, Wenjie Ou NetEase Games AI Lab, Zhou Yu Columbia University, Yue Lin NetEase Games AI Lab

Unsupervised constrained text generation aims to generate text under a given set of constraints without any supervised data. Current state-of-the-art methods stochastically sample edit positions and actions, which may cause unnecessary search steps.

In this paper, we propose PMCTG to improve effectiveness by searching for the best edit position and action in each step. Specifically, PMCTG extends perturbed masking technique to effectively search for the most incongruent token to edit. Then it introduces four multi-aspect scoring functions to select edit action to further reduce search difficulty. Since PMCTG does not require supervised data, it could be applied to different generation tasks. We show that under the unsupervised setting, PMCTG achieves new state-of-the-art results in two representative tasks, namely keywords-to-sentence generation and paraphrasing.

11 Research Papers Accepted to EMNLP 2021

Papers from CS researchers were accepted to the Empirical Methods in Natural Language Processing (EMNLP) 2021. The Best Short Paper Award was also awarded to a paper from the Spoken Language
Processing Group.

Best Short Paper Award
CHoRaL: Collecting Humor Reaction Labels from Millions of Social Media Users
Zixiaofan Yang, Shayan Hooshmand and Julia Hirschberg


Humor detection has gained attention in recent years due to the desire to understand user-generated content with figurative language. However, substantial individual and cultural differences in humor perception make it very difficult to collect a large-scale humor dataset with reliable humor labels. We propose CHoRaL, a framework to generate perceived humor labels on Facebook posts, using the naturally available user reactions to these posts with no manual annotation needed. CHoRaL provides both binary labels and continuous scores of humor and non-humor. We present the largest dataset to date with labeled humor on 785K posts related to COVID-19. Additionally, we analyze the expression of COVID-related humor in social media by extracting lexico-semantic and affective features from the posts, and build humor detection models with performance similar to humans. CHoRaL enables the development of large-scale humor detection models on any topic and opens a new path to the study of humor on social media.


A Bag of Tricks for Dialogue Summarization
Muhammad Khalifa, Miguel Ballesteros and Kathleen McKeown


Dialogue summarization comes with its own peculiar challenges as opposed to news or scientific articles summarization. In this work, we explore four different challenges of the task: handling and differentiating parts of the dialogue belonging to multiple speakers, negation understanding, reasoning about the situation, and informal language understanding. Using a pretrained sequence-to-sequence language model, we explore speaker name substitution, negation scope highlighting, multi-task learning with relevant tasks, and pretraining on in-domain data. Our experiments show that our proposed techniques indeed improve summarization performance, outperforming strong baselines.


Timeline Summarization based on Event Graph Compression via Time-Aware Optimal Transport
Manling Li, Tengfei Ma, Mo Yu, Lingfei Wu, Tian Gao, Heng Ji and Kathleen McKeown


Timeline Summarization identifies major events from a news collection and describes them following temporal order, with key dates tagged. Previous methods generally generate summaries separately for each date after they determine the key dates of events. These methods overlook the events’ intra-structures (arguments) and inter-structures (event-event connections). Following a different route, we propose to represent the news articles as an event-graph, thus the summarization task becomes compressing the whole graph to its salient sub-graph. The key hypothesis is that the events connected through shared arguments and temporal order depict the skeleton of a timeline, containing events that are semantically related, structurally salient, and temporally coherent in the global event graph. A time-aware optimal transport distance is then introduced for learning the compression model in an unsupervised manner. We show that our approach significantly improves the state of the art on three real-world datasets, including two public standard benchmarks and our newly collected Timeline100 dataset.


Don’t Go Far Off: An Empirical Study on Neural Poetry Translation
Tuhin Chakrabarty, Arkadiy Saakyan and Smaranda Muresan


Despite constant improvements in machine translation quality, automatic poetry translation remains a challenging problem due to the lack of open-sourced parallel poetic corpora, and to the intrinsic complexities involved in preserving the semantics, style and figurative nature of poetry. We present an empirical investigation for poetry translation along several dimensions: 1) size and style of training data (poetic vs. non-poetic), including a zeroshot setup; 2) bilingual vs. multilingual learning; and 3) language-family-specific models vs. mixed-language-family models. To accomplish this, we contribute a parallel dataset of poetry translations for several language pairs. Our results show that multilingual fine-tuning on poetic text significantly outperforms multilingual fine-tuning on non-poetic text that is 35X larger in size, both in terms of automatic metrics (BLEU, BERTScore, COMET) and human evaluation metrics such as faithfulness (meaning and poetic style). Moreover, multilingual fine-tuning on poetic data outperforms bilingual fine-tuning on poetic data.


Implicit Premise Generation with Discourse-aware Commonsense Knowledge Models
Tuhin Chakrabarty, Aadit Trivedi and Smaranda Muresan


Enthymemes are defined as arguments where a premise or conclusion is left implicit. We tackle the task of generating the implicit premise in an enthymeme, which requires not only an understanding of the stated conclusion and premise, but also additional inferences that could depend on commonsense knowledge. The largest available dataset for enthymemes (Habernal et al., 2018) consists of 1.7k samples, which is not large enough to train a neural text generation model. To address this issue, we take advantage of a similar task and dataset: Abductive reasoning in narrative text (Bhagavatula et al., 2020). However, we show that simply using a state-of-the-art seq2seq model fine-tuned on this data might not generate meaningful implicit premises associated with the given enthymemes. We demonstrate that encoding discourse-aware commonsense during fine-tuning improves the quality of the generated implicit premises and outperforms all other baselines both in automatic and human evaluations on three different datasets.


GOLD: Improving Out-of-Scope Detection in Dialogues using Data Augmentation
Derek Chen and Zhou Yu


Practical dialogue systems require robust methods of detecting out-of-scope (OOS) utterances to avoid conversational breakdowns and related failure modes. Directly training a model with labeled OOS examples yields reasonable performance, but obtaining such data is a resource-intensive process. To tackle this limited-data problem, previous methods focus on better modeling the distribution of in-scope (INS) examples. We introduce GOLD as an orthogonal technique that augments existing data to train better OOS detectors operating in low-data regimes. GOLD generates pseudo-labeled candidates using samples from an auxiliary dataset and keeps only the most beneficial candidates for training through a novel filtering mechanism. In experiments across three target benchmarks, the top GOLD model outperforms all existing methods on all key metrics, achieving relative gains of 52.4%, 48.9% and 50.3% against median baseline performance. We also analyze the unique properties of OOS data to identify key factors for optimally applying our proposed method.


Continual Learning in Task-Oriented Dialogue Systems
Andrea Madotto, Zhaojiang Lin, Zhenpeng Zhou, Seungwhan Moon, Paul Crook, Bing Liu, Zhou Yu, Eunjoon Cho, Pascale Fung, and Zhiguang Wang


Continual learning in task-oriented dialogue systems can allow us to add new domains and functionalities through time without incurring the high cost of a whole system retraining. In this paper, we propose a continual learning benchmark for task-oriented dialogue systems with 37 domains to be learned continuously in four settings, such as intent recognition, state tracking, natural language generation, and end-to-end. Moreover, we implement and compare multiple existing continual learning baselines, and we propose a simple yet effective architectural method based on residual adapters. Our experiments demonstrate that the proposed architectural method and a simple replay-based strategy perform comparably well but they both achieve inferior performance to the multi-task learning baseline, in where all the data are shown at once, showing that continual learning in task-oriented dialogue systems is a challenging task. Furthermore, we reveal several trade-off between different continual learning methods in term of parameter usage and memory size, which are important in the design of a task-oriented dialogue system. The proposed benchmark is released together with several baselines to promote more research in this direction.


Zero-Shot Dialogue State Tracking via Cross-Task Transfer
Zhaojiang Lin, Bing Liu, Andrea Madotto, Seungwhan Moon, Zhenpeng Zhou, Paul Crook, Zhiguang Wang, Zhou Yu, Eunjoon Cho, Rajen Subba, and Pascale Fung


Zero-shot transfer learning for dialogue state tracking (DST) enables us to handle a variety of task-oriented dialogue domains without the expense of collecting in-domain data. In this work, we propose to transfer the crosstask knowledge from general question answering (QA) corpora for the zero-shot DST task. Specifically, we propose TransferQA, a transferable generative QA model that seamlessly combines extractive QA and multichoice QA via a text-to-text transformer framework, and tracks both categorical slots and non-categorical slots in DST. In addition, we introduce two effective ways to construct unanswerable questions, namely, negative question sampling and context truncation, which enable our model to handle “none” value slots in the zero-shot DST setting. The extensive experiments show that our approaches substantially improve the existing zero-shot and few-shot results on MultiWoz. Moreover, compared to the fully trained baseline on the Schema-Guided Dialogue dataset, our approach shows better generalization ability in unseen domains.Zero-shot transfer learning for dialogue state tracking (DST) enables us to handle a variety of task-oriented dialogue domains without the expense of collecting in-domain data. In this work, we propose to transfer the crosstask knowledge from general question answering (QA) corpora for the zero-shot DST task. Specifically, we propose TransferQA, a transferable generative QA model that seamlessly combines extractive QA and multichoice QA via a text-to-text transformer framework, and tracks both categorical slots and non-categorical slots in DST. In addition, we introduce two effective ways to construct unanswerable questions, namely, negative question sampling and context truncation, which enable our model to handle “none” value slots in the zero-shot DST setting. The extensive experiments show that our approaches substantially improve the existing zero-shot and few-shot results on MultiWoz. Moreover, compared to the fully trained baseline on the Schema-Guided Dialogue dataset, our approach shows better generalization ability in unseen domains.


Refine and Imitate: Reducing Repetition and Inconsistency in Persuasion Dialogues via Reinforcement Learning and Human Demonstration
Weiyan Shi, Yu Li, Saurav Sahay, and Zhou Yu


Despite the recent success of large-scale language models on various downstream NLP tasks, the repetition and inconsistency problems still persist in dialogue response generation. Previous approaches have attempted to avoid repetition by penalizing the language model’s undesirable behaviors in the loss function. However, these methods focus on tokenlevel information and can lead to incoherent responses and uninterpretable behaviors. To alleviate these issues, we propose to apply reinforcement learning to refine an MLE-based language model without user simulators, and distill sentence-level information about repetition, inconsistency and task relevance through rewards. In addition, to better accomplish the dialogue task, the model learns from human demonstration to imitate intellectual activities such as persuasion, and selects the most persuasive responses. Experiments show that our model outperforms previous state-of-the-art dialogue models on both automatic metrics and human evaluation results on a donation persuasion task, and generates more diverse, consistent and persuasive conversations according to the user feedback.


Attribute Alignment: Controlling Text Generation from Pre-trained Language Models
Dian Yu, Zhou Yu, and Kenji Sagae


Large language models benefit from training with a large amount of unlabeled text, which gives them increasingly fluent and diverse generation capabilities. However, using these models for text generation that takes into account target attributes, such as sentiment polarity or specific topics, remains a challenge. We propose a simple and flexible method for controlling text generation by aligning disentangled attribute representations. In contrast to recent efforts on training a discriminator to perturb the token level distribution for an attribute, we use the same data to learn an alignment function to guide the pre-trained, non-controlled language model to generate texts with the target attribute without changing the original language model parameters. We evaluate our method on sentiment- and topiccontrolled generation, and show large performance gains over previous methods while retaining fluency and diversity.


KERS: A Knowledge-Enhanced Framework for Recommendation Dialog Systems with Multiple Subgoals
Jun Zhang, Yan Yang, Chencai Chen, Liang He, and Zhou Yu


Recommendation dialogs require the system to build a social bond with users to gain trust and develop affinity in order to increase the chance of a successful recommendation. It is beneficial to divide up, such conversations with multiple subgoals (such as social chat, question answering, recommendation, etc.), so that the system can retrieve appropriate knowledge with better accuracy under different subgoals. In this paper, we propose a unified framework for common knowledge-based multi-subgoal dialog: knowledge-enhanced multi-subgoal driven recommender system (KERS). We first predict a sequence of subgoals and use them to guide the dialog model to select knowledge from a sub-set of existing knowledge graph. We then propose three new mechanisms to filter noisy knowledge and to enhance the inclusion of cleaned knowledge in the dialog response generation process. Experiments show that our method obtains state-of-the-art results on DuRecDial dataset in both automatic and human evaluation.

Heroes of Natural Language Processing

Professor Kathy McKeown talks with DeepLearning.AI’s Andrew Ng about how she started in artificial intelligence (AI), her research projects, how her understanding of AI has changed through the decades, and AI career advice for learners of NLP. 

Research That Studied Data From Social Media, Automatic Summarization, and Spatial Relations Accepted to NAACL 2019

The annual conference of the North American Chapter of the Association for Computational Linguistics (NAACL) is the preeminent event in the field of natural language processing. CS researchers in professor Julia Hirschberg’s group won a Best Paper award for a novel resource, SpatialNet, which provides a formal representation of how a language expresses spatial relations. Other accepted papers are detailed below.

Linguistic Analysis of Schizophrenia in Reddit Posts
Jonathan Zomick Hofstra University, Sarah Ita Levitan Columbia University, Mark Serper Hofstra University Mount Sinai School of Medicine

The paper was presented at the Sixth Annual Workshop on Computational Linguistics and Clinical Psychology, at NAACL.

The researchers identified and analyzed unique linguistic characteristics of Reddit posts written by users who claim to have received a diagnosis for schizophrenia. The findings were interpreted in the context of established schizophrenia symptoms and compared with results from previous research that has looked at schizophrenia and language on social media platforms.  

The results showed several differences in language usage between users with schizophrenia and a control group. For example, people with schizophrenia used less punctuation in their Reddit posts. Disorganized language use is a prominent and common symptom of schizophrenia.

A machine learning classifier was trained to automatically identify self-identified users with schizophrenia on Reddit, using linguistic cues.  

“We hope that this work contributes toward the ultimate goal of identifying high risk individuals,” said Sara Ita Levitan, a postdoctoral research scientist with the Spoken Language Processing Group. “Early intervention and diagnosis is important to improve overall treatment outcomes for schizophrenia.”

Fixed That for You: Generating Contrastive Claims with Semantic Edits
Christopher Hidey Columbia University and Kathy McKeown Columbia University

For many people, social media is a primary source of information and it can become a key venue for opinionated discussion. In order to evaluate and analyze these discussions, it is important to understand contrast or a difference in opinions.

As a step towards a better understanding of arguments, the researchers developed a method to automatically generate responses to internet comments containing differences in stance. They created a corpus from over one million contrastive claims mined from the social media site Reddit. In order to obtain training data for the models, they extracted pairs of comments containing the acronym FTFY (“fixed that for you”).  

For example, in a discussion over who should be the next President of the United States, one participant might state “Bernie Sanders for president” and another might state “Hillary Clinton for president. FTFY”

A neural network model was trained on the pairs to edit the original claim and produce a new claim with a different view. 

Claim : Bernie Sanders for president
New claim : Hillary Clinton for president.

“One aspect of this problem that was surprising was that the standard ‘sequence-to-sequence with attention’ baseline performed poorly, often just copying the output or selecting generic responses,” said Christopher Hidey, a fourth year PhD student. While generic response generation is a known problem in neural models, their custom model significantly outperformed this baseline in several metrics including novelty and overlap with human-generated responses.

A Robust Abstractive System for Cross-Lingual Summarization
Jessica Ouyang Columbia University, Boya Song Columbia University, and Kathy McKeown Columbia University

The researchers developed an automatic summarization system that specializes in producing English summaries for documents originally written in three low-resource languages – Somali, Swahili, and Tagalog.

There is little natural language processing work done in low-resource languages and machine translation systems for those languages are of lower quality than those for high-resource languages like French or German.

As a result, the translations are often disfluent and contain errors that make them difficult for a human to understand, much less for a summarization system to process.

An example of machine-translated document
originally written in Swahili :

Mange Kimambi ‘I pray for the parliamentary seat for Kinondoni
constituency for ticket of CCM. Not special seats’ Kinondoni
without drugs is possible I pray for the parliamentary seat for
Kinondoni constituency on the ticket of CCM. Yes, it’s not a
special seats, Khuini Kinondoni, what will I do for Kinondoni?
Tension is many I get but we must remember no good that is
available easily. Kinondoni without drugs is possible. As a friend,
fan or patriotism I urge you to grant your contribution to the
situation and propert. You can use Western Union or money to go
to Mange John Kimambi. Account of CRDB Bank is on blog. Reduce
my profile in my blog understand why I have decided to vie for
Kinondoni constituency. you will understand more.

A standard summarization system’s
output on the document :

Mange Kimambi, who pray for parliamentary seat for Kinondoni
constituency for ticket of CCM, is on blog, and not special seats’
Kinondoni without drugs.

The robust summarization system’s output
on the document :

Mange Kimambi, who pray for parliamentary seat for Kinondoni
constituency for ticket of CCM, comments on his plans to vie for
‘Kinondoni’ without drugs.

“We addressed this challenge by creating large collections of synthetic, errorful “translations” that mimic the output of low-quality machine translations,” said Jessica Ouyang, a seventh year PhD student. They paired the problematic text with high-quality, human-written summaries. The experiment showed that a neural network summarizer trained on this synthetic data was able to correct or elide translation errors and produce fluent English summaries. The error-correcting ability of the system extends to Arabic, a new language previously unseen by the system.

IMHO Fine-Tuning Improves Claim Detection
Tuhin Chakrabarty Columbia University, Christopher Hidey Columbia University, and Kathy McKeown Columbia University

Argument mining, or argumentation mining, is a research area within the natural language processing field. Argument mining is applied in many different genres including the qualitative assessment of social media content (e.g. Twitter, Facebook) – where it provides a powerful tool for policy-makers and researchers in social and political sciences – legal documents, product reviews, scientific articles, online debates, newspaper articles, and dialogical domains. One of the main tasks of argument mining is to detect a claim.

Sentences from each dataset and their nearest neighbor in the IMHO dataset

Claims are the central component of an argument. Detecting claims across different domains or data sets can often be challenging due to their varying conceptualization. The researchers set out to alleviate this problem by fine-tuning a language model. They created a corpus mined from Reddit that is composed of 5.5 million opinionated claims. These claims are self-labeled by their authors using the internet acronyms IMO/IMHO or “In My Humble Opinion”.

By fine-tuning the language on the IMHO dataset they were able to obtain a significant improvement on claim detection of the datasets. As these data sets include diverse domains such as social media and student essays, this improvement demonstrates the robustness of fine-tuning on this novel corpus.

The Answer is Language Model Fine-tuning
Tuhin Chakrabarty Columbia University and Smaranda Muresan Columbia University

Community Question Answering forums such as Yahoo! Answers and Quora are popular nowadays, as they represent effective means for communities to share information around particular topics. But the information often shared on these forums may be incorrect or misleading.

The paper presents the ColumbiaNLP submission for the SemEval-2019 Task 8: Fact-Checking in Community Question Answering Forums. The researchers show how fine-tuning a language model on a large unannotated corpus of old threads from the Qatar Living forum helps to classify question types (factual, opinion, socializing) and to judge the factuality of answers on the shared task labeled data from the same forum. Their system finished 4th and 2nd on Subtask A (question type classification) and B (answer factuality prediction), respectively, based on the official metric of accuracy.

Question classification
Factual : The question is asking for factual information,
which can be answered by checking various information
sources, and it is not ambiguous.
e.g. “What is Ooredoo customer service number?”
Opinion : The question asks for an opinion or advice,
not for a fact.
e.g. “Can anyone recommend a good Vet in Doha?”
Socializing : Not a real question, but intended for
socializing or for chatting. This can also mean expressing
an opinion or sharing some information, without really
asking anything of general interest.
e.g. “What was your first car?”

Answer classification
Factual – TRUE : The answer is True and can be proven
with an external resource.
Q : “I wanted to know if there were any specific shots and
vaccinations I should get before coming over [to Doha].”
A : “Yes there are; though it varies depending on which
country you come from. In the UK; the doctor has a list of
all countries and the vaccinations needed for each.”
Factual – FALSE : The answer gives a factual response, but
it is False, it is partially false or the responder is unsure about
Q : “Can I bring my pitbulls to Qatar?”
A : “Yes, you can bring it but be careful this kind of dog is
very dangerous.”
Non-Factual : When the answer does not provide factual
information to the question; it can be an opinion or an advice
that cannot be verified
e.g. “It’s better to buy a new one.”

“We show that fine-tuning a language model on a large unsupervised corpus from the same community forum helps us achieve better accuracy for question classification,” said Tuhin Chakrabarty, lead researcher of the paper. Most community question-answering forums have such unlabeled data, which can be used in the absence of large labeled training data.

For answer classification they show how to leverage information from previously answered questions on the thread through language model fine-tuning. Their experiments also show that modeling an answer individually is not the best idea for fact-verification and results are improved when considering the context of the question.

“Determining factuality of answers requires modeling of world knowledge or external evidence – the questions asked are often very noisy and require reformulation,” shared Chakrabarty. “As a future step we would want to incorporate external evidence from the internet in the factual answer classification problem.”

Identifying Therapist Conversational Actions Across Diverse Psychotherapeutic Approaches
Fei-Tzin Lee Columbia University, Derrick Hull Talkspace, Jacob Levine Talkspace, Bonnie Ray Talkspace and Kathleen McKeown Columbia University

The paper studied dialogue act classification in therapy transcripts. Dialogue act classification is a task in which the researchers attempt to determine the intention of the speaker at each point in a dialogue, classifying it into one of a fixed number of possible types. This provides a layer of abstraction away from what the speaker is literally saying, giving a higher-level view of the conversation. Ultimately, they hope this work can help analyze the dynamics of text-based therapy on a large scale.
Transcripts of therapy sessions were examined, focusing on the speech of the therapist, using a classification scheme developed for this purpose. On a sentence-by-sentence basis, they found which of the labels best matches the conversational “action” the sentence takes.
For example, if a therapist makes the statement, “It almost feels like if you could do something, anything would be better than…” This would be classified into the Reflection category, as it is rephrasing or restating the experience the client just described, but in a way that makes what they are feeling more explicit.
“One of the interesting results from this research came when we analyzed the performance of our best classifier across different styles of therapy,” said Fei-Tzin Lee, a third year PhD student. Certain styles were markedly easier to classify than others; this was not simply a case where the classifier performed better on therapeutic styles for which there was more data.
Generally, it seemed that therapy styles involving more complex sentence structure were more difficult to classify, although to fully understand the differences between styles further work would be necessary. Continued Lee, “Regardless of the reason, it was interesting to note that there are marked differences that are quantitatively measurable between different styles.”