Natural Language Processing Papers Accepted to EMNLP 2020

Six papers from the Speech & NLP group were accepted to the Empirical Methods in Natural Language Processing (EMNLP) conference. 

Generating Similes Effortlessly Like a Pro: A Style Transfer Approach for Simile Generation
Tuhin Chakrabarty Columbia University, Smaranda Muresan Columbia University, and Nanyun Peng University of Southern California and University of California, Los Angeles

Abstract:
Literary tropes, from poetry to stories, are at the crux of human imagination and communication. Figurative language, such as a simile, goes beyond plain expressions to give readers new insights and inspirations. We tackle the problem of simile generation. Generating a simile requires proper understanding for effective mapping of properties between two concepts. To this end, we first propose a method to automatically construct a parallel corpus by transforming a large number of similes collected from Reddit to their literal counterpart using structured common sense knowledge. We then fine-tune a pre-trained sequence to sequence model, BART (Lewis et al., 2019), on the literal-simile pairs to generate novel similes given a literal sentence. Experiments show that our approach generates 88% novel similes that do not share properties with the training data. Human evaluation on an independent set of literal statements shows that our model generates similes better than two literary experts 37%1 of the times, and three baseline systems including a recent metaphor generation model 71%2 of the times when compared pairwise.3 We also show how replacing literal sentences with similes from our best model in machine-generated stories improves evocativeness and leads to better acceptance by human judges.

 

Content Planning for Neural Story Generation with Aristotelian Rescoring
Seraphina Goldfarb-Tarrant University of Southern California and University of Edinburgh, Tuhin Chakrabarty Columbia University, Ralph Weischedel University of Southern California and Nanyun Peng University of Southern California and University of California, Los Angeles

Abstract:
Long-form narrative text generated from large language models manages a fluent impersonation of human writing, but only at the local sentence level, and lacks structure or global cohesion. We posit that many of the problems of story generation can be addressed via high-quality content planning, and present a system that focuses on how to learn good plot structures to guide story generation. We utilize a plot-generation language model along with an ensemble of rescoring models that each implement an aspect of good story-writing as detailed in Aristotle’s Poetics. We find that stories written with our more principled plot structure are both more relevant to a given prompt and higher quality than baselines that do not content plan, or that plan in an unprincipled way.

 

Severing the Edge Between Before and After: Neural Architectures for Temporal Ordering of Events
Miguel Ballesteros Amazon AI, Rishita Anubhai Amazon AI, Shuai Wang Amazon AI, Nima Pourdamghani Amazon AI, Yogarshi Vyas Amazon AI, Jie Ma Amazon AI, Parminder Bhatia Amazon AI, Kathleen McKeown Columbia University and Amazon AI and Yaser Al-Onaizan Amazon AI

Abstract:
In this paper, we propose a neural architecture and a set of training methods for ordering events by predicting temporal relations. Our proposed models receive a pair of events within a span of text as input and they identify temporal relations (Before, After, Equal, Vague) between them. Given that a key challenge with this task is the scarcity of annotated data, our models rely on either pre-trained representations (i.e. RoBERTa, BERT or ELMo), transfer, and multi-task learning (by leveraging complementary datasets), and self-training techniques. Experiments on the MATRES dataset of English documents establish a new state-of-the-art on this task.

 

Controllable Meaning Representation to Text Generation: Linearization and Data Augmentation Strategies
Chris Kedzie Columbia University and Kathleen McKeown Columbia University

Abstract:
We study the degree to which neural sequenceto-sequence models exhibit fine-grained controllability when performing natural language generation from a meaning representation. Using two task-oriented dialogue generation benchmarks, we systematically compare the effect of four input linearization strategies on controllability and faithfulness. Additionally, we evaluate how a phrase-based data augmentation method can improve performance. We find that properly aligning input sequences during training leads to highly controllable generation, both when training from scratch or when fine-tuning a larger pre-trained model. Data augmentation further improves control on difficult, randomly generated utterance plans.

Zero-Shot Stance Detection: A Dataset and Model using Generalized Topic Representations
Emily Allaway Columbia University and Kathleen McKeown Columbia University

Abstract:
Stance detection is an important component of understanding hidden influences in everyday life. Since there are thousands of potential topics to take a stance on, most with little to no training data, we focus on zero-shot stance detection: classifying stance from no training examples. In this paper, we present a new dataset for zero-shot stance detection that captures a wider range of topics and lexical variation than in previous datasets. Additionally, we propose a new model for stance detection that implicitly captures relationships between topics using generalized topic representations and show that this model improves performance on a number of challenging linguistic phenomena.

 

Unsupervised Cross-Lingual Part-of-Speech Tagging for Truly Low-Resource Scenarios
Ramy Eskander Columbia University, Smaranda Muresan Columbia University, and Michael Collins Columbia University

Abstract:
We describe a fully unsupervised cross-lingual transfer approach for part-of-speech (POS) tagging under a truly low resource scenario. We assume access to parallel translations between the target language and one or more source languages for which POS taggers are available. We use the Bible as parallel data in our experiments: small size, out-of-domain, and covering many diverse languages. Our approach innovates in three ways: 1) a robust approach of selecting training instances via cross-lingual annotation projection that exploits best practices of unsupervised type and token constraints, word-alignment confidence and density of projected POS, 2) a Bi-LSTM architecture that uses contextualized word embeddings, affix embeddings and hierarchical Brown clusters, and 3) an evaluation on 12 diverse languages in terms of language family and morphological typology. In spite of the use of limited and out-of-domain parallel data, our experiments demonstrate significant improvements in accuracy over previous work. In addition, we show that using multi-source information, either via projection or output combination, improves the performance for most target languages.

 

Natural Language Processing and Spoken Language Processing groups present papers at EMNLP 2018

Columbia researchers presented their work at the Empirical Methods in Natural Language Processing (EMNLP) in Brussels, Belgium.

Professor Julia Hirschberg gave a keynote talk on the work done by the Spoken Language Processing Group on how to automatically detect deception in spoken language – how to identify cues in trusted speech vs. mistrusted speech and how these features differ by speaker and by listener. Slides from the talk can be viewed here.

Five teams with computer science undergrad and PhD students from the Natural Language Processing Group (NLP) also attended the conference to showcase their work on text summarization, analysis of social media, and fact checking.


Robust Document Retrieval and Individual Evidence Modeling for Fact Extraction and Verification
Tuhin Chakrabarty Computer Science Department, Tariq Alhindi Computer Science Department, and Smaranda Muresan Computer Science Department and Data Science Institute

”Given the difficult times, we are living in, it’s extremely necessary to be perfect with our facts,” said Tuhin Chakrabarty, lead researcher of the paper. “Misinformation spreads like wildfire and has long-lasting impacts. This motivated us to delve into the area of fact extraction and verification.”

This paper presents the ColumbiaNLP submission for the FEVER Workshop Shared Task. Their system is an end-to-end pipeline that extracts factual evidence from Wikipedia and infers a decision about the truthfulness of the claim based on the extracted evidence.

Fact checking is a type of investigative journalism where experts examine the claims published by others for their veracity. The claims can range from statements made by public figures to stories reported by other publishers. The end goal of a fact checking system is to provide a verdict on whether the claim is true, false, or mixed. Several organizations such as FactCheck.org and PolitiFact are devoted to such activities.

The FEVER Shared task aims to evaluate the ability of a system to verify information using evidence from Wikipedia. Given a claim involving one or more entities (mapping to Wikipedia pages), the system must extract textual evidence (sets of sentences from Wikipedia pages) that supports or refutes the claim and then using this evidence, it must label the claim as Supported, Refuted or NotEnoughInfo.


Detecting Gang-Involved Escalation on Social Media Using Context
Serina Chang Computer Science Department, Ruiqi Zhong Computer Science Department, Ethan Adams Computer Science Department, Fei-Tzin Lee Computer Science Department, Siddharth Varia Computer Science Department, Desmond Patton School of Social Work, William Frey School of Social Work, Chris Kedzie Computer Science Department, and Kathleen McKeown Computer Science Department

This research is a collaboration between Professor Kathy McKeown’s NLP lab and the Columbia School of Social Work. Professor Desmond Patton, from the School of Social Work and a member of the Data Science Institute, discovered that gang-involved youth in cities such as Chicago increasingly turn to social media to grieve the loss of loved ones, which may escalate into aggression toward rival gangs and plans for violence.

The team created a machine learning system that can automatically detect aggression and loss in the social media posts of gang-involved youth. They developed an approach with the hope to eventually use a system that can save critical time, scale reach, and intervene before more young lives are lost.

The system features the use of word embeddings and lexicons, automatically derived from a large domain-specific corpus which the team constructed. They also created context features that capture user’s recent posts, both in semantic and emotional content, and their interactions with other users in the dataset. Incorporating domain-specific resources and context feature in a Convolutional Neural Network (CNN) that leads to a significant improvement over the prior state-of-the-art. 

The dataset used spans the public Twitter posts of nearly 300 users from a gang-involved community in Chicago. Youth volunteers and violence prevention organizations helped identify users and annotate the dataset for aggression and loss. Here are two examples of labeled tweets, both of which the system was able to classify correctly. Names are blocked out to preserve the privacy of users.

Tweet examples

For semantics, which were represented by word embeddings, the researchers found that it was optimal to include 90 days of recent tweet history. While for emotion, where an emotion lexicon was employed, only two days of recent tweets were needed. This matched insight from prior social work research, which found that loss is significantly likely to precede aggression in a two-day window. They also found that emotions fluctuate more quickly than semantics so the tighter context window would be able to capture more fine-grained fluctuation.

“We took this context-driven approach because we believed that interpreting emotion in a given tweet requires context, including what the users had been saying recently, how they had been feeling, and their social dynamics with others,” said Serina Chang, an undergraduate computer science student. One thing that surprised them was the extent to which different types of context offered different types of information, as demonstrated by the contrasting contributions of the semantic-based user history feature and the emotion-based one. Continued Chang, “As we hypothesized, adding context did result in a significant performance improvement in our neural net model.”


Team SWEEPer: Joint Sentence Extraction and Fact Checking with Pointer Networks
Christopher Hidey Columbia University, Mona Diab Amazon AI Lab

Automated fact checking of textual claims is of increasing interest in today’s world. Previous research has investigated fact checking in political statements, news articles, and community forums.

“Through our model we can fact check claims and find specific statements that support the evidence,” said Christopher Hidey, a fourth year PhD student. “This is a step towards addressing the propagation of misinformation online.”

As part of the FEVER community shared task, the researchers developed models that given a statement would jointly find a Wikipedia article and a sentence related to the statement, and then predict whether the statement is supported by that sentence.

For example, given the claim “Lorelai Gilmore’s father is named Robert,” one could find the Wikipedia article on Lorelai Gilmore and extract the third sentence “Lorelai has a strained relationship with her wealthy parents, Richard and Emily, after running away as a teen to raise her daughter on her own” to show that the claim is false. 

Credit : Wikipedia – https://en.wikipedia.org/wiki/Lorelai_Gilmore

One aspect of this problem that the team observed was how poorly TF-IDF, a standard technique in information retrieval and natural language processing, performed at retrieving Wikipedia articles and sentences.  Their custom model improved performance by 35 points in terms of recall over a TF-IDF baseline, achieving 90% recall for 5 articles. Overall, the model retrieved the correct sentence and predicted the veracity of the claim 50% of the time.


Where is your Evidence: Improving Fact-checking by Justification Modeling
Tariq Alhindi Computer Science Department, Savvas Petridis Computer Science Department, Smaranda Muresan Computer Science Department and Data Science Institute

The rate of which misinformation is spreading on the web is faster than the rate of manual fact-checking conducted by organizations like Politifact.com and Factchecking.org. For this paper the researchers wanted to explore how to automate parts or all of the fact-checking process. A poster with their findings was presented as part of the FEVER workshop.

“In order to come up with reliable fact-checking systems we need to understand the current manual process and identify opportunities for automation,” said Tariq Alhindi, lead author on the paper. They looked at the LIAR dataset – around 10,000 claims classified by Politifact.com to one of six degrees of truth – pants-on-fire, false, mostly-false, half-true, mostly-true, true. Continued Alhindi, we also looked at the fact-checking article for each claim and automatically extracted justification sentences of a given verdict and used them in our models, after removing all sentences that contain the verdict (e.g. true or false).

Excerpt from the LIAR-PLUS dataset

Feature-based machine learning models and neural networks were used to develop models that can predict whether a given statement is true or false. Results showed that using some sort of justification or evidence always improves the results of fake-news detection models.

“What was most surprising about the results is that adding features from the extracted justification sentences consistently improved the results no matter what classifier we used or what other features we included,” shared Alhindi, a PhD student. “However, we were surprised that the improvement was consistent even when we compare traditional feature-based linear machine learning models against state of the art deep learning models.”

Their research extends the previous work done on this data set which only looked at the linguistic cues of the claim and/or the metadata of the speaker (history, venue, party-affiliation, etc.). The researchers also released the extended dataset to the community to allow further work on this dataset with the extracted justifications. 


Content Selection in Deep Learning Models of Summarization
Chris Kedzie Columbia University, Kathleen McKeown Columbia University, Hal Daume III University of Maryland, College Park

Recently, a specific type of machine learning, called deep learning, has made strides in reaching human level performance on hard to articulate problems, that is, things people do subconsciously like recognizing faces or understanding speech. And so, natural language processing researchers have turned to these models for the task of identifying the most important phrases and sentences in text documents, and have trained them to imitate the decisions a human editor might make when selecting content for a summary.

“Deep learning models have been successful in summarizing natural language texts, news articles and online comments,” said Chris Kedzie, a fifth year PhD student. “What we wanted to know is how they are doing it.”

While these deep learning models are empirically successful, it is not clear how they are performing this task. By design, they are learning to create their own representation of words and sentences, and then using them to predict whether a sentence is important – if it should go into a summary of the document. But just what kinds of information are they using to create these representations? 

One hypotheses the researchers had was that certain types of words were more informative than others. For example, in a news article, nouns and verbs might be more important than adjectives and adverbs for identifying the most important information since such articles are typically written in a relatively objective manner. 

To see if this was so, they trained models to predict sentence importance on redacted datasets, where either nouns, verbs, adjectives, adverbs, or function words were removed and compared them to models trained on the original data. 

On a dataset of personal stories published on Reddit, adjectives and adverbs were the key to achieving the best performance. This made intuitive sense in that people tend to use intensifiers to highlight the most important or climactic moments in their stories with sentences like, “And those were the WORST customers I ever served.”

What surprised the researchers were the news articles – removing any one class of words did not dramatically decrease model performance. Either important content was broadly distributed across all kinds of words or there was some other signal that the model was using.

They suspected that sentence order was important because journalists are typically instructed to write according to the inverted pyramid style with the most important information at the top of the article. It was possible that the models were implicitly learning this and simply selecting sentences from the article lead. 

Two pieces of evidence confirmed this. First, looking at a histogram of sentence positions selected as important, the models overwhelmingly preferred the lead of the article. Second, in a follow up experiment, the sentence ordered was shuffled to remove sentence position as a viable signal from which to learn. On news articles, model performance dropped significantly, leading to the conclusion that sentence position was most responsible for model performance on news documents.

The result concerned the researchers as they want models to be trained to truly understand human language and not use simple and brittle heuristics (like sentence position). “To connect this to broader trends in machine learning, we should be very concerned and careful about what signals are being exploited by our models, especially when making sensitive decisions,” Kedzie continued. ”The signals identified by the model as helpful may not truly capture the problem we are trying to solve, and worse yet, may be exploiting biases in the dataset that we do not wish it to learn.”

However, Kedzie sees this as an opportunity to improve the utility of word representations so that models are better able to use the article content itself. Along these lines, in the future, he hopes to show that by quantifying the surprisal or novelty of a particular word or phrase, models are able to make better sentence importance predictions. Just as people might remember the most surprising and unexpected parts of a good story.