CS@CU At ACL 2024

CS researchers presented their work at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024).

Parallel Structures in Pre-training Data Yield In-Context Learning
Yanda Chen Columbia University, Chen Zhao New York University, Zhou Yu Columbia University, Kathleen McKeown Columbia University, He He New York University

Abstract:
Pre-trained language models (LMs) are capable of in-context learning (ICL): they can adapt to a task with only a few examples given in the prompt without any parameter update. However, it is unclear where this capability comes from as there is a stark distribution shift between pre-training text and ICL prompts. In this work, we study what patterns of the pretraining data contribute to ICL. We find that LMs’ ICL ability depends on parallel structures in the pre-training data—pairs of phrases following similar templates in the same context window. Specifically, we detect parallel structures by checking whether training on one phrase improves prediction of the other, and conduct ablation experiments to study their effect on ICL. We show that removing parallel structures in the pre-training data reduces LMs’ ICL accuracy by 51% (vs 2% from random ablation). This drop persists even when excluding common patterns such as n-gram repetitions and long-range dependency, showing the diversity and generality of parallel structures. A closer look at the detected parallel structures indicates that they cover diverse linguistic tasks and span long distances in the data.

Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models
Zachary Horvitz Columbia University, Jingru Chen Columbia University, Rahul Aditya Columbia University, Harshvardhan Srivastava Columbia University, Robert West EPFL, Zhou Yu Columbia University, Kathleen McKeown Columbia University

Abstract:
Humor is a fundamental facet of human cognition and interaction. Yet, despite recent advances in natural language processing, humor detection remains a challenging task that is complicated by the scarcity of datasets that pair humorous texts with similar non-humorous counterparts. We investigate whether large language models (LLMs) can generate synthetic data for humor detection via editing texts. We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to “unfun” jokes, as judged by humans and as measured on the downstream task of humor detection. We extend our approach to a code-mixed English-Hindi humor dataset where we find that GPT-4’s synthetic data is highly rated by bilingual annotators and provides challenging adversarial examples for humor classifiers.

ICLEF: In-Context Learning with Expert Feedback for Explainable Style Transfer
Arkadiy Saakyan Columbia University and Smaranda Muresan Columbia University

Abstract:
While state-of-the-art large language models (LLMs) can excel at adapting text from one style to another, current work does not address the explainability of style transfer models. Recent work has explored generating textual explanations from larger teacher models and distilling them into smaller student models. One challenge with such approach is that LLM outputs may contain errors that require expertise to correct, but gathering and incorporating expert feedback is difficult due to cost and availability. To address this challenge, we propose ICLEF, a novel human-AI collaboration approach to model distillation that incorporates scarce expert human feedback by combining in-context learning and model self-critique. We show that our method leads to generation of high-quality synthetic explainable style transfer datasets for formality (E-GYAFC) and subjective bias (EWNC). Via automatic and human evaluation, we show that specialized student models finetuned on our datasets outperform generalist teacher models on the explainable style transfer task in one-shot settings, and perform competitively compared to few-shot teacher models, highlighting the quality of the data and the role of expert feedback. In an extrinsic task of authorship attribution, we show that explanations generated by smaller models fine-tuned on E-GYAFC are more predictive of authorship than explanations generated by few-shot teacher models.

ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution
Xuanming Zhang Columbia University, Zixun Chen Columbia University, and Zhou Yu Columbia University

Abstract:
Lexical Substitution discovers appropriate substitutes for a given target word in a context sentence. However, the task fails to consider substitutes that are of equal or higher proficiency than the target, an aspect that could be beneficial for language learners looking to improve their writing. To bridge this gap, we propose a new task — language proficiencyoriented lexical substitution. We also introduce ProLex, a novel benchmark designed to assess systems’ ability to generate not only appropriate substitutes but also substitutes that demonstrate better language proficiency. Besides the benchmark, we propose models that can automatically perform the new task. We show that our best model, a Llama2-13B model fine-tuned with task-specific synthetic data, outperforms ChatGPT by an average of 3.2% in F-score and achieves comparable results with GPT-4 on ProLex.

LOCALRQA: From Generating Data to Locally Training, Testing, and Deploying Retrieval-Augmented QA Systems
Xiao Yu Columbia University, Yunan Lu Columbia University, Zhou Yu Columbia University

Abstract:
Retrieval-augmented question-answering systems combine retrieval techniques with large language models to provide answers that are more accurate and informative. Many existing toolkits allow users to quickly build such systems using off-the-shelf models, but they fall short in supporting researchers and developers to customize the model training, testing, and deployment process. We propose LOCALRQA1 , an open-source toolkit that features a wide selection of model training algorithms, evaluation methods, and deployment tools curated from the latest research. As a showcase, we build QA systems using online documentation obtained from Databricks and Faire’s websites. We find 7B-models trained and deployed using LOCALRQA reach a similar performance compared to using OpenAI’s text-ada-002 and GPT-4-turbo.

Large Language Models are Few-Shot Training Example Generators: A Case Study in Fallacy Recognition
Tariq Alhindi Mohamed bin Zayed University of Artificial Intelligence, Smaranda Muresan Columbia University, Preslav Nakov Mohamed bin Zayed University of Artificial Intelligence

Abstract:
Recognizing fallacies is crucial for ensuring the quality and validity of arguments across various domains. However, computational fallacy recognition faces challenges due to the diverse genres, domains, and types of fallacies found in datasets. This leads to a highly multiclass, and even multi-label, setup with substantial class imbalance. In this study, we aim to enhance existing models for fallacy recognition by incorporating additional context and by leveraging large language models to generate synthetic data, thus increasing the representation of the infrequent classes. We experiment with GPT-3.5 to generate synthetic examples and we examine the impact of prompt settings for this. Moreover, we explore zero-shot and few-shot scenarios to evaluate the effectiveness of using the generated examples for training smaller models within a unified fallacy recognition framework. Furthermore, we analyze the overlap between the synthetic data and existing fallacy datasets. Finally, we investigate the usefulness of providing supplementary context for detecting fallacy types that need such context, e.g., diversion fallacies. Our evaluation results demonstrate consistent improvements across fallacy types, datasets, and generators. The code and the synthetic datasets are all publicly available.

CS Papers Accepted to ACL 2019

CS researchers will be at the 2019 Annual Meeting of the Association of Computational Linguistics in Florence, Italy. Numerous papers covering the computational approaches to natural language were accepted.

Accepted papers

Pay “Attention” to Your Context when Classifying Abusive Language
Tuhin Chakrabarty Columbia University, Kilol Gupta Columbia University, and Smaranda Muresan Columbia University
Neural Network Alignment for Sentential Paraphrases
Jessica Ouyang Columbia University and Kathleen McKeown Columbia University
Rubric Reliability and Annotation of Content and Argument in Source-Based Argument Essays
Yanjun Gao Pennsylvania State University, Alex Driban Pennsylvania State University, Brennan Xavier McManus Columbia University, Elena Musi University of Liverpool, Patricia M. Davies Prince Mohammad Bin Fahd University, Smaranda Muresan Columbia University, and Rebecca J. Passonneau Pennsylvania State University
Unsupervised Morphological Segmentation for Low-Resource Polysynthetic Languages
Ramy Eskander Columbia University, Judith L. Klavans University of Maryland, Smaranda Muresan Columbia University

Summaries of the papers are below:

Pay “Attention” to Your Context when Classifying Abusive Language
Tuhin Chakrabarty Columbia University, Kilol Gupta Columbia University, and Smaranda Muresan Columbia University

The goal of any social media platform is to facilitate healthy and meaningful interactions among its users. But more often than not, it has been found that it becomes an avenue for wanton attacks.

In the paper the researchers propose an experimental study that has three aims: (1) to provide a deeper understanding of current datasets that focus on different types of abusive language, which are sometimes overlapping (racism, sexism, hate speech, offensive language and personal attacks); (2) to investigate what type of attention mechanism (contextual vs. self-attention) is better for abusive language detection using deep learning architectures; and (3) to investigate whether stacked architectures provide an advantage over simple architectures for this task.

The work is about using context attention instead of self-attention for abuse detection which encapsulates the information by looking at examples globally through the training data, unlike self attention which only focuses on words for that particular tweet while trying to classify it.

The first tweet belongs to the NONE class while the second tweet belongs to RACISM class. The word “islam” may appear in the realm of racism as well as in any normal conversation. The model successfully identified the two distinct contextual usages of the word “Islam” in the two tweets, as demonstrated by a much higher attention weight in the second case and a relatively smaller one in the first case.

Neural Network Alignment for Sentential Paraphrases
Jessica Ouyang Columbia University and Kathleen McKeown Columbia University

The researchers created a system that automatically aligns paraphrases between two input sentences — that is, it detects which parts of the two sentences are paraphrases of each other. Their aligner is unique in that it is able to align phrases of arbitrary length, including full sentences, as well as relatively loose paraphrases, where the two aligned phrases mean approximately, but not necessarily exactly, the same thing.

Paraphrase alignment is the task of detecting parts of two input sentences that mean the same thing. Previous work on this task has focused on a strict definition of paraphrase, which requires that the aligned phrases mean exactly the same thing; previous systems aligned only words that exactly matched, or were close synonyms, between the sentences. In addition, previous work on paraphrase alignment was practically limited to phrases of three or fewer words, due to running time constraints. However, most people’s intuition about what counts as a paraphrase is much less strict, and paraphrases can be much longer than three words.

The phrases in bold are examples of paraphrases that the system can align, but that previous work could not. The entire phrase, “I vaguely recalled him telling me” means the same thing as “I remembered a story” in the context of these two sentences, but there is no one-to-one mapping between the words in the two phrases (eg. “vaguely” in Sentence 1 has no corresponding word in Sentence 2), which would prevent previous systems from successfully aligning these phrases.

The designed system aligns these looser and longer paraphrases by first breaking the input sentences into grammatical chunks, such as noun or verb phrases. For each chunk, it calculates a single vector that represents the meaning of that chunk by combining the vectors representing the meanings of the words within it. Then, a neural network is used to align each chunk in one of the input sentences to the chunks in the other sentence. This method allows for the alignment of all of the words within a chunk at once, regardless of the length of the chunk, and small differences in meaning or in individual words are mitigated by the meanings of the other words in the chunk. The system is the first to use a neural network to perform the alignment task, and it is able to align longer and less exactly-matching sentences than previous systems could.

Rubric Reliability and Annotation of Content and Argument in Source-Based Argument Essays
Yanjun Gao Pennsylvania State University, Alex Driban Pennsylvania State University, Brennan Xavier McManus Columbia University, Elena Musi University of Liverpool, Patricia M. Davies Prince Mohammad Bin Fahd University, Smaranda Muresan Columbia University, and Rebecca J. Passonneau Pennsylvania State University

Students with STEM majors were prompted to write short argumentative essays on topics including cryptocurrencies, cybercrime, and self-driving cars. These essays were graded on a rubric, and the essays were analyzed for content.

The argumentative structure of these essays were analyzed, which involved breaking the essays down into units of argumentation and indicating whether one argument supports, attacks, or is necessary context for another, from the main claim of the essay down to individual pieces of evidence. The results of this annotation were compared to the results of applying the rubric for each of these essays, leading to a set of argumentative features associated with essays of particular scores.

One simple finding is that essays with the highest overall score (5) tended to have a higher ratio of argumentative sentences to non-argumentative sentences, while the essays in the next highest group (4) tended to be longer. The essays with the higher scores and lower scores often had similar numbers of claims, but the latter group would tend to fail to connect these claims to the main argument of their essay.

The goal of research in this area is to assess the eventual effectiveness and usability of automated grading assistants for argumentative essays, and to what extent a rubric can be applied to fairly analyze the content and argumentative structure of essays in a similar way in which automated grading scripts are used within the CS department here at Columbia.

Tag: acl

CS@CU At ACL 2024

CS Papers Accepted to ACL 2019

Computer Science at Columbia University

Upcoming Events

In the News

Press Mentions