Celebrating Success at ACL 2025

The department had a strong showing at the 2025 Annual Meeting of the Association for Computational Linguistics (ACL 2025). Kathleen McKeown won the ACL Lifetime Achievement Award, and Julia Hirschberg received the Dragomir Radev Distinguished Service Award, a testament to their impact on the field and the dedication of their research teams.

Several papers authored by faculty, students, and collaborators were accepted to this year’s conference, reflecting the depth and innovation of our ongoing research in natural language processing.

 

Reranking-based Generation for Unbiased Perspective Summarization

Narutatsu Ri Columbia University, Nicholas Deas Columbia University, and Kathleen McKeown Columbia University

Abstract
Generating unbiased summaries in real-world settings such as political perspective summarization remains a crucial application of Large Language Models (LLMs). Yet, existing evaluation frameworks rely on traditional metrics for measuring key attributes such as coverage and faithfulness without verifying their applicability, and efforts to develop improved summarizers are still nascent. We address these gaps by (1) identifying reliable metrics for measuring perspective summary quality, and (2) investigating the efficacy of LLM-based methods beyond zero-shot inference. Namely, we build a test set for benchmarking metric reliability using human annotations and show that traditional metrics underperform compared to language model–based metrics, which prove to be strong evaluators. Using these metrics, we show that reranking-based methods yield strong results, and preference tuning with synthetically generated and reranking-labeled data further boosts performance. Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.

 

Data Caricatures: On the Representation of African American Language in Pretraining Corpora

Nicholas Deas Columbia University, Blake Vente Columbia University, Amith Ananthram Columbia University, Jessica A. Grieser University of Michigan, Desmond Patton University of Pennsylvania, Shana Kleiner University of Pennsylvania, James Shepard University of Tennessee, Kathleen McKeown Columbia University

Abstract
With a combination of quantitative experiments, human judgments, and qualitative analyses, we evaluate the quantity and quality of African American Language (AAL) representation in 12 predominantly English, open-source pretraining corpora. We specifically focus on the sources, variation, and naturalness of included AAL texts representing the AALspeaking community. We find that AAL is underrepresented in all evaluated pretraining corpora compared to US demographics, constituting as few as 0.007% and at most 0.18% of documents. We also find that more than 25% of AAL texts in C4 may be perceived as inappropriate for LLMs to generate and to reinforce harmful stereotypes. Finally, we find that most automated filters are more likely to conserve White Mainstream English (WME) texts over AAL in pretraining corpora.

 

Akan Cinematic Emotions (AkaCE): A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues

David Sasu IT University of Copenhagen, Zehui Wu Columbia University, Ziwei Gong Columbia University, Run Chen Columbia University, Pengyuan Shi Columbia University, Lin Ai Columbia University, Julia Hirschberg Columbia University, Natalie Schluter IT University of Copenhagen

Abstract
In this paper, we introduce the Akan Conversation Emotion (AkaCE) dataset, the first multimodal emotion dialogue dataset for an African language, addressing the significant lack of resources for low-resource languages in emotion recognition research. AkaCE, developed for the Akan language, contains 385 emotion-labeled dialogues and 6,162 utterances across audio, visual, and textual modalities, along with word-level prosodic prominence annotations. The presence of prosodic labels in this dataset also makes it the first prosodically annotated African language dataset. We demonstrate the quality and utility of AkaCE through experiments using state-of-the-art emotion recognition methods, establishing solid baselines for future research. We hope AkaCE inspires further work on inclusive, linguistically and culturally diverse NLP resources.

 

CONFIT V2: Improving Resume-Job Matching using Hypothetical Resume Embedding and Runner-Up Hard-Negative Mining

Xiao Yu Columbia University, Ruize Xu Columbia University, Chengyuan Xue University of Toronto, Jinzhong Zhang Intellipro Group Inc., Xu Ma Intellipro Group Inc., Zhou Yu Columbia University

Abstract
A reliable resume-job matching system helps a company recommend suitable candidates from a pool of resumes and helps a job seeker find relevant jobs from a list of job posts. However, since job seekers apply only to a few jobs, interaction labels in resume-job datasets are sparse. We introduce CONFIT V2, an improvement over CONFIT to tackle this sparsity problem. We propose two techniques to enhance the encoder’s contrastive training process: augmenting job data with hypothetical reference resume generated by a large language model; and creating high-quality hard negatives from unlabeled resume/job pairs using a novel hardnegative mining strategy. We evaluate CONFIT V2 on two real-world datasets and demonstrate that it outperforms CONFIT and prior methods (including BM25 and OpenAI text-embedding003), achieving an average absolute improvement of 13.8% in recall and 17.5% in nDCG across job-ranking and resume-ranking tasks.

 

Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning

Sky CH-Wang Columbia University, Darshan Deshpande Patronus AI, Smaranda Muresan Columbia University,  Anand Kannappan Patronus AI, Rebecca Qian Patronus AI

Abstract
We introduce BROWSING LOST UNFORMED RECOLLECTIONS, a tip-of-the-tongue knowni tem search and reasoning benchmark for general AI assistants. BLUR introduces a set of 573 real-world validated questions that demand searching and reasoning across multimodal and multilingual inputs, as well as proficient tool use, in order to excel on. Humans easily ace these questions (scoring on average 98%), while the best-performing system scores around 56%. To facilitate progress toward addressing this challenging and aspirational use case for general AI assistants, we release 350 questions through a public leaderboard, retain the answers to 250 of them, and have the rest as a private test set.

 

Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges

Bolei Ma LMU Munich & Munich Center for Machine Learning, Yuting Li University of Cologne, Wei Zhou University of Augsburg, Ziwei Gong Columbia University, Yang Janet Liu LMU Munich & Munich Center for Machine Learning, Katja Jasinskaja University of Cologne, Annemarie Friedrich University of Augsburg, Julia Hirschberg Columbia University, Frauke Kreuter LMU Munich & Munich Center for Machine Learning, Barbara Plank LMU Munich & Munich Center for Machine Learning

Abstract
Understanding pragmatics—the use of language in context—is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatic phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.

 

The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination

Yuji Zhang University of Illinois Urbana-Champaign, Sha Li University of Illinois Urbana-Champaign, Cheng Qian University of Illinois Urbana-Champaign, Jiateng Liu University of Illinois Urbana-Champaign, Pengfei Yu University of Illinois Urbana-Champaign, Chi Han University of Illinois Urbana-Champaign, Yi R. Fung University of Illinois Urbana-Champaign, Kathleen McKeown Columbia University, Chengxiang Zhai University of Illinois Urbana-Champaign, Manling Li Northwestern University, Heng Ji University of Illinois Urbana-Champaign

Abstract
Hallucination is a persistent challenge in large language models (LLMs), where even with rigorous quality control, models often generate distorted facts. This paradox, in which error generation continues despite high-quality training data, calls for a deeper understanding of the underlying LLM mechanisms. To address it, we propose a novel concept: knowledge overshadowing, where model’s dominant knowledge can obscure less prominent knowledge during text generation, causing the model to fabricate inaccurate details. Building on this idea, we introduce a novel framework to quantify factual hallucinations by modeling knowledge overshadowing. Central to our approach is the log-linear law, which predicts that the rate of factual hallucination increases linearly with the logarithmic scale of (1) Knowledge Popularity, (2) Knowledge Length, and (3) Model Size. The law provides a means to preemptively quantify hallucinations, offering foresight into their occurrence even before model training or inference. Built on the overshadowing effect, we propose a new decoding strategy CoDA, to mitigate hallucinations, which notably enhance model factuality on Overshadow (27.9%), MemoTrap (13.1%) and NQ-Swap (18.3%). Our findings not only deepen understandings of the underlying mechanisms behind hallucinations but also provide actionable insights for developing more predictable and controllable language models.

 

Six Papers From the NLP & Speech Group Accepted to NAACL 2024

The 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) is a premiere annual conference for natural language research. Held June 16-21, 2024, in Mexico City, Mexico, researchers from the department presented work that covers language models, summarization, social media, code-switching, and sentiment analysis.

Associate Professor Zhou Yu and her team won a Best Paper Award for their paper Teaching Language Models to Self-Improve through Interactive Demonstrations. They introduce TRIPOST, a training algorithm that endows smaller models with self-improvement ability, which shows that the interactive experience of learning from and correcting its own mistakes is crucial for small models to improve their performance.

Below are the abstracts:

Teaching Language Models to Self-Improve through Interactive Demonstrations
Xiao Yu Columbia University, Baolin Peng Microsoft Research, Michel Galley Microsoft Research, Jianfeng Gao Microsoft Research, Zhou Yu Columbia University

Abstract:
The self-improving ability of large language models (LLMs), enabled by prompting them to analyze and revise their own outputs, has garnered significant interest in recent research. However, this ability has been shown to be absent and difficult to learn for smaller models, thus widening the performance gap between state-of-the-art LLMs and more costeffective and faster ones. To reduce this gap, we introduce TRIPOST, a training algorithm that endows smaller models with such selfimprovement ability, and show that our approach can improve LLaMA-7B’s performance on math and reasoning tasks by up to 7.13%. In contrast to prior work, we achieve this by using the smaller model to interact with LLMs to collect feedback and improvements on its own generations. We then replay this experience to train the small model. Our experiments on four math and reasoning datasets show that the interactive experience of learning from and correcting its own mistakes is crucial for small models to improve their performance.

 

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Liyan Tang The University of Texas at Austin, Igor Shalyminov AWS AI Labs, Amy Wing-mei Wong AWS AI Labs, Jon Burnsky AWS AI Labs, Jake W. Vincent AWS AI Labs, Yuan Yang AWS AI Labs, Siffi Singh AWS AI Labs, Song Feng AWS AI Labs, Hwanjun Song Korea Advanced Institute of Science & Technology, Hang Su AWS AI Labs, Lijia Sun AWS AI Labs, Yi Zhang AWS AI Labs, Saab Mansour AWS AI Labs, Kathleen McKeown Columbia University

Abstract:
Single-document news summarization has seen substantial progress in faithfulness in recent years, driven by research on the evaluation of factual consistency or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model’s size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in modelgenerated summaries and that non-LLM-based metrics can capture all error types better than LLM-based evaluators.

 

Fair Abstractive Summarization of Diverse Perspectives
Yusen Zhang Penn State University, Nan Zhang Penn State University, Yixin Liu Yale University, Alexander Fabbri Salesforce Research, Junru Liu Texas A&M University, Ryo Kamoi Penn State University, Xiaoxin Lu Penn State University, Caiming Xiong Salesforce Research, Jieyu Zhao University of Southern California, Dragomir Radev Yale University, Kathleen McKeown Columbia University, Rui Zhang Penn State University

Abstract:
People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization. In this paper, we systematically investigate fair abstractive summarization for user-generated data. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and we propose four reference-free automatic metrics by measuring the differences between target and source perspectives. We evaluate nine LLMs, including three GPT models, four LLaMA models, PaLM 2, and Claude, on six datasets collected from social media, online reviews, and recorded transcripts. Experiments show that both the model-generated and the human-written reference summaries suffer from low fairness. We conduct a comprehensive analysis of the common factors influencing fairness and propose three simple but effective methods to alleviate unfair summarization. Our dataset and code are available at https: //github.com/psunlpgroup/FairSumm.

 

Measuring Entrainment in Spontaneous Code-switched Speech
Debasmita Bhattacharya Columbia University, Siying Ding Columbia University, Alayna Nguyen Columbia University, Julia Hirschberg Columbia University

Abstract:
It is well-known that speakers who entrain to one another have more successful conversations than those who do not. Previous research has shown that interlocutors entrain on linguistic features in both written and spoken monolingual domains. More recent work on code-switched communication has also shown preliminary evidence of entrainment on certain aspects of code-switching (CSW). However, such studies of entrainment in codeswitched domains have been extremely few and restricted to human-machine textual interactions. Our work studies code-switched spontaneous speech between humans, finding that (1) patterns of written and spoken entrainment in monolingual settings largely generalize to code-switched settings, and (2) some patterns of entrainment on code-switching in dialogue agent-generated text generalize to spontaneous code-switched speech. Our findings give rise to important implications for the potentially “universal” nature of entrainment as a communication phenomenon, and potential applications in inclusive and interactive speech technology.

 

Multimodal Multi-loss Fusion Network for Sentiment Analysis
zehui wu, Ziwei Gong, Jaywon Koo, Julia Hirschberg

Abstract:
This paper investigates the optimal selection and fusion of feature encoders across multiple modalities and combines these in one neural network to improve sentiment detection. We compare different fusion methods and examine the impact of multi-loss training within the multi-modality fusion network, identifying surprisingly important findings relating to subnet performance. We have also found that integrating context significantly enhances model performance. Our best model achieves state-of-the-art performance for three datasets (CMU-MOSI, CMU-MOSEI and CH-SIMS). These results suggest a roadmap toward an optimized feature selection and fusion approach for enhancing sentiment detection in neural networks.

 

Identifying Self-Disclosures of Use, Misuse and Addiction in Community-based Social Media Posts
Chenghao Yang, Tuhin Chakrabarty, Karli R Hochstatter, Melissa N Slavin, Nabila El-Bassel, Smaranda Muresan

Abstract:
In the last decade, the United States has lost more than 500,000 people from an overdose involving prescription and illicit opioids, making it a national public health emergency (USDHHS, 2017). Medical practitioners require robust and timely tools that can effectively identify at-risk patients. Community-based social media platforms such as Reddit allow self-disclosure for users to discuss otherwise sensitive drug-related behaviors. We present a moderate-size corpus of 2500 opioid-related posts from various subreddits labeled with six different phases of opioid use: Medical Use, Misuse, Addiction, Recovery, Relapse, and Not Using. For every post, we annotate span-level extractive explanations and crucially study their role both in annotation quality and model development.2 We evaluate several state-of-the-art models in a supervised, few-shot, or zero-shot setting. Experimental results and error analysis show that identifying the phases of opioid use disorder is highly contextual and challenging. However, we find that using explanations during modeling leads to a significant boost in classification accuracy, demonstrating their beneficial role in a high-stakes domain such as studying the opioid use disorder continuum.

Eleanor Lin And Walter McKelvie Selected For Honorable Mention For The Outstanding Undergraduate Researcher Award

Two CS students were selected by the Computing Research Association (CRA) for the 2024 Outstanding Undergraduate Researcher Award for their exemplary dedication to research and academic excellence, earning them a well-deserved commendation. The honorees, Eleanor Lin and Walter McKelvie, have exhibited exceptional skills and commitment in their respective areas of focus within computer science.

Eleanor LinEleanor Lin (CC ‘24), distinguished herself through groundbreaking research with the Spoken Language Processing Group, where she is advised by Professor Julia Hirschberg. Her work as the lead researcher on the Switchboard Dialogue Act Re-alignment project has showcased innovation and contributed significantly to updating the corpus used to identify regional differences in U. S. speakers–extremely important for Automatic Speech Recognition, particularly in telephony. Eleanor made substantial contributions to multiple Speech Lab projects while concurrently serving as a teaching assistant for computer science and linguistics. She also collaborated with researchers from Rice University, the University of Southern California, and Teacher’s College.

Walter MckelvieWalter McKelvie (SEAS ‘24), earned an honorable mention for their remarkable work in theoretical computer science and cryptography. He worked with Professor Tal Malkin and the Crypto Lab on fixing a problem with proof-of-stake blockchains, making a secret leader election “accountable” so that leaders cannot anonymously refuse to publish a block. His dedication to pushing the boundaries of understanding in this field has been commendable, and he greatly contributed to the research by coming up with one of the three paradigms included in the paper and writing several of the technical parts in the paper. McKelvie additionally served as a teaching assistant and collaborated with researchers from Purdue and Harvard.

The honorable mentions serve as a testament to the vibrant research community of the department, where students are encouraged to explore and excel in their chosen fields. Julia Hirschberg, the Percy K. and Vida L. W. Hudson Professor of Computer Science, assembles a team of 15 undergrads with different skills to work on the Speech Lab’s projects. Students can work on data collection and annotation, building large language models (LLMs), or both. Professor Tal Malkin typically has one or two undergraduate students who work on cryptography research. Students need to have mathematical maturity; ideally, they should have taken Malkin’s graduate-level Introduction to Cryptography class.

These recognitions also highlight the department’s commitment to providing students with a robust academic environment that encourages curiosity, creativity, and a passion for discovery.

The one management skill ChatGPT can’t replace

Empathy is one of the most important leadership traits for managers. It helps build trust and connection among teams and demonstrates a leader’s ability to understand the needs of employees.