## Papers from CS Researchers Accepted to SIGDIAL 2021

Research from the department was presented at the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2021). The conference is a forum for academic and industry researchers to discuss their work on discourse and dialogue including discourse processing, dialogue systems, corpora, tools, and methodology.

Professor Julia Hirschberg was one of the invited keynote speakers and during her lecture she talked about how computer systems can encourage user trust for recommender systems, knowledge-delivery systems, and dialogue systems.

Below are the links to the accepted papers and abstracts.

What to Fact-Check: Guiding Check-Worthy Information Detection in News Articles Through Argumentative Discourse Structure
Tariq Alhindi Columbia University, Brennan McManus Columbia University, and Smaranda Muresan Columbia University

Most existing methods for automatic fact checking start with a precompiled list of claims to verify. We investigate the understudied problem of determining what statements in news articles are worthy to factcheck. We annotate the argument structure of 95 news articles in the climate change domain that are fact-checked by climate scientists at climatefeedback.org. We release the first multi-layer annotated corpus for both argumentative discourse structure (argument components and relations) and for fact checked statements in news articles. We discuss the connection between argument structure and check-worthy statements and develop several baseline models for detecting checkworthy statements in the climate change domain. Our preliminary results show that using information about argumentative discourse structure shows slight but statistically significant improvement over a baseline of local discourse structure.

Improving Named Entity Recognition in Spoken Dialog Systems By Context and Speech Pattern Modeling
Minh Nguyen University of California, Davis and Zhou Yu Columbia University

While named entity recognition (NER) from speech has been around as long as NER from written text has, the accuracy of NER from speech has generally been much lower than that of NER from text. The rise in popularity of spoken dialog systems such as Siri or Alexa highlights the need for more accurate NER from speech because NER is a core component for understanding what users said in dialogs. Deployed spoken dialog systems receive user input in the form of automatic speech recognition (ASR) transcripts, and simply applying NER model trained on written text to ASR transcripts often leads to low accuracy because compared to written text, ASR transcripts lack important cues such as punctuation and capitalization. Besides, errors in ASR transcripts also make NER from speech challenging. We propose two models that exploit dialog context and speech pattern clues to extract named entities more accurately from open-domain dialogs in spoken dialog systems. Our results show the benefit of modeling dialog context and speech patterns in two settings: a standard setting with random partition of data and a more realistic but also more difficult setting where many named entities encountered during deployment are unseen during training.

Evaluation of In-Person Counseling Strategies to Develop Physical Activity Chatbot for Women
Kai-Hui Liang Columbia University, Patrick Lange University of California, Davis, Yoo Jung Oh University of California, Davis, Jingwen Zhang University of California, Davis, Yoshimi Fukuoka University of California, San Francisco, and Zhou Yu Columbia University

Artificial intelligence chatbots are the vanguard in technology-based intervention to change people’s behavior. To develop intervention chatbots, the first step is to understand natural language conversation strategies in human conversation. This work introduces an intervention conversation dataset collected from a real-world physical activity intervention program for women. We designed comprehensive annotation schemes in four dimensions (domain, strategy, social exchange, and taskfocused exchange) and annotated a subset of dialogs. We built a strategy classifier with context information to detect strategies from both trainers and participants based on the annotation. To understand how human intervention induces effective behavior changes, we analyzed the relationships between the intervention strategies and the participants’ changes in the barrier and social support for physical activity. We also analyzed how the participant’s baseline weight correlates to the amount of occurrence of the corresponding strategy. This work lays the foundation for developing a personalized physical activity intervention bot.

DialogStitch: Synthetic Deeper and Multi-Context Task-Oriented Dialogs
Satwik Kottur Facebook AI, Chinnadhurai Sankar Facebook AI, Zhou Yu Columbia University, and Alborz Geramifard Facebook AI

Real-world conversational agents must effectively handle long conversations that span multiple contexts. Such context can be interspersed with chitchat (dialog turns not directly related to the task at hand), and potentially grounded in a multimodal setting. While prior work focused on the above aspects in isolation, there is a lack of a unified framework that studies them together. To overcome this, we propose DialogStitch, a novel framework to seamlessly ‘stitch’ multiple conversations and highlight these desirable traits in a task-oriented dialog. After stitching, our dialogs are provably deeper, contain longer-term dependencies, and span multiple contexts, when compared with the source dialogs— all by leveraging existing human annotations! Though our framework generalizes to a variety of combinations, we demonstrate its benefits in two settings: (a) multimodal, image-grounded conversations, and, (b) task-oriented dialogs fused with chit-chat conversations. We benchmark state-of-the-art dialog models on our datasets and find accuracy drops of (a) 12% and (b) 45% respectively, indicating the additional challenges in the stitched dialogs. Our code and data are publicly available.

Annotation Inconsistency and Entity Bias in MultiWOZ
Kun Qian Columbia University, Ahmad Beirami Facebook AI, Zhouhan Lin Facebook AI, Ankita De Facebook AI, Alborz Geramifard Facebook AI, Zhou Yu Columbia University, and Chinnadhurai Sankar Facebook AI

MultiWOZ (Budzianowski et al., 2018) is one of the most popular multi-domain taskoriented dialog datasets, containing 10K+ annotated dialogs covering eight domains. It has been widely accepted as a benchmark for various dialog tasks, e.g., dialog state tracking (DST), natural language generation (NLG) and end-to-end (E2E) dialog modeling. In this work, we identify an overlooked issue with dialog state annotation inconsistencies in the dataset, where a slot type is tagged inconsistently across similar dialogs leading to confusion for DST modeling. We propose an automated correction for this issue, which is present in 70% of the dialogs. Additionally, we notice that there is significant entity bias in the dataset (e.g., “cambridge” appears in 50% of the destination cities in the train domain). The entity bias can potentially lead to named entity memorization in generative models, which may go unnoticed as the test set suffers from a similar entity bias as well. We release a new test set with all entities replaced with unseen entities. Finally, we benchmark joint goal accuracy (JGA) of the state-of-the art DST baselines on these modified versions of the data. Our experiments show that the annotation inconsistency corrections lead to 7- 10% improvement in JGA. On the other hand, we observe a 29% drop in JGA when models are evaluated on the new test set with unseen entities.

## 13 Research Papers Accepted to ICML 2021

Papers from CS researchers have been accepted to the 38th International Conference on Machine Learning (ICML 2021).

Associate Professor Daniel Hsu was one of the publication chairs of the conference and Assistant Professor Elham Azizi helped organize the 2021 ICML Workshop on Computational Biology. The workshop highlighted how machine learning approaches can be tailored to making both translational and basic scientific discoveries with biological data.

Below are the abstracts and links to the accepted papers.

A Proxy Variable View of Shared Confounding
Yixin Wang Columbia University, David Blei Columbia University

Causal inference from observational data can be biased by unobserved confounders. Confounders—the variables that affect both the treatments and the outcome—induce spurious non-causal correlations between the two. Without additional conditions, unobserved confounders generally make causal quantities hard to identify. In this paper, we focus on the setting where there are many treatments with shared confounding, and we study under what conditions is causal identification possible. The key observation is that we can view subsets of treatments as proxies of the unobserved confounder and identify the intervention distributions of the rest. Moreover, while existing identification formulas for proxy variables involve solving integral equations, we show that one can circumvent the need for such solutions by directly modeling the data. Finally, we extend these results to an expanded class of causal graphs, those with other confounders and selection variables.

Unsupervised Representation Learning via Neural Activation Coding
Yookoon Park Columbia University, Sangho Lee Seoul National University, Gunhee Kim Seoul National University, David Blei Columbia University

We present neural activation coding (NAC) as a novel approach for learning deep representations from unlabeled data for downstream applications. We argue that the deep encoder should maximize its nonlinear expressivity on the data for downstream predictors to take full advantage of its representation power. To this end, NAC maximizes the mutual information between activation patterns of the encoder and the data over a noisy communication channel. We show that learning for a noise-robust activation code increases the number of distinct linear regions of ReLU encoders, hence the maximum nonlinear expressivity. More interestingly, NAC learns both continuous and discrete representations of data, which we respectively evaluate on two downstream tasks: (i) linear classification on CIFAR-10 and ImageNet-1K and (ii) nearest neighbor retrieval on CIFAR-10 and FLICKR-25K. Empirical results show that NAC attains better or comparable performance on both tasks over recent baselines including SimCLR and DistillHash. In addition, NAC pretraining provides significant benefits to the training of deep generative models. Our code is available at https://github.com/yookoon/nac.

The Logical Options Framework
Brandon Araki MIT, Xiao Li MIT, Kiran Vodrahalli Columbia University, Jonathan DeCastro Toyota Research Institute, Micah Fry MIT Lincoln Laboratory, Daniela Rus MIT CSAIL

Learning composable policies for environments with complex rules and tasks is a challenging problem. We introduce a hierarchical reinforcement learning framework called the Logical Options Framework (LOF) that learns policies that are satisfying, optimal, and composable. LOF efficiently learns policies that satisfy tasks by representing the task as an automaton and integrating it into learning and planning. We provide and prove conditions under which LOF will learn satisfying, optimal policies. And lastly, we show how LOF’s learned policies can be composed to satisfy unseen tasks with only 10-50 retraining steps on our benchmarks. We evaluate LOF on four tasks in discrete and continuous domains, including a 3D pick-and-place environment.

Estimating Identifiable Causal Effects on Markov Equivalence Class through Double Machine Learning
Yonghan Jung Columbia University, Jin Tian Columbia University, Elias Bareinboim Columbia University

General methods have been developed for estimating causal effects from observational data under causal assumptions encoded in the form of a causal graph. Most of this literature assumes that the underlying causal graph is completely specified. However, only observational data is available in most practical settings, which means that one can learn at most a Markov equivalence class (MEC) of the underlying causal graph. In this paper, we study the problem of causal estimation from a MEC represented by a partial ancestral graph (PAG), which is learnable from observational data. We develop a general estimator for any identifiable causal effects in a PAG. The result fills a gap for an end-to-end solution to causal inference from observational data to effects estimation. Specifically, we develop a complete identification algorithm that derives an influence function for any identifiable causal effects from PAGs. We then construct a double/debiased machine learning (DML) estimator that is robust to model misspecification and biases in nuisance function estimation, permitting the use of modern machine learning techniques. Simulation results corroborate with the theory.

Environment Inference for Invariant Learning
Elliot Creager University of Toronto, Joern Jacobsen Apple Inc., Richard Zemel Columbia University

Learning models that gracefully handle distribution shifts is central to research on domain generalization, robust optimization, and fairness. A promising formulation is domain-invariant learning, which identifies the key issue of learning which features are domain-specific versus domain-invariant. An important assumption in this area is that the training examples are partitioned into domains'' orenvironments”. Our focus is on the more common setting where such partitions are not provided. We propose EIIL, a general framework for domain-invariant learning that incorporates Environment Inference to directly infer partitions that are maximally informative for downstream Invariant Learning. We show that EIIL outperforms invariant learning methods on the CMNIST benchmark without using environment labels, and significantly outperforms ERM on worst-group performance in the Waterbirds dataset. Finally, we establish connections between EIIL and algorithmic fairness, which enables EIIL to improve accuracy and calibration in a fair prediction problem.

SketchEmbedNet: Learning Novel Concepts by Imitating Drawings
Alex Wang University of Toronto, Mengye Ren University of Toronto, Richard Zemel Columbia University

Sketch drawings capture the salient information of visual concepts. Previous work has shown that neural networks are capable of producing sketches of natural objects drawn from a small number of classes. While earlier approaches focus on generation quality or retrieval, we explore properties of image representations learned by training a model to produce sketches of images. We show that this generative, class-agnostic model produces informative embeddings of images from novel examples, classes, and even novel datasets in a few-shot setting. Additionally, we find that these learned representations exhibit interesting structure and compositionality.

Universal Template for Few-Shot Dataset Generalization
Eleni Triantafillou University of Toronto, Hugo Larochelle Google Brain, Richard Zemel Columbia University, Vincent Dumoulin Google

Few-shot dataset generalization is a challenging variant of the well-studied few-shot classification problem where a diverse training set of several datasets is given, for the purpose of training an adaptable model that can then learn classes from \emph{new datasets} using only a few examples. To this end, we propose to utilize the diverse training set to construct a \emph{universal template}: a partial model that can define a wide array of dataset-specialized models, by plugging in appropriate components. For each new few-shot classification problem, our approach therefore only requires inferring a small number of parameters to insert into the universal template. We design a separate network that produces an initialization of those parameters for each given task, and we then fine-tune its proposed initialization via a few steps of gradient descent. Our approach is more parameter-efficient, scalable and adaptable compared to previous methods, and achieves the state-of-the-art on the challenging Meta-Dataset benchmark.

On Monotonic Linear Interpolation of Neural Network Parameters
James Lucas University of Toronto, Juhan Bae University of Toronto, Michael Zhang University of Toronto, Stanislav Fort Google AI, Richard Zemel Columbia University, Roger Grosse University of Toronto

Linear interpolation between initial neural network parameters and converged parameters after training with stochastic gradient descent (SGD) typically leads to a monotonic decrease in the training objective. This Monotonic Linear Interpolation (MLI) property, first observed by Goodfellow et al. 2014, persists in spite of the non-convex objectives and highly non-linear training dynamics of neural networks. Extending this work, we evaluate several hypotheses for this property that, to our knowledge, have not yet been explored. Using tools from differential geometry, we draw connections between the interpolated paths in function space and the monotonicity of the network — providing sufficient conditions for the MLI property under mean squared error. While the MLI property holds under various settings (e.g., network architectures and learning problems), we show in practice that networks violating the MLI property can be produced systematically, by encouraging the weights to move far from initialization. The MLI property raises important questions about the loss landscape geometry of neural networks and highlights the need to further study their global properties.

A Computational Framework For Slang Generation
Zhewei Sun University of Toronto, Richard Zemel Columbia University, Yang Xu University of Toronto

Slang is a common type of informal language, but its flexible nature and paucity of data resources present challenges for existing natural language systems. We take an initial step toward machine generation of slang by developing a framework that models the speaker’s word choice in slang context. Our framework encodes novel slang meaning by relating the conventional and slang senses of a word while incorporating syntactic and contextual knowledge in slang usage. We construct the framework using a combination of probabilistic inference and neural contrastive learning. We perform rigorous evaluations on three slang dictionaries and show that our approach not only outperforms state-of-the-art language models, but also better predicts the historical emergence of slang word usages from 1960s to 2000s. We interpret the proposed models and find that the contrastively learned semantic space is sensitive to the similarities between slang and conventional senses of words. Our work creates opportunities for the automated generation and interpretation of informal language.

Wandering Within A World: Online Contextualized Few-Shot Learning
Mengye Ren University of Toronto, Michael Iuzzolino Google Research, Michael Mozer Google Research, Richard Zemel Columbia University

We aim to bridge the gap between typical human and machine-learning environments by extending the standard framework of few-shot learning to an online, continual setting. In this setting, episodes do not have separate training and testing phases, and instead models are evaluated online while learning novel classes. As in the real world, where the presence of spatiotemporal context helps us retrieve learned skills in the past, our online few-shot learning setting also features an underlying context that changes throughout time. Object classes are correlated within a context and inferring the correct context can lead to better performance. Building upon this setting, we propose a new few-shot learning dataset based on large scale indoor imagery that mimics the visual experience of an agent wandering within a world. Furthermore, we convert popular few-shot learning approaches into online versions and we also propose a new contextual prototypical memory model that can make use of spatiotemporal contextual information from the recent past.

Bayesian Few-Shot Classification With One-Vs-Each Polya-Gamma Augmented Gaussian Processes
Jake Snell University of Toronto, Richard Zemel Columbia University

Few-shot classification (FSC), the task of adapting a classifier to unseen classes given a small labeled dataset, is an important step on the path toward human-like machine learning. Bayesian methods are well-suited to tackling the fundamental issue of overfitting in the few-shot scenario because they allow practitioners to specify prior beliefs and update those beliefs in light of observed data. Contemporary approaches to Bayesian few-shot classification maintain a posterior distribution over model parameters, which is slow and requires storage that scales with model size. Instead, we propose a Gaussian process classifier based on a novel combination of Pólya-Gamma augmentation and the one-vs-each softmax approximation that allows us to efficiently marginalize over functions rather than model parameters. We demonstrate improved accuracy and uncertainty quantification on both standard few-shot classification benchmarks and few-shot domain transfer tasks.

Theoretical Bounds On Estimation Error For Meta-Learning
James Lucas University of Toronto, Mengye Ren University of Toronto, Irene Kameni African Master for Mathematical Sciences, Toni Pitassi Columbia University, Richard Zemel Columbia University

Machine learning models have traditionally been developed under the assumption that the training and test distributions match exactly. However, recent success in few-shot learning and related problems are encouraging signs that these models can be adapted to more realistic settings where train and test distributions differ. Unfortunately, there is severely limited theoretical support for these algorithms and little is known about the difficulty of these problems. In this work, we provide novel information-theoretic lower-bounds on minimax rates of convergence for algorithms that are trained on data from multiple sources and tested on novel data. Our bounds depend intuitively on the information shared between sources of data, and characterize the difficulty of learning in this setting for arbitrary algorithms. We demonstrate these bounds on a hierarchical Bayesian model of meta-learning, computing both upper and lower bounds on parameter estimation via maximum-a-posteriori inference.

A PAC-Bayesian Approach To Generalization Bounds For Graph Neural Networks
Renjie Liao University of Toronto, Raquel Urtasun University of Toronto, Richard Zemel Columbia University

In this paper, we derive generalization bounds for the two primary classes of graph neural networks (GNNs), namely graph convolutional networks (GCNs) and message passing GNNs (MPGNNs), via a PAC-Bayesian approach. Our result reveals that the maximum node degree and spectral norm of the weights govern the generalization bounds of both models. We also show that our bound for GCNs is a natural generalization of the results developed in arXiv:1707.09564v2 [cs.LG] for fully-connected and convolutional neural networks. For message passing GNNs, our PAC-Bayes bound improves over the Rademacher complexity based bound in arXiv:2002.06157v1 [cs.LG], showing a tighter dependency on the maximum node degree and the maximum hidden dimension. The key ingredients of our proofs are a perturbation analysis of GNNs and the generalization of PAC-Bayes analysis to non-homogeneous GNNs. We perform an empirical study on several real-world graph datasets and verify that our PAC-Bayes bound is tighter than others.

## Software Systems Laboratory Wins Best Paper Awards at the OSDI and USENIX ATC Conferences

Researchers from the Software Systems Laboratory bagged Best Paper Awards at the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2021) and the 2021 USENIX Annual Technical Conference (USENIX ATC 2021).

#### Jay Lepreau Best Paper Award, OSDI’21

DistAI: Data-Driven Automated Invariant Learning for Distributed Protocols
Jianan Yao, Runzhou Tao, Ronghui Gu, Jason Nieh, Suman Jana, and Gabriel Ryan

Abstract:

Distributed systems are notoriously hard to implement correctly due to non-determinism. Finding the inductive invariant of the distributed protocol is a critical step in verifying the correctness of distributed systems, but takes a long time to do even for simple protocols. We present DistAI, a data-driven automated system for learning inductive invariants for distributed protocols. DistAI generates data by simulating the distributed protocol at different instance sizes and recording states as samples. Based on the observation that invariants are often concise in practice, DistAI starts with small invariant formulas and enumerates all strongest possible invariants that hold for all samples. It then feeds those invariants and the desired safety properties to an SMT solver to check if the conjunction of the invariants and the safety properties is inductive. Starting with small invariant formulas and strongest possible invariants avoids large SMT queries, improving SMT solver performance. Because DistAI starts with the strongest possible invariants, if the SMT solver fails, DistAI does not need to discard failed invariants, but knows to monotonically weaken them and try again with the solver, repeating the process until it eventually succeeds. We prove that DistAI is guaranteed to find the ∃-free inductive invariant that proves the desired safety properties in finite time, if one exists. Our evaluation shows that DistAI successfully verifies 13 common distributed protocols automatically and outperforms alternative methods both in the number of protocols it verifies and the speed at which it does so, in some cases by more than two orders of magnitude.

#### USENIX ATC Best Paper Award, ATC’21

Argus: Debugging Performance Issues in Modern Desktop Applications with Annotated Causal Tracing
Lingmei Weng, Peng Huang, Jason Nieh, and Junfeng Yang

Abstract:

Modern desktop applications involve many asynchronous, concurrent interactions that make performance issues difficult to diagnose. Although prior work has used causal tracing for debugging performance issues in distributed systems, we find that these techniques suffer from high inaccuracies for desktop applications. We present Argus, a fast, effective causal tracing tool for debugging performance anomalies in desktop applications. Argus introduces a novel notion of strong and weak edges to explicitly model and annotate trace graph ambiguities, a new beam-search-based diagnosis algorithm to select the most likely causal paths in the presence of ambiguities, and a new way to compare causal paths across normal and abnormal executions. We have implemented Argus across multiple versions of macOS and evaluated it on 12 infamous spinning pinwheel issues in popular macOS applications. Argus diagnosed the root causes for all issues, 10 of which were previously unknown, some of which have been open for several years. Argus incurs less than 5% CPU overhead when its system-wide tracing is enabled, making always-on tracing feasible.

## Making Our Computers More Secure

Associate Professor Simha Sethumadhavan, Mohamed Tarek, and Miguel Arroyo design new techniques to bolster memory safety; ideas are now being used by Air Force Research Lab.

## 9 Papers From CS Researchers Accepted to CVPR 2021

Research from the department has been accepted to the 2021 Computer Vision and Pattern Recognition (CVPR) Conference. The annual event explores machine learning, artificial intelligence, and computer vision research and its applications.

Open-Vocabulary Object Detection Using Captions
Alireza Zareian Snap Inc. and Columbia University, Kevin Dela Rosa Snap Inc., Derek Hao Hu Snap Inc., Shih-Fu Chang Columbia University

Abstract
Despite the remarkable accuracy of deep neural networks in object detection, they are costly to train and scale due to supervision requirements. Particularly, learning more object categories typically requires proportionally more bounding box annotations. Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but they have not been as successful and widely adopted as supervised models. In this paper, we put forth a novel formulation of the object detection problem, namely open-vocabulary object detection, which is more general, more practical, and more effective than weakly supervised and zero-shot approaches. We propose a new method to train object detectors using bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost. We show that the proposed method can detect and localize objects for which no bounding box annotation is provided during training, at a significantly higher accuracy than zero-shot approaches. Meanwhile, objects with bounding box annotation can be detected almost as accurately as supervised methods, which is significantly better than weakly supervised baselines. Accordingly, we establish a new state-of-the-art for scalable object detection.

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
Xudong Lin Columbia University, Gedas Bertasius Facebook AI, Jue Wang Facebook AI, Shih-Fu Chang Columbia University, Devi Parikh Facebook AI and Georgia Tech, Lorenzo Torresani Facebook AI and Dartmouth

Abstract
We present Vx2Text, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. In order to leverage transformer networks, which have been shown to be effective at modeling language, each modality is first converted into a set of language embeddings by a learnable tokenizer. This allows our approach to perform multimodal fusion in the language space, thus eliminating the need for ad-hoc cross-modal fusion modules. To address the non-differentiability of tokenization on continuous inputs (e.g., video or audio), we utilize a relaxation scheme that enables end-to-end training. Furthermore, unlike prior encoder-only models, our network includes an autoregressive decoder to generate open-ended text from the multimodal embeddings fused by the language encoder. This renders our approach fully generative and makes it directly applicable to different “video+x to text” problems without the need to design specialized network heads for each task. The proposed framework is not only conceptually simple but also remarkably effective: experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks—captioning, question answering, and audio-visual scene-aware dialog. Our code will be made publicly available.

Co-Grounding Networks With Semantic Attention for Referring Expression Comprehension in Videos
Sijie Song Wangxuan Institute of Computer Technology, Xudong Lin Columbia University, Jiaying Liu Wangxuan Institute of Computer Technology, Zongming Guo Wangxuan Institute of Computer Technology, Shih-Fu Chang Columbia University

Abstract
In this paper, we address the problem of referring expression comprehension in videos, which is challenging due to complex expression and scene dynamics. Unlike previous methods which solve the problem in multiple stages (i.e., tracking, proposal-based matching), we tackle the problem from a novel perspective, co-grounding, with an elegant one-stage framework. We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency with co-grounding feature learning. Semantic attention learning explicitly parses referring cues in different attributes to reduce the ambiguity in the complex expression. Co-grounding feature learning boosts visual feature representations by integrating temporal correlation to reduce the ambiguity caused by scene dynamics. Experiment results demonstrate the superiority of our framework on the video grounding datasets VID and OTB in generating accurate and stable results across frames. Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset. Our project is available at https://sijiesong.github.io/co-grounding.

Seeing in Extra Darkness Using a Deep-Red Flash
Jinhui Xiong KAUST, Jian Wang Snap Research, Wolfgang Heidrich KAUST, Shree Nayar Snap Research and Columbia University

Abstract
We propose a new flash technique for low-light imaging, using deep-red light as an illuminating source. Our main observation is that in a dim environment, the human eye mainly uses rods for the perception of light, which are not sensitive to wavelengths longer than 620nm, yet the camera sensor still has a spectral response. We propose a novel modulation strategy when training a modern CNN model for guided image filtering, fusing a noisy RGB frame and a flash frame. This fusion network is further extended for video reconstruction. We have built a prototype with minor hardware adjustments and tested the new flash technique on a variety of static and dynamic scenes. The experimental results demonstrate that our method produces compelling reconstructions, even in extra dim conditions.

UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training
Mingyang Zhou University of California, Davis, Luowei Zhou Microsoft Dynamics 365 AI Research, Shuohang Wang Microsoft Dynamics 365 AI Research, Yu Cheng Microsoft Dynamics 365 AI Research, Linjie Li Microsoft Dynamics 365 AI Research, Zhou Yu University of California, Davis and Columbia University, Jingjing Liu Microsoft Dynamics 365 AI Research

Abstract
Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC^2, the first machine translation-augmented framework for cross-lingual cross-modal representation learning. To tackle the scarcity problem of multilingual captions for image datasets, we first augment existing English-only datasets with other languages via machine translation (MT). Then we extend the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (eg. using image as pivot). To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Maksed Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data. Evaluation on multilingual image-text retrieval and multilingual visual question answering benchmarks demonstrates that our proposed framework achieves new state of the art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.

Learning Goals From Failure
Dave Epstein Columbia University and Carl Vondrick Columbia University

Abstract
We introduce a framework that predicts the goals behind observable human action in video. Motivated by evidence in developmental psychology, we leverage video of unintentional action to learn video representations of goals without direct supervision. Our approach models videos as contextual trajectories that represent both low-level motion and high-level action features. Experiments and visualizations show our trained model is able to predict the underlying goals in video of unintentional action. We also propose a method to “automatically correct” unintentional action by leveraging gradient signals of our model to adjust latent trajectories. Although the model is trained with minimal supervision, it is competitive with or outperforms baselines trained on large (supervised) datasets of successfully executed goals, showing that observing unintentional action is crucial to learning about goals in video.

Generative Interventions for Causal Learning
Chengzhi Mao Columbia University, Augustine Cha Columbia University, Amogh Gupta Columbia University, Hao Wang Rutgers University, Junfeng Yang Columbia University, Carl Vondrick Columbia University

Abstract
We introduce a framework for learning robust visual representations that generalize to new viewpoints, backgrounds, and scene contexts. Discriminative models often learn naturally occurring spurious correlations, which cause them to fail on images outside of the training distribution. In this paper, we show that we can steer generative models to manufacture interventions on features caused by confounding factors. Experiments, visualizations, and theoretical results show this method learns robust representations more consistent with the underlying causal relationships. Our approach improves performance on multiple datasets demanding out-of-distribution generalization, and we demonstrate state-of-the-art performance generalizing from ImageNet to ObjectNet dataset.

Learning the Predictability of the Future
Didac Suris Columbia University, Ruoshi Liu Columbia University, Carl Vondrick Columbia University

Abstract
We introduce a framework for learning from unlabeled video what is predictable in the future. Instead of committing up front to features to predict, our approach learns from data which features are predictable. Based on the observation that hyperbolic geometry naturally and compactly encodes hierarchical structure, we propose a predictive model in hyperbolic space. When the model is most confident, it will predict at a concrete level of the hierarchy, but when the model is not confident, it learns to automatically select a higher level of abstraction. Experiments on two established datasets show the key role of hierarchical representations for action prediction. Although our representation is trained with unlabeled video, visualizations show that action hierarchies emerge in the representation.

Linear Semantics in Generative Adversarial Networks
Jianjin Xu Columbia University, Changxi Zheng Columbia University

Abstract
Generative Adversarial Networks (GANs) are able to generate high-quality images, but it remains difficult to explicitly specify the semantics of synthesized images. In this work, we aim to better understand the semantic representation of GANs, and thereby enable semantic control in GAN’s generation process. Interestingly, we find that a well-trained GAN encodes image semantics in its internal feature maps in a surprisingly simple way: a linear transformation of feature maps suffices to extract the generated image semantics. To verify this simplicity, we conduct extensive experiments on various GANs and datasets; and thanks to this simplicity, we are able to learn a semantic segmentation model for a trained GAN from a small number (e.g., 8) of labeled images. Last but not least, leveraging our finding, we propose two few-shot image editing approaches, namely Semantic-Conditional Sampling and Semantic Image Editing. Given a trained GAN and as few as eight semantic annotations, the user is able to generate diverse images subject to a user-provided semantic layout, and control the synthesized image semantics. We have made the code publicly available.

## Five CS professors awarded Amazon Research Awards

Faculty from the department have been named recipients of the 2020 Amazon Research Awards.

## 6 Papers From the Department Accepted to the EACL 2021

Six papers from CS researchers were accepted to the 16th conference of the European Chapter of the Association for Computational Linguistics (EACL).  As the flagship European conference in the field of computational linguistics, EACL welcomes European and international researchers covering a broad spectrum of research areas that are concerned with computational approaches to natural language.

Below are brief descriptions and links to the papers.

Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings
Kailash Karthik Saravanakumar Columbia University, Miguel Ballesteros Amazon AI, Muthu Kumar Chandrasekaran Amazon AI, Kathleen McKeown Columbia University & Amazon AI

This paper presents a new clustering paradigm for news streams, where clusters have a one-to-one correspondence with real-world events (for example, the Suez canal blockage). An important aspect of this problem is that the number of clusters is unknown and varies with time (new events occur and old events cease to be of relevance). The proposed paradigm follows a pipeline approach – where representations are built for each new article, comparisons are made with existing clusters to pick the most compatible one, and finally, a clustering decision is produced.

A surprising observation from this work is that contextual embeddings (from models like BERT), in contrast to their overwhelming success in many NLP problems, achieve sub-par performance by themselves on this clustering problem. However, when combined with other representations (like TF-IDF and timestamps) and fine-tuned with task-specific augmentations, they achieve new state-of-the-art performance. Another interesting observation is that the widely reported B-Cubed metrics are biased towards large clusters and hence don’t capture cluster fragmentation on smaller clusters as well. Since clusters corresponding to emerging events are small and errors made on such clusters are highly undesirable, the authors suggest using an additional metric CEAF-e to evaluate models for this task.

Segmenting Subtitles for Correcting ASR Segmentation Errors
David Wan Columbia University, Chris Kedzie Columbia University, Faisal Ladakh Columbia University, Elsbeth Turcan Columbia University, Petra Galuszkova University of Maryland, Elena Zotkina University of Maryland, Zhengping Jiang Columbia University, Peter Bell University of Edinburgh, and Kathleen McKeown Columbia University

For the task of spoken language translation, the usual approach is to have a pipeline consisting of Automatic Speech Recognition (ASR) that transforms audio files into words and utterances in the original language and a Machine Translation (MT) that translate the utterances into the target language. However this setup may suffer from input-output mismatches: ASR segments utterances by acoustic information such as pauses, and thus may produce run-on sentences or sentence fragments, but MT is usually trained on proper sentences without such issues and may not perform well under such setting. This paper proposes the use of an intermediate model to segment utterances into sentences to improve performance in MT as well as other downstream tasks.

One crucial problem for developing such models is the lack of suitable training data for segmentation, especially when the languages involved are low-resourced. To this end, this paper also proposes a way to use subtitles dataset as proxy speech data as well as creating synthetic acoustic utterances that mimic common ASR errors for the model to learn to fix. Using a simple neural tagging model, the authors of this paper show improvement over the baseline ASR segmentation on MT for Lithuanian, Bulgarian, and Farisi. A surprising finding is that the segmentation model most improves the translation quality of more syntactically complex segments.

“Talk to me with left, right, and angles”: Lexical entrainment in spoken Hebrew dialogue
Andreas Weise CUNY Graduate Center, Vered Silber-Varod The Open University of Israel, Anat Lerner The Open University of Israel, Julia Hirschberg Columbia University, and Rivka Levitan Columbia University

It has been well-documented for several languages that human interlocutors tend to adapt their linguistic productions to become more similar to each other. This behavior, known as entrainment, affects lexical choice as well, both with regard to specific words, such as referring expressions, and overall style.

Lexical entrainment is the behavior that causes the words that speakers use in a conversation to become more similar over time. Entrainment more broadly is a human behavior causing interlocutors to adapt to each other to become more similar. Its effects are measurable but entrainment itself is not a measure.

This paper offers the first investigation of such lexical entrainment in Hebrew.

The analysis of Hebrew speakers interacting in a Map Task, a popular experimental setup, provides rich evidence of lexical entrainment. No clear pattern of differences is found between speaker pairs by the combination of their genders, nor between speakers by their individual gender. However, speakers in a position of less power are found to entrain more than those with greater power, which matches theoretical accounts.

Overall, the results mostly accord with those for American English. There is, however, a surprising lack of entrainment on a list of hedge words that were previously found to be highly entrained in English. This might be due to cultural differences between American and Israeli speakers that render adoption of a more tentative style less appropriate in the Hebrew context.

Entity-level Factual Consistency of Abstractive Text Summarization
Feng Nan Amazon Web Services, Ramesh Nallapati Amazon Web Services, Zhiguo Wang Amazon Web Services, Cicero Nogueira dos Santos Amazon Web Services, Henghui Zhu Amazon Web Services, Dejiao Zhang Amazon Web Services, Kathleen McKeown Amazon Web Services & Columbia University, Bing Xiang Amazon Web Services

A key challenge for abstractive summarization is ensuring factual consistency of the generated summary with respect to the original document. For example, state-of-the-art models trained on existing datasets exhibit entity hallucination, generating names of entities that are not present in the source document.

The paper proposes a set of new metrics to quantify the entity-level factual consistency of generated summaries and shows that the entity hallucination problem can be alleviated by simply filtering the training data. In addition, the paper introduces a summary-worthy entity classification task to the training process as well as a joint entity and summary generation approach, which yields further improvements in entity-level metrics.

“Laughing at you or with you”: The Role of Sarcasm in Shaping the Disagreement Space
Debanjan Ghosh Educational Testing Service, Ritvik Shrivastava MindMeld, Cisco Systems & Columbia University, and Smaranda Muresan Columbia University

Detecting arguments in online interactions is useful to understand how conflicts arise and get resolved. Users often use figurative language, such as sarcasm, either as persuasive devices or to attack the opponent by an ad hominem argument. To further our understanding of the role of sarcasm in shaping the disagreement space, the paper presents a thorough experimental setup using a corpus annotated with both argumentative moves (agree/disagree) and sarcasm. The research exploits joint modeling in terms of (a) applying discrete features that are useful in detecting sarcasm to the task of argumentative relation classification (agree/disagree/none), and (b) multitask learning for argumentative relation classification and sarcasm detection using deep learning architectures (e.g., dual Long ShortTerm Memory (LSTM) with hierarchical attention and Transformer-based architectures). The paper shows that modeling sarcasm improves the argumentative relation classification task (agree/disagree/none) in all setups.

A Unified Feature Representation for Lexical Connotations
Emily Allaway Columbia University and Kathleen McKeown Columbia University

Ideological attitudes and stances are often expressed through subtle meanings of words and phrases. Understanding these connotations is critical to recognize the cultural and emotional perspectives of the speaker. In this paper, the researchers use distant labeling to create a new lexical resource representing connotation aspects for nouns and adjectives. Their analysis shows that it aligns well with human judgments. Additionally, they present a method for creating lexical representations that capture connotations within the embedding space and show that using the embeddings provides a statistically significant improvement on the task of stance detection when data is limited.