## 13 Research Papers Accepted to ICML 2021

Papers from CS researchers have been accepted to the 38th International Conference on Machine Learning (ICML 2021).

Associate Professor Daniel Hsu was one of the publication chairs of the conference and Assistant Professor Elham Azizi helped organize the 2021 ICML Workshop on Computational Biology. The workshop highlighted how machine learning approaches can be tailored to making both translational and basic scientific discoveries with biological data.

Below are the abstracts and links to the accepted papers.

A Proxy Variable View of Shared Confounding
Yixin Wang Columbia University, David Blei Columbia University

Causal inference from observational data can be biased by unobserved confounders. Confounders—the variables that affect both the treatments and the outcome—induce spurious non-causal correlations between the two. Without additional conditions, unobserved confounders generally make causal quantities hard to identify. In this paper, we focus on the setting where there are many treatments with shared confounding, and we study under what conditions is causal identification possible. The key observation is that we can view subsets of treatments as proxies of the unobserved confounder and identify the intervention distributions of the rest. Moreover, while existing identification formulas for proxy variables involve solving integral equations, we show that one can circumvent the need for such solutions by directly modeling the data. Finally, we extend these results to an expanded class of causal graphs, those with other confounders and selection variables.

Unsupervised Representation Learning via Neural Activation Coding
Yookoon Park Columbia University, Sangho Lee Seoul National University, Gunhee Kim Seoul National University, David Blei Columbia University

We present neural activation coding (NAC) as a novel approach for learning deep representations from unlabeled data for downstream applications. We argue that the deep encoder should maximize its nonlinear expressivity on the data for downstream predictors to take full advantage of its representation power. To this end, NAC maximizes the mutual information between activation patterns of the encoder and the data over a noisy communication channel. We show that learning for a noise-robust activation code increases the number of distinct linear regions of ReLU encoders, hence the maximum nonlinear expressivity. More interestingly, NAC learns both continuous and discrete representations of data, which we respectively evaluate on two downstream tasks: (i) linear classification on CIFAR-10 and ImageNet-1K and (ii) nearest neighbor retrieval on CIFAR-10 and FLICKR-25K. Empirical results show that NAC attains better or comparable performance on both tasks over recent baselines including SimCLR and DistillHash. In addition, NAC pretraining provides significant benefits to the training of deep generative models. Our code is available at https://github.com/yookoon/nac.

The Logical Options Framework
Brandon Araki MIT, Xiao Li MIT, Kiran Vodrahalli Columbia University, Jonathan DeCastro Toyota Research Institute, Micah Fry MIT Lincoln Laboratory, Daniela Rus MIT CSAIL

Learning composable policies for environments with complex rules and tasks is a challenging problem. We introduce a hierarchical reinforcement learning framework called the Logical Options Framework (LOF) that learns policies that are satisfying, optimal, and composable. LOF efficiently learns policies that satisfy tasks by representing the task as an automaton and integrating it into learning and planning. We provide and prove conditions under which LOF will learn satisfying, optimal policies. And lastly, we show how LOF’s learned policies can be composed to satisfy unseen tasks with only 10-50 retraining steps on our benchmarks. We evaluate LOF on four tasks in discrete and continuous domains, including a 3D pick-and-place environment.

Estimating Identifiable Causal Effects on Markov Equivalence Class through Double Machine Learning
Yonghan Jung Columbia University, Jin Tian Columbia University, Elias Bareinboim Columbia University

General methods have been developed for estimating causal effects from observational data under causal assumptions encoded in the form of a causal graph. Most of this literature assumes that the underlying causal graph is completely specified. However, only observational data is available in most practical settings, which means that one can learn at most a Markov equivalence class (MEC) of the underlying causal graph. In this paper, we study the problem of causal estimation from a MEC represented by a partial ancestral graph (PAG), which is learnable from observational data. We develop a general estimator for any identifiable causal effects in a PAG. The result fills a gap for an end-to-end solution to causal inference from observational data to effects estimation. Specifically, we develop a complete identification algorithm that derives an influence function for any identifiable causal effects from PAGs. We then construct a double/debiased machine learning (DML) estimator that is robust to model misspecification and biases in nuisance function estimation, permitting the use of modern machine learning techniques. Simulation results corroborate with the theory.

Environment Inference for Invariant Learning
Elliot Creager University of Toronto, Joern Jacobsen Apple Inc., Richard Zemel Columbia University

Learning models that gracefully handle distribution shifts is central to research on domain generalization, robust optimization, and fairness. A promising formulation is domain-invariant learning, which identifies the key issue of learning which features are domain-specific versus domain-invariant. An important assumption in this area is that the training examples are partitioned into domains'' orenvironments”. Our focus is on the more common setting where such partitions are not provided. We propose EIIL, a general framework for domain-invariant learning that incorporates Environment Inference to directly infer partitions that are maximally informative for downstream Invariant Learning. We show that EIIL outperforms invariant learning methods on the CMNIST benchmark without using environment labels, and significantly outperforms ERM on worst-group performance in the Waterbirds dataset. Finally, we establish connections between EIIL and algorithmic fairness, which enables EIIL to improve accuracy and calibration in a fair prediction problem.

SketchEmbedNet: Learning Novel Concepts by Imitating Drawings
Alex Wang University of Toronto, Mengye Ren University of Toronto, Richard Zemel Columbia University

Sketch drawings capture the salient information of visual concepts. Previous work has shown that neural networks are capable of producing sketches of natural objects drawn from a small number of classes. While earlier approaches focus on generation quality or retrieval, we explore properties of image representations learned by training a model to produce sketches of images. We show that this generative, class-agnostic model produces informative embeddings of images from novel examples, classes, and even novel datasets in a few-shot setting. Additionally, we find that these learned representations exhibit interesting structure and compositionality.

Universal Template for Few-Shot Dataset Generalization
Eleni Triantafillou University of Toronto, Hugo Larochelle Google Brain, Richard Zemel Columbia University, Vincent Dumoulin Google

Few-shot dataset generalization is a challenging variant of the well-studied few-shot classification problem where a diverse training set of several datasets is given, for the purpose of training an adaptable model that can then learn classes from \emph{new datasets} using only a few examples. To this end, we propose to utilize the diverse training set to construct a \emph{universal template}: a partial model that can define a wide array of dataset-specialized models, by plugging in appropriate components. For each new few-shot classification problem, our approach therefore only requires inferring a small number of parameters to insert into the universal template. We design a separate network that produces an initialization of those parameters for each given task, and we then fine-tune its proposed initialization via a few steps of gradient descent. Our approach is more parameter-efficient, scalable and adaptable compared to previous methods, and achieves the state-of-the-art on the challenging Meta-Dataset benchmark.

On Monotonic Linear Interpolation of Neural Network Parameters
James Lucas University of Toronto, Juhan Bae University of Toronto, Michael Zhang University of Toronto, Stanislav Fort Google AI, Richard Zemel Columbia University, Roger Grosse University of Toronto

Linear interpolation between initial neural network parameters and converged parameters after training with stochastic gradient descent (SGD) typically leads to a monotonic decrease in the training objective. This Monotonic Linear Interpolation (MLI) property, first observed by Goodfellow et al. 2014, persists in spite of the non-convex objectives and highly non-linear training dynamics of neural networks. Extending this work, we evaluate several hypotheses for this property that, to our knowledge, have not yet been explored. Using tools from differential geometry, we draw connections between the interpolated paths in function space and the monotonicity of the network — providing sufficient conditions for the MLI property under mean squared error. While the MLI property holds under various settings (e.g., network architectures and learning problems), we show in practice that networks violating the MLI property can be produced systematically, by encouraging the weights to move far from initialization. The MLI property raises important questions about the loss landscape geometry of neural networks and highlights the need to further study their global properties.

A Computational Framework For Slang Generation
Zhewei Sun University of Toronto, Richard Zemel Columbia University, Yang Xu University of Toronto

Slang is a common type of informal language, but its flexible nature and paucity of data resources present challenges for existing natural language systems. We take an initial step toward machine generation of slang by developing a framework that models the speaker’s word choice in slang context. Our framework encodes novel slang meaning by relating the conventional and slang senses of a word while incorporating syntactic and contextual knowledge in slang usage. We construct the framework using a combination of probabilistic inference and neural contrastive learning. We perform rigorous evaluations on three slang dictionaries and show that our approach not only outperforms state-of-the-art language models, but also better predicts the historical emergence of slang word usages from 1960s to 2000s. We interpret the proposed models and find that the contrastively learned semantic space is sensitive to the similarities between slang and conventional senses of words. Our work creates opportunities for the automated generation and interpretation of informal language.

Wandering Within A World: Online Contextualized Few-Shot Learning
Mengye Ren University of Toronto, Michael Iuzzolino Google Research, Michael Mozer Google Research, Richard Zemel Columbia University

We aim to bridge the gap between typical human and machine-learning environments by extending the standard framework of few-shot learning to an online, continual setting. In this setting, episodes do not have separate training and testing phases, and instead models are evaluated online while learning novel classes. As in the real world, where the presence of spatiotemporal context helps us retrieve learned skills in the past, our online few-shot learning setting also features an underlying context that changes throughout time. Object classes are correlated within a context and inferring the correct context can lead to better performance. Building upon this setting, we propose a new few-shot learning dataset based on large scale indoor imagery that mimics the visual experience of an agent wandering within a world. Furthermore, we convert popular few-shot learning approaches into online versions and we also propose a new contextual prototypical memory model that can make use of spatiotemporal contextual information from the recent past.

Bayesian Few-Shot Classification With One-Vs-Each Polya-Gamma Augmented Gaussian Processes
Jake Snell University of Toronto, Richard Zemel Columbia University

Few-shot classification (FSC), the task of adapting a classifier to unseen classes given a small labeled dataset, is an important step on the path toward human-like machine learning. Bayesian methods are well-suited to tackling the fundamental issue of overfitting in the few-shot scenario because they allow practitioners to specify prior beliefs and update those beliefs in light of observed data. Contemporary approaches to Bayesian few-shot classification maintain a posterior distribution over model parameters, which is slow and requires storage that scales with model size. Instead, we propose a Gaussian process classifier based on a novel combination of Pólya-Gamma augmentation and the one-vs-each softmax approximation that allows us to efficiently marginalize over functions rather than model parameters. We demonstrate improved accuracy and uncertainty quantification on both standard few-shot classification benchmarks and few-shot domain transfer tasks.

Theoretical Bounds On Estimation Error For Meta-Learning
James Lucas University of Toronto, Mengye Ren University of Toronto, Irene Kameni African Master for Mathematical Sciences, Toni Pitassi Columbia University, Richard Zemel Columbia University

Machine learning models have traditionally been developed under the assumption that the training and test distributions match exactly. However, recent success in few-shot learning and related problems are encouraging signs that these models can be adapted to more realistic settings where train and test distributions differ. Unfortunately, there is severely limited theoretical support for these algorithms and little is known about the difficulty of these problems. In this work, we provide novel information-theoretic lower-bounds on minimax rates of convergence for algorithms that are trained on data from multiple sources and tested on novel data. Our bounds depend intuitively on the information shared between sources of data, and characterize the difficulty of learning in this setting for arbitrary algorithms. We demonstrate these bounds on a hierarchical Bayesian model of meta-learning, computing both upper and lower bounds on parameter estimation via maximum-a-posteriori inference.

A PAC-Bayesian Approach To Generalization Bounds For Graph Neural Networks
Renjie Liao University of Toronto, Raquel Urtasun University of Toronto, Richard Zemel Columbia University

In this paper, we derive generalization bounds for the two primary classes of graph neural networks (GNNs), namely graph convolutional networks (GCNs) and message passing GNNs (MPGNNs), via a PAC-Bayesian approach. Our result reveals that the maximum node degree and spectral norm of the weights govern the generalization bounds of both models. We also show that our bound for GCNs is a natural generalization of the results developed in arXiv:1707.09564v2 [cs.LG] for fully-connected and convolutional neural networks. For message passing GNNs, our PAC-Bayes bound improves over the Rademacher complexity based bound in arXiv:2002.06157v1 [cs.LG], showing a tighter dependency on the maximum node degree and the maximum hidden dimension. The key ingredients of our proofs are a perturbation analysis of GNNs and the generalization of PAC-Bayes analysis to non-homogeneous GNNs. We perform an empirical study on several real-world graph datasets and verify that our PAC-Bayes bound is tighter than others.

## 6 Papers From the Department Accepted to the EACL 2021

Six papers from CS researchers were accepted to the 16th conference of the European Chapter of the Association for Computational Linguistics (EACL).  As the flagship European conference in the field of computational linguistics, EACL welcomes European and international researchers covering a broad spectrum of research areas that are concerned with computational approaches to natural language.

Below are brief descriptions and links to the papers.

Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings
Kailash Karthik Saravanakumar Columbia University, Miguel Ballesteros Amazon AI, Muthu Kumar Chandrasekaran Amazon AI, Kathleen McKeown Columbia University & Amazon AI

This paper presents a new clustering paradigm for news streams, where clusters have a one-to-one correspondence with real-world events (for example, the Suez canal blockage). An important aspect of this problem is that the number of clusters is unknown and varies with time (new events occur and old events cease to be of relevance). The proposed paradigm follows a pipeline approach – where representations are built for each new article, comparisons are made with existing clusters to pick the most compatible one, and finally, a clustering decision is produced.

A surprising observation from this work is that contextual embeddings (from models like BERT), in contrast to their overwhelming success in many NLP problems, achieve sub-par performance by themselves on this clustering problem. However, when combined with other representations (like TF-IDF and timestamps) and fine-tuned with task-specific augmentations, they achieve new state-of-the-art performance. Another interesting observation is that the widely reported B-Cubed metrics are biased towards large clusters and hence don’t capture cluster fragmentation on smaller clusters as well. Since clusters corresponding to emerging events are small and errors made on such clusters are highly undesirable, the authors suggest using an additional metric CEAF-e to evaluate models for this task.

Segmenting Subtitles for Correcting ASR Segmentation Errors
David Wan Columbia University, Chris Kedzie Columbia University, Faisal Ladakh Columbia University, Elsbeth Turcan Columbia University, Petra Galuszkova University of Maryland, Elena Zotkina University of Maryland, Zhengping Jiang Columbia University, Peter Bell University of Edinburgh, and Kathleen McKeown Columbia University

For the task of spoken language translation, the usual approach is to have a pipeline consisting of Automatic Speech Recognition (ASR) that transforms audio files into words and utterances in the original language and a Machine Translation (MT) that translate the utterances into the target language. However this setup may suffer from input-output mismatches: ASR segments utterances by acoustic information such as pauses, and thus may produce run-on sentences or sentence fragments, but MT is usually trained on proper sentences without such issues and may not perform well under such setting. This paper proposes the use of an intermediate model to segment utterances into sentences to improve performance in MT as well as other downstream tasks.

One crucial problem for developing such models is the lack of suitable training data for segmentation, especially when the languages involved are low-resourced. To this end, this paper also proposes a way to use subtitles dataset as proxy speech data as well as creating synthetic acoustic utterances that mimic common ASR errors for the model to learn to fix. Using a simple neural tagging model, the authors of this paper show improvement over the baseline ASR segmentation on MT for Lithuanian, Bulgarian, and Farisi. A surprising finding is that the segmentation model most improves the translation quality of more syntactically complex segments.

“Talk to me with left, right, and angles”: Lexical entrainment in spoken Hebrew dialogue
Andreas Weise CUNY Graduate Center, Vered Silber-Varod The Open University of Israel, Anat Lerner The Open University of Israel, Julia Hirschberg Columbia University, and Rivka Levitan Columbia University

It has been well-documented for several languages that human interlocutors tend to adapt their linguistic productions to become more similar to each other. This behavior, known as entrainment, affects lexical choice as well, both with regard to specific words, such as referring expressions, and overall style.

Lexical entrainment is the behavior that causes the words that speakers use in a conversation to become more similar over time. Entrainment more broadly is a human behavior causing interlocutors to adapt to each other to become more similar. Its effects are measurable but entrainment itself is not a measure.

This paper offers the first investigation of such lexical entrainment in Hebrew.

The analysis of Hebrew speakers interacting in a Map Task, a popular experimental setup, provides rich evidence of lexical entrainment. No clear pattern of differences is found between speaker pairs by the combination of their genders, nor between speakers by their individual gender. However, speakers in a position of less power are found to entrain more than those with greater power, which matches theoretical accounts.

Overall, the results mostly accord with those for American English. There is, however, a surprising lack of entrainment on a list of hedge words that were previously found to be highly entrained in English. This might be due to cultural differences between American and Israeli speakers that render adoption of a more tentative style less appropriate in the Hebrew context.

Entity-level Factual Consistency of Abstractive Text Summarization
Feng Nan Amazon Web Services, Ramesh Nallapati Amazon Web Services, Zhiguo Wang Amazon Web Services, Cicero Nogueira dos Santos Amazon Web Services, Henghui Zhu Amazon Web Services, Dejiao Zhang Amazon Web Services, Kathleen McKeown Amazon Web Services & Columbia University, Bing Xiang Amazon Web Services

A key challenge for abstractive summarization is ensuring factual consistency of the generated summary with respect to the original document. For example, state-of-the-art models trained on existing datasets exhibit entity hallucination, generating names of entities that are not present in the source document.

The paper proposes a set of new metrics to quantify the entity-level factual consistency of generated summaries and shows that the entity hallucination problem can be alleviated by simply filtering the training data. In addition, the paper introduces a summary-worthy entity classification task to the training process as well as a joint entity and summary generation approach, which yields further improvements in entity-level metrics.

“Laughing at you or with you”: The Role of Sarcasm in Shaping the Disagreement Space
Debanjan Ghosh Educational Testing Service, Ritvik Shrivastava MindMeld, Cisco Systems & Columbia University, and Smaranda Muresan Columbia University

Detecting arguments in online interactions is useful to understand how conflicts arise and get resolved. Users often use figurative language, such as sarcasm, either as persuasive devices or to attack the opponent by an ad hominem argument. To further our understanding of the role of sarcasm in shaping the disagreement space, the paper presents a thorough experimental setup using a corpus annotated with both argumentative moves (agree/disagree) and sarcasm. The research exploits joint modeling in terms of (a) applying discrete features that are useful in detecting sarcasm to the task of argumentative relation classification (agree/disagree/none), and (b) multitask learning for argumentative relation classification and sarcasm detection using deep learning architectures (e.g., dual Long ShortTerm Memory (LSTM) with hierarchical attention and Transformer-based architectures). The paper shows that modeling sarcasm improves the argumentative relation classification task (agree/disagree/none) in all setups.

A Unified Feature Representation for Lexical Connotations
Emily Allaway Columbia University and Kathleen McKeown Columbia University

Ideological attitudes and stances are often expressed through subtle meanings of words and phrases. Understanding these connotations is critical to recognize the cultural and emotional perspectives of the speaker. In this paper, the researchers use distant labeling to create a new lexical resource representing connotation aspects for nouns and adjectives. Their analysis shows that it aligns well with human judgments. Additionally, they present a method for creating lexical representations that capture connotations within the embedding space and show that using the embeddings provides a statistically significant improvement on the task of stance detection when data is limited.