Machine Learning Papers Accepted to ICML 2020

The 37th International Conference on Machine Learning is an annual event taking place virtually this week. Professor David Blei is the General Chair of the conference for the larger machine learning research community.

Professor Elias Bareinboim presented a tutorial entitled “Towards Causal Reinforcement Learning,” where he discussed a new approach for decision-making under uncertainty in sequential settings. The tutorial is based on his observation, in his own words, “Causal inference (CI) and reinforcement learning (RL) are two disciplines that have evolved independently and achieved great successes in the past decades, but, in reality, are much closer than one would expect. Many new learning problems emerge whenever we understand and formally acknowledge their connection. I believe this understanding will make them closer and more applicable to real-world settings.”

In particular, Bareinboim explained that CI provides a set of tools and principles that allows one to combine structural invariances about the environment with data to reason about questions of counterfactual nature — i.e., what would have happened had reality been different, even when no data about this imagined reality is available. On the other hand, RL is concerned with efficiently finding a policy that optimizes a specific function (e.g., reward, regret) in interactive environments under sampling constraints, i.e., regarding the agent’s number of interactions with the environment. These two disciplines have evolved with virtually no interaction between them. Bareinboim noticed that even though they operate under different aspects of reality (an “environment,” as usually said in RL jargon), they are two sides of the same coin, namely, counterfactual relations.

His group at Columbia, the Causal Artificial Intelligence Lab, has been investigating in the last years the implications of this observation for decision-making in partially observable environments. In the first part of his tutorial, Bareinboim introduced the basics of CI and RL, and then explained how they fit the larger picture through a CRL lens. In particular, he used a newly proved result (joint with collaborators at Stanford University), the Causal Hierarchy Theorem (CHT), to discuss the different types of “interactions” an agent could have with the environment it is deployed, namely, “seeing”, “doing”, and “imagining.” He then puts the literature of RL — on- and off-policy learning — as well as the CI type of learning — called do-calculus learning — under the same formal umbrella. After being able to formalize CI and RL in a shared notation, in the second part of the tutorial, he notices that CRL opens up a new family of learning problems that are both real and pervasive, but weren’t acknowledged before. He discusses many of these problems, including the combination of online & offline learning (called generalized policy learning), when and where the agent should intervene in a system, counterfactual decision-making, generalizability across environments, causal imitation learning, to cite a few.

Bareinboim’s program is to develop a principled framework for designing causal AI systems that integrate observational, experimental, counterfactual data, modes of reasoning, and knowledge, which he believes, will lead to “a natural treatment to human-like explainability and rational decision-making.” If you are interested in joining the effort, he is actively recruiting graduate students, postdoctoral scholars, and trying to find collaborators who believe in Causal AI. The tutorial’s material and further resources about CRL can be found here: https://crl.causalai.net.

Below are links to the accepted papers and brief research descriptions.

You’ll save money by purchasing a term paper for sale online that has already been written by a college student. These papers are often cheaper than other types of term papers for sale due to an ongoing discount period and unforeseen changes in the schedule of the student. This way, you’ll be sure to get an excellent paper at a great price!

Causal Effect Identifiability under Partial-Observability
Sanghack Lee Columbia University, Elias Bareinboim Columbia University

One of the central challenges in the data sciences is to compute the effect of a novel (never experience before) intervention from a combination of observational and experimental distributions, under the assumptions encoded in the form of a causal graph. Formally speaking, this problem appears under the rubric of causal effect identification and has been studied under various conditions in the theory of data-fusion (Survey: Bareinboim & Pearl, PNAS 2016; Video: Columbia Talk). Most of this literature, however, implicitly assumes that every variable modeled in the graph is measured throughout the datasets. This is a very stringent condition. In practice, the data collected in different studies are usually not consistent, i.e., each dataset measures a different set of variables.

For concreteness, consider a health care researcher who wants to estimate the effect of exercise (X) on cardiac stroke (Y) from two existing datasets. The first dataset is from an experimental study of the effect of exercise (X) on blood pressure (B) for different age groups (A). The second dataset is from an observational study that investigated the association between BMI (C), blood pressure (B), and stroke (Y). Currently, no algorithm can take these two datasets over {X, Y, A, B, C} and systematically combine them so as to identify the targeted causal effect — how exercise (X) affects cardiac stroke (Y).

In this paper, the researchers studied the causal effect identifiability problem when the available distributions encompass different sets of variables, which they refer to as causal effect identification under partial-observability (GID-PO). They derived properties of the factors that comprise a causal effect under various levels of abstraction (known as projections), and then characterized the relationship between them with respect to their status relative to the identification of a targeted intervention. They establish a sufficient graphical criterion for determining whether the effects are identifiable from partially-observed distributions. Finally, building on these graphical properties, they developed an algorithm that returns a formula for a causal effect in terms of the available distributions, whenever this effect is found identifiable.

One metaphor used throughout the paper is of a jigsaw puzzle, where the targeted effect is comprised of pieces and each dataset contains chunks, where each chunk is one or more pieces merged together. The task is to combine the chunks so as to solve the puzzle, even though not all puzzles are solvable. They infer that the identifiability problem under partial-observability is NP-complete.

“This conjecture is particularly surprising since many existing classes of identifiability problems have been entirely solved in polynomial time, the decision version of the problem, not an estimation,” said Sanghack Lee, an associate research scientist in the CausalAI Lab. “The class of time complexity of an identification problem may rely crucially on the lack of variables in the data sets, not the number of available data sets nor the size of a causal graph.”

Efficient Identification in Linear Structural Causal Models with Auxiliary Cutsets
Daniel Kumor Purdue University, Carlos Cinelli University of California, Los Angeles, Elias Bareinboim Columbia University

Linear regression and its generalizations have been the go-to tool of a generation of data scientists trying to understand the relationships among variables in multivariate systems, and the workhorse behind countless breakthroughs in science. Modern statistics and machine learning techniques allow one to scale regression up to thousands of variables simultaneously. Despite the power entailed by this family of methods, it only explains the association (or correlation) between variables, while it remains silent with respect to causation. In practice, a health scientist may be interested in knowing the effect of a new treatment on the survival of its patients, while an economist may attempt to understand the unintended consequences of a new policy on a nation’s gross domestic product. If regression analysis doesn’t allow scientists to answer their more pressing questions, which technique or framework could legitimize such inferences?

The answer lies in the tools of causal inference, specifically in the methods developed to solve the problem of “identification”. In the context of linear systems, formally speaking, identification asks whether a specific causal coefficient can be uniquely computed from the observational covariance matrix and the causal model. There exists a rich literature on linear identification, which includes a general algebraic solution (based on Groebner bases) that is doubly-exponential. In practice, and given this hardness result, the most common approach in the field is known as the method of instrumental variables (IVs), which looks for a particular substructure within the covariance matrix.

The research develops a new polynomial-time algorithm for linear identification that generalizes IV and all known efficient methods. The machinery developed underlying this algorithm unifies several disparate methods developed in the literature for the last two decades. Perhaps surprisingly, it turns out that this new method is also able to subsume identification approaches based on conditioning, some of which have high identification power but undetermined computational complexity, with certain variants shown to be NP-Hard.

Finally, this paper also allows one to efficiently solve for total effects based on a new decomposition strategy of the covariance matrix. It is still an open problem whether there exists an algorithm capable of identifying all causal coefficients in polynomial time.

Designing Optimal Dynamic Treatment Regimes: A Causal Reinforcement Learning Approach
Junzhe Zhang Columbia University, Elias Bareinboim Columbia University

This paper investigates the problem of learning decision rules that dictate what action should be taken at each state given an evolving set of covariates (features) and actions, which is usually called Dynamic Treatment Regimes (DTRs). DTRs have been increasingly popular in decision-making settings in healthcare.

Consider the treatment of alcohol dependence. Based on the condition of alcohol-dependent patients (S1), the physician may prescribe medication or behavioral therapy (X1). Patients are then classified as responders or non-responders (S2) based on their level of heavy drinking within the next two months. The physician then must decide whether to continue the initial treatment or to switch to an augmented plan combining both medication and behavioral therapy (X2). The primary outcome Y is the percentage of abstinent days over a 12-month period. A DTR here is a sequence of decision rules x1 = f1(s1), x2 = f2(x1, s1, s2) that decides the treatment X so as to optimize the expected outcome Y.

In this paper, the researchers studied how to find the optimal DTR in the context of online reinforcement learning. It has been shown that finding the optimal DTR requires Ω(|X ⋃ S|) trials, where |X ⋃ S| is the size of the treatment and state variables. In many applications, the domains of X and S could be significantly large, which means that the time required to ensure appropriate learning may be unattainable. They develop a novel strategy that leverages the causal diagram of the underlying environment to improve this sample complexity, leading to online algorithms that require an exponentially smaller number of trials. They also introduce novel methods to derive bounds over the underlying system dynamics (i.e., the causal effects) from the combination of the causal diagram and the observational (nonexperimental) data (e.g., from the physicians’ prior decisions). The derived bounds are then incorporated into proposed online decision-making algorithms.

“This result is surprising since most of the existing methods would suggest discarding the observational data when it is plagued with the unobserved confounding, and the causal effect is not uniquely identifiable,” said Junzhe Zhang, a research assistant in the CausalAI lab. “We show that the performance of online learners could be consistently improved by leveraging abundant, yet biased observational data.”

Stochastic Flows and Geometric Optimization on the Orthogonal Group
Krzysztof Choromanski Google Brain Robotics, Valerii Likhosherstov University of Cambridge, Jared Q Davis Google Research, David Cheikhi Columbia University, Achille Nazaret Columbia University, Xingyou Song Google Brain, Achraf Bahamou Columbia University, Jack Parker-Holder University of Oxford, Mrugank Akarte Columbia University, Yuan Gao Columbia University, Jacob Bergquist Columbia University, Aldo Pacchiano UC Berkeley, Vikas Sindhwani Google, Tamas Sarlos Google, Adrian Weller University of Cambridge, Alan Turing Institute

This paper presents a new class of stochastic, geometrically driven optimization algorithm for optimization problem defined on orthogonal group. The objective is to maximize a function F(x), where F takes arguments from matrix manifold. The paper solves this problem by computing and discretizing the on-manifold gradient flows and showed an intriguing connection between efficient stochastic optimization on the orthogonal group and graph theory.

The paper compares FLOPS needed for classification model on MISNT and CIFAR-10 dataset for our algorithm with state-of-the-art algorithms. The algorithm had substantial gains in time complexity as compared to other methods without much loss in accuracy. The paper also shows effectiveness of the algorithm for variety of continuous Reinforcement learning tasks like Humanoid from OpenAI gym.

Bridging the Technology Gap

Students from diverse, nontraditional backgrounds can now take a one-year transition before entering the MS program in computer science.

The CS@CU MS Bridge Program in Computer Science launched this summer with students from a variety of backgrounds ranging from the sciences to architecture, economics, finance, and management. The 22 students have one thing in common – a desire to learn computer science (CS) to jump-start a career in technology.

This program offers prospective applicants from non-computer science backgrounds, including those without programming experience, the opportunity to acquire the knowledge and skills necessary to prepare them for the CS master’s program. A year of rigorous “bridge” coursework is composed of introductory CS courses. This one-year transition prepares students for a seamless entrance into the MS program after the bridge year. The program is highly competitive with only 22 students accepted out of 88 applicants.

The virtual orientation of the first cohort.

The main inspiration for the program is Northeastern’s Align MS in Computer Science program. Columbia was one of a small number of schools which received seed funding from Facebook to help create bridge programs for students with diverse backgrounds at other universities.

“Unlike other similar programs, MS Bridge students start their program by taking the same core CS courses as our undergraduate students,” said Tony Dear, a Lecturer in Discipline and the CS@CU MS Bridge Program Director. The cohort has access to the same instruction, resources, and interactions with other students who are starting CS for the first time. In addition, MS Bridge students are afforded as much flexibility as they need, taking anywhere from one to three or four courses per semester until they are ready to move on to the MS portion of the program.

Another goal of the program is to advance diversity in the master’s program and this caught the attention of CS alum Janet Lustgarten (MS ‘85). “As a woman and an immigrant, I know how important it is to open up opportunities for different types of people,” said Lustgarten. After emigrating from Colombia at a young age, she found the universal language of mathematics a gateway to excel in school. She went on to become the first graduate with a degree in Computer Science and Mathematical Logic from Mount Holyoke and then received her MS degree from Columbia Engineering. She has led an exceptional career as co-founder & CEO of Kx Systems and President of Shakti Software, making numerous contributions to science, engineering, business, and beyond.

The Lustgarten-Whitney Family Fellowship was established specifically for the CS@CU MS Bridge Program. The generous gift provides fellowship support to three highly-qualified students from underrepresented and nontraditional backgrounds pursuing a career in computer science. Shared Lustgarten, “I hope that this gift from my family will play a vital role in launching the program, ensuring that the most qualified students are able to attend regardless of financial need, and inspiring a new generation of leaders in technology.”

Students convened virtually in May and most are now in the thick of summer classes. About half of the group is interested in pursuing the software systems and machine learning tracks for their master’s degree. Some have expressed interest in computational biology and natural language processing. A few already have plans to pursue a PhD afterward and they are interested in doing research and/or a thesis during the MS. Because there is flexibility when it comes to courses, this first cohort can explore and figure out the right path for their careers in technology.

Speech and Natural Language Processing Papers Accepted to the ACL 2020

Papers from the Speech and Natural Language Processing groups were accepted to the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020). The research developed systems that improve fact-checking, generate sarcasm, and detect deception and a neural network that extracts knowledge about products on Amazon.

A team led by Shih-Fu Chang, the Richard Dicker Professor and Senior Executive Vice Dean School of Engineering, won a Best Demonstration Paper award for GAIA: A Fine-grained Multimedia Knowledge Extraction System, the first comprehensive, open-source multimedia knowledge extraction system.

Kathleen McKeown, the Henry and Gertrude Rothschild Professor of Computer Science, also gave a keynote presentation – “Rewriting the Past: Assessing the Field through the Lens of Language Generation”.

R3 : Reverse, Retrieve, and Rank for Sarcasm Generation with Commonsense Knowledge Tuhin Chakrabarty Columbia University, Debanjan Ghosh Educational Testing Service, Smaranda Muresan Columbia University, and Nanyun Peng University of Southern California

The paper proposes an unsupervised approach for sarcasm generation based on a non-sarcastic input sentence. This method employs a retrieve-and-edit framework to instantiate two major characteristics of sarcasm: reversal of valence and semantic incongruity with the context, which could include shared commonsense or world knowledge between the speaker and the listener.

While prior works on sarcasm generation predominantly focus on context incongruity, this research shows that combining valence reversal and semantic incongruity based on commonsense knowledge generates sarcastic messages of higher quality based on several criteria. Human evaluation shows that the system generates sarcasm better than human judges 34% of the time, and better than a reinforced hybrid baseline 90% of the time.

Acoustic-Prosodic and Lexical Cues to Deception and Trust: Deciphering How People Detect Lies
Xi (Leslie) Chen Columbia University, Sarah Ita Levitan Columbia University, Michelle Levine Columbia University, Marko Mandic Columbia University, Julia Hirschberg Columbia University

Humans are very poor lie detectors. Study after study has found that humans perform only slightly above chance at detecting lies. This paper aims to understand why people are so poor at detecting lies, and what factors affect human perception of deception. To conduct this research, a web-based lie detection game called LieCatcher was created. Players were presented with recordings of people telling lies and truths, and they had to guess which are truthful and which are deceptive. The game is entertaining for players and simultaneously provides labels of deception perception.

Judgments from more than 400 players were collected. This perception data was used to identify linguistic and prosodic (intonation, tone, stress, and rhythm) characteristics of speech that lead listeners to believe what is true or to trust it.

The researchers studied how the characteristics of trusted speech align or misalign with indicators of deception, and they trained machine learning classifiers to automatically identify speech that is likely to be trusted or mistrusted. They found several intuitive indicators of mistrust, such as disfluencies (e.g. “um”, “uh”) and speaking slowly. An indicator of trust was a faster speaking rate — people tended to trust speech that was faster. They also observed a mismatch between features of trusted and truthful speech and the strategies used by the players to be ineffective.

What surprised the researchers is that prosodic indicators of deception, such as pitch and loudness changes, were the most difficult for humans to perceive and interpret correctly. In fact, many prosodic cues to deception were perceived by human raters as cues to the truth. For example, utterances with a higher pitch were perceived as more truthful, when in fact this is a cue to deception.

This work is useful for a number of applications, such as generating trustworthy speech for virtual assistants or dialogue systems.

TXtract: Taxonomy-Aware Knowledge Extraction for Thousands of Product Categories
Giannis Karamanolakis Columbia University, Jun Ma Amazon.com, Xin Luna Dong Amazon.com

The paper presents TXtract, a neural network that extracts knowledge about products from Amazon’s taxonomy with thousands of product categories, ranging from groceries to health products and electronics.

For example, given an ice cream product at Amazon.com with text description “Ben & Jerry’s Strawberry Cheesecake Ice Cream 16 oz”, TXtract extracts text phrases as product attributes:

Product brand: “Ben & Jerry’s”
Flavor: “strawberry cheesecake”
Package size: “16 oz”

Values extracted through TXtract lead to a richer catalog with millions of products and their corresponding attributes, which is promising to improve the experience of customers searching for products at Amazon.com or asking product-specific questions through Amazon Alexa.

DeSePtion: Dual Sequence Prediction and Adversarial Examples for
Improved Fact-Checking
Christopher Hidey Columbia University, Tuhin Chakrabarty Columbia University, Tariq Alhindi Columbia University, Siddharth Varia Columbia University, Kriste Krstovski Columbia University, Mona Diab Facebook AI & George Washington University, and Smaranda Muresan Columbia University

The increased focus on misinformation has spurred the development of data and systems for detecting the veracity of a claim as well as retrieving authoritative evidence. The FactExtraction and VERification (FEVER) dataset provide such a resource for evaluating end-to-end fact-checking, requiring retrieval of evidence from Wikipedia to validate a veracity prediction.

This paper shows that current systems for FEVER are vulnerable to three categories of realistic challenges for fact-checking – multiple propositions, temporal reasoning, and ambiguity and lexical variation – and introduces a resource with these types of claims.

Fact checking where the fact needs to be verified with evidence
from two different sources (Michelle Obama and Barrack Obama).

The researchers present a system designed to be resilient to these “attacks” using multiple pointer networks for document selection and jointly modeling a sequence of evidence sentences and veracity relation predictions. They found that in handling these attacks they obtained state-of-the-art results on FEVER, largely due to improved evidence retrieval.

Research Vignette: Generalization and Interpolation

In a story for the Simons Institute, Daniel Hsu writes about deep learning and how it has attracted new attention to several basic questions in statistical machine learning.

CS Theory Papers Accepted to STOC 2020

Papers from the Theory group look at long-standing research, local search methods, and introduces a new concept in graph theory.

Testing Noisy Linear Functions for Sparsity
Xue Chen Northwestern University, Anindya De University of Pennsylvania, Rocco A. Servedio Columbia University

This paper considers the following setting: A data analysis algorithm is given access to noisy labeled data points (x,w.x + noise) where x is a high-dimensional vector, w.x is some linear function of x, and the true labels w.x are corrupted by noise. The goal is to determine whether the coefficient vector w=(w1,…,wn) is sparse (i.e. has almost all of its entries being zero) or is far from sparse (has many nonzero entries).

A long line of work in a field known as “compressed sensing” has addressed a related version of this problem in which the goal is to (approximately) discover the vector w assuming that it is sparse. It is known that algorithms for this “reconstruction” problem require a number of data points which grows with n. This work considers a potentially easier “decision” question, of just deciding whether or not the unknown vector w actually is sparse as described above…but is this problem actually easier, or not?

The main result of this paper gives a sharp characterization of the conditions under which the decision problem actually is easier than the reconstruction problem. It shows that for many distributions of data points, it is possible to solve the decision problem with no dependence on the dimension n, and explains why in the other cases a dependence on n is necessary.

“It was a bit of a surprise that we were able to come up with a sharp characterization of the problem variants for which there needs to be a dependence on the distribution,” said Rocco Servedio, a computer science professor who worked on the paper. “This characterization relied on some results from probability theory that were proved 80 years ago – we were lucky that the right tools for the job had already been prepared for us in the math literature.”

Fooling Gaussian PTFs via Local Hyperconcentration
Ryan O’Donnell Carnegie Mellon University, Rocco A. Servedio Columbia University, Li-Yang Tan Stanford University

This paper is aimed at understanding the extent to which randomness can help computers solve problems more efficiently. A long line of research in this area aims at coming up with efficient deterministic (non-random) algorithms for various problems that are known to have efficient randomized algorithms; this research program is sometimes known as “derandomization,” as its goal is to show that randomness is not really necessary as a fundamental computational resource.

The most commonly pursued avenue of work along these lines is to design certain particular kinds of deterministic algorithms known as “pseudorandom generators”. Roughly speaking, a pseudorandom generator for a certain type of Boolean (0/1-valued) function gives an efficient deterministic way of estimating the fraction of inputs that make the function output 1 as opposed to 0. A natural way to estimate this fraction is just to do random sampling, but the key challenge is to do it without using randomness — i.e. to “derandomize” the natural random-sampling-based approach.

This particular paper develops new pseudorandom generators for a class of geometrically defined functions called “polynomial threshold functions”. A polynomial threshold function is defined by a multivariable polynomial, like p(x,y,z) = 3x^2 – 6 x^2 y^2 z^3 + 4xyz; the most important parameter of such a polynomial is its degree. (The degree of the example polynomial p(x,y,z) given above is 7, because of the x^2 y^2 z^3 term.) The higher the degree of a polynomial, the more complicated it is. The main result in this paper is a pseudorandom generator for polynomial threshold functions which can handle polynomials of exponentially higher degree than could be achieved in previous work.

Parallel Approximate Undirected Shortest Paths Via Low Hop Emulators
Alexandr Andoni Columbia University, Clifford Stein Columbia University, Peilin Zhong Columbia University

The researchers designed a new parallel approximate undirected shortest path algorithm which has polylog(n) depth and near linear total work. This improved a long-standing upper bound obtained by Edith Cohen (STOC’94).

To further illustrate the efficiency of the algorithm, consider the case when the algorithm is run on a single processor, the running time is almost the same as the best known sequential shortest path algorithm, Dijkstra’s algorithm. If the algorithm is run on m processors, where m is the size of the graph, the algorithm only takes a polylog(n) number of rounds — this is also near optimal (many simple problems such as computing XOR of n bits need at least log(n) rounds).

The paper develops a new concept called low hop emulator to solve the parallel approximate shortest paths problem. It shows that for any graph G and any integer k>=1, there exists another graph G’ with size n^(1+2/k) such that 1. Any shortest path in G’ only uses log(k) number of edges, 2, the distance between each pair of vertices in G’ is a poly(k) approximation to their original distance in G.

“This result is surprising and I believe this is an important discovery in graph theory,” said Peilin Zhong, a fourth-year PhD student. “We also show that G’ can be constructed efficiently.”

Smoothed complexity of local Max-Cut and binary Max-CSP
Emmanouil-Vasileios Vlatakis-Gkaragkounis Columbia University, Xi Chen Columbia University, Chenghao Guo IIIS, Tsinghua University, Mihalis Yannakakis Columbia University, Xinzhi Zhang IIIS, Tsinghua University

This work provides insights into the inner-workings of some of the most widely used local search methods.

Local search is one of the most prominent algorithm design paradigms for combinatorial optimization problems. A local search algorithm begins with an initial solution and then follows a path by iteratively moving to a better neighboring solution until a local optimum is reached. However, a recurring phenomenon in local search algorithms is that they are usually efficient in practice but the theory of worst-case analysis indicates the opposite — due to some delicate pathological instances that one may never encounter in practice, the algorithm would be characterized as an exponential running time method.

In order to bridge theory and practice, a smoothed complexity framework has been introduced, a hybrid complexity model between the pessimistic worst-case analysis and the optimistic average-case analysis.

This work analyzes the MaxCut problem, one of the famous 21-Karp NP-complete (”hard”) problems. Given a weighted undirected graph, the MaxCut problem seeks to partition the vertices of the graph into two sets such that the sum of the weights of the edges that join the two sets is maximized. Its applications come from diverse areas including the layout of electronic circuitry, social networks, and statistical physics.

The researchers improved the smoothed complexity of the local search methods for the Max-Cut problem. The result is based on an analysis of long sequences of local changes (flips), which shows that it is very unlikely for every flip in a long sequence to incur a positive but small improvement in the cut weight. They also extend the same upper bound on the smoothed complexity of FLIP to all binary Maximum Constraint Satisfaction Problems.

Year: 2020