Amey Pasarkar has been selected as a Finalist for his work on computational models that examine the microbiome and its role in health and disease.
Directional Gaussian Mixture Models of the Gut Microbiome Elucidate Microbial Spatial Structure
Amey P. Pasarkar, Tyler A. Joseph, Itsik Pe’er
The human gut microbiome contains trillions of bacteria that play a key role in preventing and fighting disease. However, the behaviors of these bacteria are still not well understood. One of these behaviors is the spatial arrangement of microbes in the gut. Numerous experimental studies have hinted that the various unique environments along the gut lead to certain bacterial spatial patterns. The research team created a consolidated computational model of the microbiome’s spatial arrangement. Their model can identify many of the previously discovered patterns and propose new ones as well. For example, the models found that the pH levels in the small intestine are likely promoting the growth of a specific family of bacteria.
The results demonstrate that the gut microbiome, while exceptionally large, has predictable and quantifiable spatial patterns that can be used to help us understand its role in health and disease.
For this year’s Outstanding Undergraduate Researcher Award, Payal Chandak, Sophia Kolak, and Yanda Chen were among students recognized by the Computing Research Association (CRA) for their work in an area of computing research.
Using Machine Learning to Identify Adverse Drug Effects Posing Increased Risk to Women
Payal Chandak Columbia University, Nicholas Tatonetti Columbia University
The researchers developed AwareDX – Analysing Women At Risk for Experiencing Drug toXicity – a machine learning algorithm that identifies and predicts differences in adverse drug effects between men and women by analyzing 50 years’ worth of reports in an FDA database. The algorithm automatically corrects for biases in these data that stem from an overrepresentation of male subjects in clinical research trials.
Though men and women can have different responses to medications – the sleep aid Ambien, for example, metabolizes more slowly in women, causing next-day grogginess – doctors may not know about these differences because most clinical trial data itself is biased toward men. This trickles down to impact prescribing guidelines, drug marketing, and ultimately, patients’ health. Unfortunately, pharmaceutical companies have a history of ignoring complex problems and clinical trials have singularly studied men, not even including women. As a result, there is a lot less information about how women respond to drugs compared to men. The research tries to bridge this information gap.
It Takes a Village to Build a Robot: An Empirical Study of The ROS Ecosystem
Sophia Kolak Columbia University, Afsoon Afzal Carnegie Mellon University, Claire Le Goues Carnegie Mellon University, Michael Hilton Carnegie Mellon University, Christopher Steven Timperley Carnegie Mellon University
The Robot Operating System (ROS) is the most popular framework for robotics development. In this paper, the researchers conducted the first major empirical study of ROS, with the goal of understanding how developers collaborate across the many technical disciplines that coalesce in robotics.
Building a complete robot is a difficult task that involves bridging many technical disciplines. ROS aims to simplify development by providing reusable libraries, tools, and conventions for building a robot. Still, as building a robot requires domain expertise in software, mechanical, and electrical engineering, as well as artificial intelligence and robotics, ROS faces knowledge-based barriers to collaboration. The researchers wanted to understand how the necessity of domain-specific knowledge impacts the open-source collaboration model in ROS.
Virtually no one is an expert in every subdomain of robotics: experts who create computer vision packages likely need to rely on software designed by mechanical engineers to implement motor control. As a result, the researchers found that development in ROS is centered around a few unique subgroups each devoted to a different specialty in robotics (i.e. perception, motion). This is unlike other ecosystems, where competing implementations are the norm.
Detecting Performance Patterns with Deep Learning
Sophia Kolak Columbia University
Performance has a major impact on the overall quality of a software project. Performance bugs—bugs that substantially decrease run-time—have long been studied in software engineering, and yet they remain incredibly difficult for developers to handle. In this project, the researchers leveraged contemporary methods in machine learning to create graph embeddings of Python code that can be used to automatically predict performance.
Using un-optimized programming language concepts can lead to performance bugs and the researchers hypothesized that statistical language embeddings could help reveal these patterns. By transforming code samples into graphs that captured the control and data flow of a program, the researchers studied how various unsupervised embeddings of these graphs could be used to predict performance.
Implementing “sort” by hand as opposed to using the built-in Python sort function is an example of a choice that typically slows down a program’s run-time. When the researchers embedded the AST and data flow of a code snippet in Euclidean space (using DeepWalk), patterns like this were captured in the embedding and allowed classifiers to learn which structures are correlated with various levels of performance.
“I was surprised by how often research changes directions,” said Sophia Kolak. In both projects, they started out with one set of questions but answered completely different ones by the end. “It showed me that, in addition to persistence, research requires open-mindedness.”
Cross-language Sentence Selection Via Data Augmentation and Rationale Training
Yanda Chen Columbia University, Chris Kedzie Columbia University, Suraj Nair University of Maryland, Petra Galuscakova University of Maryland, Rui Zhang Yale University, Douglas Oard University of Maryland, and Kathleen McKeown Columbia University
In this project, the researchers proposed a new approach to cross-language sentence selection, where they used models to predict sentence-level query relevance with English queries over sentences within document collections in low-resource languages such as Somali, Swahili, and Tagalog.
The system is used as part of cross-lingual information retrieval and query-focused summarization system. For example, if a user puts in a query word “business activity” and specifies Swahili as the language of source documents, then the system will automatically retrieve the Swahili documents that are related to “business activity” and produce short summaries that are then translated from Swahili to English.
A major challenge of the project was the lack of training data for low-resource languages. To tackle this problem, the researchers proposed to generate a relevance dataset of query-sentence pairs through data augmentation based on parallel corpora collected from the web. To mitigate the spurious correlations learned by the model, they proposed the idea of rationale training where they first trained a phrase-based statistical machine translation system and used the alignment information to provide additional supervision for the models.
The approach achieved state-of-the-art results on both text and speech across three languages – Somali, Swahili, and Tagalog.
For this year’s Outstanding Undergraduate Researcher Award, three computer science students received honorable mentions – Lalita Devadas, Dave Epstein, and Jessy Xinyi Han. The Computing Research Association (CRA) recognized the undergraduates for their work in an area of computing research.
Secure Montgomery Multiplication and Repeated Squares for Modular Exponentiation
Lalita Devadas Columbia University and Justin Bloom Oregon State University
The researchers worked on using some recent advances in
garbling of arithmetic circuits for secure exponentiation mod N,
a vital operation in many cryptosystems, including in the RSA public-key
A garbled circuit is a cryptographic protocol which allows for
secure two-party computation, in which two parties, Alice and Bob, each with a
private input, want to compute some shared function of their inputs
without either party learning the other’s input.
Their novel approach implemented the Montgomery multiplication method, which uses clever arithmetic to avoid costly division by the modulus being multiplied in. The best method they found had each wire in a circuit representing one digit of a number in base p. They developed a system of base p arithmetic which is asymptotically more efficient in the given garbled circuit architecture than any existing protocols.
They measured performance for both approaches by counting
the ciphertexts communicated for a single multiplication (a typical
measure of efficiency for garbled circuit operations). They found that the
base p Montgomery multiplication implementation
vastly outperformed all other implementations for values
of N with bit length greater than 500 (i.e.,
all N used for applications like RSA encryption).
“Unfortunately, our best implementations showed only incremental improvement over existing non-Montgomery-based implementations for values of N used in practice,” said Lalita Devadas. “We are still looking into further optimizations using Montgomery multiplication.”
Secure multiparty computation has many applications outside of computer science. For example, suppose five friends want to know their cumulative net worth without anyone learning anyone else’s individual net worth. This is actually a secure computation problem, since the friends want to perform some computation of their inputs while keeping said inputs private from other parties.
Oops! Predicting Unintentional Action in Video
Dave Epstein Columbia University, Boyuan Chen Columbia University, and Carl Vondrick Columbia University
The paper trains models to detect when human action is unintentional using self-supervised computer vision, an important step towards machines that can intelligently reason about the intentions behind complex human actions.
Despite enormous scientific progress over the last five to ten years, machines still struggle with tasks learned quickly and autonomously by young children, such as understanding human behavior or learning to speak a language. Epstein’s research tackles these types of problems by using self-supervised computer vision, a paradigm that predicts information naturally present in large amounts of input data such as images or videos. This stands in contrast with supervised learning, which relies on humans manually labelling data (e.g. “this is a picture of a dog”).
“I was surprised to learn that failure is an expected part of research and that it can take a long time to realize you’re failing,” said Dave Epstein. “Taking a failed idea, identifying the promising parts, and trying again leads to successful research.”
Seeding Network Influence in Biased Networks and the Benefits of Diversity
Ana-Andreea Stoica Columbia University, Jessy Xinyi Han Columbia University, and Augustin Chaintreau Columbia University
The paper explores the problem of social influence maximization
and how information is diffused in a social network.
For example, it might be about what kind of news people read on social media, how many people know about job opportunities or who hears about the latest loan options from a bank. So given a social network, classical algorithms are focused on picking the best k early-adopters based on how central they are in a network, say, based on their number of connections, to maximize outreach.
However, since social inequalities are reflected in
the uneven networks, classical algorithms which ignore demographics often
amplify such inequalities in information access.
“We were wondering if we can do better than an algorithm that ignores demographics,” said Jessy Xinyi Han. “‘Better’ here means more people in total and more people from the disadvantaged group can receive the information.”
Through a network model with unequal communities, they developed
new heuristics to take demographics into account, showing that
including sensitive features in the input of most natural seed selection
algorithms substantially improves diversity but also often leaves efficiency
untouched or even provides a small gain.
Such analytical condition turned out to be a closed-form
condition on the number of early adopters. They also validated this result on
the real CS co-authorship network from DBLP.