16 Papers Accepted To NeurIPS 2023

Researchers from the department presented machine learning and artificial intelligence research at the thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2023).

Outstanding Dataset Paper

ClimSim: An Open Large-Scale Dataset For Training High-Resolution Physics Emulators In Hybrid Multi-Scale Climate Models
Sungduk Yu, Walter Hannah, Liran Peng, Jerry Lin, Mohamed Aziz Bhouri, Ritwik Gupta, Björn Lütjens, Justus C. Will, Gunnar Behrens, Nora Loose, Charles Stern, Tom Beucler, Bryce Harrop, Benjamin Hillman, Andrea Jenney, Savannah L. Ferretti, Nana Liu, Animashree Anandkumar, Noah Brenowitz, Veronika Eyring, Nicholas Geneva, Pierre Gentine, Stephan Mandt, Jaideep Pathak, Akshay Subramaniam, Carl Vondrick, Rose Yu, Laure Zanna, Ryan Abernathey, Fiaz Ahmed, David Bader, Pierre Baldi, Elizabeth Barnes, Christopher Bretherton, Julius Busecke, Peter Caldwell, Wayne Chuang, Yilun Han, YU HUANG, Fernando Iglesias-Suarez, Sanket Jantre, Karthik Kashinath, Marat Khairoutdinov, Thorsten Kurth, Nicholas Lutsko, Po-Lun Ma, Griffin Mooers, J. David Neelin, David Randall, Sara Shamekh, Mark Taylor, Nathan Urban, Janni Yuval, Guang Zhang, Tian Zheng, Mike Pritchard

Abstract:
Modern climate projections lack adequate spatial and temporal resolution due to computational constraints. A consequence is inaccurate and imprecise predictions of critical processes such as storms. Hybrid methods that combine physics with machine learning (ML) have introduced a new generation of higher fidelity climate simulators that can sidestep Moore’s Law by outsourcing compute-hungry, short, high-resolution simulations to ML emulators. However, this hybrid ML-physics simulation approach requires domain-specific treatment and has been inaccessible to ML experts because of lack of training data and relevant, easy-to-use workflows. We present ClimSim, the largest-ever dataset designed for hybrid ML-physics research. It comprises multi-scale climate simulations, developed by a consortium of climate scientists and ML researchers. It consists of 5.7 billion pairs of multivariate input and output vectors that isolate the influence of locally-nested, high-resolution, high-fidelity physics on a host climate simulator’s macro-scale physical state. The dataset is global in coverage, spans multiple years at high sampling frequency, and is designed such that resulting emulators are compatible with downstream coupling into operational climate simulators. We implement a range of deterministic and stochastic regression baselines to highlight the ML challenges and their scoring. The data (https://huggingface.co/datasets/LEAP/ClimSim_high-res) and code (https://leap-stc.github.io/ClimSim) are released openly to support the development of hybrid ML-physics and high-fidelity climate simulations for the benefit of science and society.

Objaverse-XL: A Colossal Universe of 3D Objects
Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, Ali Farhadi

Abstract:
Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. Representing the largest scale and diversity in the realm of 3D datasets, Objaverse-XL enables significant new possibilities for 3D vision. Our experiments demonstrate the improvements enabled with the scale provided by Objaverse-XL. We show that by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images, we achieve strong zero-shot generalization abilities. We hope that releasing Objaverse-XL will enable further innovations in the field of 3D vision at scale.

Causal discovery from observational and interventional data across multiple environments
Adam Li, Amin Jaber, Elias Bareinboim

Abstract:
A fundamental problem in many sciences is the learning of causal structure underlying a system, typically through observation and experimentation. Commonly, one even collects data across multiple domains, such as gene sequencing from different labs, or neural recordings from different species. Although there exist methods for learning the equivalence class of causal diagrams from observational and experimental data, they are meant to operate in a single domain. In this paper, we develop a fundamental approach to structure learning in non-Markovian systems (i.e. when there exist latent confounders) leveraging observational and interventional data collected from multiple domains. Specifically, we start by showing that learning from observational data in multiple domains is equivalent to learning from interventional data with unknown targets in a single domain. But there are also subtleties when considering observational and experimental data. Using causal invariances derived from do-calculus, we define a property called S-Markov that connects interventional distributions from multiple-domains to graphical criterion on a selection diagram. Leveraging the S-Markov property, we introduce a new constraint-based causal discovery algorithm, S-FCI, that can learn from observational and interventional data from different domains. We prove that the algorithm is sound and subsumes existing constraint-based causal discovery algorithms.

A Causal Framework for Decomposing Spurious Variations
Drago Plecko, Elias Bareinboim

Abstract:
One of the fundamental challenges found throughout the data sciences is to explain why things happen in specific ways, or through which mechanisms a certain variable X exerts influences over another variable Y. In statistics and machine learning, significant efforts have been put into developing machinery to estimate correlations across variables efficiently. In causal inference, a large body of literature is concerned with the decomposition of causal effects under the rubric of mediation analysis. However, many variations are spurious in nature, including different phenomena throughout the applied sciences. Despite the statistical power to estimate correlations and the identification power to decompose causal effects, there is still little understanding of the properties of spurious associations and how they can be decomposed in terms of the underlying causal mechanisms. In this manuscript, we develop formal tools for decomposing spurious variations in both Markovian and Semi-Markovian models. We prove the first results that allow a non-parametric decomposition of spurious effects and provide sufficient conditions for the identification of such decompositions. The described approach has several applications, ranging from explainable and fair AI to questions in epidemiology and medicine, and we empirically demonstrate its use on a real-world dataset.

Nonparametric Identifiability of Causal Representations from Unknown Interventions
Julius von Kügelgen, Michel Besserve, Liang Wendong, Luigi Gresele, Armin Kekić, Elias Bareinboim, David Blei, Bernhard Schölkopf

Abstract:
We study causal representation learning, the task of inferring latent causal variables and their causal relations from high-dimensional mixtures of the variables. Prior work relies on weak supervision, in the form of counterfactual pre- and post-intervention views or temporal structure; places restrictive assumptions, such as linearity, on the mixing function or latent causal model; or requires partial knowledge of the generative process, such as the causal graph or intervention targets. We instead consider the general setting in which both the causal model and the mixing function are nonparametric. The learning signal takes the form of multiple datasets, or environments, arising from unknown interventions in the underlying causal model. Our goal is to identify both the ground truth latents and their causal graph up to a set of ambiguities which we show to be irresolvable from interventional data. We study the fundamental setting of two causal variables and prove that the observational distribution and one perfect intervention per node suffice for identifiability, subject to a genericity condition. This condition rules out spurious solutions that involve fine-tuning of the intervened and observational distributions, mirroring similar conditions for nonlinear cause-effect inference. For an arbitrary number of variables, we show that at least one pair of distinct perfect interventional domains per node guarantees identifiability. Further, we demonstrate that the strengths of causal influences among the latent variables are preserved by all equivalent solutions, rendering the inferred representation appropriate for drawing causal conclusions from new data. Our study provides the first identifiability results for the general nonparametric setting with unknown interventions, and elucidates what is possible and impossible for causal representation learning without more direct supervision.

Estimating Causal Effects Identifiable from Combination of Observations and Experiments
Yonghan Jung, Ivan Diaz, Jin Tian, Elias Bareinboim

Abstract:
Learning cause and effect relations is arguably one of the central challenges found throughout the data sciences. Formally, determining whether a collection of observational and interventional distributions can be combined to learn a target causal relation is known as the problem of generalized identification (or g-identification) [Lee et al., 2019]. Although g-identification has been well understood and solved in theory, it turns out to be challenging to apply these results in practice, in particular when considering the estimation of the target distribution from finite samples. In this paper, we develop a new, general estimator that exhibits multiply robustness properties for g-identifiable causal functionals. Specifically, we show that any g-identifiable causal effect can be expressed as a function of generalized multioutcome sequential back-door adjustments that are amenable to estimation. We then construct a corresponding estimator for the g-identification expression that exhibits robustness properties to bias. We analyze the asymptotic convergence properties of the estimator. Finally, we illustrate the use of the proposed estimator in experimental studies. Simulation results corroborate the theory.

Causal Fairness for Outcome Control
Drago Plecko, Elias Bareinboim

Abstract:
As society transitions towards an AI-based decision-making infrastructure, an ever-increasing number of decisions once under control of humans are now delegated to automated systems. Even though such developments make various parts of society more efficient, a large body of evidence suggests that a great deal of care needs to be taken to make such automated decision-making systems fair and equitable, namely, taking into account sensitive attributes such as gender, race, and religion. In this paper, we study a specific decision-making task called outcome control in which an automated system aims to optimize an outcome variable Y while being fair and equitable. The interest in such a setting ranges from interventions related to criminal justice and welfare, all the way to clinical decision-making and public health. In this paper, we first analyze through causal lenses the notion of benefit, which captures how much a specific individual would benefit from a positive decision, counterfactually speaking, when contrasted with an alternative, negative one. We introduce the notion of benefit fairness, which can be seen as the minimal fairness requirement in decision-making, and develop an algorithm for satisfying it. We then note that the benefit itself may be influenced by the protected attribute, and propose causal tools which can be used to analyze this. Finally, if some of the variations of the protected attribute in the benefit are considered as discriminatory, the notion of benefit fairness may need to be strengthened, which leads us to articulating a notion of causal benefit fairness. Using this notion, we develop a new optimization procedure capable of maximizing Y while ascertaining causal fairness in the decision process.

Distribution-Free Statistical Dispersion Control for Societal Applications
Zhun Deng, Thomas Zollo, Jake Snell, Toniann Pitassi, Richard Zemel

Abstract:
Explicit finite-sample statistical guarantees on model performance are an important ingredient in responsible machine learning. Previous work has focused mainly on bounding either the expected loss of a predictor or the probability that an individual prediction will incur a loss value in a specified range. However, for many high-stakes applications, it is crucial to understand and control the dispersion of a loss distribution, or the extent to which different members of a population experience unequal effects of algorithmic decisions. We initiate the study of distribution-free control of statistical dispersion measures with societal implications and propose a simple yet flexible framework that allows us to handle a much richer class of statistical functionals beyond previous work. Our methods are verified through experiments in toxic comment detection, medical imaging, and film recommendation.

Representational Strengths and Limitations of Transformers
Clayton Sanford, Daniel Hsu, Matus Telgarsky

Abstract:
Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both positive and negative results on the representation power of attention layers, with a focus on intrinsic complexity parameters such as width, depth, and embedding dimension. On the positive side, we present a sparse averaging task, where recurrent networks and feedforward networks all have complexity scaling polynomially in the input size, whereas transformers scale merely logarithmically in the input size; furthermore, we use the same construction to show the necessity and role of a large embedding dimension in a transformer. On the negative side, we present a triple detection task, where attention layers in turn have complexity scaling linearly in the input size; as this scenario seems rare in practice, we also present natural variants that can be efficiently solved by attention layers. The proof techniques emphasize the value of communication complexity in the analysis of transformers and related models, and the role of sparse averaging as a prototypical attention task, which even finds use in the analysis of triple detection.

Fast Attention Requires Bounded Entries
Josh Alman, Zhao Song

Abstract:
In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices Q,K,V∈[−B,B]n×d, and the goal is to construct the matrix Att(Q,K,V):=diag(A1n)−1AV∈ℝn×d, where A=exp(QK⊤/d) is the `attention matrix’, and exp is applied entry-wise. Straightforward methods for this problem explicitly compute the n×n attention matrix A, and hence require time Ω(n2) even when d=no(1) is small.
In this paper, we investigate whether faster algorithms are possible by implicitly making use of the matrix A. We present two results, showing that there is a sharp transition at B=Θ(logn‾‾‾‾‾√).
∙ If d=O(logn) and B=o(logn‾‾‾‾‾√), there is an n1+o(1) time algorithm to approximate Att(Q,K,V) up to 1/poly(n) additive error.
∙ If d=O(logn) and B=Θ(logn‾‾‾‾‾√), assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate Att(Q,K,V) up to 1/poly(n) additive error in truly subquadratic time n2−Ω(1).
This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.

Bypass Exponential Time Preprocessing: Fast Neural Network Training via Weight-Data Correlation Preprocessing
Josh Alman, Jiehao Liang, Zhao Song, Ruizhe Zhang, Danyang Zhuo

Abstract:
Over the last decade, deep neural networks have transformed our society, and they are already widely applied in various machine learning applications. State-of-art deep neural networks are becoming larger in size every year to deliver increasing model accuracy, and as a result, model training consumes substantial computing resources and will only consume more in the future. Using current training methods, in each iteration, to process a data point x∈ℝd in a layer, we need to spend Θ(md) time to evaluate all the m neurons in the layer. This means processing the entire layer takes Θ(nmd) time for n data points. Recent work [Song, Yang and Zhang, NeurIPS 2021] reduces this time per iteration to o(nmd), but requires exponential time to preprocess either the data or the neural network weights, making it unlikely to have practical usage.

In this work, we present a new preprocessing method that simply stores the weight-data correlation in a tree data structure in order to quickly, dynamically detect which neurons fire at each iteration. Our method requires only O(nmd) time in preprocessing and still achieves o(nmd) time per iteration. We complement our new algorithm with a lower bound, proving that assuming a popular conjecture from complexity theory, one could not substantially speed up our algorithm for dynamic detection of firing neurons.

Differentially Private Approximate Near Neighbor Counting in High Dimensions
Alexandr Andoni, Piotr Indyk, Sepideh Mahabadi, Shyam Narayanan

Abstract:
Range counting (e.g., counting the number of data points falling into a given query ball) under differential privacy has been studied extensively. However, the current algorithms for this problem are subject to the following dichotomy. One class of algorithms suffers from an additive error that is a fixed polynomial in the number of points. Another class of algorithms allows for polylogarithmic additive error, but the error grows exponentially in the dimension. To achieve the latter, the problem is relaxed to allow a “fuzzy” definition of the range boundary, e.g., a count of the points in a ball of radius r might also include points in a ball of radius cr for some c > 1.

In this paper, we present an efficient algorithm that offers a sweet spot between these two classes. The algorithm has an additive error that is an arbitrary small power of the data set size, depending on how fuzzy the range boundary is, as well as a small (1 + o(1)) multiplicative error. Crucially, the amount of noise added has no dependence on the dimension. Our algorithm introduces a variant of Locality-Sensitive Hashing, utilizing it in a novel manner.

Variational Inference with Gaussian Score Matching
Chirag Modi, Robert Gower, Charles Margossian, Yuling Yao, David Blei, Lawrence Saul

Abstract:
Variational inference (VI) is a method to approximate the computationally intractable posterior distributions that arise in Bayesian statistics. Typically, VI fits a simple parametric distribution to the target posterior by minimizing an appropriate objective such as the evidence lower bound (ELBO). In this work, we present a new approach to VI based on the principle of score matching, that if two distributions are equal then their score functions (i.e., gradients of the log density) are equal at every point on their support. With this, we develop score matching VI, an iterative algorithm that seeks to match the scores between the variational approximation and the exact posterior. At each iteration, score matching VI solves an inner optimization, one that minimally adjusts the current variational estimate to match the scores at a newly sampled value of the latent variables.

We show that when the variational family is a Gaussian, this inner optimization enjoys a closed form solution, which we call Gaussian score matching VI (GSM-VI). GSM-VI is also a “black box” variational algorithm in that it only requires a differentiable joint distribution, and as such it can be applied to a wide class of models. We compare GSM-VI to black box variational inference (BBVI), which has similar requirements but instead optimizes the ELBO. We study how GSM-VI behaves as a function of the problem dimensionality, the condition number of the target covariance matrix (when the target is Gaussian), and the degree of mismatch between the approximating and exact posterior distribution. We also study GSM-VI on a collection of real-world Bayesian inference problems from the posteriorDB database of datasets and models. In all of our studies we find that GSM-VI is faster than BBVI, but without sacrificing accuracy. It requires 10-100x fewer gradient evaluations to obtain a comparable quality of approximation.

Practical and Asymptotically Exact Conditional Sampling in Diffusion Models
Luhuan Wu, Brian Trippe, Christian Naesseth, David Blei, John Cunningham

Abstract:
Diffusion models have been successful on a range of conditional generation tasks including molecular design and text-to-image generation. However, these achievements have primarily depended on task-specific conditional training or error-prone heuristic approximations. Ideally, a conditional generation method should provide exact samples for a broad range of conditional distributions without requiring task-specific training. To this end, we introduce the Twisted Diffusion Sampler, or TDS. TDS is a sequential Monte Carlo (SMC) algorithm that targets the conditional distributions of diffusion models. The main idea is to use twisting, an SMC technique that enjoys good computational efficiency, to incorporate heuristic approximations without compromising asymptotic exactness. We first find in simulation and on MNIST image inpainting and class-conditional generation tasks that TDS provides a computational statistical trade-off, yielding more accurate approximations with many particles but with empirical improvements over heuristics with as few as two particles. We then turn to motif-scaffolding, a core task in protein design, using a TDS extension to Riemannian diffusion models. On benchmark test cases, TDS allows flexible conditioning criteria and often outperforms the state-of-the-art.

Causal-structure Driven Augmentations for Text OOD Generalization
Amir Feder, Yoav Wald, Claudia Shi, Suchi Saria, David Blei

Abstract:
The reliance of text classifiers on spurious correlations can lead to poor generalization at deployment, raising concerns about their use in safety-critical domains such as healthcare. In this work, we propose to use counterfactual data augmentation, guided by knowledge of the causal structure of the data, to simulate interventions on spurious features and to learn more robust text classifiers. We show that this strategy is appropriate in prediction problems where the label is spuriously correlated with an attribute. Under the assumptions of such problems, we discuss the favorable sample complexity of counterfactual data augmentation, compared to importance re-weighting. Pragmatically, we match examples using auxiliary data, based on diff-in-diff methodology, and use a large language model (LLM) to represent a conditional probability of text. Through extensive experimentation on learning caregiver-invariant predictors of clinical diagnoses from medical narratives and on semi-synthetic data, we demonstrate that our method for simulating interventions improves out-of-distribution (OOD) accuracy compared to baseline invariant learning algorithms.

Evaluating the Moral Beliefs Encoded in LLMs
Nino Scherrer, Claudia Shi, Amir Feder, David Blei

Abstract:
This paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (LLMs). It comprises two components: (1) A statistical method for eliciting beliefs encoded in LLMs. We introduce statistical measures and evaluation metrics that quantify the probability of an LLM “making a choice”, the associated uncertainty, and the consistency of that choice. (2) We apply this method to study what moral beliefs are encoded in different LLMs, especially in ambiguous cases where the right choice is not obvious. We design a large-scale survey comprising 680 high-ambiguity moral scenarios (e.g., “Should I tell a white lie?”) and 687 low-ambiguity moral scenarios (e.g., “Should I stop for a pedestrian on the road?”). Each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., “do not kill”). We administer the survey to 28 open- and closed-source LLMs. We find that (a) in unambiguous scenarios, most models “choose” actions that align with commonsense. In ambiguous cases, most models express uncertainty. (b) Some models are uncertain about choosing the commonsense action because their responses are sensitive to the question-wording. (c) Some models reflect clear preferences in ambiguous scenarios. Specifically, closed-source models tend to agree with each other.

The Pro Gamer Who Has To Rely Upon Sound Alone

Brian Smith talks to the BBC about making video games that are accessible to blind gamers.

The Race to Develop Quantum Computers

Team Led by Elias Bareinboim Wins $5M NSF Grant to Transform AI Decision-making

The multi-institutional team will use causal modeling techniques to build AI systems that better communicate with people and react to unforeseen circumstances.

Voices of CS: Wei Hao

Last August, Wei Hao stepped onto the Google Campus in Sunnyvale, California, as part of the inaugural MLCommons Rising Stars cohort.

Thirty-five recipients, out of over 100 applicants, were invited to this two-day in-person workshop. The cohort had the chance to listen to talks by researchers from Google, Intel, and Meta, and professors from Havard, UC Berkeley, and Cornell about trendy research topics, such as ML for ML systems, software-hardware codesign, and responsible machine learning. They also had the chance to do a poster presentation of their work, where they got useful feedback. The aim of the workshop was to develop community, foster research and career growth, enable collaborations, and discuss career opportunities among the rising generation of researchers at the intersection of machine learning and systems.

The first cohort of the MLCommons Rising Stars

“It was a great experience,” said Wei, a third-year PhD student who works with Junfeng Yang and Asaf Cidon. “I always feel the fastest way of developing research ideas is to talk to people and brainstorm, and the workshop was one of the perfect occasions for that.”

His main objective was to make connections, and by the end of the workshop, he came out of it with a potential research collaboration. Along with Amber Liu, a University of Michigan PhD student, they came up with the idea of using a combination of machine learning (ML) models of various sizes to accelerate the inference process of causal language modeling.

We caught up with Wei to talk about his experience at the machine learning workshop and how his PhD life has been.

Q: How did you become part of the workshop?
I applied to the workshop months ago with my resume and a research plan. During the application process, I was not asked to talk about a specific project but an overview of the research I was doing. Looking back, I think this contributed to the diversity of the selected cohort, as people’s work covered the whole stack of ML systems from chip design to application-level ML.

The project I presented at the workshop was titled Nazar: Monitoring and Adapting ML Models on Mobile Devices. The setup is that machine learning models are more and more commonly being pushed to mobile devices due to the convenience of low latency. However, they are often undermined by unpredictable distribution shifts after deployment, such as moderate to severe weather conditions and demographic changes.

We are the first to provide a systematic solution to mitigate the performance degradation of post-deployment models by building a three-stage system that continuously monitors, analyzes, and adapts to distribution shifts without needing user feedback.

Wei Hao at the poster presentation — Wei Hao (in the middle back) at the poster presentation

Q: Can you talk about your background and why you decided to pursue a PhD?
I engaged in doing research when I was an undergraduate student at the University of Wisconsin-Madison. At the very beginning, getting paid and sharpening my resume were two of my main objectives. However, during the process, I developed an interest in solving open problems that are intellectually challenging.

Moreover, I enjoy defining new problems, which requires a lot of logical thinking but is very rewarding. These two characteristics made me think I am a good candidate for the PhD position. I also really enjoyed the professors I worked with and was encouraged to pursue a PhD. After talking to my current advisors, Junfeng Yang and Asaf Cidon, I was impressed by their enthusiasm and finally made up my mind.

Q: What are your research interests?
My research interest is building efficient and secure systems for machine learning workloads. The reason for pursuing this type of research is my belief in realizing artificial general intelligence (AGI), which requires reliable system support. I decided to focus on it since I found satisfaction in interacting with ML workload while building practical system components while in undergrad.

Q: What sort of research questions or issues do you hope to answer?
Besides the technical questions on how to make ML deployment ubiquitous, I also hope to answer some philosophical questions: What do people expect from using artificial intelligence (AI)? Are there capacity and efficiency boundaries of AI? Which boundaries should I focus on pushing forward in the future?

Q: What are you working on now?
I am building an ML model versioning and management system called MGit.

Models derived from other models are extremely common in machine learning today. For example, transfer learning is used to create task-specific models from “pre-trained” models through finetuning. This has led to an ecosystem where models are related to each other, sharing structure and often even parameter values.

However, it is hard to manage these model derivatives: the storage overhead of storing all derived models quickly becomes onerous, prompting users to get rid of intermediate models that might be useful for further analysis. Additionally, undesired behaviors in models are hard to track down (e.g., is a bug inherited from an upstream model?).

In the current project I am working on, we propose a model versioning and management system called MGit that makes it easier to store, test, update, and collaborate on model derivatives. MGit introduces a lineage graph that records provenance and versioning information between models, optimizations to efficiently store model parameters, as well as abstractions over this lineage graph that facilitate relevant testing, updating, and collaboration functionality. MGit is able to reduce the lineage graph’s storage footprint by up to 7× and automatically update downstream models in response to updates to upstream models.”

Q: How do you decide what to work on, and what is it like doing research?
I have written four research papers during my PhD so far: Clockworks, DIVA, Nazar, and MGit. All of them are in the field of ML systems and relate to improving the efficiency and robustness of ML applications.

To decide the topics, I always start by brainstorming with my mentors and advisors to derive possible choices. Then, I read related works and define the concrete problem to tackle. The problem definition that I derive at the beginning is usually not exactly the final version before a lot of trial and error.

For example, when we started work on DIVA, we were originally attempting to tame non-determinisms during the model training process. However, I detoured when I read about quantization and found it super interesting. The research morphed into an adversarial attack that tries to enlarge the deviations between ML models and their adapted version on edge devices

Overall, I found the most time-consuming and difficult part of doing research is to define the concrete problem that is logically valid and attractive to me. It can take me up to half a year, while the solutions and corresponding implementations are relatively easy to come up with.

Left to right: Amber Liu (University of Michigan), Han Guo (Carnegie Mellon Univeristy), Hanrui Wang(MIT), Wei Hao (Columbia University), Di Wu (University of Wisconsin-Madison)

Q: How did your previous experiences prepare you for a PhD?
I started to do research when I was a freshman in college, and I felt well-prepared before my PhD. Since the structure of research projects is more or less the same – brainstorming, defining problems, finding and evaluating solutions, and polishing papers – I get more and more familiar after each project, which makes me confident and not stressed about temporary slow-downs.

Q: Why did you apply to Columbia, and how was that process?
Aside from the prestigious reputation of Columbia and the research interests match, I really appreciate the proactiveness of my advisors during the recruitment process. I still remember that Asaf reached out to me before the application deadline, which made me feel very welcome. Because of him and my previous advisor at Madison, my stress was hugely alleviated during the application process. Thus, I encourage reaching out to whom you are really interested in working with early on, to both students and faculty.

Q: What has been the highlight of your time at Columbia?
The highlight of my time at Columbia so far is when I get the chance to share my research with a wide audience, such as at the CAIT symposium, DSI poster session, or during this interview. I also expect my research to have some real impact, and I believe that day is coming soon.

Q: Was there anything difficult that you had to face while taking your PhD?
So far, there have been three. I think one of the hardest things is to fight the feeling of low self-worth when a paper is rejected by a conference. Then, when a field I am working on attracts too many people, it becomes competitive, and I sometimes feel stressed about this kind of speed race of everyone trying to be the first to come up with something. And some loneliness when seeing friends my age bid farewell to their student life and start a career.

But since I have chosen this road of taking my PhD, I have to bear with these and find other ways to release stress. For example, I recently started indoor cycling at the gym as it is an effective way to burn both calories and overthinking.

Q: Looking back, what would you have done differently?
I would have thought less and got my hands dirty early. Sometimes, I spend too much time reading papers before doing experiments. No one was born prepared, and the earlier one fails, the sooner one can find a way out.

Q: Do you think your skills have been enhanced by your time at Columbia? In which ways?
I think I am more and more confident in delivering my thoughts in a structural way due to the training process of defining concrete problems and writing papers. I also feel that I have gained expertise in my field through the different projects I have taken on.

Q: What is your advice to students on how to navigate their time at Columbia? If they want to do research, what should they know or do to prepare?
My advice to students is to engage in what they feel passionate about as early as possible and not be afraid of failure. For those who are interested in doing research, talk to professors and PhD students proactively about your interests and how you think we can help. Do not be afraid of being an amateur and assume we know everything as the world is moving so fast, especially with the new wave of AI. I think most of us, or at least myself, value vision and passion more than the ability to solve problems, which can definitely be fostered during the PhD journey.

Q: Is there anything else that you think people should know?
My personal goal is to create start-ups that are impactful to society. If you have similar goals or related sources at Columbia that you would like to share, please reach out. Thanks!

Universities Can’t Accommodate All the Computer Science Majors

The AI Boom’s Chip Shortage Has An Unlikely Hero: The Blockchain

MS Bridge student Ishan Dhanani was interviewed by Semafor on how his startup Agora Labs helps researchers book time to rent GPUs for their projects.

19 PhD Students Awarded Prestigious Fellowships

Graduate students from the department have been selected to receive scholarships. The diverse group is a mix of those new to Columbia and students who have received fellowships for the year.

IBM PhD Fellowship Award

IBM has recognized and rewarded outstanding PhD students around the world through its highly competitive IBM PhD Fellowship Award program. The award recipients demonstrated academic excellence as well as provided innovative and exceptional research proposals.

Yangruibo Ding
Yangruibo Ding is a fourth-year PhD student working with Baishakhi Ray and Gail Kaiser. His research focuses on source code modeling, specifically learning the semantic perspective of software programs to automate software engineering tasks, such as automatic code generation and program analysis. His research has been awarded the IBM PhD Fellowship and the ACM SIGSOFT Distinguished Paper Award.

Ding received an MS in Computer Science from Columbia University in 2019 and a BE in Software Engineering from the University of Electronic Science and Technology of China in 2018. In his free time, he enjoys various sports, regularly playing basketball and table tennis, but he is always looking for new sports to try.

Google Fellowship

The Google PhD Fellowship Program was created to recognize outstanding graduate students doing exceptional and innovative research in areas relevant to computer science and related fields.

Zachary Huang
Zachary Huang is a fifth-year PhD student working on database management systems, advised by Eugene Wu. His previous projects involved building interactive dashboards, machine learning systems, and data search tools on top of join graphs. Currently, he is also exploring solutions to data problems with large language models and accelerating query processing with GPUs.

Zachary Huang graduated with a BS degree in Computer Science from the University of Wisconsin-Madison in 2019. Besides the Google Ph.D. Fellowship, he also received the Columbia Data Science Institute’s Avanessian PhD Fellowship. In his leisure time, he develops video games.

DoD NDSEG Fellow

The Department of Defense National Defense Science and Engineering Graduate Fellowship is awarded annually to U.S. citizens pursuing doctoral degrees in science and engineering disciplines.

Jeremy Klotz
Jeremy Klotz is a second-year PhD student who works with Shree Nayar on computational imaging. His research combines the design of cameras and software to solve computer vision tasks.

Klotz graduated with a BS and MS in electrical and computer engineering from Carnegie Mellon University in 2022.

Rafael Sofaer
Raphael Sofaer is a third-year PhD student in the Software Systems Lab. The focus of his research is software system reliability, dependency management, and reducing the cost of building dependable software. He is co-advised by Junfeng Yang, Jason Nieh, and Ronghui Gu.

Sofaer graduated from New York University with a B.A. in Math and Computer Science in 2015. He bakes bread every week and loves to try new recipes.

NSF Graduate Research Fellowships Program

The GRFP is a three-year fellowship that recognizes and supports outstanding graduate students in NSF-supported STEM disciplines who are pursuing research-based master’s and doctoral degrees.

Jacob Blindenbach
Jacob Blindenbach is a first-year PhD student interested in applied cryptography and designing practical and deployable secure solutions. He will be working with Gamze Gürsoy to design new privacy-preserving protocols for biomedical data, focusing on genomic data.

In May 2022, Blindenbach received a BS with Highest Distinction in Math and Computer Science from the University of Virginia. He is an avid swimmer who placed 19th at Dutch Nationals in the 100m butterfly and enjoys playing ragtime piano.

Charlie Carver
Charlie Carver is a sixth-year PhD student working with Zia Zhou on laser-based light communication and sensing in mobile systems and networking.

Carver received an MS in Computer Science from Dartmouth College in 2022 and a BS in Physics from Fordham University in 2018. Charlie won a Best Paper Award at NSDI’20, Best Demo at HotMobile’20, and the Grand Prize at the 2022 Dartmouth Innovation and Technology Festival. While at Fordham, he received the Victor F. Hess Award for the best record of achievement and service in Physics. He loves skiing, sailing, playing guitar, and caring for his two awesome cats.

Gabriel Chuang
Gabriel Chuang is a first-year PhD student co-advised by Augustin Chaintreau and Cliff Stein. He is generally interested in fairness-oriented algorithm design, especially in the context of social networks and in fairness in redistricting, i.e., identifying and preventing gerrymandering.

Chuang graduated from Carnegie Mellon University with a BS in Computer Science in 2022. In his free time, he likes to draw and play board games.

Samir Gadre
Samir Gadre is interested in large-scale dataset construction and model training with an emphasis on understanding how model performance improves predictably with better datasets and bigger models. Nowadays, he investigates these interests in the context of multimodal models and language models. He is a fourth-year PhD student advised by Shuran Song.

Gadre graduated from Brown University with a ScB Computer Science in 2018. Before joining Columbia, he worked as a Software Engineer at Microsoft HoloLens.

Toma Itagaki
Toma Itagaki is a first-year PhD student interested in human-computer interaction and mobile computing. He will work with Zia Xhou to develop mobile computing systems and wearable tech that will enable personalized health, wellness, and productivity.

Itagaki graduated in 2023 from the University of Washington with a BS in Neuroscience.

Tal Zussman
Tal Zussman is a first-year PhD student working on operating systems and storage systems for cloud computing. He is advised by Asaf Cidon.

Zussman graduated from Columbia University in May 2023 with a BS in Computer Science with Minors in Applied Mathematics and Political Science. He was a C.P. Davis Scholar and received the Department of Computer Science’s Andrew P. Kosoresow Memorial Award for Excellence in Teaching and Service, the Data Science Institute’s Outstanding Course Assistant Award, and the Columbia University Leadership and Excellence Award for Principled Action.

NSF CISE Graduate Fellowship (CSGrad4US)

The CSGrad4US program aims to increase the number and diversity of domestic graduate students pursuing research and innovation careers in computer and information science and engineering fields. The program helps bachelor’s degree holders return to academia and pursue their research interests, enabling them to engage in innovative and high-impact projects without the burden of financial constraints.

Daniel Meyer
Daniel Mayer is a first-year PhD student advised by David Knowles. His research interests are machine learning and gene regulation, with a focus on understanding polygenic disease.

After receiving a BS in Computer Science from Tufts University in 2018, Meyer worked as a Computational Associate at the Broad Institute for five years. Meyer is a proud dog parent, enjoys talking about Linux, and plays the bassoon.

Sarah Mundy
Sarah is a first-year PhD student advised by Salvatore Stolfo. Her research interests are cybersecurity applied to quantum computing, specifically looking at potential malware attack vectors. Previously, Sarah worked with NASA’s Office of the Chief Human Capital Officer in the workforce planning group, the Pentagon’s Office of the Undersecretary of Defense Research & Engineering under the Principal Director of AI, on DARPA’s Media Forensic program, and with various military and intelligence research groups focused in the AI and ML spaces.

She graduated from the University of Nevada, Reno, with a BS in Electrical Engineering in 2013. She has received the Echostar Spot Award for outstanding performance on a satellite networking project, NAVAIR’s Flight Test Excellence Award for her work planning Tomahawk missile software test flights, the UNR Outstanding Student Service Awards for both the College of Engineering and the Department of Electrical Engineering, 1st and 2nd place in the IEEE Region 6 paper and design competition, respectively, and is a Tau Beta Pi engineering honors society lifetime member.

Her hobbies include running, lifting, hiking, reading science fiction and non-fiction, and caring for her orchids and potted fruit tree.

Argha Talukder
Argha Talukder is interested in machine learning in computational biology, specifically modeling the impact of evolutionary genomics on diseases. She is a first-year PhD student advised by Itsik Pe’er and David Knowles.

In 2021, she earned a BS in Electrical Engineering from Texas A&M University, College Station. In her spare time, she learns new languages by watching foreign films.

Graduate Fellowships for STEM Diversity (GFSD)

The GFSD was founded in 1989 “to increase the number of American citizens with graduate degrees in STEM fields, emphasizing recruitment of a diverse applicant pool.”

Max Chen
Max Chen is a third-year PhD student interested in dialogue systems, conversation modeling, and human-centric artificial intelligence. He works with Zhou Yu to develop better models and systems for multi-party conversations and mixed-initiative contexts.

Chen graduated cum laude in 2021 from Cornell University with a BA in Computer Science and BA in Statistical Science. He also received an NSF Graduate Research Fellowship in 2021. He likes to keep active by going for runs and playing various sports like basketball and ultimate frisbee, enjoys listening to all sorts of music, and plays the violin, piano, and ukulele.

SEAS Fellowships

The School of Engineering and Applied Sciences established the Presidential and SEAS fellowships to recruit outstanding students from around the world to pursue graduate studies at the school.

Mudd Fellows

Siyan “Sylvia” Li
Siyan “Sylvia” Li is a first-year PhD student working on empathetic dialogues in both speech and text modalities and their applications. She is co-advised by Julia Hirschberg and Zhou Yu.

Li completed her BS in Computer Science at Georgia Institute of Technology in 2020 and an MS in Computer Science at Stanford University in 2023. Li enjoys arts and crafts, movies, musicals, and comedy. She is a comedic improviser and is a frequent visitor to Broadway shows.

Jingwen Liu
Jingwen Liu is a first-year PhD student interested in understanding the theoretical properties of current machine learning models and developing algorithms with theoretical guarantees. She is co-advised by Daniel Hsu and Alex Andoni.

Liu graduated summa cum laude with a BS in Mathematics and Computer Science from UC San Diego in 2023. She loves skiing, playing ping pong, and reading fiction in her spare time.

Greenwood Fellow

Matthew Beveridge
Matthew Beveridge is a first-year doctoral student in the CAVE Lab working with Shree Nayar. His research focuses on computer vision, computational imaging, and machine learning for robust perception of the physical environment. Beyond research, Matthew has been involved with startups in the field of autonomy, organized community events around energy and climate, and worked on human spaceflight at NASA. In addition to the Greenwoods Fellowship, he is also a recipient of the LEAP Momentum Fellowship to study the optical properties of atmospheric aerosols.

In 2021, Matthew completed an MEng and BS in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology (MIT) with a double major in Mathematics and a minor in Theater Arts.

Tang Fellow

Cyrus Illick
Cyrus Illickis a first-year PhD student co-advised by Vishal Misra and Dan Rubenstein. He is interested in network systems and will do research on fairness and reliability in congestion control protocols.

In 2023, Illick graduated with a BA in Computer Science from Columbia University. He enjoys playing squash and gardening.

SEAS Fellow

Xiaofeng Yan
Xiaofeng Yan is a first-year PhD student in the MobileX Lab, advised by Xia Zhou. Her research interests are in human-computer interaction and the Internet of Things, with the aim to design and build mobile sensing systems with better usability.

Xiaofeng earned an MS in Information Networking in 2023 from Carnegie Mellon University. In 2021, she graduated from Tsinghua University with a BS in Automation and a second degree in Philosophy.

Distinguished Lecture Series 2023

The Distinguished Lecture series brings computer scientists to Columbia to discuss current issues and research that are affecting their particular research fields.

Cognitive Workforce Revolution with Trustworthy and Self-Learning Generative AI

Monica Lam, Stanford University
CS Auditorium (CSB 451)
November 15, 2023
11:40 AM to 12:40 PM

Generative AI, and in particular Large Language Models (LLMs), have already changed how we work and study. To truly transform the cognitive workforce however, LLMs need to be trustworthy so they can operate autonomously without human oversight. Unfortunately, language models are not grounded and have a tendency to hallucinate.

Our research hypothesis is that we can turn LLM into useful workers across different domains if we (1) teach them how to acquire and apply knowledge in external corpora such as written documents, knowledge bases, and APIs; (2) have them self-learn through model distillation of simulated conversations. We showed that by supplying different external corpora to our Genie assistant framework, we can readily create trustworthy agents that can converse about topics in open domains from Wikidata, Wikipedia, or StackExchange; help navigate services and products such as restaurants or online stores; persuade users to donate to charities; and improve the social skills of people with autism spectrum disorder.

Watch the Video of the Lecture

Causal Representation Learning and Optimal Intervention Design

Caroline Uhler, MIT
CS Auditorium (CSB 451)
November 8, 2023
11:40 AM to 12:40 PM

Massive data collection holds the promise of a better understanding of complex phenomena and, ultimately, of better decisions. Representation learning has become a key driver of deep learning applications since it allows learning latent spaces that capture important properties of the data without requiring any supervised annotations. While representation learning has been hugely successful in predictive tasks, it can fail miserably in causal tasks, including predicting the effect of an intervention. This calls for a marriage between representation learning and causal inference. An exciting opportunity in this regard stems from the growing availability of interventional data (in medicine, advertisement, education, etc.). However, these datasets are still minuscule compared to the action spaces of interest in these applications (e.g. interventions can take on continuous values like the dose of a drug or can be combinatorial as in combinatorial drug therapies). In this talk, we will present initial ideas towards building a statistical and computational framework for causal representation learning and discuss its applications to optimal intervention design in the context of drug design and single-cell biology.

Watch the Video of the Lecture

SmartBook: an AI Prophetess for Disaster Reporting and Forecasting

Heng Ji, University of Illinois at Urbana-Champaign
CS Auditorium (CSB 451)
November 1, 2023
11:40 AM to 12:40 PM

Abstract:
We propose SmartBook, a novel framework that cannot be solved by ChatGPT, targeting situation report generation which consumes large volumes of news data to produce a structured situation report with multiple hypotheses (claims) summarized and grounded with rich links to factual evidence by claim detection, fact checking, misinformation detection and factual error correction. Furthermore, SmartBook can also serve as a novel news event simulator, or an intelligent prophetess. Given “What-if” conditions and dimensions elicited from a domain expert user concerning a disaster scenario, SmartBook will induce schemas from historical events, and automatically generate a complex event graph along with a timeline of news articles that describe new simulated events based on a new Λ-shaped attention mask that can generate text with infinite length. By effectively simulating disaster scenarios in both event graph and natural language format, we expect SmartBook will greatly assist humanitarian workers and policymakers to exercise reality checks (what would the next disaster look like under these given conditions?), and thus better prevent and respond to future disasters.

Watch the Video of the Lecture

Enabling the Era of Immersive Computing

Sarita Adve, University of Illinois at Urbana-Champaign
CS Auditorium (CSB 451)
October 25, 2023
11:40 AM to 12:40 PM

Computing is on the brink of a new immersive era. Recent innovations in virtual/augmented/mixed reality (extended reality or XR) show the potential for a new immersive modality of computing that will transform most human activities and change how we design, program, and use computers. There is, however, an orders of magnitude gap between the power/performance/quality-of-experience attributes of current and desirable immersive systems. Bridging this gap requires an inter-disciplinary research agenda that spans end-user devices, edge, and cloud, is based on hardware-software-algorithm co-design, and is driven by end-to-end human-perceived quality of experience.

The ILLIXR (Illinois Extended Reality) project has developed an open source end-to-end XR system to enable such a research agenda. ILLIXR is being used in academia and industry to quantify the research challenges for desirable immersive experiences and provide solutions to address these challenges. To further push the interdisciplinary frontier for immersive computing, we recently established the IMMERSE center at Illinois to bring together research, education, and infrastructure activities in immersive technologies, applications, and human experience. This talk will give an overview of IMMERSE and a deeper dive into the ILLIXR project, including the ILLIXR infrastructure, its use to identify XR systems research challenges, and cross-system solutions to address several of these challenges.

Watch the Video of the Lecture

Protecting Human Users from Misused AI

Ben Zhao, University of Chicago
CS Auditorium (CSB 451)
October 9, 2023
11:40 AM to 12:40 PM

Abstract:
Recent developments in machine learning and artificial intelligence have taken nearly everyone by surprise. The arrival of arguably the most transformative wave of AI did not bring us smart cities full of self-driving cars, or robots that do our laundry and mow our lawns. Instead, it brought us over-confident token predictors that hallucinate, deepfake generators that produce realistic images and video, and ubiquitous surveillance. In this talk, I’ll describe some of our recent efforts to warn, and later defend against some of the darker side of AI.

In particular, I will tell the story of how our efforts to disrupt unauthorized facial recognition models led unexpectedly to Glaze, a tool to defend human artists against art mimicry by generative image models. I will share some of the ups and downs of implementing and deploying an adversarial ML tool to a global user base, and reflect on mistakes and lessons learned.

Watch the Video of the Lecture

New Tool Automates the Formal Verification of Systems Software

The Software Systems Laboratory developed Spoq, a tool that significantly reduces the time and manual effort needed to verify software systems.

Year: 2023