Research From the Robotics and Vision Group Accepted to CVPR

The 2024 Computer Vision and Pattern Recognition (CVPR) Conference recognizes top research in computer vision, artificial intelligence (AI), machine learning (ML), augmented, virtual, and mixed reality (AR/VR/MR), deep learning, and more.

Assistant Professor Carl Vondrick won a Young Researcher Award, which recognizes researchers within seven years of receiving their Ph.D. who have made distinguished research contributions to computer vision.

New faculty member, Aleksander Holynski, won a Best Paper Award for work done with Google Research. The paper, Generative Image Dynamics, presents a new approach for modeling natural oscillation dynamics from a single still picture. This approach produces photo-realistic animations from a single picture and significantly outperforms prior baselines. It also demonstrates the potential to enable several downstream applications, such as creating seamlessly looping or interactive image dynamics. 

 

Below are the abstracts:

pix2gestalt: Amodal Segmentation by Synthesizing Wholes 
Ege Ozguroglu Columbia University, Ruoshi Liu Columbia University, Dídac Surís Columbia University, Dian Chen Toyota Research Institute, Achal Dave Toyota Research Institute, Pavel Tokmakov Toyota Research Institute, Carl Vondrick Columbia University

Abstract:
We introduce pix2gestalt, a framework for zero-shot amodal segmentation, which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. By capitalizing on large-scale diffusion models and transferring their representations to this task, we learn a conditional diffusion model for reconstructing whole objects in challenging zero-shot cases, including examples that break natural and physical priors, such as art. As training data, we use a synthetically curated dataset containing occluded objects paired with their whole counterparts. Experiments show that our approach outperforms supervised baselines on established benchmarks. Our model can furthermore be used to significantly improve the performance of existing object recognition and 3D reconstruction methods in the presence of occlusions.

 

GES: Generalized Exponential Splatting for Efficient Radiance Field Rendering
Abdullah J Hamdi University of Oxford, Luke Melas-Kyriazi University of Oxford, Jinjie Mai King Abdullah University of Science and Technology, Guocheng Qian King Abdullah University of Science and Technology, Ruoshi Liu Columbia University, Carl Vondrick Columbia University, Bernard Ghanem King Abdullah University of Science and Technology,  Andrea Vedaldi University of Oxford

Abstract:
Advancements in 3D Gaussian Splatting have significantly accelerated 3D reconstruction and generation. However, it may require a large number of Gaussians, which creates a substantial memory footprint. This paper introduces GES (Generalized Exponential Splatting), a novel representation that employs Generalized Exponential Function (GEF) to model 3D scenes, requiring far fewer particles to represent a scene and thus significantly outperforming Gaussian Splatting methods in efficiency with a plug-and-play replacement ability for Gaussian-based utilities. GES is validated theoretically and empirically in both principled 1D setup and realistic 3D scenes. It is shown to represent signals with sharp edges more accurately, which are typically challenging for Gaussians due to their inherent low-pass characteristics. Our empirical analysis demonstrates that GEF outperforms Gaussians in fitting natural-occurring signals (E.g. squares, triangles, parabolic signals), thereby reducing the need for extensive splitting operations that increase the memory footprint of Gaussian Splatting. With the aid of a frequency-modulated loss, GES achieves competitive performance in novel-view synthesis benchmarks while requiring less than half the memory storage of Gaussian Splatting and increasing the rendering speed by up to 39%. The code is available on the project website https://abdullahamdi.com/ges .

 

MoDE: CLIP Data Experts via Clustering
Jiawei Ma Columbia University, Po-Yao Huang FAIR, Meta, Saining Xie New York University, Shang-Wen Li FAIR, Meta, Luke Zettlemoyer University of Washington, Shih-Fu Chang Columbia University, Wen-tau Yih FAIR, Meta, Hu Xu FAIR, Meta

Abstract:
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in webcrawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use finegrained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (<35%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https: //github.com/facebookresearch/MetaCLIP/ tree/main/mode.

 

What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Brian Chen Columbia University, Nina Shvetsova Goethe University Frankfurt, Andrew Rouditchenko MIT CSAIL, Daniel Kondermann Quality Match GmbH, Samuel Thomas IBM Research AI, Shih-Fu Chang Columbia University, Rogerio Feris IBM Research AI, James Glass MIT CSAIL, Hilde Kuehne MIT-IBM Watson AI Lab

Abstract:
Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video and subtitle supervision only, without human annotation. To this end, we combine local representation learning, which focuses on leveraging fine-grained spatial information, with a global representation encoding that captures higher-level representations and incorporates both in a joint approach. To evaluate this challenging task in a reallife setting, a new benchmark dataset is proposed, providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos for over 5K events. We evaluate the proposed approach and other methods on the proposed and standard downstream tasks, showing that our method improves over current baselines in various settings, including spatial, temporal, and untrimmed multi-action spatio-temporal grounding.

 

GDA: Generalized Diffusion for Robust Test-time Adaptation
Yun-Yun Tsai Columbia University, Fu-Chen Chen Amazon, Albert Chen Amazon, Junfeng Yang Columbia University, Che-Chun Su Amazon, Min Sun Amazon, Cheng-Hao Kuo Amazon

Abstract:
Machine learning models struggle with generalization when encountering out-of-distribution (OOD) samples with unexpected distribution shifts. For vision tasks, recent studies have shown that test-time adaptation employing diffusion models can achieve state-of-the-art accuracy improvements on OOD samples by generating new samples that align with the model’s domain without the need to modify the model’s weights. Unfortunately, those studies have primarily focused on pixel-level corruptions, thereby lacking the generalization to adapt to a broader range of OOD types. We introduce Generalized Diffusion Adaptation (GDA), a novel diffusion-based test-time adaptation method robust against diverse OOD types. Specifically, GDA iteratively guides the diffusion by applying a marginal entropy loss derived from the model, in conjunction with style and content preservation losses during the reverse sampling process. In other words, GDA considers the model’s output behavior with the semantic information of the samples as a whole, which can reduce ambiguity in downstream tasks during the generation process. Evaluation across various popular model architectures and OOD benchmarks shows that GDA consistently outperforms prior work on diffusion-driven adaptation. Notably, it achieves the highest classification accuracy improvements, ranging from 4.4\% to 5.02\% on ImageNet-C and 2.5\% to 7.4\% on Rendition, Sketch, and Stylized benchmarks. This performance highlights GDA’s generalization to a broader range of OOD benchmarks.

 

Generating Illustrated Instructions
Sachin Menon Columbia University, Ishan Misra GenAI, Meta, Rouit Girdhar GenAI, Meta

Abstract:
We introduce a new task of generating “Illustrated Instructions”, i.e. visual instructions customized to a user’s needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong textto-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and stateof-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles. Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user’s individual situation.

Three Columbia Engineering Researchers Win Amazon Research Awards

Proposals from Mechanical Engineering Professor Matei Ciocarlie and Computer Science Professors Tal Malkin and Carl Vondrick focus on wearable robotic devices for stroke victims, cryptography techniques for LLMs, and improvements in computer vision queries.

Five Papers From The Computer Vision Group Accepted To ICCV 2023

Research papers from the Computer Vision Group were accepted to the International Conference on Computer Vision (ICCV ’23), the premiere international conference that includes computer vision workshops and tutorials.

 

ViperGPT: Visual Inference via Python Execution for Reasoning
Dídac Surís Columbia University, Sachit Menon Columbia University, Carl Vondrick Columbia University

Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.

 

Zero-1-to-3: Zero-shot One Image to 3D Object
Ruoshi Liu Columbia University, Rundi Wu Columbia University, Basile Van Hoorick Columbia University, Pavel Tokmakov Toyota Research Institute, Sergey Zakharov Toyota Research Institute, Carl Vondrick Columbia University

We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image. To perform novel view synthesis in this under-constrained setting, we capitalize on the geometric priors that large-scale diffusion models learn about natural images. Our conditional diffusion model uses a synthetic dataset to learn controls of the relative camera viewpoint, which allow new images to be generated of the same object under a specified camera transformation. Even though it is trained on a synthetic dataset, our model retains a strong zero-shot generalization ability to out-of-distribution datasets as well as in-the-wild images, including impressionist paintings. Our viewpoint-conditioned diffusion approach can further be used for the task of 3D reconstruction from a single image. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art single-view 3D reconstruction and novel view synthesis models by leveraging Internet-scale pre-training.

 

Muscles in Action
Mia Chiquier Columbia University, Carl Vondrick Columbia University

Human motion is created by, and constrained by, our muscles. We take a first step at building computer vision methods that represent the internal muscle activity that causes motion. We present a new dataset, Muscles in Action (MIA), to learn to incorporate muscle activity into human motion representations. The dataset consists of 12.5 hours of synchronized video and surface electromyography (sEMG) data of 10 subjects performing various exercises. Using this dataset, we learn a bidirectional representation that predicts muscle activation from video, and conversely, reconstructs motion from muscle activation. We evaluate our model on in-distribution subjects and exercises, as well as on out-of-distribution subjects and exercises. We demonstrate how advances in modeling both modalities jointly can serve as conditioning for muscularly consistent motion generation. Putting muscles into computer vision systems will enable richer models of virtual humans, with applications in sports, fitness, and AR/VR.

 

SurfsUp: Learning Fluid Simulation for Novel Surfaces
Arjun Mani Columbia University, Ishaan Preetam Chandratreya Columbia University, Elliot Creager University of Toronto, Carl Vondrick Columbia University, Richard Zemel Columbia University

Modeling the mechanics of fluid in complex scenes is vital to applications in design, graphics, and robotics. Learning-based methods provide fast and differentiable fluid simulators, however most prior work is unable to accurately model how fluids interact with genuinely novel surfaces not seen during training. We introduce SURFSUP, a framework that represents objects implicitly using signed distance functions (SDFs), rather than an explicit representation of meshes or particles. This continuous representation of geometry enables more accurate simulation of fluid-object interactions over long time periods while simultaneously making computation more efficient. Moreover, SURFSUP trained on simple shape primitives generalizes considerably out-of-distribution, even to complex real-world scenes and objects. Finally, we show we can invert our model to design simple objects to manipulate fluid flow.

 

Landscape Learning for Neural Network Inversion
Ruoshi Liu Columbia University, Chengzhi Mao Columbia University, Purva Tendulkar Columbia University, Hao Wang Rutgers University, Carl Vondrick Columbia University

Many machine learning methods operate by inverting a neural network at inference time, which has become a popular technique for solving inverse problems in computer vision, robotics, and graphics. However, these methods often involve gradient descent through a highly non-convex loss landscape, causing the optimization process to be unstable and slow. We introduce a method that learns a loss landscape where gradient descent is efficient, bringing massive improvement and acceleration to the inversion process. We demonstrate this advantage on a number of methods for both generative and discriminative tasks, including GAN inversion, adversarial defense, and 3D human pose reconstruction.

 

AI Learns to Predict Human Behavior from Videos

Assistant Professor Carl Vondrick, Didac Souris, and Ruoshi Liu developed a computer vision algorithm for predicting human interactions and body language in video, a capability that could have applications for assistive technology, autonomous vehicles, and collaborative robots.

Carl Vondrick Wins NSF CAREER Award

Assistant Professor Carl Vondrick has won the National Science Foundation’s (NSF) Faculty Early Career Development award for his proposal program to develop machine perception systems that robustly detect and track objects even when they disappear from sight, thereby enabling machines to build spatial awareness of their surroundings.

Robot Displays a Glimmer of Empathy to a Partner Robot

A Columbia Engineering robot has learned to predict its partner robot’s future actions and goals based on just a few initial video frames. The study is part of a broader effort to endow robots with the ability to understand and anticipate the goals of other robots, purely from visual observations.

CS Welcomes New Faculty

The department welcomes Baishakhi Ray, Ronghui Gu, Carl Vondrick, and Tony Dear.

Baishakhi Ray
Assistant Professor, Computer Science
PhD, University of Texas, Austin, 2013; MS, University of Colorado, Boulder, 2009; BTech, Calcutta University, India, 2004; BSc, Presidency College, India, 2001

Baishakhi Ray works on end-to-end software solutions and treats the entire software system – anything from debugging, patching, security, performance, developing methodology, to even the user experience of developers and users.

At the moment her research is focused on machine learning bias. For example, some models see a picture of a baby and a man and identify it as a woman and child. Her team is developing ways on how to train a system and to solve practical problems.

Ray previously taught at the University of Virginia and was a postdoctoral fellow in computer science at the University of California, Davis. In 2017, she received Best Paper Awards at the SIGSOFT Symposium on the Foundations of Software Engineering and the International Conference on Mining Software Repositories.

Ronghui Gu
Assistant Professor, Computer Science
PhD, Yale University, 2017; Tsinghua University, China, 2011

Ronghui Gu focuses on programming languages and operating systems, specifically language-based support for safety and security, certified system software, certified programming and compilation, formal methods, and concurrency reasoning. He seeks to build certified concurrent operating systems that can resist cyberattacks.

Gu previously worked at Google and co-founded Certik, a formal verification platform for smart contracts and blockchain ecosystems. The startup grew out of his thesis, which proposed CertiKOS, a comprehensive verification framework. CertiKOS is used in high-profile DARPA programs CRASH and HACMS, is a core component of an NSF Expeditions in Computing project DeepSpec, and has been widely considered “a real breakthrough” toward hacker-resistant systems.

Carl Vondrick
Assistant Professor, Computer Science
PhD, Massachusetts Institute of Technology, 2017; BS, University of California, Irvine, 2011

Carl Vondrick’s research focuses on computer vision and machine learning. His work often uses large amounts of unlabeled data to teach perception to machines. Other interests include interpretable models, high-level reasoning, and perception for robotics.

His past research developed computer systems that watch video in order to anticipate human actions, recognize ambient sounds, and visually track objects. Computer vision is enabling applications across health, security, and robotics, but they currently require large labeled datasets to work well, which is expensive to collect. Instead, Vondrick’s research develops systems that learn from unlabeled data, which will enable computer vision systems to efficiently scale up and tackle versatile tasks. His research has been featured on CNN and Wired and in a skit on the Late Show with Stephen Colbert, for training computer vision models through binge-watching TV shows.

Recently, three research papers he worked on were presented at the European Conference for Computer Vision (EECV). Vondrick comes to Columbia from Google Research, where he was a research scientist.

Tony Dear
Lecturer in Discipline, Computer Science
PhD, Carnegie Mellon University, 2018; MS, Carnegie Mellon University, 2015; BS, University of California, Berkeley, 2012

Tony Dear’s research and pedagogical interests lie in bringing theory into practice. In his PhD research, this idea motivated the application of analytical tools to motion planning for “real” or physical locomoting robotic systems that violate certain ideal assumptions but still exhibit some structure – how to get unconventional robots to move around with stealth of animals and biological organisms. Also, how to simplify tools and expand that to other systems, as well as how to generalize mathematical models to be used in multiple robots.

In his teaching, Dear strives to engage students with relatable examples and projects, alternative ways of learning, such as an online curriculum with lecture videos. He completed the Future Faculty Program at the Eberly Center for Teaching Excellence at Carnegie Mellon and has been the recipient of a National Defense Science and Engineering Graduate Fellowship.

At Columbia, Dear is looking forward to teaching computer science, robotics, and AI. He hopes to continue small-scale research projects in robotic locomotion and conduct outreach to teach teens STEM and robotics courses.