Research From the Robotics and Vision Group Accepted to CVPR

The 2024 Computer Vision and Pattern Recognition (CVPR) Conference recognizes top research in computer vision, artificial intelligence (AI), machine learning (ML), augmented, virtual, and mixed reality (AR/VR/MR), deep learning, and more.

Assistant Professor Carl Vondrick won a Young Researcher Award, which recognizes researchers within seven years of receiving their Ph.D. who have made distinguished research contributions to computer vision.

New faculty member, Aleksander Holynski, won a Best Paper Award for work done with Google Research. The paper, Generative Image Dynamics, presents a new approach for modeling natural oscillation dynamics from a single still picture. This approach produces photo-realistic animations from a single picture and significantly outperforms prior baselines. It also demonstrates the potential to enable several downstream applications, such as creating seamlessly looping or interactive image dynamics. 


Below are the abstracts:

pix2gestalt: Amodal Segmentation by Synthesizing Wholes 
Ege Ozguroglu Columbia University, Ruoshi Liu Columbia University, Dídac Surís Columbia University, Dian Chen Toyota Research Institute, Achal Dave Toyota Research Institute, Pavel Tokmakov Toyota Research Institute, Carl Vondrick Columbia University

We introduce pix2gestalt, a framework for zero-shot amodal segmentation, which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. By capitalizing on large-scale diffusion models and transferring their representations to this task, we learn a conditional diffusion model for reconstructing whole objects in challenging zero-shot cases, including examples that break natural and physical priors, such as art. As training data, we use a synthetically curated dataset containing occluded objects paired with their whole counterparts. Experiments show that our approach outperforms supervised baselines on established benchmarks. Our model can furthermore be used to significantly improve the performance of existing object recognition and 3D reconstruction methods in the presence of occlusions.


GES: Generalized Exponential Splatting for Efficient Radiance Field Rendering
Abdullah J Hamdi University of Oxford, Luke Melas-Kyriazi University of Oxford, Jinjie Mai King Abdullah University of Science and Technology, Guocheng Qian King Abdullah University of Science and Technology, Ruoshi Liu Columbia University, Carl Vondrick Columbia University, Bernard Ghanem King Abdullah University of Science and Technology,  Andrea Vedaldi University of Oxford

Advancements in 3D Gaussian Splatting have significantly accelerated 3D reconstruction and generation. However, it may require a large number of Gaussians, which creates a substantial memory footprint. This paper introduces GES (Generalized Exponential Splatting), a novel representation that employs Generalized Exponential Function (GEF) to model 3D scenes, requiring far fewer particles to represent a scene and thus significantly outperforming Gaussian Splatting methods in efficiency with a plug-and-play replacement ability for Gaussian-based utilities. GES is validated theoretically and empirically in both principled 1D setup and realistic 3D scenes. It is shown to represent signals with sharp edges more accurately, which are typically challenging for Gaussians due to their inherent low-pass characteristics. Our empirical analysis demonstrates that GEF outperforms Gaussians in fitting natural-occurring signals (E.g. squares, triangles, parabolic signals), thereby reducing the need for extensive splitting operations that increase the memory footprint of Gaussian Splatting. With the aid of a frequency-modulated loss, GES achieves competitive performance in novel-view synthesis benchmarks while requiring less than half the memory storage of Gaussian Splatting and increasing the rendering speed by up to 39%. The code is available on the project website .


MoDE: CLIP Data Experts via Clustering
Jiawei Ma Columbia University, Po-Yao Huang FAIR, Meta, Saining Xie New York University, Shang-Wen Li FAIR, Meta, Luke Zettlemoyer University of Washington, Shih-Fu Chang Columbia University, Wen-tau Yih FAIR, Meta, Hu Xu FAIR, Meta

The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in webcrawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use finegrained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (<35%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https: // tree/main/mode.


What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Brian Chen Columbia University, Nina Shvetsova Goethe University Frankfurt, Andrew Rouditchenko MIT CSAIL, Daniel Kondermann Quality Match GmbH, Samuel Thomas IBM Research AI, Shih-Fu Chang Columbia University, Rogerio Feris IBM Research AI, James Glass MIT CSAIL, Hilde Kuehne MIT-IBM Watson AI Lab

Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video and subtitle supervision only, without human annotation. To this end, we combine local representation learning, which focuses on leveraging fine-grained spatial information, with a global representation encoding that captures higher-level representations and incorporates both in a joint approach. To evaluate this challenging task in a reallife setting, a new benchmark dataset is proposed, providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos for over 5K events. We evaluate the proposed approach and other methods on the proposed and standard downstream tasks, showing that our method improves over current baselines in various settings, including spatial, temporal, and untrimmed multi-action spatio-temporal grounding.


GDA: Generalized Diffusion for Robust Test-time Adaptation
Yun-Yun Tsai Columbia University, Fu-Chen Chen Amazon, Albert Chen Amazon, Junfeng Yang Columbia University, Che-Chun Su Amazon, Min Sun Amazon, Cheng-Hao Kuo Amazon

Machine learning models struggle with generalization when encountering out-of-distribution (OOD) samples with unexpected distribution shifts. For vision tasks, recent studies have shown that test-time adaptation employing diffusion models can achieve state-of-the-art accuracy improvements on OOD samples by generating new samples that align with the model’s domain without the need to modify the model’s weights. Unfortunately, those studies have primarily focused on pixel-level corruptions, thereby lacking the generalization to adapt to a broader range of OOD types. We introduce Generalized Diffusion Adaptation (GDA), a novel diffusion-based test-time adaptation method robust against diverse OOD types. Specifically, GDA iteratively guides the diffusion by applying a marginal entropy loss derived from the model, in conjunction with style and content preservation losses during the reverse sampling process. In other words, GDA considers the model’s output behavior with the semantic information of the samples as a whole, which can reduce ambiguity in downstream tasks during the generation process. Evaluation across various popular model architectures and OOD benchmarks shows that GDA consistently outperforms prior work on diffusion-driven adaptation. Notably, it achieves the highest classification accuracy improvements, ranging from 4.4\% to 5.02\% on ImageNet-C and 2.5\% to 7.4\% on Rendition, Sketch, and Stylized benchmarks. This performance highlights GDA’s generalization to a broader range of OOD benchmarks.


Generating Illustrated Instructions
Sachin Menon Columbia University, Ishan Misra GenAI, Meta, Rouit Girdhar GenAI, Meta

We introduce a new task of generating “Illustrated Instructions”, i.e. visual instructions customized to a user’s needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong textto-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and stateof-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles. Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user’s individual situation.