Voices of CS: Purva Tendulkar

The fourth-year PhD student’s love of cartoons springboarded into a research career advancing realism in human-centric generative computer vision.

As kids, many of us spent countless hours watching cartoons, getting lost in the colorful worlds and playful characters. For most, these shows are simply a source of entertainment, a fun escape into magical realms. But for Purva Tendulkar, those endless hours of watching animated movies and shows became something more—a spark that ignited a lifelong curiosity about how animation works.

While other kids pressed play to enjoy the story, Tendulkar found herself drawn to the process behind the scenes. She would watch “making of” videos, fascinated by the creative techniques that brought her favorite characters to life. This early passion for animation has stayed with her, growing into a dedicated pursuit in her PhD studies, where she now works with Carl Vondrick on using computer vision and graphics techniques to make films and video games come alive. Her research, recognized with the prestigious Apple Scholars in AIML PhD fellowship, explores how to create digital humans that interact more authentically with their environments, aiming to push the boundaries of perceptive and generative tools for designers and developers.

(1) Tendulkar at CVPR, (2) Purva Tendulkar, Samir Gadre, Revant Teotia, Basile Van Hoorick, Ruoshi Liu, Sachit Menon, (3) Dídac Surís, Purva Tendulkar, Scott Geng, Arjun Mani, Sachit Menon, Ishaan Chandratreya, Basile Van Hoorick, Carl Vondrick, Sruthi Sudhakar, Mia Chiquier, Revant Teotia

In this interview, Tendulkar delves into the inspiration behind her research and her vision for the future of realistic human motion in digital environments.

Q: Can you describe your research focus and what motivates your work?
My research interests lie at the intersection of Computer Vision, Machine Learning, and Computer Graphics. My vision is to emulate the varied facets of human behavior authentically. I work on understanding and synthesizing humans and how they interact with their physical surroundings. This has applications in developing cutting-edge video games, robotic simulators, and immersive AR/VR experiences – all of which cater to human needs and behaviors.

I grew up watching Disney movies and have been fascinated with the magic that could be created on-screen by bringing animated characters to life and the emotions they evoked in me. I would find myself watching hours of behind-the-scenes of these movies – how the artists hand-designed each character’s personality and refined the animations (e.g., sword fights) to be more realistic.

This directly reflected my choices when it came to selecting a topic to work on in the first year of my PhD – which was to synthesize all the realistic ways humans might interact with the world. This theme has stayed consistent throughout my PhD.

 

Q: What challenges and questions drive your research?
Humans interact with the world, as well as each other, in a variety of physically and logically plausible ways that come very naturally to us. However, it is difficult to teach machines what is plausible and what is not – one of the biggest challenges is the data. Presently, human motion data arises from highly accurate but extremely expensive motion capture systems that are catered to a specific scene. On the other hand, there is abundant information present in internet videos (e.g., on YouTube) that cannot possibly be captured in a studio. Through my research, I aim to creatively combine the benefits of complementary data sources to build powerful generative models that wouldn’t be possible with just one source.

I have worked on generating 3D human-object interactions. Concretely, given a 3D object in a 3D environment (e.g., a mug on a kitchen countertop), my method, called FLEX, is able to generate a number of diverse human avatars, grasping the object realistically. Such a tool could serve as a template for animation artists to work with.

Earlier works that tackled this problem collected expensive full-body 3D human-object interaction data and trained models on it. However, such methods suffer from the limitations of the dataset and do not scale when the object appears in a different configuration than seen during training. For example, when an object was placed on the floor, previous methods would generate humans that were sinking into the floor rather than a kneeling/squatting pose.

Instead of creating a model that has to be trained to learn a full-body pose to grasp an object, we decided to combine two information sources – systems that can generate “hand-only” grasps and full-body data without objects. Then, we developed an algorithm that optimized a full-body pose that matched the “hand-only” grasp. We found that our approach, which does not use full-body grasping data, outperforms methods trained on it, thus challenging existing data collection paradigms for 3D humans and propagating for better utilization of existing sources.

 

Q: Do you play video games, and how do they influence your research?
I quite enjoy playing video games in my free time. Still, I often find myself spending more time appreciating/critiquing the design rather than finishing the game. Some games that I think have great graphics are Horizon Zero Dawn, God of War, and Uncharted.

I think the visuals in current video games are truly impressive from a graphics/rendering standpoint – clothing, skin, object textures, and lighting effects are all very realistic and convincing. But there’s still a lot of room for improvement when it comes to how characters move and interact.

Most characters walk the same way, which strips away their unique personalities. Even human-object interaction feels templated and unrealistic, wherein objects sometimes appear to just “stick” to a hand instead of a realistic grasp. Sometimes, you see characters behave in an unpredictable way when interacting with the world around them. For instance, if you try to move your character into a spot that isn’t pre-programmed, like a tight space in a wall, the game just freaks out, and you might end up walking through the wall! And when it comes to how characters react to each other beyond scripted scenes, it feels a bit off. Say, if you have your character run circles around others or bump into them, the other characters barely react—they just go on about their business as if nothing’s happening.

I find all these problems very fascinating! Getting characters to behave in a truly convincing way would really prove how deeply we understand human behavior. After all, our world is built by humans, for humans, and I am excited to continue pushing the frontiers of 3D human research.

Research From the Robotics and Vision Group Accepted to CVPR

The 2024 Computer Vision and Pattern Recognition (CVPR) Conference recognizes top research in computer vision, artificial intelligence (AI), machine learning (ML), augmented, virtual, and mixed reality (AR/VR/MR), deep learning, and more.

Assistant Professor Carl Vondrick won a Young Researcher Award, which recognizes researchers within seven years of receiving their Ph.D. who have made distinguished research contributions to computer vision.

New faculty member, Aleksander Holynski, won a Best Paper Award for work done with Google Research. The paper, Generative Image Dynamics, presents a new approach for modeling natural oscillation dynamics from a single still picture. This approach produces photo-realistic animations from a single picture and significantly outperforms prior baselines. It also demonstrates the potential to enable several downstream applications, such as creating seamlessly looping or interactive image dynamics. 

 

Below are the abstracts:

pix2gestalt: Amodal Segmentation by Synthesizing Wholes 
Ege Ozguroglu Columbia University, Ruoshi Liu Columbia University, Dídac Surís Columbia University, Dian Chen Toyota Research Institute, Achal Dave Toyota Research Institute, Pavel Tokmakov Toyota Research Institute, Carl Vondrick Columbia University

Abstract:
We introduce pix2gestalt, a framework for zero-shot amodal segmentation, which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. By capitalizing on large-scale diffusion models and transferring their representations to this task, we learn a conditional diffusion model for reconstructing whole objects in challenging zero-shot cases, including examples that break natural and physical priors, such as art. As training data, we use a synthetically curated dataset containing occluded objects paired with their whole counterparts. Experiments show that our approach outperforms supervised baselines on established benchmarks. Our model can furthermore be used to significantly improve the performance of existing object recognition and 3D reconstruction methods in the presence of occlusions.

 

GES: Generalized Exponential Splatting for Efficient Radiance Field Rendering
Abdullah J Hamdi University of Oxford, Luke Melas-Kyriazi University of Oxford, Jinjie Mai King Abdullah University of Science and Technology, Guocheng Qian King Abdullah University of Science and Technology, Ruoshi Liu Columbia University, Carl Vondrick Columbia University, Bernard Ghanem King Abdullah University of Science and Technology,  Andrea Vedaldi University of Oxford

Abstract:
Advancements in 3D Gaussian Splatting have significantly accelerated 3D reconstruction and generation. However, it may require a large number of Gaussians, which creates a substantial memory footprint. This paper introduces GES (Generalized Exponential Splatting), a novel representation that employs Generalized Exponential Function (GEF) to model 3D scenes, requiring far fewer particles to represent a scene and thus significantly outperforming Gaussian Splatting methods in efficiency with a plug-and-play replacement ability for Gaussian-based utilities. GES is validated theoretically and empirically in both principled 1D setup and realistic 3D scenes. It is shown to represent signals with sharp edges more accurately, which are typically challenging for Gaussians due to their inherent low-pass characteristics. Our empirical analysis demonstrates that GEF outperforms Gaussians in fitting natural-occurring signals (E.g. squares, triangles, parabolic signals), thereby reducing the need for extensive splitting operations that increase the memory footprint of Gaussian Splatting. With the aid of a frequency-modulated loss, GES achieves competitive performance in novel-view synthesis benchmarks while requiring less than half the memory storage of Gaussian Splatting and increasing the rendering speed by up to 39%. The code is available on the project website https://abdullahamdi.com/ges .

 

MoDE: CLIP Data Experts via Clustering
Jiawei Ma Columbia University, Po-Yao Huang FAIR, Meta, Saining Xie New York University, Shang-Wen Li FAIR, Meta, Luke Zettlemoyer University of Washington, Shih-Fu Chang Columbia University, Wen-tau Yih FAIR, Meta, Hu Xu FAIR, Meta

Abstract:
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in webcrawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use finegrained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (<35%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https: //github.com/facebookresearch/MetaCLIP/ tree/main/mode.

 

What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Brian Chen Columbia University, Nina Shvetsova Goethe University Frankfurt, Andrew Rouditchenko MIT CSAIL, Daniel Kondermann Quality Match GmbH, Samuel Thomas IBM Research AI, Shih-Fu Chang Columbia University, Rogerio Feris IBM Research AI, James Glass MIT CSAIL, Hilde Kuehne MIT-IBM Watson AI Lab

Abstract:
Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video and subtitle supervision only, without human annotation. To this end, we combine local representation learning, which focuses on leveraging fine-grained spatial information, with a global representation encoding that captures higher-level representations and incorporates both in a joint approach. To evaluate this challenging task in a reallife setting, a new benchmark dataset is proposed, providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos for over 5K events. We evaluate the proposed approach and other methods on the proposed and standard downstream tasks, showing that our method improves over current baselines in various settings, including spatial, temporal, and untrimmed multi-action spatio-temporal grounding.

 

GDA: Generalized Diffusion for Robust Test-time Adaptation
Yun-Yun Tsai Columbia University, Fu-Chen Chen Amazon, Albert Chen Amazon, Junfeng Yang Columbia University, Che-Chun Su Amazon, Min Sun Amazon, Cheng-Hao Kuo Amazon

Abstract:
Machine learning models struggle with generalization when encountering out-of-distribution (OOD) samples with unexpected distribution shifts. For vision tasks, recent studies have shown that test-time adaptation employing diffusion models can achieve state-of-the-art accuracy improvements on OOD samples by generating new samples that align with the model’s domain without the need to modify the model’s weights. Unfortunately, those studies have primarily focused on pixel-level corruptions, thereby lacking the generalization to adapt to a broader range of OOD types. We introduce Generalized Diffusion Adaptation (GDA), a novel diffusion-based test-time adaptation method robust against diverse OOD types. Specifically, GDA iteratively guides the diffusion by applying a marginal entropy loss derived from the model, in conjunction with style and content preservation losses during the reverse sampling process. In other words, GDA considers the model’s output behavior with the semantic information of the samples as a whole, which can reduce ambiguity in downstream tasks during the generation process. Evaluation across various popular model architectures and OOD benchmarks shows that GDA consistently outperforms prior work on diffusion-driven adaptation. Notably, it achieves the highest classification accuracy improvements, ranging from 4.4\% to 5.02\% on ImageNet-C and 2.5\% to 7.4\% on Rendition, Sketch, and Stylized benchmarks. This performance highlights GDA’s generalization to a broader range of OOD benchmarks.

 

Generating Illustrated Instructions
Sachin Menon Columbia University, Ishan Misra GenAI, Meta, Rouit Girdhar GenAI, Meta

Abstract:
We introduce a new task of generating “Illustrated Instructions”, i.e. visual instructions customized to a user’s needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong textto-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and stateof-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles. Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user’s individual situation.