Voices of CS: Jeremy Klotz

In a world fixated on ever-higher resolutions and increasingly detailed images, a revolutionary new camera takes a daring step in the opposite direction. The Minimalist Camera is an innovation designed to prioritize efficiency and privacy over unnecessary detail. By capturing only the minimal data needed for a specific task, this groundbreaking technology challenges conventional thinking about imaging—and redefines what cameras don’t need to do.

Developed by Jeremy Klotz, a third-year PhD student in the CAVE Lab, in collaboration with Professor Shree Nayar, the minimalist camera forgoes traditional images. Instead, it relies on a handful of custom-shaped “freeform” pixels, carefully tailored to the task at hand. The result? A device that preserves privacy by avoiding the capture of identifiable details while consuming so little power that it’s entirely self-sustaining. Whether monitoring traffic flow or analyzing crowd movements, this camera captures only the essential data—empowering practical applications without compromising individual privacy.

Jeremy Klotz and Shree Nayar at ECCV 2024

The innovation has not gone unnoticed, earning a Best Paper Award at the European Conference on Computer Vision (ECCV 2024). More than just an accolade, the minimalist camera signals a paradigm shift in how cameras can function in our increasingly interconnected world.

We caught up with Klotz to explore the story behind the minimalist camera, the ups and downs of PhD life, and what it means to push the boundaries of imaging technology in the name of privacy and efficiency.

Q: How did you develop the idea for the minimalist camera?
When I started my PhD, my advisor and I began brainstorming research directions. After discussing different ideas, we landed on the high-level concept of creating a camera that captures the least information necessary to perform a vision task. In contrast to a traditional camera that uses millions of tiny square pixels, our idea was to let each pixel take on an arbitrary shape (which we call a freeform pixel). Once we evaluated this idea in simulation, we found that freeform pixels can solve vision tasks with significantly fewer pixels than traditional cameras.

I worked with my advisor on every aspect of the project, from refining the high-level idea to building a prototype camera. This involved careful thinking about how to design freeform pixels, simulating them in software, and then building a camera that uses a very small number of freeform pixels to solve real-world vision tasks. This project took about one and a half years from start to finish.

Q: Can you describe your research focus and what motivates your work?
My research is in computational imaging, where we design new cameras using novel hardware and software. This area is particularly exciting since it merges research in computer vision (typically all software) with imaging hardware. In particular, I love building prototypes to demonstrate our research ideas. Working with hardware is definitely challenging, but seeing a prototype work at the end of the day makes it even more rewarding.

I’m interested in asking questions like, “What are the fewest measurements needed to solve a vision task?” and “How can we build a camera that captures the fewest measurements?” These questions are particularly relevant right now. Most cameras produce exceptionally high-quality images, but this comes at a cost: high-resolution images often reveal too much information about the world, and the cameras consume so much power that they can only be deployed on buildings (with a tether for power) or with a battery that needs to be recharged.

Q: Why did you decide to pursue a PhD?
Before coming to Columbia, I studied electrical and computer engineering at Carnegie Mellon. While I was an undergrad, I was introduced to research in computational imaging. I didn’t plan to pursue a PhD at the time, but after this foray into research, I found that I really enjoyed the open-ended problems and decided to pursue a PhD.

My undergraduate research was the most important experience that prepared me for my PhD. Although it’s hard to completely understand what a PhD entails until you start, my undergrad research introduced me to how it feels to do research full-time and what it’s like to work with a professor rather than for a professor.

Now as a PhD student, my work’s direction is completely up to me. If I believe that an idea is worth pursuing, then I can commit all of my time to working on it. This freedom is incredible, and it allows me to choose the most interesting problems to work on.

Q: What standout moments or experiences have shaped your journey at Columbia so far?
I’ve really enjoyed going to conferences—presenting research and meeting others in the field is a blast. I’ve also enjoyed attending department seminars on research outside my area. It’s helped me to ask thoughtful questions about work in other fields.

With my research, we’ve had quite a few ideas that simply don’t work out. My strategy is to try to determine if a new idea is viable as early as possible, and quickly pivot if it isn’t.

Q: What is your advice to students on how to navigate their time at Columbia?
If you want to do research, keep an open mind to explore areas you may not be familiar with. A lot of research can appear intimidating at first, but the students and faculty working in the area are extremely passionate and excited to chat if you ask.

Outstanding Research and Best Paper Honors at ECCV 2024

CS researchers won a Best Paper Award at the European Conference on Computer Vision (ECCV) 2024, one of the premier international conferences in the fields of computer vision and machine learning. As a biennial event, ECCV attracts leading researchers, scholars, and practitioners from around the world, presenting cutting-edge advancements and breakthroughs. This year’s accepted papers from the department showcase groundbreaking innovations and high-impact research that push the boundaries of computer vision and artificial intelligence.

Best Paper

Minimalist Vision with Freeform Pixels
Jeremy Klotz Columbia University and Shree K. Nayar Columbia University

Abstract
A minimalist vision system uses the smallest number of pixels needed to solve a vision task. While traditional cameras use a large grid of square pixels, a minimalist camera uses freeform pixels that can take on arbitrary shapes to increase their information content. We show that the hardware of a minimalist camera can be modeled as the first layer of a neural network, where the subsequent layers are used for inference. Training the network for any given task yields the shapes of the camera’s freeform pixels, each of which is implemented using a photodetector and an optical mask. We have designed minimalist cameras for monitoring indoor spaces (with 8 pixels), measuring room lighting (with 8 pixels), and estimating traffic flow (with 8 pixels). The performance demonstrated by these systems is on par with a traditional camera with orders of magnitude more pixels. Minimalist vision has two major advantages. First, it naturally tends to preserve the privacy of individuals in the scene since the captured information is inadequate for extracting visual details. Second, since the number of measurements made by a minimalist camera is very small, we show that it can be fully self-powered, i.e., function without an external power supply or a battery.

How Video Meetings Change Your Expression
Sumit Sarin Columbia University, Utkarsh Mall Columbia University, Purva Tendulkar Columbia University, Carl Vondrick Columbia University

Abstract
Do our facial expressions change when we speak over video calls? Given two unpaired sets of videos of people, we seek to automatically find spatio-temporal patterns that are distinctive of each set. Existing methods use discriminative approaches and perform post-hoc explainability analysis. Such methods are insufficient as they are unable to provide insights beyond obvious dataset biases, and the explanations are useful only if humans themselves are good at the task. Instead, we tackle the problem through the lens of generative domain translation: our method generates a detailed report of learned, input-dependent spatio-temporal features and the extent to which they vary between the domains. We demonstrate that our method can discover behavioral differences between conversing face-to-face (F2F) and on video-calls (VCs). We also show the applicability of our method on discovering differences in presidential communication styles. Additionally, we are able to predict temporal change-points in videos that decouple expressions in an unsupervised way, and increase the interpretability and usefulness of our model. Finally, our method, being generative, can be used to transform a video call to appear as if it were recorded in a F2F setting. Experiments and visualizations show our approach is able to discover a range of behaviors, taking a step towards deeper understanding of human behaviors.

Controlling the World by Sleight of Hand
Sruthi Sudhakar Columbia University, Ruoshi Liu Columbia University, Basile Van Hoorick Columbia University, Carl Vondrick Columbia University, and Richard Zemel Columbia University

Abstract
Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform object manipulation conditioned on actions, an important tool for world modeling and action planning. Therefore, we propose to learn an action-conditional generative models by learning from unlabeled videos of human hands interacting with objects. The vast quantity of such data on the internet allows for efficient scaling which can enable high-performing action-conditional models. Given an image, and the shape/location of a desired hand interaction, CosHand, synthesizes an image of a future after the interaction has occurred. Experiments show that the resulting model can predict the effects of hand-object interactions well, with strong generalization particularly to translation, stretching, and squeezing interactions of unseen objects in unseen environments. Further, CosHand can be sampled many times to predict multiple possible effects, modeling the uncertainty of forces in the interaction/environment. Finally, method generalizes to different embodiments, including non-human hands, i.e. robot hands, suggesting that generative video models can be powerful models for robotics.

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis
Basile Van Hoorick Columbia University, Rundi Wu Columbia University, Ege Ozguroglu Columbia University, Kyle Sargent Stanford University, Ruoshi Liu Columbia University, Pavel Tokmakov Toyota Research Institute, Achal Dave Toyota Research Institute, Changxi Zheng Columbia University, Carl Vondrick Columbia University

Abstract
Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessitating careful recording setups, and significantly restricting their utility in the wild as well as in terms of embodied AI applications. In this paper, we propose GCD, a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to, given a video of any scene, generate a synchronous video from any other chosen perspective, conditioned on a set of relative camera pose parameters. Our model does not require depth as input, and does not explicitly model 3D scene geometry, instead performing end-to-end video-to-video translation in order to achieve its goal efficiently. Despite being trained on synthetic multi-view video data only, zero-shot real-world generalization experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.

Evolving Interpretable Visual Classifiers with Large Language Models
Mia Chiquier Columbia University, Utkarsh Mall Columbia University, Carl Vondrick Columbia University

Abstract
Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to their open-vocabulary flexibility and high performance. However, vision-language models, which compute similarity scores between images and class labels, are largely black-box, with limited interpretability, risk for bias, and inability to discover new visual concepts not written down. Moreover, in practical settings, the vocabulary for class names and attributes of specialized concepts will not be known, preventing these methods from performing well on images uncommon in large-scale vision-language datasets. To address these limitations, we present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition. We introduce an evolutionary search algorithm that uses a large language model and its in-context learning abilities to iteratively mutate a concept bottleneck of attributes for classification. Our method produces state-of-the-art, interpretable fine-grained classifiers. We outperform the latest baselines by 18 .4% on five fine-grained iNaturalist datasets and by 22 .2% on two KikiBouba datasets, despite the baselines having access to privileged information about class names.

RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
Ali Zare Columbia University, Yulei Niu Columbia University, Hammad Ayyubi Columbia University, and Shih-Fu Chang Columbia University

Abstract
Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequencelevel labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets. In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges, we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the superiority of RAP over traditional fixedlength models, establishing it as a strong baseline solution for adaptive procedure planning.

PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation
Tianyuan Zhang Massachusetts Institute of Technology, Hong-Xing Yu Stanford University, Rundi Wu Columbia University, Brandon Y. Feng Massachusetts Institute of Technology, Changxi Zheng Columbia University, Noah Snavely Cornell University, Jiajun Wu Stanford University, William T. Freeman Massachusetts Institute of Technology

Abstract
Realistic object interactions are crucial for creating immersive virtual experiences, yet synthesizing realistic 3D object dynamics in response to novel interactions remains a significant challenge. Unlike unconditional or text-conditioned dynamics generation, action-conditioned dynamics requires perceiving the physical material properties of objects and grounding the 3D motion prediction on these properties, such as object stiffness. However, estimating physical material properties is an open problem due to the lack of material ground-truth data, as measuring these properties for real objects is highly difficult. We present PhysDreamer, a physics-based approach that endows static 3D objects with interactive dynamics by leveraging the object dynamics priors learned by video generation models. By distilling these priors, PhysDreamer enables the synthesis of realistic object responses to novel interactions, such as external forces or agent manipulations. We demonstrate our approach on diverse examples of elastic objects and evaluate the realism of the synthesized interactions through a user study. PhysDreamer takes a step towards more engaging and realistic virtual experiences by enabling static 3D objects to dynamically respond to interactive stimuli in a physically plausible manner. See our project page at https: //physdreamer.github.io/.

Tag: ECCV 2024