Scaling Up and Distilling Down:
Language-Guided Robot Skill Acquisition

1Columbia University, 2Google DeepMind
Conference on Robot Learning 2023
Oral Presentation, LangRob Workshop CoRL 2023
Oral Presentation, Cognitive Science & Robot Learning Workshop CoRL 2023

Scaling Up and Distilling Down is a framework for language-guided skill learning. Give it a task description, and it will automatically generate rich, diverse robot trajectories, complete with success label and dense language labels.

The best part? It uses no expert demonstrations, manual reward supervision, and no manual language annotation.

Language-guided Robot Skill Learning

Our framework efficiently scales up data generation of language-labeled robot data and effectively distills this data down into a robust multi-task language-conditioned visuomotor policy.


For scaling up data generation, we use a language model to guide high-level planning and sampling-based robot planners to generate rich and diverse manipulation trajectories (b). To robustify this data-collection process, the language model also infers a code-snippet for the success condition of each task, simultaneously enabling the data-collection process to detect failure and retry and automatically label of trajectories with success/failure (c).

For distilling down into a policy for real-world deployment (d), we extend the diffusion policy single-task behavior-cloning approach to multi-task settings with language conditioning.

Robustness In, Robustness Out

We use a language model to predict each task's success condition code snippet, which allows the robot to retry failed tasks.

The result is demonstrations of robust behavior, which teach the policy to recover after failed attempts, resulting in more successful trajectories when given more time.


Language-guided, not language-constrained

Language-model planners' abilities to perform rich, 6 DoF manipulation alone is language-constrained. Many things robotic systems need to understand, like geometry and articulation structure, are challenging to describe in natural language. That is where sampling-based planners come in.

Approach 6 DoF Manipulation Common-sense No Sim State
Sampling-based Planners
LLM Planners
Our Data Generation
Our Policy

A New Multi-Task Benchmark

We introduce a new multi-task benchmark to test ⌛ long-horizon behavior, 🧠 common-sense reasoning, ðŸ”Ļ tool-use, and intuitive physics. Running our language-guided skill learning framework in the benchmark gives an infinite amount of language-labelled robot experience.


Run it in Real

with zero-finetuning

Using domain randomization, our diffusion policy can be deployed on a real robot with ðŸŠķ no fine-tuning.

So You Think Your Policy Can Scale? 📈

Our framework is a step towards putting robotics on the same scaling trend as large language models while not compromising on rich low-level manipulation and control. As LLM keeps getting better, how can our robot policies keep up?
See how well your policy scales with infinite language-labeled, diverse robot trajectories.


    title={Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition}, 
    author={Huy Ha and Pete Florence and Shuran Song},
    booktitle={Proceedings of the 2023 Conference on Robot Learning},

If you have any questions, please contact Huy Ha.

Questions & Answers

What can't this framework do?

The framework uses privileged simulation state information for the data generation process. This is why the language model can infer good reward functions using simulation contact and joint information. While we have successfully demonstrated its application in a transport task in the real world using a domain randomized policy, there remains room for improvement in terms of perfecting Sim2Real transferability. This represents an exciting challenge to tackle, and is currently our main focus for enhancement.

How can the distilled policy's performance be better than its data collection policy?

At data collection time, the language model also predicts a success condition used to label its experience with success or failure. The distilled policy filters the replay buffer using this automatically generated success label, learning from only successful experiences.

How can language models do that?

Tasks in our benchmark are contact-rich and require fine-grained, 6 DoF behavior to solve. Instead of getting language models to output actions for such tasks directly, we use them for high-level planning over API calls to sampling-based planners, such as rapidly-exploring random trees and grasp samplers.

The result is a data generation approach that combines the best of both worlds: Low-level geometry reasoning and diverse trajectories from sampling-based planners, and the flexibility of a language model.

What are these colorful lines?

Our policy builds on Diffusion Policy, a behavior cloning approach for learning from diverse, multi-modal demonstrations. Each action inference is sampled from a pseudo-random diffusion process over action sequences. The action sequence samples are visualized here as lines, where blue is the start of the action sequence while red is the end.

You can generate them yourself too! Check out our codebase for visualization visualization instructions.

Isn't this just SayCan or Code-as-Policy?

These prior works use language models as zero-shot planners and policies, which limit their inference-time performance by the language model's planning robustness. This also means they do not improve with more experience.

In contrast, our approach uses language models as zero-shot data collection policies, supplied with an API to sampling-based robot planners. The generated data is then distilled into a robust, multi-task visuomotor policy, which performs better than its data collection policy.