CLIP-RT : Learning Language-Conditioned Robotic Policies from Natural Language Supervision

Seoul National University

*Equal Contribution, Corresponding Authors

Close the laptop
Draw a line from the star to the circle
Erase the whiteboard
Hide the Pooh with the green cup
Open the cabinet
Open the trash can
Play with the orange car
Pour the dog food in the bowl
Stamp near the circle

TL;DR

We explore how non-experts can teach robotic skills only through natural language supervision.

We propose a VLA model that learns visuomotor policies directly from this supervision.

Abstract

Teaching robots desired skills in real-world environments remains challenging, especially for non-experts. A key bottleneck is that collecting robotic data often requires expertise or specialized hardware, limiting accessibility and scalability. We posit that natural language offers an intuitive and accessible interface for robot learning. To this end, we study two aspects: (1) enabling non-experts to collect robotic data through natural language supervision (e.g., “move the arm to the right”) and (2) learning robotic policies directly from this supervision. Specifically, we introduce a data collection framework that collects robot demonstrations based on natural language supervision and further augments these demonstrations. We then present CLIP-RT, a vision-language-action (VLA) model that learns language-conditioned visuomotor policies from this supervision. CLIP-RT adapts the pretrained CLIP models and learns to predict language-based motion primitives via contrastive imitation learning. We train CLIP-RT on the Open X-Embodiment dataset and finetune it on in-domain data collected by our framework to learn diverse skills. CLIP-RT demonstrates strong capabilities in learning novel manipulation skills, outperforming the state-of-the-art model, OpenVLA (7B parameters), by 24% in average success rates, while using 7x fewer parameters (1B). We further observe that CLIP-RT shows significant improvements in few-shot generalization. Finally, through collaboration with humans or large pretrained models, CLIP-RT can further improve its generalization on challenging robotic tasks.

Motivation

Recent work has made significant progress toward generalist robotic policies using large-scale demonstration datasets. However, collecting these demonstrations often requires expertise and access to specialized hardware (e.g., teleoperation systems), limiting the accessibility and scalability of robot learning. How can non-experts train robotic policies without relying on specialized expertise or devices for data collection?

We already have an intuitive and accessible interface for robot learning: natural language. In this work, we explore an approach for training robotic skills solely through natural language. It includes:

Robotic data collection through natural language supervision

A VLA model that learns visuomotor skills directly from natural language supervision

Data Collection

Language-Based Teleoperation

Research Question
Language-based teleoperation

We propose a language-based teleoperation method which leverages the in-context learning capabilities of large language models (LLMs). We collect 10 episodes for each skill through language-based teleoperation

1. Provide initial language instruction (e.g., “Pour the dog food into the bowl”)

2. Provide natural language supervision in each state to complete the instruction

3. LLM translates the language supervision into the end-effector action based on the text prompt

4. Repeat step 2 and step 3 until the episode ends

Stochastic Trajectory Diversification (STD)

Stochastic Trajectory Diversification
Stochastic Trajectory Diversification

We augment the demonstration data collected by humans based on a method called stochastic trajectory diversification (STD), consisting of two parts:

(1) The Diversification phase diversifies the expert trajectory into multiple alternative trajectories

(2) The Recovery phase intentionally deviates from the original trajectory and then executes a recovery action to return to the original path. Recovery action is utilized in training

CLIP-RT Model

Overview of CLIP-RT.
Overview of CLIP-RT.

We propose a new VLA model, CLIP-RT which extends the idea of CLIP to robot learning to learn language-conditioned policies from natural language. Different from other VLA models, CLIP-RT learns to predict robotic actions represented in natural language (e.g., “Move arm left”) based on contrastive imitation learning. Furthermore, CLIP-RT is a discriminative VLA model that predicts the language-based motion primitive in the predefined list of motion primitives.

Contrastive Imitation Learning

The goal of contrastive imitation learning is to optimize the pairwise similarity between language supervision and contextual information (i.e., the current scene and language instruction). We first train CLIP-RT on Open X-Embodiment (OXE) dataset and then finetune it on our collected in-domain data. Since the OXE dataset does not contain natural language supervision, we transform existing low-level end-effector actions into natural language supervision to train CLIP-RT.

Closed-Loop Robot Control

In each time step, CLIP-RT selects the language-based motion primitive with the highest similarity score. The selected motion primitive is translated into the lower-level end-effector commands based on a pre-defined lookup table. Since CLIP-RT is a discriminative model, CLIP-RT can predict action in a single forward pass without autoregressive decoding. This model requires 7GB of GPU memory and runs at 16Hz (one H100 GPU using float32 precision) and 8Hz (one NVIDIA RTX 3090 GPU using float32 precision) without applying any speed-up tricks, such as model quantization and compilation.

Experiments

Experimental results on 18 robot manipulation tasks.
Main results on 18 robot manipulation tasks

We evaluate CLIP-RT on 9 Common tasks (Top) and 9 Novel tasks (Bottom); Arranged in ascending order based on average steps per episode.

Key Takeaways

1. CLIP-RT outperforms OpenVLA in average success rates by 3% (Common) and 24% (Novel).

2. Natural language supervision significantly increases performance (vs CLIP-RT-Action)

3. STD boosts the overall performance (vs CLIP-RT-Passive)

Why does CLIP-RT perform well on Novel manipulation tasks?

Comparison between multi-task and single-task policies.
Comparison between multi-task and single-task policies

Where does the significant performance gap between CLIP-RT and OpenVLA on Novel tasks come from? One of our hypotheses is that CLIP-RT effectively learns the shared structure across diverse robotic tasks by utilizing language-based motion primitives as basic building blocks. To verify this, we train a single-task policy for each Novel task and evaluate the performance of each model. OpenVLA-Single and CLIP-RT-Single denote the performance of single-task policies for each model. CLIP-RT benefits more on multi-task policies (+11.1%) compared with OpenVLA (+3.3%). It indicates that CLIP-RT benefits more from shared knowledge across diverse robotic tasks. In other words, learning more generalizable and transferable representations is one of the factors that CLIP-RT performs well on Novel tasks.

Another factor is attributed to the use of natural language supervision. As shown in the main results, CLIP-RT consistently outperforms a baseline model, CLIP-RT-Action that learns action tokens from scratch similar to existing VLA models. This observation implies that the use of natural language supervision enhances CLIP-RT’s generalization capabilities. To delve into this more deeply, we visualize how these models embed each motion primitives.

t-SNE visualization of 58 motion primitive embeddings from CLIP-RT (left) and CLIP-RT-Action (right)}.
t-SNE visualization of 58 motion primitive embeddings from CLIP-RT (left) and CLIP-RT-Action (right)}.

The figure above shows the t-SNE projection of 58 motion primitives for CLIP-RT (left) and CLIP-RT-Action (right). We categorize motion primitives into 16 groups based on the type of displacement. As shown in the Figure, CLIP-RT tends to embed the same groups of motions closer, while CLIP-RT-Action does not show clear structures in its embeddings. It implies that natural language supervision enables CLIP-RT to leverage inherent language priors in the pretrained vision-language model (i.e., CLIP), facilitating the learning of more structured and semantically meaningful action representations. We conjecture that these language priors improve the generalization capabilities of CLIP-RT.

Our paper describes other interesting results, such as (1) collaborating CLIP-RT with humans or large pretrained models and (2) few-shot generalization capabilities.

Conclusion

This paper presents CLIP-RT, enabling non-experts to teach robots new manipulation skills through natural language, making robot learning more accessible and scalable for everyday users.

BibTeX citation

    @article{kang2024cliprt,
  title={CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision},
  author={Kang, Gi-Cheon and Kim, Junghyun and Shim, Kyuhwan and Lee, Jun Ki and Zhang, Byoung-Tak},
  journal={arXiv preprint arXiv:2411.00508},
  year = {2024}
}

  

Acknowledgements

This work was partly supported by the IITP (RS-2021-II212068-AIHub/10%, RS-2021-II211343-GSAI/20%, 2022-0-00951-LBA/20%, 2022-0-00953-PICA/20%) and NRF (RS-2024-00353991/20%, RS-2023-00274280/10%) grant funded by the Korean government.