CLIP-RT : Learning Language-Conditioned Robotic Policies from Natural Language Supervision

¹IPAI, Seoul National University , ²AI Institute, Seoul National University (AIIS)

*Equal Contribution, Corresponding Authors

Close the laptop
Draw a line from the star to the circle
Erase the whiteboard
Hide the Pooh with the green cup
Open the cabinet
Open the trash can
Play with the orange car
Pour the dog food in the bowl
Stamp near the circle

TL;DR

We explore how non-experts can teach robotic skills only through natural language supervision.

Our VLA model, CLIP-RT, learns end-to-end policies directly from this supervision.

Abstract

This paper explores how non-experts can teach robots desired skills in their environments. We argue that natural language is an intuitive and accessible interface for robot learning. To this end, we investigate two key aspects: (1) how non-experts collect robotic data using natural language supervision and (2) how pre-trained vision-language models learn end-to-end policies directly from this supervision. We propose a data collection framework that collects robot demonstrations based on natural language supervision (e.g., “move forward”) and further augments these demonstrations. Next, we introduce a model that learns language-conditioned policies from natural language supervision called CLIP-RT. Our model employs pre-trained CLIP models and learns to predict actions represented in language via contrastive imitation learning. We first train CLIP-RT on large-scale robotic data and then enable it to learn desired skills using data collected from our framework. CLIP-RT shows strong capabilities in acquiring novel manipulation skills, outperforming the state-of-the-art model, OpenVLA (7B parameters), by 17% in average success rates, while using 7x fewer parameters (1B).

Motivation

Challenges in Language-Conditioned Robotic Policies

(1) Obtaining real-world robot data often relies on experts who can operate robots or teleoperation systems

(2) Struggle to rapidly expand a set of manipulation skills to perform a wide range of real-world tasks

Research Question

How non-experts can teach robots desired skills intuitively by using natural language as a interface?

Approach

(1) Collect robotic data through natural language

(2) Train using collected data

Data Collection

Language-Based Teleoperation

Research Question
Language-based teleoperation

We propose a language-based teleoperation method which leverages the in-context learning capabilities of large language models (LLMs). We collect 10 episodes for each skill through language-based teleoperation

(1) Provide initial language instruction (e.g., “Pour the dog food”)

(2) Provide natural language supervisions in specific states to complete the instruction (e.g., “Move left a lot”)

(3) LLM translates the natural language supervision into the low-level end-effector command based on a detailed text prompt

(4) Repeat step 2 and step 3 until the episode ends

Stochastic Trajectory Diversification (STD)

Stochastic Trajectory Diversification
Stochastic Trajectory Diversification

We augment the demonstration data collected by humans based on a method called stochastic trajectory diversification (STD), consisting of two parts:

(1) The Diversification phase diversifies the expert trajectory into multiple alternative trajectories

(2) The Recovery phase intentionally deviates from the original trajectory and then executes a recovery action to return to the original path. Recovery action is utilized in training

CLIP-RT Model

Overview of CLIP-RT.
Overview of CLIP-RT.

We propose a new VLA model, CLIP-RT which extends the idea of CLIP to robot learning to learn language-conditioned policies from natural language. Different from other VLA models, CLIP-RT learns to predict robotic actions represented in natural language (e.g., “Move arm left”) based on contrastive imitation learning.

Contrastive Imitation Learning

The goal of contrastive imitation learning is to optimize the pairwise similarity between language supervision and contextual information (i.e., the current scene and language instruction). We first train CLIP-RT on Open X-Embodiment (OXE) dataset and then finetune it on our collected in-domain data. Since the OXE dataset does not contain natural language supervision, we transform existing low-level end-effector actions into natural language supervision to train CLIP-RT.

imitation learning

Closed-Loop Robot Control

In each time step, CLIP-RT selects the language action class with the highest similarity score. The selected action class is translated into the lower-level end-effector commands based on a pre-defined lookup tabl. Since CLIP-RT is a discriminative model, CLIP-RT can predict action in a single forward pass without autoregressive decoding. This model requires 7GB of GPU memory and runs at 16Hz (one H100 GPU using float32 precision) and 8Hz (one NVIDIA RTX 3090 GPU using float32 precision) without applying any speed-up tricks, such as model quantization and compilation.

Experimental Results

Experimental Results
Experimental Results

We evaluate CLIP-RT on 9 Common tasks (Top) and 10 Novel tasks (Bottom); Arranged in ascending order based on average steps per episode.

Key Takeaways

(1) CLIP-RT outperforms OpenVLA in average success rates by 3% on common tasks and 17% on novel manipulation tasks. Note that we also finetune OpenVLA on the same in-domain data as CLIP-RT through low-rank adaptation (LoRA).

(2) Vision-language models (VLMs) trained on Internet-scale data favor action representations specified in language, rather than existing low-level action encoding (v.s. CLIP-RT-Action)

(3) STD boosts the overall performance (v.s. CLIP-RT-Passive)

Conclusion

This paper presents CLIP-RT, enabling non-experts to teach robots new manipulation skills through natural language, making robot learning more accessible and scalable for everyday users.

BibTeX citation

    @article{kang2024cliprt,
  title={CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision},
  author={Kang, Gi-Cheon and Kim, Junghyun and Shim, Kyuhwan and Lee, Jun Ki and Zhang, Byoung-Tak},
  journal={arXiv preprint arXiv:2411.00508},
  year = {2024}
}

  

Acknowledgements

This work was partly supported by the IITP (RS-2021-II212068-AIHub/10%, RS-2021-II211343-GSAI/20%, 2022-0-00951-LBA/20%, 2022-0-00953-PICA/20%) and NRF (RS-2024-00353991/20%, RS-2023-00274280/10%) grant funded by the Korean government.