¹IPAI, Seoul National University , ²AI Institute, Seoul National University (AIIS)
*Equal Contribution, †Corresponding Authors
TL;DR
We explore how non-experts can teach robotic skills only through natural language supervision.
Our VLA model, CLIP-RT, learns end-to-end policies directly from this supervision.
This paper explores how non-experts can teach robots desired skills in their environments. We argue that natural language is an intuitive and accessible interface for robot learning. To this end, we investigate two key aspects: (1) how non-experts collect robotic data using natural language supervision and (2) how pre-trained vision-language models learn end-to-end policies directly from this supervision. We propose a data collection framework that collects robot demonstrations based on natural language supervision (e.g., “move forward”) and further augments these demonstrations. Next, we introduce a model that learns language-conditioned policies from natural language supervision called CLIP-RT. Our model employs pre-trained CLIP models and learns to predict actions represented in language via contrastive imitation learning. We first train CLIP-RT on large-scale robotic data and then enable it to learn desired skills using data collected from our framework. CLIP-RT shows strong capabilities in acquiring novel manipulation skills, outperforming the state-of-the-art model, OpenVLA (7B parameters), by 17% in average success rates, while using 7x fewer parameters (1B).
(1) Obtaining real-world robot data often relies on experts who can operate robots or teleoperation systems
(2) Struggle to rapidly expand a set of manipulation skills to perform a wide range of real-world tasks
How non-experts can teach robots desired skills intuitively by using natural language as a interface?
(1) Collect robotic data through natural language
(2) Train using collected data
We propose a language-based teleoperation method which leverages the in-context learning capabilities of large language models (LLMs). We collect 10 episodes for each skill through language-based teleoperation
(1) Provide initial language instruction (e.g., “Pour the dog food”)
(2) Provide natural language supervisions in specific states to complete the instruction (e.g., “Move left a lot”)
(3) LLM translates the natural language supervision into the low-level end-effector command based on a detailed text prompt
(4) Repeat step 2 and step 3 until the episode ends
We augment the demonstration data collected by humans based on a method called stochastic trajectory diversification (STD), consisting of two parts:
(1) The Diversification phase diversifies the expert trajectory into multiple alternative trajectories
(2) The Recovery phase intentionally deviates from the original trajectory and then executes a recovery action to return to the original path. Recovery action is utilized in training
We propose a new VLA model, CLIP-RT which extends the idea of CLIP to robot learning to learn language-conditioned policies from natural language. Different from other VLA models, CLIP-RT learns to predict robotic actions represented in natural language (e.g., “Move arm left”) based on contrastive imitation learning.
The goal of contrastive imitation learning is to optimize the pairwise similarity between language supervision and contextual information (i.e., the current scene and language instruction). We first train CLIP-RT on Open X-Embodiment (OXE) dataset and then finetune it on our collected in-domain data. Since the OXE dataset does not contain natural language supervision, we transform existing low-level end-effector actions into natural language supervision to train CLIP-RT.
In each time step, CLIP-RT selects the language action class with the highest similarity score. The selected action class is translated into the lower-level end-effector commands based on a pre-defined lookup tabl. Since CLIP-RT is a discriminative model, CLIP-RT can predict action in a single forward pass without autoregressive decoding. This model requires 7GB of GPU memory and runs at 16Hz (one H100 GPU using float32 precision) and 8Hz (one NVIDIA RTX 3090 GPU using float32 precision) without applying any speed-up tricks, such as model quantization and compilation.
We evaluate CLIP-RT on 9 Common tasks (Top) and 10 Novel tasks (Bottom); Arranged in ascending order based on average steps per episode.
(1) CLIP-RT outperforms OpenVLA in average success rates by 3% on common tasks and 17% on novel manipulation tasks. Note that we also finetune OpenVLA on the same in-domain data as CLIP-RT through low-rank adaptation (LoRA).
(2) Vision-language models (VLMs) trained on Internet-scale data favor action representations specified in language, rather than existing low-level action encoding (v.s. CLIP-RT-Action)
(3) STD boosts the overall performance (v.s. CLIP-RT-Passive)
This paper presents CLIP-RT, enabling non-experts to teach robots new manipulation skills through natural language, making robot learning more accessible and scalable for everyday users.
@article{kang2024cliprt,
title={CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision},
author={Kang, Gi-Cheon and Kim, Junghyun and Shim, Kyuhwan and Lee, Jun Ki and Zhang, Byoung-Tak},
journal={arXiv preprint arXiv:2411.00508},
year = {2024}
}
This work was partly supported by the IITP (RS-2021-II212068-AIHub/10%, RS-2021-II211343-GSAI/20%, 2022-0-00951-LBA/20%, 2022-0-00953-PICA/20%) and NRF (RS-2024-00353991/20%, RS-2023-00274280/10%) grant funded by the Korean government.