CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision


Gi-Cheon Kang, Junghyun Kim, Kyuhwan Shim, Jun Ki Lee, Byoung-Tak Zhang

Paper ID 16

Session 2. VLA Models

Poster Session (Day 1): Saturday, June 21, 6:30-8:00 PM

Abstract: Teaching robots desired skills in real-world environments remains challenging, especially for non-experts. Current robot learning methods often require expert demonstrations or complex programming, limiting their accessibility to non-experts. We posit that natural language offers an intuitive and accessible interface for robot learning. To this end, we study two aspects: (1) enabling non-experts to collect robotic data through natural language supervision (e.g., “move the arm up”) and (2) learning robotic policies directly from this supervision. Specifically, we introduce a data collection framework that collects robot demonstrations based on natural language supervision and further augments these demonstrations. We then present CLIP-RT, a vision-language-action (VLA) model that learns language-conditioned visuomotor policies from this supervision. CLIP-RT adapts pre-trained CLIP models and learns to predict language-based motion primitives via contrastive imitation learning. We train CLIP-RT on the Open X-Embodiment dataset and finetune it on in-domain data collected by our framework to learn diverse skills. CLIP-RT demonstrates strong capabilities in acquiring novel manipulation skills, outperforming the state-of-the-art model, OpenVLA (7B parameters), by 24% in average success rates, while using 7x fewer parameters (1B). We further observe that CLIP-RT shows significant improvements in few-shot imitation learning. Finally, CLIP-RT demonstrates its adaptability by collaborating with humans through corrections or incorporating predictions from foundation models for improved generalization.