Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li

Paper ID 14

Session 2. VLA Models

Poster Session (Day 1): Saturday, June 21, 6:30-8:00 PM

Abstract: The advancement of generalist robotic models capable of executing diverse tasks across varied environments and embodiments has been impeded by the dependence on large-scale, labeled datasets and the inherent heterogeneity of action and observation spaces. To address these challenges, we introduce UniVLA, a framework designed to develop omni-purpose vision-language-action (VLA) policies that facilitate scalable and efficient planning across diverse environments and tasks. Our methodology comprises three pivotal stages: 1) Task-Centric Latent Action Learning, where we derive task-relevant action representations from extensive cross-embodiment videos in an unsupervised manner, utilizing DINOv2 features and language instructions to filter out task-irrelevant dynamics; 2) Latent Action Pretraining, where we train an auto-regressive vision-language model with discretized latent action tokens to enable embodiment-agnostic planning; and 3) Latent Action Decoding, where we translate latent plans into executable behaviors for deployment across diverse and heterogeneous robotic systems. UniVLA achieves state-of-the-art performance on multiple manipulation and navigation benchmarks, surpassing existing VLAs while requiring reduced computational cost. Extensive evaluations underscore the efficiency, scalability, and generalizability of UniVLA, presenting a promising pathway toward the development of next-generation generalist policies.