Abstract: The advancement of generalist robotic models capable of executing diverse tasks across varied environments and embodiments has been impeded by the dependence on large-scale, labeled datasets and the inherent heterogeneity of action and observation spaces. To address these challenges, we introduce UniVLA, a framework designed to develop omni-purpose vision-language-action (VLA) policies that facilitate scalable and efficient planning across diverse environments and tasks. Our methodology comprises three pivotal stages: 1) Task-Centric Latent Action Learning, where we derive task-relevant action representations from extensive cross-embodiment videos in an unsupervised manner, utilizing DINOv2 features and language instructions to filter out task-irrelevant dynamics; 2) Latent Action Pretraining, where we train an auto-regressive vision-language model with discretized latent action tokens to enable embodiment-agnostic planning; and 3) Latent Action Decoding, where we translate latent plans into executable behaviors for deployment across diverse and heterogeneous robotic systems. UniVLA achieves state-of-the-art performance on multiple manipulation and navigation benchmarks, surpassing existing VLAs while requiring reduced computational cost. Extensive evaluations underscore the efficiency, scalability, and generalizability of UniVLA, presenting a promising pathway toward the development of next-generation generalist policies.