Abstract: This work highlights that video world modeling, alongside vision-language pre-training, establishes a distinct foundation for robot learning. By capturing environmental dynamics, video world models enable the robot to “imagine” near future states—a capability essential for effective planning. Inspired by this, we introduce CauVA, an autoregressive diffusion framework that unifies frame prediction and action inference. Our approach features three key innovations: (1) an autoregressive Mixture-of-Transformers (MoT) that processes visual frames and actions as a single causal sequence; (2) a history integration mechanism using KV cache to maintain temporal context from real-world interactions; and (3) a noisy latent augmentation strategy that enables decoding actions directly from intermediate denoised videos for fast inference. Evaluations on simulation and real-world benchmarks demonstrate that CauVA excels in complex manipulation, including long-horizon, high-precision, and deformable object tasks. Our code and models are publicly available.