Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Zhangluyao, Mingrui Yu, Zelin Gao, Nan Xue, Boyu Zhou, Xing Zhu, Mingyu Ding, Yujun Shen, Yinghao Xu

Paper ID 16

Session World Models & Memory

Posters presented in the poster session following their oral. Locations not assigned.

Abstract: This work highlights that video world modeling, alongside vision-language pre-training, establishes a distinct foundation for robot learning. By capturing environmental dynamics, video world models enable the robot to “imagine” near future states—a capability essential for effective planning. Inspired by this, we introduce CauVA, an autoregressive diffusion framework that unifies frame prediction and action inference. Our approach features three key innovations: (1) an autoregressive Mixture-of-Transformers (MoT) that processes visual frames and actions as a single causal sequence; (2) a history integration mechanism using KV cache to maintain temporal context from real-world interactions; and (3) a noisy latent augmentation strategy that enables decoding actions directly from intermediate denoised videos for fast inference. Evaluations on simulation and real-world benchmarks demonstrate that CauVA excels in complex manipulation, including long-horizon, high-precision, and deformable object tasks. Our code and models are publicly available.