Causal World Modeling for Robot Control


Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Zhangluyao, Mingrui Yu, Zelin Gao, Nan Xue, Boyu Zhou, Xing Zhu, Mingyu Ding, Yujun Shen, Yinghao Xu

Paper ID 16

Session World Models & Memory

Poster session details TBA

Abstract: This work highlights that video world modeling, alongside vision-language pre-training, establishes a distinct foundation for robot learning. By capturing environmental dynamics, video world models enable the robot to “imagine” near future states—a capability essential for effective planning. Inspired by this, we introduce CauVA, an autoregressive diffusion framework that unifies frame prediction and action inference. Our approach features three key innovations: (1) an autoregressive Mixture-of-Transformers (MoT) that processes visual frames and actions as a single causal sequence; (2) a history integration mechanism using KV cache to maintain temporal context from real-world interactions; and (3) a noisy latent augmentation strategy that enables decoding actions directly from intermediate denoised videos for fast inference. Evaluations on simulation and real-world benchmarks demonstrate that CauVA excels in complex manipulation, including long-horizon, high-precision, and deformable object tasks. Our code and models are publicly available.