Collaborating Visual and Parameter Spaces for Consistent Long-Horizon Embodied World Model

Longyu Chen, Heng Li, Wei Yang, Manqi Zhao, Dongsheng Jiang

Paper ID 14

Session World Models & Memory

Posters presented in the poster session following their oral. Locations not assigned.

Abstract: Embodied World Models (EWMs) have emerged as a scalable and risk-free paradigm for evaluating Vision-Language-Action (VLA) systems. However, their reliability as evaluation benchmarks is often limited by the representation gap between low-dimensional actions and high-dimensional video synthesis. This gap leads to a lack of geometric correspondence, manifesting as accumulated trajectory drift and inconsistent object-robot interactions in long-horizon rollouts. To bridge this gap, we propose ViPSim, a framework that achieves consistent long-horizon generation through the synergistic collaboration of Visual and Parameter Spaces. We define the Visual Space as a domain of explicit spatial priors, integrating pixel-aligned projections of actions, camera perspectives, depth-informed scene geometry, and robot morphological masks to provide dense structural constraints. Concurrently, the Parameter Space is defined as a domain of numerical drivers that injects raw action sequences and camera matrices to provide precise motion guidance. By unifying these two spaces, ViPSim ensures that the generated states are simultaneously constrained by geometric boundaries and driven by precise numerical commands. Extensive experiments demonstrate that ViPSim is backbone-agnostic and significantly enhances trajectory consistency. Notably, our approach exhibits emergent capabilities in generating complex interactions with deformable objects (e.g., cloth folding) and maintains robust performance in out-of-distribution and cross-embodiment scenarios, providing a high-fidelity foundation for the automated evaluation of embodied intelligence.