Abstract: Imitation learning has emerged as a promising approach towards building generalist robots. However, the reliance on high-quality expert demonstrations poses a challenge in scaling imitation learning for large-scale robot foundation models. On the other hand, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data efficiently for robotics, however, is difficult due to the lack of action annotation necessary for current imitation learning methods. In this work, we present Unified World Models, a framework that allows for leveraging video data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWMs can flexibly generate samples from the forward dynamics, the inverse dynamics, as well as marginal and joint distributions. Through simulated and real-world experiments, we show that: (1) UWMs can effectively be used as a policy class for behavior cloning, achieving comparable performance to state-of-the-art behavior cloning methods, (2) UWMs enable efficient pretraining on large-scale multitask robot datasets, where finetuned policies outperform baselines in terms of generalization and robustness and (3) UWMs naturally enable learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWMs offer a promising step toward harnessing large, heterogeneous datasets for scalable robot learning.