Abstract: Action-conditioned video prediction models (often referred to as world models) have shown strong potential for robotics applications, but existing world models are often slow and struggle to capture accurate physical interactions over long horizons, limiting their use for scalable robot policy training and evaluation. We present Interactive World Simulator, a framework that builds interactive world models using a moderate-sized robot interaction dataset and enables scalable robot policy training and evaluation. In our experiments, we show that our world models 1) produce physically accurate pixel-level predictions, and 2) support stable long-horizon interactions for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. Our framework uses data collected through interaction with the world models to train imitation policies, including Diffusion Policy, Action Chunking Transformer, and π-series policies. Through extensive real-world evaluation across diverse tasks involving rigid objects, deformable objects, object piles, and their interactions, we find that imitation policies trained on world-model-generated data perform comparably to policies trained on the same amount of real-world data. Additionally, we evaluate policies both within the world models and in the real world across diverse tasks, and observe a strong correlation between performance in the world models and in the real world. Together, these results establish the Interactive World Simulator as a stable and physically consistent surrogate that enables scalable robotic data generation and faithful, reproducible policy evaluation.