FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Donghu Kim, Youngdo Lee, Minho Park, Kinam Kim, Takuma Seno, I. Made Aswin Nahrendra, Sehee Min, Daniel Palenicek, Florian Vogt, Danica Kragic, Jan Peters, Jaegul Choo, Hojoon Lee

Paper ID 99

Session Control & Dynamics

Posters presented in the poster session following their oral. Locations not assigned.

Abstract: Simulation-based reinforcement learning (RL) is central for robotic control when expert demonstrations are unavailable. However, scaling RL to high-dimensional robots remains challenging. On-policy methods such as PPO are reliable but require large amounts of simulation because they discard past data. Off-policy methods can reuse experience and are more sample-efficient, but they often become unstable in high-dimensional control due to critic errors that are amplified during bootstrapped updates. We introduce FlashSAC, a fast and stable off-policy RL algorithm for high-dimensional robotic control. FlashSAC improves training stability in two ways: (1) it explicitly bounds weight, feature, and gradient norms to limit critic error amplification, and (2) it increases data coverage through large-scale parallel simulation, a high-capacity replay buffer, and strong exploration. These design choices preserve the sample efficiency of off-policy learning while improving training stability. Across 50+ state-based and vision-based tasks in 10 simulators, FlashSAC consistently surpasses PPO and strong off-policy baselines in both final performance and wall-clock efficiency, with larger gains on higher-dimensional tasks. In sim-to-real humanoid walking, FlashSAC reduces training time from hours to minutes while maintaining stable real-world deployment. Our results show that stabilizing off-policy learning enables scalable sim-to-real RL for high-dimensional robotic systems.