RLux-VLA: A Unified and Efficient Framework for Reinforcement Learning of Vision-Language-Action Models


Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Peihong Wang, Hua Yuan, Yixian Zhang, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, Yu Wang

Paper ID 89

Session VLA Models

Poster session details TBA

Abstract: Recent advances in vision-language-action (VLA) models have motivated the extension of their capabilities to embodied settings, where reinforcement learning (RL) offers a principled way to optimize task success through interaction. However, existing methods remain fragmented, lacking both a unified platform for fair comparison across architectures and algorithms and an efficient system design for scalable training. To address these challenges, we introduce RLux-VLA, a unified and efficient framework for scalable RL training of VLA models. RLux-VLA achieves unification by providing a unified interface that standardizes the integration of diverse VLA architectures, multiple RL algorithms, and heterogeneous simulators, enabling extensibility. To ensure efficiency, the system adopts a flexible resource allocation architecture for rendering, inference, and training workloads in RL pipelines. In particular, for GPU-parallelized simulators, RLux-VLA introduces a hybrid fine-grained pipeline allocation strategy, yielding a 1.61x–1.88x training speedup. Using this unified system, models trained with RLux-VLA demonstrate consistent performance improvements of approximately 20–85% across multiple simulation benchmarks, including LIBERO, ManiSkill, and RoboTwin. Furthermore, we distill a set of training practices for effective RL-based VLA training. We position RLux-VLA as a foundational system to enable efficient, unified, and reproducible research in embodied intelligence.