Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation

Chengyu Deng, Guanqi Chen, Chen Yizhou, Zejia Liu, Zhiwen Ruan, Guanhua Chen, Jia Pan

Paper ID 57

Session Manipulation 2

Poster session details TBA

Abstract: Diffusion-based policies have established a new standard for precise robotic manipulation but face a critical scalability bottleneck: high-performance models are computationally expensive, while lightweight alternatives often fail to generalize across diverse multi-task environments. Mixture-of-Experts (MoE) architectures offer a promising path to efficiency by activating only a subset of parameters. However, existing MoE routing mechanisms typically rely on low-level noise or latent statistics, ignoring the compositional nature of manipulation tasks. This results in redundant experts that fail to capture reusable skills, limiting both interpretability and transfer. We introduce Semantically Structured Mixture-of-Experts Diffusion Policy for Compositional Robotic Manipulation (SMoDP), a framework that grounds expert specialization in semantic task structure. SMoDP leverages a lightweight, inference-time skill predictor—distilled from offline VLM annotations—to route action chunks to experts specialized for specific behavioral phases. To ensure robust assignment, we propose a dual contrastive alignment strategy that grounds multi-modal observations in language-defined skill semantics (Inter-modal) while enforcing routing consistency across visually distinct but functionally identical behaviors (Intra-modal). Our approach achieves state-of-the-art performance on multi-task benchmarks with significantly improved parameter efficiency and demonstrates effective compositional transfer to novel tasks through parameter-efficient fine-tuning.