PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation


Yifan Yin, Zhengtao Han, Shivam Aarya, Shuhang Xu, Jianxin Wang, Jiawei Peng, Angtian Wang, Alan Yuille, Tianmin Shu

Paper ID 148

Session 16. Manipulation III

Poster Session (Day 4): Tuesday, June 24, 12:30-2:00 PM

Abstract: Fine-grained robot manipulation, such as lifting and rotating a bottle to display the label on the cap, requires robust reasoning about object parts and their relationships with intended tasks. Despite recent advances in training general-purpose robot manipulation policies guided by language instructions, there is a notable lack of large-scale datasets for fine-grained manipulation tasks with part-level instructions and diverse 3D object instances annotated with part-level labels. In this work, we introduce PartInstruct, the first large-scale benchmark for both training and evaluating fine-grained robot manipulation models using part-level instructions. PartInstruct comprises 513 object instances across 14 categories, each annotated with part-level information, and 1302 fine-grained manipulation tasks organized into 16 task classes. We generated a training set that includes over 10,000 expert demonstrations synthesized in a 3D simulator, each annotated with an overall task instruction, a chain of basic part-based skill instructions, and ground-truth 3D information about the object and its parts. Additionally, we designed a comprehensive test suite to evaluate the generalizability of learned policies across new states, objects, and tasks. We evaluated several state-of-the-art vision-language policy learning methods for robot manipulation on our benchmark. The experimental results reveal that current models struggle to robustly ground part concepts in 3D vision and motion planning, and face challenges when manipulating object parts in long-horizon tasks.