Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations

Lin Shao (Stanford University); Toki Migimatsu (Stanford University); Qiang Zhang (Shanghai Jiao Tong University); Kaiyuan Yang (Stanford University); Jeannette Bohg (Stanford)


We aim to endow a robot with the ability to learn manipulation concepts that link natural language instructions to motor skills. Our goal is to learn a single multi-task policy that takes as input a natural language instruction and an image of the initial scene and outputs a robot motion trajectory to achieve the specified task. This policy has to generalize over different instructions and environments. Our insight is that we can approach this problem through Learning from Demonstration by leveraging large-scale video datasets of humans performing manipulation actions. Thereby, we avoid more time-consuming processes such as teleoperation or kinesthetic teaching. We also avoid having to manually design task-specific rewards. We propose a two-stage learning process where we first learn single-task policies through reinforcement learning. The reward is provided by scoring how well the robot visually appears to perform the task. This score is given by a video-based action classifier trained on a large-scale human activity dataset. In the second stage, we train a multi-task policy through imitation learning to imitate all the single-task policies. In extensive simulation experiments, we show that the multi-task policy learns to perform a large percentage of the 78 different manipulation tasks on which it was trained. The tasks are of greater variety and complexity than previously considered robot manipulation tasks. We show that the policy generalizes over variations of the environment. We also show examples of successful generalization over novel but similar instructions.

Live Paper Discussion Information

Start Time End Time
07/16 15:00 UTC 07/16 17:00 UTC

Virtual Conference Presentation

Paper Reviews

Review 3

The paper contributes to an important area of multi-modal, multi-task learning problem. The exposition is clear and well positioned. The method is novel and technically sound. The evaluations are nicely constructed to evaluate each component of the method. A more recent baseline would have been better. Parametrization of the trajectories is a clever way to ensure dynamic feasibility. Questions: - It is interesting that the model works without attention layer. Could the authors comment as why is that? - Why is CEM applied after Q network learning in the single task learning?