AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos

Laura Smith, Nikita Dhawan, Marvin Zhang, Pieter Abbeel, Sergey Levine


Robotic reinforcement learning (RL) holds the promise of enabling robots to learn complex behaviors through experience. However, realizing this promise for long-horizon tasks in the real world requires mechanisms to reduce human burden in terms of defining the task and scaffolding the learning process. In this paper, we study how these challenges can be alleviated with an automated robotic learning framework, in which multi-stage tasks are defined simply by providing videos of a human demonstrator and then learned autonomously by the robot from raw image observations. A central challenge in imitating human videos is the difference in appearance between the human and robot, which typically requires manual correspondence. We instead take an automated approach and perform pixel-level image translation via CycleGAN to convert the human demonstration into a video of a robot, which can then be used to construct a reward function for a model-based RL algorithm. The robot then learns the task one stage at a time, automatically learning how to reset each stage to retry it multiple times without human-provided resets. This makes the learning process largely automatic, from intuitive task specification via a video to automated training with minimal human intervention. We demonstrate that our approach is capable of learning complex tasks, such as operating a coffee machine, directly from raw image observations, requiring only 20 minutes to provide human demonstrations and about 180 minutes of robot interaction.

Live Paper Discussion Information

Start Time End Time
07/14 15:00 UTC 07/14 17:00 UTC

Virtual Conference Presentation

Paper Reviews

Review 1

(Originality) The main originality in the work is the full system to go from videos of human demonstrations to a robot policy that can solve a multi-stage task. The authors break this problem up into several pieces. They propose to train an image translation model to translate from human demonstration videos to robot videos to account for differences in morphology. Next, they manually segment the human videos at stage boundaries into instruction images that are used for training a model-based RL agent to reach these instruction images and solve each part of a task in sequence. The authors leverage a latent-space model-based RL algorithm along with a learned classifier that provides rewards for reaching instruction images. They use the classifier confidence to dictate whether additional human labels of success and failure should be solicited, and whether to try and autonomously reset to the start of a stage to try again, in order to limit human intervention in the system. There have already been several works that tackle each of the aforementioned pieces - including using CycleGAN for image translation, latent-space MPC for model-based RL, doing RL with autonomous resets, and using a learned classifier as a reward function in real robot RL along with human queries. The main novelty of the paper is a system that combines all of these approaches together, and demonstrating the efficacy of the system over alternative choices. (Quality) The authors compare against a large set of baselines and ablations, and the empirical evaluations are good. They showcase the value of instruction images, using latent-space planning (compared to planning directly in image space) and the value of stage-wise training and resetting through the classifier (when compared to a method like BC). However, although the tasks shown are multi-stage, the action space seems to have been severely restricted, making the tasks significantly easier. The robot appears to purely move in a 2D vertical plane at a very low rate. While the authors mention that end effector velocity control is used, the robot appears to have just 3 dimensions of control - a delta position in 2 dimensions, and a grasping signal. The quality of the final policies is also pretty poor (this is perhaps due to the frequency of control being low - the policy is very choppy in execution and seems to fail often). (Clarity) The clarity of the paper is sufficient. A few parts of the method could use some more detail. For example, the MPC-CEM subroutine could use some more explanation. (Significance) While the basic motivation is clear - to ease the burden of humans while allowing the system to learn as much as it can on its own from videos of human demonstrations and a modest amount of human labels - there are significant concerns about why the proposed methodology should be used in practice over alternatives. Only a subset of translated CycleGAN translated images are used - the instruction images. Consequently there is no need for the CycleGAN to generate accurate translations - they only need to be accurate at the task segment boundaries. Furthermore, the ease of providing supervision for the CycleGAN is questionable when compared to an alternative such as kinesthetic teaching. The authors mention that only a modest number of human videos are used for traning. In that case, it seems as though kinesthetic teaching could be used to just collect a variety of robot images at stage boundaries. The CycleGan supervision already requires placing items into the robot's hand - it is a small step from there, to moving the robot to the appropriate locations in the environment. It seems to me that not much technical human ability would be required to capture a modest number of frames of the robot in each of the stages, compared to ensuring that diverse and varied robot data is captured for training the full CycleGAN. That being said, the generated robot translations seem to be of surprisingly high quality.