Scaling data-driven robotics with reward sketching and batch reinforcement learning

Serkan Cabi (DeepMind); Sergio Gómez Colmenarejo (DeepMind); Alexander Novikov (DeepMind); Ksenia Konyushova (DeepMind); Scott Reed (DeepMind); Rae Jeong (DeepMind); Konrad Zolna (DeepMind); Yusuf Aytar (DeepMind); David Budden (DeepMind); Mel Vecerik (Deepmind); Oleg Sushkov (DeepMind); David Barker (DeepMind); Jonathan Scholz (DeepMind); Misha Denil (DeepMind); Nando de Freitas (DeepMind); Ziyu Wang (Google Research, Brain Team)


By harnessing a growing dataset of robot experience, we learn control policies for a diverse and increasing set of related manipulation tasks. To make this possible, we introduce reward sketching: an effective way of eliciting human preferences to learn the reward function for a new task. This reward function is then used to retrospectively annotate all historical data, collected for different tasks, with predicted rewards for the new task. The resulting massive annotated dataset can then be used to learn manipulation policies with batch reinforcement learning (RL) from visual input in a completely off-line way, i.e., without interactions with the real robot. This approach makes it possible to scale up RL in robotics, as we no longer need to run the robot for each step of learning. We show that the trained batch RL agents, when deployed in real robots, can perform a variety of challenging tasks involving multiple interactions among rigid or deformable objects. Moreover, they display a significant degree of robustness and generalization. In some cases, they even outperform human teleoperators.

Live Paper Discussion Information

Start Time End Time
07/16 15:00 UTC 07/16 17:00 UTC

Virtual Conference Presentation

Supplementary Video

Paper Reviews

Review 1

The dataset collected here has no description of data quality. The authors claim that there is a large number of mixed quality examples, however the details are a bit vague. It would help to understand, how much of data is teleoperated, how much is random, how much is interactive (within interaction, what fraction of time interaction happens). The authors also claim that multiple task labels can be provided to the same daat examples. both qualitative examples and quantitative assessment of performance of policies on off-task relabelled data would useful to understand the benefits. Also policy learning with only relabeled data should be compared to on-task data and both. Reward sketching and ranking based reward regression is indeed an insightful idea from this paper, at least novel in this context if not broadly. In the loss computation the reward model, currently there appear to be too many parameters -- 6! perhaps ranking loss should be pairwise ordering rather than a scalar difference? During the iterative data collection and retraining -- it is unclear what and how often the human annotators needs to intervene. "to fix a reward delusion...." -- what is the setup, and a supervision or all the outlier cases seems rather inefficient and ad hoc! Experiments: The tasks of choice are simpler than other similar scaled robotics datasets MIME and Roboturk. How would RL compare on these tasks? Training Detail: "We train multiple RL agent..." how many? how many roll-outs. The experimental evaluation suggests thats Non-distributional RL is a key component along with adding off-task data to the training set. Details of the non-distributional rl algorithmic setup (either with explicit deatils of the models and insights, or with code) will make this paper much more impactful. Other similar papers in the area of RL, such SAC have claimed similar successes -- "real robot rl in matter of hours". A quantitative comparison or atleast a discussion on the topic would help the reader place this work in the broadere context. Big picture concerns: The execution and evaulation step is -on-task+on-robot. how is the dataset useful if the evaluation setup (the robot setup) is not available? How does only the dataset, if at all, help the community to work on Batch RL, unless the bigger framework is released for use by independent researchers in their setups. Citations: Datasets: Multiple Interactions Made Easy (MIME) CoRL 2019, (real) -- Many tasks Roboturk, CoRL 2018 (simulated) and IROS 2019 (real) -- Stacking, cloth, search in cluttered bin -- See table in IROS paper for other datatsets RoboNet, CoRL 2019 (real) -- Pushing Batch RL: Off-policy policy gradient with state distribution correction.(Liu et al. 1904.08473) Empirical Study of Off policy Pollicy eval. (Voloshin,1911.06854) IRIS (Mandlekar, 1911.05321) Quality: Well done paper, with intruguing and interesting results. However the takeaway for the reader is uncertain, other than the fact that batch rl can be made to work! However the data itself is not useful per se, it would be the system architechture and implemetation details to reproduce such a system that would be really helpful. Clarity:Good and well written with well made figures, video, and descriptive language

Review 2

This work proposes a framework that enables continuous data collection through demonstrations, reward sketching and experience evaluation. It tries to tackle an important problem in robot learning: Large amounts of data are not available and very costly to get. Reward sketching is a very neat idea and it can be useful in many different applications. However, I have had a hard time understanding the contributions of the paper. This might be because the related work is given at the very end. Below are my comments to the authors: - A list of contribution in the Introduction would be really helpful. Reward sketching is definitely one (and a very good one), but it seems the batch RL method used in this paper is from existing literature. In that sense, the paper might be overselling itself, and it could be better to focus on the advantages of using reward sketching (possibly over other alternatives). - I appreciate the effort put in the experiments, they are really good. It is very interesting to see that data from other tasks and random trajectories significantly help learn a task. - How is the NES system innovative? As someone who is not familiar with the research on database systems, I don't see how the used NES is different from existing databases that allow SQL queries. Moreover, the description of NES in Section II is a little distractive for the readers as it has only loose connections with the real purpose of the paper, so I would suggest completely removing Section 2B. - Reward sketching may not always be possible, because humans are not always good at assigning reward values. The authors have already discussed this, which I appreciate. It would be also good to discuss alternative ways of how to achieve reward sketching. For example, would it be possible to do something similar to preference-based learning where the reward values are obtained through comparisons between the states at different time steps? - Is "annotating data from prior tasks for a new task" (Section 2C) always possible? When the two tasks are very different, one may be completely unrelated to the other, and the humans may not be able to sketch rewards. - In the Experiments, it was not clear to me whether the human sketches reward for all the trajectories from random_watcher. If not, how are the data that will be sketched selected? Below are my minor comments: - The second paragraph of Section 2 uses enumeration from A to G, whereas Fig. 5 uses 1-7 for the corresponding list. It would be better to use the same format. - In Section 2A, should "optionally" be replaced with "occasionally"? Otherwise it may mean that human has the option to always provide unsuccessful examples. - Typo in Section 2A: interprete --> interpret - Ref 18: It seems there is a syntax error in the list of authors. The first two authors are combined and misordered.

Review 3

This paper does not have many truly original ideas, but has very good quality, clarity, and significance. The system described is quite sensible and its implementation is clear. I do not have many details comments, except to say that the experiments demonstrating the effectiveness of this system are quite well done and make this a likely significant work as one of the first to demonstrate such impressive results with large scale demonstration collection and Batch RL on real robots. The main drawback I see in this paper is a poor discussion of the human elements of the systems (teleoperation and reward sketching). It is never discussed how much prior practice or overall expertise the humans involved in the experiments had, and in general no discussion of interface considerations are present -- from a user study perspective, this is quite poor compared to prior works in Learning from Demonstrations literature. Another significant flaw is that the related work section is a bit too heavy on recent RL and does not give enough credit to related work in Learning from Demonstrations or data collection via teleoperation; in particular it is surprising that RoboTurk ( is not cited or compared to, and more generally many other approaches to using annotations and demonstrations from humans are not cited (