Compositional Transfer in Hierarchical Reinforcement Learning

Markus Wulfmeier (DeepMind); Abbas Abdolmaleki (Google DeepMind); Roland Hafner (Google DeepMind); Jost Tobias Springenberg (DeepMind); Michael Neunert (Google DeepMind); Noah Siegel (DeepMind); Tim Hertweck (DeepMind); Thomas Lampe (DeepMind); Nicolas Heess (DeepMind); Martin Riedmiller (DeepMind)

Abstract

The successful application of general reinforcement learning algorithms to real-world robotics applications is often limited by their high data requirements. We introduce Regularized Hierarchical Policy Optimization (RHPO) to improve data-efficiency for domains with multiple dominant tasks and ultimately reduce required platform time. To this end, we employ compositional inductive biases on multiple levels and corresponding mechanisms for sharing off-policy transition data across low-level controllers and tasks as well as scheduling of tasks. The presented algorithm enables stable and fast learning for complex, real-world domains in the parallel multitask and sequential transfer case. We show that the investigated types of hierarchy enable positive transfer while partially mitigating negative interference and evaluate the benefits of additional incentives for efficient, compositional task solutions in single task domains. Finally, we demonstrate substantial data-efficiency and final performance gains over competitive baselines in a week-long, physical robot stacking experiment.

Live Paper Discussion Information

	Start Time	End Time
	07/15 15:00 UTC	07/15 17:00 UTC

Virtual Conference Presentation

Paper Reviews

Review 1

The authors propose an interesting approach to HRL and present a well-thought out approach to the problem of compositionally learning skills and their sequencing. The preliminary and methods sections are well-written and descriptive of the approach and extensive simulation and ablation studies are done. The approach is promising and I look forward to future iterations of it. Regarding the technical content of the paper, I have two comments that may improve the paper if addressed. First, the authors acknowledge that the number of components have to be specified externally and demonstrate robustness in the appendix. However, the tasks studied and tested on are all within distribution and I wonder how well the approach would work for out of distribution tasks or variations. For example, if the block sizes were to change or their physical properties altered, or if the task was to make pyramids instead of a vertical stack, or to balance a block on its edge as its side rests on another block, how well would the sub-policies learned generalize? Second, the approach’s sample-efficiency is clearly demonstrated when significant sequential sub-tasks are required to achieve a specific over-all task (see Fig. 3 Fig. 4 lower panels). I am curious why the authors did not choose to benchmark their approaches against the 3 citations below rather than the monolithic SAC or SAC-Independent. Comparing policies with additional information (existence and number of components) to those without would expect to yield performance increases (which is great); however, it would not necessarily demonstrate state-of-the-art results. I understand that implementing other people’s work may prove a significant challenge, but it would be great if it were possible to see the relative performance of the approaches. I am also curious, in the cases were the policies do not asymptote to the same expected return, is the optimal solution actually found, or a refactoring/change of components may yield a better policy with higher expected return? @inproceedings{kulkarni2016hierarchical, title={Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation}, author={Kulkarni, Tejas D and Narasimhan, Karthik and Saeedi, Ardavan and Tenenbaum, Josh}, booktitle={Advances in neural information processing systems}, pages={3675--3683}, year={2016} } @inproceedings{vezhnevets2017feudal, title={Feudal networks for hierarchical reinforcement learning}, author={Vezhnevets, Alexander Sasha and Osindero, Simon and Schaul, Tom and Heess, Nicolas and Jaderberg, Max and Silver, David and Kavukcuoglu, Koray}, booktitle={Proceedings of the 34th International Conference on Machine Learning-Volume 70}, pages={3540--3549}, year={2017}, organization={JMLR. org} } @inproceedings{nachum2018data, title={Data-efficient hierarchical reinforcement learning}, author={Nachum, Ofir and Gu, Shixiang Shane and Lee, Honglak and Levine, Sergey}, booktitle={Advances in Neural Information Processing Systems}, pages={3303--3313}, year={2018} }

Review 3

This paper proposes a way to train a hierarchical policy with compositional structure from off-policy data and in a multi-task setting. The proposed structure factorizes the policy into a high-level one, which depends on the task index and determines a discrete option, and a low-level one, which is conditioned on high-level options but independent of tasks. To make training amenable, the paper proposes to first estimate a non-parametric policy as an intermediate goal from off-policy data (SAC-U learning), and then perform EM update to fit the target policy towards this goal. Both stages adopt trust-region-like constraints to regularize policy update for a robust optimization. The results, experimenting with piling and cleaning blocks in both simulation and real world, demonstrate the effectiveness of proposed compositional hierarchy and policy shift constraints. The clarity of the paper is okay. The authors are trying to condensate many things to respect the page limit, with referring to many previous works and a lengthy appendix. I don't have much experience about specific techniques that are used as building blocks. Hence I didn't feel this is an easy read but was still able to get the main messages. The topic of improving data efficiency in robot learning is important and the idea of composing/transferring modular sub-task skills is not ground-breaking novel. However, the algorithm is developed with many practical considerations, such as allowing for off-policy data, flexible policy form, trust-region constraints and fewer asynchronous actors. These make a potential for the algorithm to be used as a standard framework. The experiments are thorough to me, especially the one-week long real robot experiment. The reviewer also appreciates the experiment details and assessment about the regularization sensitivity, which facilitate reproduction for ones having sufficient hardware. I roughly checked the derivations in the paper and appendix. Looked all right to me. It is not clear for me to see the point made in introduction (Page 1, 4th parag): "(3) switching between the execution of policies for different tasks within a single episode leads to effective exploration". There seems no related discussion about the exploration effectiveness in the experiment section. The results on multitask learning and composing low-level policies to solve new task show different difficulty levels of sub-tasks. It would be great that the authors could discuss the potential of conducting non-uniform task sampling, e.g. in the context of curriculum learning. Also, reporting statistics with 3 runs concerned me a little bit. I would suggest trying 5 or 10 random seeds, at least for the simulation experiments. Other minor issues: 1. Annotations in Fig. 2 and Appendix Fig. 10 & 11 are hardly readable. 2. Appendix, Page 1, last parag, "where \mathcal{D}...": should be \mathcal{T}.