Residual Policy Learning for Shared Autonomy


Charles Schaff, Matthew R. Walter

Abstract

Shared autonomy provides an effective framework for human-robot collaboration that takes advantage of the complementary strengths of humans and robots to achieve common goals. Many existing approaches to shared autonomy make restrictive assumptions that the goal space, environment dynamics, or human policy are known a priori, or are limited to discrete action spaces, preventing those methods from scaling to complicated real world environments. We propose a model-free, residual policy learning algorithm for shared autonomy that alleviates the need for these assumptions. Our agents are trained to minimally adjust the human’s actions such that a set of goal-agnostic constraints are satisfied. We test our method in two continuous control environments: Lunar Lander, a 2D flight control domain, and a 6-DOF quadrotor reaching task. In experiments with human and surrogate pilots, our method significantly improves task performance without any knowledge of the human’s goal beyond the constraints. These results highlight the ability of model-free deep reinforcement learning to realize assistive agents suited to continuous control settings with little knowledge of user intent.

Live Paper Discussion Information

Start Time End Time
07/16 15:00 UTC 07/16 17:00 UTC

Virtual Conference Presentation

Paper Reviews

Review 1

Summary: This paper develops a method for shared autonomy between a human and a robot. Prior work generally assumes that the agent has access to information about the set of goals for the task or the environment dynamics. The authors’ approach does not rely on this information and instead aims to learn a policy that augments the human while satisfying some constraints. They add the agent’s action as a residual correction to the human’s actions and use a formulation based on constrained MDPs to ensure that the agent follows the constraints set by a reward function R_{general}. The method is evaluated on multiple domains with both simulated and real human users, highlighting that having an agent copilot a human leads to significantly better performance than having a human act alone. Originality: The work has some original components, such as taking existing methods (e.g., residual policy learning) and applying it to a shared autonomy scenario. However, there are a lot of works in the space of human-agent joint RL so the authors should spend more time talking about how this work differs from these, not just how the work differs from the most related works (e.g., Reddy et al. and Broad et al.) Clarity: The paper is well-written and clear. The figures are well-done (e.g., Figure 2). Quality: The paper is of high quality. The experiments show that the agent copilot helps the human in both simulated and real human settings. There could have been more baselines to show that the agent is really helping at the right points (more below in other points), like comparing with randomly-assisting agents or agents that provide bad assistance. One point to fix in the paper though is that the introduction of the paper and other places throughout the text slightly oversell the work in that it claims that this work doesn’t require any information about goals, environment dynamics, etc, but it sounds like the main goal of the paper is to support the human in the task. In this setting, it’s not that the agent is learning the goals autonomously. It is instead just deflecting to the human, which is not what is expected when reading through the abstract and introduction. Significance: The work is significant for the community as the method can support effective shared autonomy between humans and robots. I appreciate that the authors included both simulated human and real human experiments, as that leads to a stronger evaluation and better generalization to the community. Other comments: - Related to the point in the quality section, there is a statement about prior work often “requir[ing] access to demonstrations or the user’s policy for achieving each goal,” but this work also requires similar data from humans. I would recommend in this place and throughout the paper, to be careful about overselling. The paper seems to be about developing an agent that can assist humans rather than an agent that can act under unknown goals and dynamics. - In Results, this point comes again, where the authors state that the agent generalizes, despite having no explicit or implicit knowledge of the task. But, this is debatable as the agent does have R_{general} and the rest of the task-specific components are taken care of by a human. This statement should be revised to be more clear about this. - For experiments, the authors compared a human acting alone vs. a human acting with the copilot. It would have been nice to include baselines such as random assistance (randomly providing assistance to the human at about the same frequency as the assisted condition) and bad assistance (assisting at exactly the moments the assisted condition would not). - Before the conclusion, the authors say that the agent responded slowly to user commands. Why was this? Was it because the algorithm ran slowly or that the authors set up a time delay in the experiments?

Review 2

Regarding originality, this paper provides a novel perspective on shared autonomy, in that it focuses on training a task-agnostic agent that can effectively help human users without having to know the user's intent. This can be applied to any domain that has constraints (e.g., avoid collisions, stay upright) that are independent of what the goal is. The algorithm itself is not especially novel -- it combines residual policy learning with constrained policy optimization, both of which have been proposed by previous work. (On that note, it's worth briefly citing related work on constrained policy optimization in the paper, e.g. [1] and [2].) In addition, the contribution of the paper is somewhat overstated. Even though the agent itself is model-free and thus does not learn a model of the environment dynamics, the simulator used for training implicitly contains knowledge of environment dynamics. In particular, there are no guarantees that the agent will work if the dynamics at test time (e.g., in the real world), are different from those in the simulator. This limits the ability to use this approach for real-world applications. Regarding quality and clarity, this paper is well-motivated, written very well, and easy to understand. Both the approach and the experimental design are explained clearly, and the experiments are well-designed. The results show that agents trained with this approach indeed help humans perform better, both in the task that the agent was trained on (i.e., Lunar Lander and Drone Reacher), and in a new task (Lunar Reacher). The latter shows that the agent is indeed task-agnostic, since the goals in Lunar Reacher are quite different than those in Lunar Lander. There are a few details missing that are necessary for reproducibility; these could be included in an Appendix, if there is not enough space in the main paper. Specifically, what was the weighting on the different task-agnostic reward components? Is the learning sensitive to how these weights are chosen? And how were the shaping terms used in Lunar Lander and Drone Reacher computed? I also have a few concerns regarding experiment design. First, the baseline is weak. The only comparison is to an agent that does not assist the human at all. It would be beneficial to compare against agents trained with different human models (e.g., laggy or noisy). Even better, it would be great to compare against Reddy et al. (citation [41] in the paper) in the Lunar Lander environment, as a best-case scenario of how helpful a non-task-agnostic agent can be, and perhaps even take an agent trained with that approach in Lunar Lander, and see how it performs in assisting people in Lunar Reacher. Second, why wasn't a single agent trained on all models of human behavior together (i.e., noisy, laggy, and all imitation policies)? It seems that this would result in an agent that is more robust than the agents considered in the paper, which are trained on only one of those three models of human behavior. Finally, a drawback is that the user sample is very gender-biased (all male), and possibly age biased (what was the standard deviation in age? please include this in the paper as well). Minor comments / questions: - The direction of the inequality sign for the constraint is greater-than in equations 6 and 7, but less-than-or-equal-to in equation 3, which is inconsistent. - Why did users assisted by the trained agent experience more timeouts in Drone Reacher? Participants said the agent "responded slowly to user commands," but why was this the case in Drone Reacher but not in Lunar Lander or Lunar Reacher? - Figure 4 plots the success vs. crash vs. timeout proportions; I would also like to see a plot of rewards, and an analysis of whether those differences are statistically significant. [1] Tessler et al. Reward constrained policy optimization. ICLR 2019. [2] Bohez et al. Value constrained model-free continuous control. https://arxiv.org/abs/1902.04623