Authors: Dorsa Sadigh, Anca Dragan, Shankar Sastry, Sanjit Seshia
Our goal is to efficiently learn reward functions encoding a human’s preferences for how a dynamical system should act. There are two challenges with this. First, in many problems it is difficult for people to provide demonstrations of the desired system trajectory (like a high-DOF robot arm motion or an aggressive driving maneuver), or to even assign how much numerical reward an action or trajectory should get. We build on work in label ranking and propose to learn from preferences (or comparisons) instead: the person provides the system a relative preference between two trajectories. Second, the learned reward function strongly depends on what environments and trajectories were experienced during the training phase. We thus take an active learning approach, in which the system decides on what preference queries to make. A novel aspect of our work is the complexity and continuous nature of the queries: continuous trajectories of a dynamical system in environments with other moving agents (humans or robots). We contribute a method for actively synthesizing queries that satisfy the dynamics of the system. Further, we learn the reward function from a continuous hypothesis space by maximizing the volume removed from the hypothesis space by each query. We assign weights to the hypothesis space in the form of a log-concave distribution and provide a bound on the number of iterations required to converge. We show that our algorithm converges faster to the desired reward compared to approaches that are not active or that do not synthesize queries in an autonomous driving domain. We then run a user study to put our method to the test with real people.