GTI: Learning to Generalize across Long-Horizon Tasks from Human Demonstrations

Ajay Mandlekar, Danfei Xu, Roberto Martín-Martín, Silvio Savarese, Li Fei-Fei


Imitation learning is an effective and safe technique to train robot policies in the real world because it does not depend on an expensive random exploration process. However, due to the lack of exploration, learning policies that generalize beyond the demonstrated behaviors is still an open challenge. We present a novel imitation learning framework to enable robots to 1) learn complex real world manipulation tasks efficiently from a small number of human demonstrations, and 2) synthesize new behaviors not contained in the collected demonstrations. Our key insight is that multi-task domains often present a latent structure, where demonstrated trajectories for different tasks intersect at common regions of the state space. We present Generalization Through Imitation (GTI), a two-stage offline imitation learning algorithm that exploits this intersecting structure to train goal-directed policies that generalize to unseen start and goal state combinations. In the first stage of GTI, we train a stochastic policy that leverages trajectory intersections to have the capacity to compose behaviors from different demonstration trajectories together. In the second stage of GTI, we collect a small set of rollouts from the unconditioned stochastic policy of the first stage, and train a goal-directed agent to generalize to novel start and goal configurations. We validate GTI in both simulated domains and a challenging long-horizon robotic manipulation domain in real world. Additional results and videos are available at

Live Paper Discussion Information

Start Time End Time
07/15 15:00 UTC 07/15 17:00 UTC

Virtual Conference Presentation

Supplementary Video

Paper Reviews

Review 2

Originality: The paper presents an original idea that addresses a problem that might not be well known to the wider community but it definitely exists in the field of imitation learning. The authors also do a good job of providing a comprehensive overview of the related work. Quality: This is a well-written paper that presents good results and presents a single, well justified story. The authors don't try to overreach for additional contributions but rather clearly present the problem that they're interested in and show an approach that addresses this problem. I think that there are two aspects that slightly diminish the quality of the paper: - the lack of strong baselines. The authors compare to well-known techniques that are destined to fail. However, they could introduce an LMP-based baseline [29] that should be able to better deal with the multi-modality in the data, as well as other versions of their method, e.g. where there is just a single Gaussian prior. - the figures (esp. Fig. 2) are often confusing and don't provide additional help with understanding the concepts in the paper. For example, what do the colors refer to in Fig. 2? Which parts of the networks are shared? If this a conditional VAE, why isn't the prior conditioned on the on the goal, etc. The same applies to the algorithm boxes which could be significantly shortened and made more accessible. Clarity: The paper is relatively clear and well written but as mentioned above, I think it could benefit a lot from better figures and algorithm boxes. There are also minor issues such as typos and duplicates in the Related Work section. Significance: I would grade the significance of this paper as medium. It addresses a problem but it doesn't fully show comparisons to competitive baselines. Here are a few suggestions on how to improve it: - introduce comparisons to simpler versions of your method: single Gaussian prior, single network that directly produces actions (even with the GMM prior) - introduce comparisons to LMP[28] and GCBC with a multi-modal output - the authors potentially missed an important point that even a multi-modal BC-like method might not be able to deal with the presented problem because of the mode-covering behavior of forward KL. - discuss the comparison to Q-learning-based methods that technically are supposed to be able to merge trajectories like the ones presented in the paper. Why GC-Batch RL method can't achieve the same result? - Remove IRIS in Fig. 3 description. It's supposed to be GTI. - the architecture of Stage 1 is very much unclear given the current Figure. I believe that the prior should be goal-conditioned. I would also suggest a comparison to a single stage process with a multi-modal prior. - motivation of the paper is rather strong but the authors cite the work of 28 as an example of large amounts of annotated demonstrations, which is not true, since it relies on unlabelled play data.

Review 3

The paper is generally well written and easy to follow. Below are some suggestions that will help improve the manuscript. 1) Section 3: definition of trajectory intersection: This definition isn't well integrated with the rest of the paper, and quite loosely defined. S_i^1 = S_j^2 will be hardly true in noisy stochastic systems especially in high-dimensional systems with continuous states. The paper later goes on to using image observations instead of states. Equivalence in observation doesn't necessarily imply equivalence in states. These details are currently overlooked, and the paper can benefit from paying close attention to it. 2) GTI doesn't explicitly model the temporal structure of the demonstrations. Some temporal details, however, need further clarification. Section 3: "H timestep" and "T length subsequence" are mentioned without much clarity. It's unclear how "H/T" is chosen, its assumptions and constraints with respect to the overall tasks horizon and trajectory intersection point. 3) The idea of leaving the trajectory intersection to amplify novel behaviors is quite interesting and is backed by real-world experiments in a few tasks. The current tasks are at its bare minimum consisting of 2 state-state, 2 goal-state, and 1 intersection point. The paper, however, lacks evidence on some critical questions. a) As the complexity of tasks grows, there will be multiple intersection points. There is no clear evidence if appropriate intersection points can be identified and effectively leveraged in planning. b) The intersection point is currently explicitly provided in the form of a green bowl. There is only one intersection point between the task and the goal. The intersection point is temporally equidistant from the start and the goal. These assumptions are not necessarily always true in real tasks. There are multiple ways to understand these questions - a suggestion here could be to advance the experimental setup by one level -- 3 state-state, 3 goal-state, and 2 intersection point (one at an early stage of the task and one at the late stage of the task. This will be the minimal setup that can provide good insight into the method without significantly advancing the complexity of experimentation and engineering effort. 4) The paper can use some insights on the goal proposal model -- it can provide goals that are out of distribution for the low-level controller, or even provide a physically implausible goal. How does the method get around these challenges? 6) Figure 3 GTI is quite noisy reaching the goals. Some insights here will be helpful 5) typos page1: independently -> independent page1: "different phases" -> phase of a demonstration isn't defined page 3: "arrive at intersecting states from different goal " -> arrive at intersecting states from different inital states