Any-point Trajectory Modeling for Policy Learning


Chuan Wen, Xingyu Lin, John Ian Reyes So, Kai Chen, Qi Dou, Yang Gao, Pieter Abbeel
Paper Website

Paper ID 92

Session 12. Robot learning foundation models

Poster Session day 3 (Thursday, July 18)

Abstract: Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, \textbf{A}ny-point \textbf{T}rajectory \textbf{M}odeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Across the \textbf{130} language-conditioned tasks we evaluated in both simulation and the real world, ATM outperforms strong video pre-training baselines by 80$\%$ on average. Furthermore, we show effective transfer learning of manipulation skills from human videos.