Memory Retrieval in Visuomotor Policies for Long-Horizon Robot Control

Rutav Shah, Yisu Li, Femi Bello, Yuke Zhu, Roberto Martín-Martín

Paper ID 10

Session World Models & Memory

Posters presented in the poster session following their oral. Locations not assigned.

Abstract: General-purpose robots operating in partially observable environments such as homes require memory to support long-term autonomy. They must recall different types of past information, such as where objects were placed, which subtasks have already been completed by a human partner, and when an appliance was turned on This capability requires an effective memory retrieval mechanism. However, hand-designed or heuristic-based retrieval methods often fail to generalize in different tasks. Attention-based retrieval provides a promising alternative, as both queries and keys are learned from data without making task-specific assumptions. However, directly applying an attention-based memory retrieval mechanism in imitation learning introduces two key challenges: (1) the policy may learn spurious correlations between the information retrieved from the past and predicted actions, and (2) errors accumulated over time in the memory due to prediction inaccuracies, compounded by interactions with the environment, lead to model drift and cascading failures in long-horizon control. To address these challenges, we introduce HALO, a visuomotor policy equipped with an attention-based memory retrieval mechanism for long-horizon control. To mitigate spurious correlations, HALO leverages vision-language models (VLMs) by generating task-relevant question–answer pairs from demonstration trajectories and jointly training the policy with a video question– answering objective. This supervision encourages the retrieval module to focus on information that is relevant to the task. Second, to reduce the impact of accumulated errors in memory during closed-loop control, HALO uses sparse attention that restricts retrieval to only the most relevant parts of the history. Together, these components enable more reliable long-horizon control by guiding the policy to retrieve task-relevant information from up to two minutes of past experience.