RAG-Diff: Adapting Diffusion Policies to Dynamic Constraints with Retrieval-Augmented Guidance

Ruolin Ye, Nayoung Ha, Shuaixing Chen, Qiandao Liu, Gavin Chen, Shaoyang Stassen, Mark Zolotas, Jose Barreiros, Tapomayukh Bhattacharjee

Paper ID 11

Session World Models & Memory

Posters presented in the poster session following their oral. Locations not assigned.

Abstract: Robots operating in unstructured environments must satisfy dynamic constraints that can change across tasks and even within a single execution. While diffusion policies can learn multimodal behaviors from demonstrations, adapting a trained policy at runtime to newly encountered or evolving constraints remains an open challenge.
We propose RAG-Diff, a runtime adaptation framework for a frozen transformer diffusion policy that leverages retrieval-augmented memory. RAG-Diff maintains PrefMem, a memory bank that stores vision-language embeddings together with (i) state-action snippets and (ii) constraint annotations. At test time, RAG-Diff queries PrefMem to retrieve the nearest entry and uses it to steer sampling in two complementary ways. First, I-Atten (in-place attention recomputation) inserts the retrieved snippet as additional cross-attention memory tokens and performs a classifier-free-guidance-style update, biasing denoising toward preference-consistent motion. Second, a predictive guidance mechanism incorporates the retrieved constraint parameters during diffusion sampling to discourage violations.
To demonstrate the effectiveness of RAG-Diff, we choose physical robot caregiving as a domain with personalized and time-varying constraints. We first benchmark on an adapted PushT environment in simulation with contact-force limits and region-to-avoid constraints. We then evaluated our method on a suite of physical caregiving tasks spanning diverse preference types: (i) interaction and affordance preferences in bed bathing, (ii) ROM-based assistance-level preferences in medicine delivery, (iii) semantic preferences in shelf cleaning, and (iv) trajectory preferences in feeding, in both RCareWorld simulation and with a real robot. We further conducted real-world user studies on the bed-bathing task. Results show that RAG-Diff improves both task success and constraint satisfaction compared to a range of baselines, including unguided diffusion and other guidance- or sampling-based variants.