Abstract: Recent advances in robot learning have shown promise in enabling robots to perform a variety of manipulation tasks and generalize to novel scenarios. One of the key contributing factors to this progress is the scale of robot data used to train the models. To obtain large-scale datasets, prior approaches have relied on either demonstrations requiring high human involvement or engineering-heavy autonomous data collection schemes, both of which being challenging in scaling up the space of new tasks and skills needed for building generalist robots. To mitigate this issue, we propose to take an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing to obtain meaningful data for robot learning without requiring additional robot data. Specifically, we make use of the state of the art text-to-image diffusion models and perform aggressive data augmentation on top of our existing robotic manipulation datasets via inpainting of various unseen objects for manipulation, backgrounds, and distractors with pure text guidance. Through extensive real-world experiments, we show that manipulation policies trained on the augmented data are able to solve completely unseen tasks with new objects and can behave more robustly w.r.t. novel distractors. In addition, we also find that we can improve the robustness and generalization of high-level robot learning tasks such as success detection through training with the diffusion-based data augmentation.