Spatial Action Maps for Mobile Manipulation

Jimmy Wu, Xingyuan Sun, Andy Zeng, Shuran Song, Johnny Lee, Szymon Rusinkiewicz, Thomas Funkhouser


Typical end-to-end formulations for learning robotic navigation involve predicting a small set of steering command actions (e.g., step forward, turn left, turn right, etc.) from images of the current state (e.g., a bird's-eye view of a SLAM reconstruction). Instead, we show that it can be advantageous to learn with dense action representations defined in the same domain as the state. In this work, we present "spatial action maps," in which the set of possible actions is represented by a pixel map (aligned with the input image of the current state), where each pixel represents a local navigational endpoint at the corresponding scene location. Using ConvNets to infer spatial action maps from state images, action predictions are thereby spatially anchored on local visual features in the scene, enabling significantly faster learning of complex behaviors for mobile manipulation tasks with reinforcement learning. In our experiments, we task a robot with pushing objects to a goal location, and find that policies learned with spatial action maps achieve much better performance than traditional alternatives.

Live Paper Discussion Information

Start Time End Time
07/15 15:00 UTC 07/15 17:00 UTC

Virtual Conference Presentation

Paper Reviews

Review 1

This paper proposes an action representation (called ‘spatial action maps’) for robots learning to manipulate objects using deep reinforcement learning. This work is inspired by previous works using dense action representations. An agent is trained, using simulation, to push objects to a target location. While a standard algorithm is used for training (DDQN), the policy is represented using a Fully Convolutional neural net. Experimental results using a few baselines show some promise of the proposed approach. The paper in general is well-written, and I was very excited when reading sections I and II. This excitement decreased from section III. For example, the reward function assumes that the distance between objects and the target location is known — which may not be the case in more complex scenarios. In addition and most importantly, the experimental setting is very toy-like environment. This environment assumes that the action representation is available in one image, which is a strong assumption in more complex (real-world) environments. Last but not least, the experiments do not take into account strong baselines to compare against. It would have been interesting to see methods X, Y and Z with and without spatial action maps, where the use of spatial action maps makes a substantial difference — not only in a toy environment but in a more realistic one.

Review 2

Originality The authors propose a novel representation for actions in mobile manipulation settings and discuss several advantages of the proposed representation. They also present an empirical study that showcases the advantages of the approach. The action representation is indeed novel, to the best of my knowledge. Quality The results are impressive and the evaluation is thorough, especially the ablations. Figure 7 clearly shows the value of the proposed action space. Taken together, Tables 1,2,3, and 4 all show the effect of different kinds of ablations (using straight line paths instead of shortest paths in the movement primitives, using a fixed step size, etc). A consistent trend is that the design choices make the most difference in the Large Divider environment - this makes sense since the agent must navigate around the large divider in order to push all of the blocks successfully. The section on limitations of the approach is an important inclusion and is appreciated. As for the supplementary website, the videos are useful to watch. The emergent behavior of grouping items against the wall and then sweeping multiple objects with long trajectory is indeed interesting, as discussed by the authors. Clarity Overall, the authors provide a very thorough explanation of their method, the state and action representations used, and their results. One minor point - the paper could use more details on how gradients are passed only through the state pixel corresponding to a selected action pixel. Significance Overall, this is a good paper that proposes a nice idea for an action space, and presents thorough validation that the proposed action space outperforms other choices.