Deep Visual Reasoning: Learning to Predict Action Sequences for Task and Motion Planning from an Initial Scene Image

Danny Driess, Jung-Su Ha, Marc Toussaint


In this paper, we propose a deep convolutional recurrent neural network that predicts action sequences for task and motion planning (TAMP) from an initial scene image. Typical TAMP problems are formalized by combining reasoning on a symbolic, discrete level (e.g. first-order logic) with continuous motion planning such as nonlinear trajectory optimization.Due to the great combinatorial complexity of possible discrete action sequences, a large number of optimization/motion planning problems have to be solved to find a solution, which limits the scalability of these approaches.To circumvent this combinatorial complexity, we develop a neural network which, based on an initial image of the scene, directly predicts promising discrete action sequences such that ideally only one motion planning problem has to be solved to find a solution to the overall TAMP problem.A key aspect is that our method generalizes to scenes with many and varying number of objects, although being trained on only two objects at a time.This is possible by encoding the objects of the scene in images as input to the neural network, instead of a fixed feature vector.Results show runtime improvements of several magnitudes.

Live Paper Discussion Information

Start Time End Time
07/14 15:00 UTC 07/14 17:00 UTC

Virtual Conference Presentation

Supplementary Video

Paper Reviews

Review 1

This paper addresses an important issue in manipulation planning, namely the fact that there's a combinatorial explosion as a function of plan length. This is traditionally addressed by a heuristic function and there is a growing body of work on learning heuristics for manipulation planning (aka TAMP). In this paper, the key novelty is formulating this learning problem as learning a convolutional RNN based on an image representation of the start state (a depth map with separate channels for object masks). Care has been taken with the learning setting so that the learned heuristic generalizes over number of objects in the scene, something which has been problematic for some earlier approaches. One observation is that the learning is being done with a very large data set of plans (for 30,000 scenes of two objects). Presumably because the appearances have to span the range of placements in the workspace and the relationship of the two objects. I note that in a more realistic setting, e.g. a mobile manipulation robot in a household, might require a prohibitive number of images to "span" it's operating space. In any case, this training is a substantial investment, so the question is does it pay back? That is, how well does it generalize? The authors show that adding other objects to the scene does not have a substantial impact on performance and they test for multiple goal locations. But all of these tasks have a very similar structure, i.e. the number of solution sequences (ignoring the discrete grasp choice) is relatively small, I believe (if it were a single arm there's only up to 3 copies of [grasp, place]). Most of the combinatorics comes from the choice of grasps (and arm). The paper shows that the vast majority of sequences are infeasible - can you give us some insight as to why? Is it due to kinematic limits? Presumably not due to motion planning failures in that simple setting. What is the network learning? The paper stresses that the approach mostly does away with search altogether. This seems an overly strong claim based on the limited testing. Yes, in their experiments there is little search needed, but the setting is limited. I would recommend toning down the claims to a more realistic level. Clarifications: 1. I'm assuming that the odd length sequences involve grasps with different arms, so the arm is another discrete parameter in the action. This also explains why there are 8 sequences of length 2 - 2 arms with 4 grasps each? 2. The initial handover illustration in Figure 1 does not seem to fit into the class of Fig 3, unless you have different grasps for each arm on the objects? 3. When counting sequences, is the assumption that only two objects can matter (the goal object and the one blocking the target)? 4. In Table I, is this the size of the search space or the number of solutions? The title of the table makes it sound as if it's the number of solutions, but huge numbers of solutions would argue that the problem is easy. 5. You need a "perfect" object detector as part of the framework; you should make it clear, especially when comparing to other methods for planning for image input. You are using images as a flexible representation for state, not really addressing realistic sensor-based manipulation. 6. Clarify the discussion on the relation to Q-functions. There's no uncertainty in action or effects being modeled, right? So, presumably it's a POMDP because the discrete actions are only partially specified?

Review 2

The paper is well written and presents an interesting variation of prior TAMP heuristic learning methods. Rather than learning feasibility of actions, this approach learns whether an action leads to the goal, and uses this as a task-level search heuristic instead. The learned model uses an image-space representation of the scene along with a clever parameterization of the action, manipulated object, and goal, which allows the method to generalize to arbitrary numbers of objects. The experiments are clear and show a significant benefit to using the proposed approach, but involve rather limited object-object interactions (cuboids and cylinders) in toy rearrangement scenarios. It is not hard to generate heuristics manually in this case, as all the images are top-down and all the objects can be grasped from top-down with very little interaction. I can imagine the approach breaking down when, for example, the goal requires multiple objects to be packed tightly together. The paper could be improved by more discussion about the limitations of learning. The real-robot experiments are not very illuminating, since real images are not being used. Basically, this is equivalent to a playback of a plan generated offline. It would be helpful to discuss how the approach could be used with real images. Minor comments: - "loosing" => "losing" - Fig 6 should have a legend, as it is not clear what bars are from which comparison group (especially if printed in B&W).

Review 3

This paper presents an interesting solution to overcome the exponential increase of computational complexity of TAMP approaches for long sequence lengths and large numbers of objects. The proposed method uses a recurrent neural network to predict sequences of high-level actions given an image of the scene at the first time-step and the goal. To generate the training data for the neural network, a large number of scenes and goal-states are generated programmatically and the corresponding problems are solved using an existing state-of-the-art TAMP approach. While the dataset generation is offline, allowing for significantly larger computational budgets than in the case of online-planning, it might still require unreasonable amounts of computation to generate plans in complex, real-world scenarios -- even if computation is offline. The authors should clarify whether physics simulation was used when running the quantitative evaluations or if only kinematics were considered. This would be especially interesting to know with regards to the generalization experiments with cylinders. To justify the claim of generalization to slightly different geometries, it would be good if the authors could add a real-world experiment with cylinders. It would be good if the authors could also discuss how the presented framework could be extended to objects with more complex shapes, without dramatically increasing the required amount of computation.