Deep Differentiable Grasp Planner for High-DOF Grippers

Min Liu (National University of Defense Technology); Zherong Pan (University of North Carolina at Chapel Hill); Kai Xu (National University of Defense Technology); Kanishka Ganguly (University of Maryland at College Park); Dinesh Manocha (University of North Carolina at Chapel Hill)


We present an end-to-end algorithm for training deep neural networks to grasp novel objects. Our algorithm builds all the essential components of a grasping system using a forward-backward automatic differentiation approach, including the forward kinematics of the gripper, the collision between the gripper and the target object, and the metric for grasp poses. In particular, we show that a generalized Q1 grasp metric is defined and differentiable for inexact grasps generated by a neural network, and the derivatives of our generalized Q1 metric can be computed from a sensitivity analysis of the induced optimization problem. We show that the derivatives of the (self-)collision terms can be efficiently computed from a watertight triangle mesh of low-quality. Altogether, our algorithm allows for the computation of grasp poses for high-DOF grippers in an unsupervised mode with no ground truth data, or it improves the results in a supervised mode using a small dataset. Our new learning algorithm significantly simplifies the data preparation for learning-based grasping systems and leads to higher qualities of learned grasps on common 3D shape datasets [7, 49, 26, 25], achieving a 22% higher success rate on physical hardware and a 0.12 higher value on the Q1 grasp quality metric.

Live Paper Discussion Information

Start Time End Time
07/15 15:00 UTC 07/15 17:00 UTC

Virtual Conference Presentation

Supplementary Video

Paper Reviews

Review 1

Strengths: 1) Because it is hard for grasp learning to predict grasping points that are in exact contacts with the target object, the paper proposes a generalized grasping metric with inexact contacts based on the Q1 metric from citation [15]. 2) It is interesting to combine analytical grasp planning and deep learning for multi-fingered grasp planning. This makes it possible for the grasping method to be used for both local optimal grasp planning and grasp deep learning. 3) This paper shows the proposed grasping metric is locally differentiable. It derives the sub-gradient of the generalized Q1 grasping metric in two different ways based on previous work (citation [41] and [13]). 4) It is nice to consider collision avoidance and forward kinematics for the grasp planning, in addition to the grasping metric. 5) The paper performs physical experiments for 50 YCB objects to show the proposed method outperforms an existing grasp planning work (citation [26]) in terms of grasp success rate. 6) The paper writing is clear. 7) The supplementary video shows cool real-robot grasping demos, though objects are always placed in roughly the same location on the table. Weaknesses: 1) This paper misses important related work [1] [2] [3] [4] listed below. Multi-fingered grasp planning is formulated as a continuous optimization problem over the learned grasping success function in [1][2]. In particular [1] predates the work cited in this paper claiming to introduce the optimization-based learning approacht o grasping. [3] presents a multi-fingered grasping optimization approach leveraging the learned grasping function and the analytical constraints such as the reconstructed object sign distance field. [4] proposes a grasp learning approach for objects in clutter for parallel jaw grippers. [1] Qingkai Lu, Mark Van der Merwe, Balakumar Sundaralingam and Tucker Hermans. Multi-Fingered Grasp Planning via Inference in Deep Neural Networks. IEEE Robotics & Automation Magazine (RAM) Special Issue: Deep Learning and Machine Learning in Robotics. [2] Qingkai Lu, Kautilya Chenna, Balakumar Sundaralingam, and Tucker Hermans. Planning Multi-Fingered Grasps as Probabilistic Inference in a Learned Deep Network. International Symposium on Robotics Research (ISRR), 2017. [3] Mark Van der Merwe, Qingkai Lu, Balakumar Sundaralingam, Martin Matak and Tucker Hermans. Learning Continuous 3D Reconstructions for Geometrically Aware Grasping. IEEE International Conference on Robotics and Automation (ICRA) 2020. [4] Gualtieri, Marcus, Andreas Ten Pas, Kate Saenko, and Robert Platt. High precision grasp pose detection in dense clutter. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 598-605. IEEE, 2016. 2) There are at least five hyper parameters in the optimization objective function. 3) Qualitatively, the grasping results are on par with those of GraspIt! (citation [30]), instead of outperforming GraspIt!. 4) Each YCB object is only tested for one trial at roughly the same location for the physical experiments. It would make the results and conclusion more convincing by testing each object for multiple trials at different object poses. 5) The paper does not mention what cameras and how many cameras are used for physical experiments. 6) In the experimental setup section, it says there is an offline testing set. But the paper did not report the offline testing performance (e.g. the loss values) on the testing set. 7) The proposed grasp learning method is limited to predict a single successful grasp for the same object, which might fail due to environment or task constraints. Questions: 1) How do you choose the object pose for the physical experiment? 2) Are all tested YCB objects unseen from training? 3) The last sentence of Experimental Results section is unclear: “our neural network failed on the 5 objects due to slippage”. Does “neural network” refer to the grasp optimization with learning? Which 5 objects? 4) Is the grasping neural network with ResNet-50 structure trained from scratch? Or is it fine-tuned from certain computer vision models?

Review 2

The paper ‘Deep Differentiable Grasp Planner for High DoF Grippers’ proposes a network architectures that takes as input a set of depth images of a single object from multiple viewpoints and outputs a grasp pose and finger configuration for the Shadow hand. The main contribution of the paper is in the derivation of the loss function that is a differentiable version of the standard Ferrari/Canny metric Q. Specifically, the authors allow gradients event to flow if there is not yet contact between hand and object. They also handle self-collisions. The authors show that their algorithm can be used as a local grasp optimiser when differentiating the novel loss with respect to the grasp pose and joint configuration. They also show that their algorithm can directly regress to a grasp pose and configuration given depth images. Additionally, the authors demonstrate their approach on a real platform and compare to a prior method. Strength/Contribution: - derivation of fully differentiable Q metric which can be used for — (i) optimising full hand configuration from an initial hand pose as thereby serve as a local grasp planner — (ii) output a grasp pose and joint configuration directly Weaknesses: - Related work: — The reviewed related work is mostly sufficient and mentions important pros and cons of prior approaches. However, the authors completely ignore the fact that many of the reviewed works actually solve much more complex grasping tasks than shown in this paper, especially grasping in clutter with more than just one object on a clean table. Furthermore, the authors also ignore that these works do not require multiple depth images but just one. While it is hard to directly compare these very different approaches, the authors should at least comment on these shortcomings of their approach relative to related work. — Minor: — missing related work for high-DoF hands grasping single objects [a] — missing related work that are not sampling-based but use a fully-convolutional architecture to make pixel-wise predictions for grasping in clutter [b] -> the author do not seem to be aware of this kind of work that also seeks alternatives to sampling-based approaches - Experiments: — because of the aforementioned two assumptions (single object, depth images from multiple viewpoints), it is very hard to see how this work can have impact in comparison to work that is grasping complex objects in complex, cluttered scenes when only provided a depth or RGB image from a single viewpoint. It would have been helpful for comparison to prior work if the authors would have focussed less on the dexterous hand and rather on the differentiation of the Q-metric and how this newly gained loss helps to train superior learned grasp planners. For example, could the authors train a model that outputs a pose for a parallel-yaw gripper in a cluttered scene be better trained with this loss and achieve a higher success rate than a sampling approach like DexNet? As network architecture, the authors could assume something similar to [b] which predicts grasp quality and optimal pose per pixel in a depth image and works in realtime. These would be experiments that would let the reader really access the impact of the work in relation to prior work. Once this is shown, the author could also show the benefits of the approach for a dexterous hand. It is clear that because it is a more brittle hand, it is difficult to control it for grasping in clutter. — the authors make some vague claims about a comparison to a sampling-based method in Section VI.C. It is unclear if the authors use the EigenGrasp planner from [30] or just sample from the mesh surface a set of approach normals and close the fingers. Furthermore, Fig 7 only shows a few solutions of the proposed approach but no comparison to [30]. Therefore the statement that the results are on par is vague. It would have been more useful to compare to other sampling-based approach that also use depth images as input. Also the EigenGrasp Planner (if used) uses a lower-dimensional space for the high DoF. Therefore, a sampling-based approach should not suffer to much in the 2D space of grasp synergies. Other comments: - cite [a,b] appropriately especially in relation to paragraph in related work, page 3, clmn 1 - the authors criticise sampling-based approaches for inferring a good grasp, yet they suffer from the problem that an object may be grasped equally well from multiple grasps and they ask their network to only output one. The authors solve this by pre-computing 100 grasps per training object and then use a chamfer-based loss to find representative grasps. They may want to emphasise more in the related work. Furthermore, the way the training data is generated based on this data loss is unclear in the first pert of Section V (specifically, “…. pick which grasp pose is most representable” is unclear.) - in general , the paper is somewhat vague about when depth images are used in the experiments. - Why does fine-tuning the network take so long and so many more episodes than pre-training? - Success Plan seems meaningless as you didn’t take the geometry of the environment into account during training. How would any difference be a meaningful indicator of grasp quality for the two compared approaches? References: [a] Q. Lu, K. Chenna, B. Sundaralingam, and T. Hermans, “Planning Multi-Fingered Grasps as Probabilistic Inference in a Learned Deep Network,”in Int. Symp. on Robotics Research, 2017. [b] Douglas Morrison, Juxi Leitner, Peter Corke. “Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach”. RSS 2018.

Review 3

The authors present a grasp metric definition that is fully differentiable, includes forward kinematics and terms to avoid undesired collisions. The function can be applied in the context of an optimization framework as a grasp planner for known geometries or in a learning framework. The experiments show improved performance in simulation and on a robotic platform compared to a prior version [26]. Contribution: The contributions are stated in the introduction. The paper presents a differentiable grasp metric that includes collision terms. It shows how to calculate its derivatives and a method to compute the collision terms without using signed distance fields but only having access to a watertight triangle mesh of the target object. The paper is well written and contributions are clearly stated. The method and results are clearly of interest for the wider community. A few recommended improvements follow: Related Work: The following statement is incorrect: “All the existing grasp quality metrics have discontinuities”. A lot of prior work has used optimization to find grasps using differentiable metrics, e.g.: [30] “Grasping Unknown Objects by Exploiting Complementarity with Robot Hand Geometry” by Kiatos and Malassiotis "Robotic grasping of unmodeled objects using time-of-flight range data and finger torque information” by Maldonado et al. These are just a few examples from the top of my head. I’m convinced a thorough literature research will reveal significantly more examples. “A grasp quality metric is only defined when the gripper and the target object have exact contact” Again, an incorrect statement (see metrics above). Another relevant line of work are those that use learned, e.g.: “Multi-Fingered Grasp Planning via Inference in Deep Neural Networks” by Lu et al. “This is in contrast with prior works [...]” (Sec. III) I think it makes sense to distinguish your approach from other works. But I think it would also help to show works that use the same/similar formulation as you do. Otherwise the reader might be left with the impression that your approach is unique in that regard. Method: Is $m$ (the scalar in matrix $M$, equation (1)) the constant that scales the importance of torques w.r.t. forces? If so, how do you choose it? (often it’s inverse proportional to the extents of the object) And maybe add sentence of what it means (right now you only write “user-provided metric tensor”). Experiments: The grasps all seem to be dominated by fingertip contacts (== precision grasps), without any palm contacts (at least none are visible in the figures or video). I assume this is a side effect of the optimization scheme / a local minima. Is there any idea on how power grasps could be favored? I can’t find any comparable numbers from [30]. Instead you write: “We have compared the (standard) Q_1 metric of our method and that generated using sampling-based grasp planner [30] in Figure 7. The results show that the qualities of our grasp poses are on par with those of [30].” How is this shown? The Fig. 7 shows a few examples and they have nothing to do with [30]. Would it be possible to add the results of [30] to Table II? “Our depth cameras are calibrated beforehand to make the camera pose exactly same as the poses used for training.” This sounds non-trivial. How can this exactly be replicated? Please add percentages for successes in Table II. Video: How is success measured in the real-world robot trials? I’m asking because in the video the banana is clearly slipping out of the hand but the video cuts to the next object (i assume this was counted as a success?). Language / Typos The paper is well written and easy to follow, two small typos: P. 7: “our planner provides locally optimal and” P. 7: “that generated using sampling-based grasp planner”