OverlapNet: Loop Closing for LiDAR-based SLAM

Xieyuanli Chen, Thomas Läbe, Andres Milioto, Timo Röhling, Olga Vysotska, Alexandre Haag, Jens Behley, Cyrill Stachniss

Abstract

Simultaneous localization and mapping (SLAM) is a fundamental capability required by most autonomous systems. In this paper, we address the problem of loop closing for SLAM based on 3D laser scans recorded by autonomous cars. Our approach utilizes a deep neural network exploiting different cues generated from LiDAR data for finding loop closures. It estimates an image overlap generalized to range images and provides a relative yaw angle estimate between pairs of scans. Based on such predictions, we tackle loop closure detection and integrate our approach into an existing SLAM system to improve its mapping results. We evaluate our approach on sequences of the KITTI odometry benchmark and the Ford campus dataset. We show that our method can effectively detect loop closures surpassing the detection performance of state-of-the-art methods. To highlight the generalization capabilities of our approach, we evaluate our model on the Ford campus dataset while using only KITTI for training. The experiments show that the learned representation is able to provide reliable loop closure candidates, also in unseen environments.

Live Paper Discussion Information

	Start Time	End Time
	07/14 15:00 UTC	07/14 17:00 UTC

Virtual Conference Presentation

Supplementary Video

Paper Reviews

Review 1

This paper proposes an interesting solution to loop closure detection in 3D range data. It is very clearly written and generally easy to follow. The reported results are significant, with high recall rates and accurate angle and position estimates, in what appears to be a competitive run-time. I appreciate that you perform ablation studies and compare to different input modalities (Table V) and different variants (Table III). The related work section is mostly relevant and complete. However, I would recommend to also relate to recent work on learning-based place recognition of Sun et al. (Li Sun et al., "Localising Faster: Efficient and precise lidar-based robot localisation in large-scale environments", ICRA 2020, arXiv:2003.01875) Furthermore, I think that the claim that methods using handcrafted features are "therefore often scene-specific" needs to be better substantiated. Can you cite evidence that they do not generalise to different environments? In particular, your evaluation is on two similar urban datasets, so I think you cannot claim that OverlapNet is less scene specific without further substantiation. In Sec III-A you state that overlap percentage may be a better measure of loop closure than using the estimated spatial distance. This is certainly true, and is also the rationale for the many appearance-based methods (that you also cite in Sec II). In Fig 3, I was surprised to see that there is not even a local maximum at the correct position in (a). Could you please clarify why that is so? Regarding the comparison to learning-based methods (OREOS) it would be interesting to see not only KITTI sequence 00, but also other experiments from the OREOS paper; e.g., the NCLT dataset. As you also state in the paper, it appears that it is the use of additional sensor modalities that gives the greatest benefits for OverlapNet. Looking at the results in Tables IV and V, the depth-only OverlapNet has similar performance as OREOS (and higher variance in the yaw estimate) and lower than the method of Sun et al. (above). This is an interesting conclusion. Looking at the results for geometry only (and using prior pose information), the difference between OverlapNet and the baselines is not significant. The precision/recall curves for the Ford dataset in Fig 5b are indistinguishable ("ours", M2DP, and "histogram"), and rather close for KITTI (5a). In Table II, the numbers are also nearly indistinguishable. In Fig 7, what are the precision rates? (Reporting recall without precision is not very informative.) Regarding the execution time, please clarify if the reported 630/550 ms is for the whole dataset, and what the processing time is as a function of map size. Minor edits/typos: 1) Below eq (3): misformed sentence "using the maximum overall these overlaps" 2) Fig 3(c): "Groud" -> "Ground" 3) "the most 100 recent scans" -> "the 100 most recent scans" 4) Check capitalisation in the reference list ("Pointnetvlad") 5) Some arXiv URLs are missing in the reference list.

Review 2

The paper presents an interesting new technique for estimating the overlap of 3D lidar scans. This is an important problem for localization of automated vehicles in urban and suburban environment, and is made challenging due to the highly dynamic nature of these scenes, and susceptibility for aliasing due to many urban roads and intersections looking alike (e.g. Manhattan road layouts). The new method provides good initializations for ICP and integrates readily with pose graph optimization, yielding a capable system. Strong results are obtained for Kitti and impressively the technique operates successfully on the Ford campus dataset without retraining. The paper is written very clearly and conveys strong technical strength, with novel insights in the design of the Siamese network and the use of the spherical projection representation for lidar scan matching to achieve a lightweight approach. The paper has an extensive evaluation and this is one of the strengths of the paper. Each claim for the new technique is carefully buttressed with appropriate experimental results, including strong precision-recall curves, comparison against other methods, interesting qualitative results and a careful analysis of the yaw estimation errors and ICP registration errors. (A minor point, I'm a little surprised that the Kitti yaw estimation errors are as high as 20 degrees for the lower overlap percentage while only as high as 5 degrees for the Ford campus data set, this must be a property of one data set vs. the other?) The figures and tables are all prepared with great care, and the video is nicely done. It would be ideal to see how this technique worked in a true Manhattan-style environment (larger-scale map with repetitive grid-structure) and to see how it degrades when the aliasing is really bad (e.g. nested turns around and around parallel and perpendicular linear city blocks).

Review 3

The work takes a step toward removing the need for using state estimates in place recognition. I subscribe to the idea and do think SLAM approaches that rely on the current estimate and geometric nearest neighbor search, despite impressive results, are not a true general solution to the problem. This is simply due to the fact that if the drift is large, the algorithm is doomed to fail. In contrast, the detection of a match between two frames, here using LIDAR, has a better chance of adding useful loop-closures to the SLAM graph. My main concern is the claim of 3D SLAM which is not verified. The datasets and results are for cars. Therefore, estimating the relative yaw angle is sufficient to have a good result in ICP. I expect the method to be useful for a more general 3D SLAM problem where all motion axes are excited, but the paper claims it without showing it. But of course, the 3D LIDAR data is processed as input and name can be used anyway. I'm also interested in seeing the side views of Figure 6. Is the height estimation as good as the top view?