Tim Tang (University of Oxford); Daniele De Martini (University of Oxford); Shangzhe Wu (University of Oxford); Paul Newman (University of Oxford)
Publicly available satellite imagery can be an ubiquitous, cheap, and powerful tool for vehicle localisation when a prior sensor map is unavailable.However, satellite images are not directly comparable to data from ground range sensors because of their starkly different modalities.We present a learned metric localisation method that not only handles the modality difference, but is cheap to train, learning in a self-supervised fashion without metrically accurate ground truth.By evaluating across multiple real-world datasets, we demonstrate the robustness and versatility of our method for various sensor configurations. We pay particular attention to the use of millimetre wave radar, which, owing to its complex interaction with the scene and its immunity to weather and lighting, makes for a compelling and valuable use case.
Start Time | End Time | |
---|---|---|
07/15 15:00 UTC | 07/15 17:00 UTC |
This paper addresses an important and interesting topic, and is very well written and clear. Overall it is a very good paper, but please find some comments here. The related work section is generally clear and comprehensive. However, I would appreciate a discussion of the expected performance difference between the proposed method and the methods cited in II-A (e.g. 18,20) and II-D (35,42,43), in particular. In connection to references 23,25, it would also make sense to cite Parsley and Julier ("Towards the Exploitation of Prior Information in SLAM", IROS 2010). In Section II-B it would make sense to also cite Mielle et al. ("The Auto-Complete Graph: Merging and Mutual Correction of Sensor and Prior Maps for SLAM", MDPI Robotics 2019) in connection with references 37,7,38. Regarding the results shown in Fig 10, it is said that the robot is never "getting lost", but that is a vague term. In several cases, it seems that the position estimate is off by more than 10, or even more than 30, metres. Not hopelessly lost, perhaps, but certainly not correct. I think you should revise your statement and add some short discussion on this. The paper does not mention training time, although it is indicated in the video. Please also discuss the amount of training needed in the paper. In Sec IV, you describe how to find the rototranslation between the map and the live data, but as far as I could see, you do not mention scale. Do you assume to have accurate pixels-per-metre scale information in both modalities, and that the scale is uniformly correct? Please clarify or discuss this. What is the significance of the parameter $n$ (number of rotations)? How have you selected it, and how does it affect the results? There are some further places where clarifications might help: 1) In Figs 3-4, adding labels of what is A and B in the figure (not just the caption) would help. 2) The plots in Fig 10 could be clearer. E.g., make the larger (and maybe cut some of Fig 12) and/or make the lines thicker. Minor edits/typos: 1) Fig. 4 caption: "An loss" -> "a loss" 2) Sec IV-B, 3rd paragraph: "two random image" -> "two random images"
The cross-modal (ground vs. satellite) data correlation approach in this paper appears to be original and useful, building off [36]; it adapts state-of-the-art neural network architectures to the cross-modality correlation problem by following a multi-stage approach in which rotation is first aligned, then translation alignment is performed with synthetic images that are rotation-aligned. This is a key novel aspect of the paper. It would be helpful perhaps to highlight which aspects of the Pose-Aware Separable Encoder Decoder CNN architecture (e.g. Figure 6) that the authors consider are most novel (beyond separating rotation/translation). The performance evaluation is fairly extensive, both quantitatively (Tables I, II, and II) and qualitatively (e.g. Figure 12) but I was expecting to also see precision-recall curves, to help build my intuition for how the technique performs, as key parameters are varied. For example (page 7, column 2): "A large value of $d_{intro}$ indicates the generated images are erroneous.... our system falls back to using odometry for dead-reckoning when $d_{intro}$ exceeds a threshold." -> what is the threshold, how does the system performance vary when that threshold is changed (ie too low vs. too high)? Are there numerical values for key parameters that a researcher would need to know to replicate the results? (Will a public implementation be made available? I feel that there are some questions to try to reimplement this, such as how many layers in the encoder-decoder etc?) The system only uses a single GPS pose at the start of the trajectory; in practice I wonder if its somewhat unrealistic to not use GPS in a real system; I think the question of how to robustly fuse many inputs including GPS in a such as a system is paramount. Also, not making use of metrically accurate ground truth for training might be something that a practitioner might not do. Overall I consider this an impressive system (but still a bit preliminary and would hope to see more details of the implementation in a longer version).