*Aditya Dhawale , Nathan Michael *

State-of-the-art dense mapping approaches cannot be deployed on Size, Weight, and Power (SWaP) constrained platforms because of their large memory and compute requirements. In this paper, we present an accurate, and efficient approach to dense multi-fidelity 3D mapping using Gaussian distributions asvolumetric primitives. The proposed mapping approach supports both high fidelity dense surface reconstruction and lower fidelity volumetric environment representation for fundamental robotics applications. We exploit the inherent working characteristics of an off-the-shelf depth sensor and approximate the distribution of approximately planar points using Gaussian distributions. Explicit modeling of the sensor noise characteristics enable us to incrementally update the map representation in real-time with high accuracy. We present the advantages of our proposed map representation over other well known state-of-the-art representations by highlighting its superior performance in terms of reconstruction accuracy, completeness and map compression properties via quantitative and qualitative metrics.

Start Time | End Time | |
---|---|---|

07/16 15:00 UTC | 07/16 17:00 UTC |

# Efficient Parametric Multi-Fidelity Surface Mapping The paper presents a mapping approach for RGBD sensors which achieves good quality reconstruction while being computationally efficient. The map is represented as a Gaussian mixture model (GMM) which is updated incrementally based on new depth images within a hierarchical scheme. This allows the approach to avoid heavy computations and provides frame-rate performance on a laptop CPU. The approach performs mapping assuming the poses of the sensor to be known. The interesting aspect of this work is that it provides a mapping method with a couple of user-defined parameters such that it can be adapted for different applications based on the computational resources and the memory available. For example, it would allow for planning/navigation purposes at lower quality maps or more precise reconstruction of the scene at the cost slightly higher computations. The authors also demonstrate this capability for room-sized maps using both simulated and real-world data. Here are some comments about the overall work and the different sections of the paper: - The main contribution of the paper is that it combines ideas from different works such using the gaussian distributions for map representation as in NDT maps [1,17], projective association similar to KinectFusion/ElasticFusion [9,25] and a hierarchical scheme [21] to reduce computations. These ideas have been put together with the goal of achieving a comparable/better reconstruction quality for a smaller memory footprint as compared to other state-of-the-art mapping approaches. - In addition to building upon these works, the new idea in this paper is to exploit the neighborhood information in the depth image to fit the Gaussian mixture model and avoid computationally expensive optimization procedures used in the previous works. - The paper seems to provide sufficient theoretical and implementation details to reproduce the work. In my opinion, the impact of the work can be enhanced by providing an implementation of the system so that it can be used both to compare against other mapping approaches, as well as build other applications on top. - Overall, the paper is well structured and clearly written. I appreciate the introduction section where the present work has been presented in the context of previous works and how they are related. - The assumptions in terms of input data, modeling and performance are clearly spelled out in most sections of the paper. I would recommend adding a short table with the values of all the fixed parameters such as alpha_n, alpha_e, alpha_conf, Sigma_unc (min and max) and other parameters used in the approach. - The derivations, as well as the notation in the paper, seems consistent. Please add the relevant reference for equation 2 as this may not be trivial. Also add the reference for equation 3 (or the link to (eq 1?)) to explain how it can be derived. - The experiments back up the claims made in the paper in terms of accuracy and memory footprint as performance as compared to other state-of-the-art approaches. The accuracy measures look impressive particularly given the size of the maps. The approach shows better accuracy and a lower memory footprint as compared to Occupancy based Maps as well as NDT maps. It would be interesting to compare the accuracy against TSDF based methods (such as KinectFusion/ElasticFusion) as the GMM maps at least look visually messy than typical TSDF maps. This would also fill in a gap in terms of the types of maps used for comparison. You may make this additional experiment based on your space constraints. - In addition to the run-time analysis performed in Sec.V E, it would help to report the timing of different methods (yours, NDT, Occupancy base mapping) for the datasets at comparable resolution. - Although it is not the main contribution, I appreciate the experiment showing the reconstructed allow for frame-to-model estimation. Here are some minor comments/corrections: - Sec II.B (end of page 3), it should be bxb pixel pathces instead of pxp. - Fig 2: The correspondence between the explanation in the caption and the figure above is a bit confusing. It may help to explicitly label each part with a number. - Sec II.D Line 13: \theta^{K} --> K should be k? - Sec III.E Please provide a reference to the Bhattacharya measure for similarity. - Fig 8 is too small to see the numbers. - Sec. IV A: (Lines 3-5) The sentence construction is confusing. It would help to rephrase it. - Table 2, 3: Please provide the thresholds used for computing precision and recall values for map accuracy. - Sec IV.F: The accuracy for D1 is reported ad 0.0004m whereas in Fig 8. it is reported as 0.004 m. Please make them consistent. - Just a comment on aesthetics, the text looks super cramped on page 1 and 2. Maybe go easy on the vspaces ;).

The work seems to be an original contribution. The paper is technically sound (except for some minor issues). Perhaps the paper could be more clear; some parts of the paper, like Section I, are a bit cryptic. The paper compares the proposed approach with other methods in the state of the art and outperforms them in the provided scenarios and metrics. However, the evaluation suffers a bit from being limited to four datasets and lacking some plots comparing different metrics. An evaluation with more samples and variety of scenarios would be desirable. Some more specific comments: - In the listed contributions at the end of Section I. There seem to be some redundancy. It would be nice if the paper could condense them and/or point to which section corresponds to each listed contribution (likely, they are II.C.1, II.C.2 and II.E), to make them clear. - In general, it is not clear what the text means by "fidelity" since it is never explicitly stated. And it is not accuracy, since often these two terms appear but not as synonyms. It appears that the fidelity is a synonym for "model complexity" and equivalent to the "level of detail" or "resolution" in computer graphics; however, this definition or clarification should be given by the authors. - The introduction could be split... It is a rather long introduction that goes over many different types of representations used for 3D scene modeling. I think this part could be carved out in its own section on "Related work" or "Representations of 3D scenes". Understanding the methodology: - In general, the approach of using probabilistic filters to represent and refine depth in 3D reconstruction is similar to the idea of Bayesian "depth filters" in the ICRA 2014 papers by Pizzoli et al., SVO (Fast Semi-Direct Monocular Visual Odometry), and REMODE (Probabilistic, Monocular Dense Reconstruction in Real Time). However, the paper does not mention these works, and therefore does not compare to REMODE. I understand that the resulting systems are quite different, but the ideas for depth fusion may not be so much and worth a short discussion. - Fig. 2 should be more explicit about what is theta_0, theta_t, in the four different plots. Why is there a sudden jump of color in the floor of the room in Fig. 2, from the left-most image to the one immediately on its right? It would also be better to set the same size for the chair, to better see how the 3D reconstruction groups in extension. - What do the three colors used for the blocks in Fig. 3 represent? Knowing this would help better understand the system and its input/output. - Because many figures are not referenced in the text and their captions contain explanations, some things are not clear just following the main text. For example, it is not clear that the poses are given. - The image in the eye-catcher is never referenced; the right plot with the sample plots is not repeated in the experiments. So it makes one wonder how necessary it is. - In Eq. (3) how is the condition of Sigma being positive semi-definite enforced, so that it is a valid covariance matrix? Evaluation / Experiments: - Evaluation metrics, such as precision and recall should be defined, as well, for completeness. - What is the trade-off in accuracy vs. speed? Could the authors provide a plot for it? - What is the trade-off in accuracy vs. completeness? This is a standard plot to asses 3D reconstruction algorithms. See, for example, Vogiatzis et al. Video-based, real-time multi-view stereo, CVIU 2011 or Pizzoli et al. REMODE ICRA 2014. - It would be nice to have a trade-off plot that illustrates the multi-fidelity aspect of the proposed method. One axis is fidelity; the other axis could be amount of texture detail, accuracy, memory, etc. - How many distributions are used in Fig. 10? I assume that the execution time is highly influenced by this parameter. What are the units of the vertical axis of this plot (Fig. 10)? - How much is "sensor rates" at the end of Section IV.E? 50 Hz? 100Hz? - Please include units as much as possible: alpha_e sometimes has units, sometimes not. Same for alpha_len. "b, the patch size is set to 8." [pixels?] Other comments: - Adopting units of cm to measure absolute error, rather than meters, seems more appropriate. It would also be useful to provide a relative measure, such as the error normalized by the scene size or depth with respect to the camera. - Figures do not seem to be in the order that they are mentioned in the text. - The orange shading and the small size of most plots do not help much visualize the reconstruction or the error (at least on printed paper). Insets with a zoom, as in Fig. 9 are somehow helpful. Other papers use color-coded error maps (i.e., plotting the differences with respect to the most accurate model). - Section II: The first time that it is used: "Given an image I_0 of size..." should be "Given a *depth* image I_0 ..." otherwise the type of information contained in I_0 is confusing. - Section II: (the back-projection) can be viewed as a linear transformation --> I would rather say that it can be *approximated* by such a transformation. - Section II: "at time t = t ..." looks like a tautology. - Section II: redundunt --> redundant - Is there a reference for the use of the acronym SWaP? I think it is not standard in computer vision; it may be more known in aerospace and military contexts. - Some sentences are very long (multiple verbs) and therefore difficult to follow. Example: last sentence of Section IV.A. - Some references do not have publication venue, e.g., [22]. Also in [22] gmm -> GMM. Check for other lowercase acronyms: slam, rgb-d. Possible typos: - The indices (subscripts, superscripts) in the variables of Sections II.B - II.C are not always consistent, which makes it a bit difficult to understand the details of the update rules of the GMM parameters. I suggest to review such indices. - I think there is a missing term (x-mu)^T in the update of Sigma in Eq. (1); otherwise Sigma is just a vector, not a covariance matrix. - Section II.C.1: I could not find an intuitive interpretation for the 99.97% confidence. Was it intended to be instead the usual 3-sigma rule: 99.7%? Note that the confidence rule of 99.7% probability corresponding to +-3 standard deviations is for a 1D Gaussian distribution. For higher dimensional distributions, such as the 2D projection (ellipse), the confidence rule changes (see "On the Mahalanobis distance classification criterion for multidimensional normal distributions", IEEE Trans. Signal Processing, 2013), and so 99.7% confidence corresponds to a Mahalanobis distance of 3.44 rather than 3. This number grows with the dimension.