Md Jahidul Islam (University of Minnesota Twin Cities); Peigen Luo (University of Minnesota-Twin Cities); Junaed Sattar (University of Minnesota)
In this paper, we introduce and tackle the simultaneous enhancement and super-resolution (SESR) problem for underwater robot vision and provide an efficient solution for near real-time applications. We present Deep SESR, a residual-in-residual network-based generative model that can learn to restore perceptual image qualities at 2x, 3x, or 4x higher spatial resolution. We supervise its training by formulating a multi-modal objective function that addresses the chrominance-specific underwater color degradation, lack of image sharpness, and loss in high-level feature representation. It is also supervised to learn salient foreground regions in the image, which in turn guides the network to learn global contrast enhancement. We design an end-to-end training pipeline to jointly learn the saliency prediction and SESR on a shared hierarchical feature space for fast inference. Moreover, we present UFO-120, the first dataset to facilitate large-scale SESR learning; it contains over 1500 training samples and a benchmark test set of 120 samples. By thorough experimental evaluation on UFO-120 and several other standard datasets, we demonstrate that Deep SESR outperforms the existing solutions for underwater image enhancement and super-resolution. We also validate its generalization performance on several test cases that include underwater images with diverse spectral and spatial degradation levels, and also terrestrial images with unseen natural objects. Lastly, we analyze its computational feasibility for single-board deployments and demonstrate its operational benefits for visually-guided underwater robots.
Start Time | End Time | |
---|---|---|
07/14 15:00 UTC | 07/14 17:00 UTC |
The problem of simultaneous super-resolution and image enhancement is relevant because it avoids the possible amplification by one step of the artifacts introduced by the other one, especially for underwater images. The paper is well written and the authors provide a thorough comparison with the SOTA for both underwater and terrestrial images. Also, the integration of additional saliency constraints is a good idea. The qualitative results are appreciated although the image quality prevents the reader from inspecting the image’s details, but this is expected. Some experimental aspects lack details that would make the reading smoother. * The data generation is not explicit enough. The authors should add that the collected images from oceanic explorations are the ‘undistorted and high-resolution’ ground-truth. The distorted images are synthetically generated using the CycleGAN network. This generator of CycleGAN is trained to transform these images into distorted underwater images so that the discriminator can not tell whether it is an actual distorted image or a synthetic one. * The low-resolution image generation is explicited in the paper. However, I don’t understand the relevance of the F split as it is made of images from both U and O. * The network takes a low-resolution distorted image and has three outputs: a saliency map S, a low-resolution enhanced image X, and a high- resolution (HR) enhance image Y. To compare their model against existing approaches, the authors evaluate the enhancement in Y (or X, I am not sure) with previous enhancing methods, and they evaluate the high-resolution of Y against previous HR methods. The main motivation of the paper is to address the resolution and the enhancement problems simultaneously. So I would have liked to see a comparison of the enhanced HR image Y compared with an image that would be first enhanced then upsampled, and vice-versa, with the top 1SOTA method. A comparison in terms of both image quality and computational time would have been relevant. Also, the enhancement in the high-resolution output (i.e. Y) is not evaluated. * The authors rely on various metrics that assess the image quality. Additional information on their interpretation and their relevance would help the reader to better appreciate their results. * In addition to the main contributions listed above, the authors announce that they will release the underwater data used for training and testing: it is made of 150 testing images and 1500 training images. The images are deemed undistorted and high resolution and they are transformed into distorted ones using an existing style transfer network. They are then transformed into low-resolution manually. This dataset release is well appreciated.
Overall the paper is well written and clearly presents the structure of the proposed learning network and the rationale for the network structure. Figures serve well to illustrate the results and the results suggest that the system may be suitable for running in realtime on future robotic deployments. The question of how the ground truth was actually generated was not addressed in the paper. Comparisons against ground truth imagery were presented but there is no discussion of whether these ground truth images were hand labelled or derived from some other source. A discussion of the process for generating the ground truth is likely to be warranted given that finding ground truth imagery for underwater datasets is a challenging task in itself.