A Semi-Supervised Approach to Monocular Depth Estimation, Depth Refinement, and Semantic Segmentation of Driving Scenes using a Siamese Triple Decoder Architecture
DOI:
https://doi.org/10.31449/inf.v44i4.3018Abstract
Depth estimation and semantic segmentation are two fundamental tasks in scene understanding. These two tasks are usually solved separately, although they have complementary properties and are highly correlated. Jointly solving these two tasks is very beneficial for real-world applications that require both geometric and semantic information. Within this context, the paper presents a unified learning framework for generating a refined depth estimation map and semantic segmentation map given a single image. Specifically, this paper proposes a novel architecture called JDSNet. JDSNet is a Siamese triple decoder architecture that can simultaneously perform depth estimation, depth refinement, and semantic labeling of a scene from an image by exploiting the interaction between depth and semantic information. A semi-supervised method is used to train JDSNet to learn features for both tasks where geometry-based image reconstruction methods are employed instead of ground-truth depth labels for the depth estimation task while ground-truth semantic labels are required for the semantic segmentation task. This work uses the KITTI driving dataset to evaluate the effectiveness of the proposed approach. The experimental results show that the proposed approach achieves excellent performance on both tasks, and these indicate that the model can effectively utilize both geometric and semantic information.References
L. Chen, Z. Yang, J. Ma, and Z. Luo (2018) Driving Scene Perception Network: Real-time Joint Detection, Depth Estimation and Semantic Segmentation, Proceedings of the IEEE Winter Conference on Applications of Computer Vision, IEEE, pp. 1283-1291. https://doi.org/10.1109/WACV.2018.00145.
G. Giannone and B. Chidlovskii (2019) Learning Common Representation from RGB and Depth Images, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, IEEE.
R. Cipolla, Y. Gal and A. Kendall (2018) Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 7482-7491. https://doi.org/10.1109/CVPR.2018.00781.
J. Liu, Y.Wang, Y. Li, J. Fu, J. Li, and H. Lu (2018) Collaborative Deconvolutional Neural Networks for Joint Depth Estimation and Semantic Segmentation, IEEE Transactions on Neural Networks and Learning Systems, IEEE, vol. 29, no. 11, pp. 5655-5666. https://doi.org/10.1109/TNNLS.2017.2787781.
D. Sanchez-Escobedo, X. Lin, J. R. Casas, and M. Pardas (2018) Hybridnet for Depth Estimation and Semantic Segmentation, Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, pp. 1563-1567. https://doi.org/10.1109/ICASSP.2018.8462433.
Peng Wang, Xiaohui Shen, Zhe Lin, S. Cohen, B. Price, and A. Yuille (2015) Towards unified depth and semantic prediction from a single image, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 2800-2809. https://doi.org/10.1109/CVPR.2015.7298897.
B. Liu, S. Gould, and D. Koller (2010) Single image depth estimation from predicted semantic labels, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 1253-1260. https://doi.org/10.1109/CVPR.2010.5539823.
L. Ladicky, J. Shi, and M. Pollefeys (2014) Pulling things out of perspective, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 89-96. https://doi.org/10.1109/CVPR.2014.19.
C. Hazirbas, L. Ma, C. Domokos, and D. Cremers (2016) Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture, Proceedings of the Asian Conference on Computer Vision, Springer, pp. 213-228. https://doi.org/10.1007/978-3-319-54181-5_14.
O. H. Jafari, O. Groth, A. Kirillov, M. Y. Yang, and C. Rother (2017) Analyzing modular CNN architectures for joint depth prediction and semantic segmentation, Proceedings of the 2017 International Conference on Robotics and Automation, IEEE, pp. 4620-4627. https://doi.org/10.1109/ICRA.2017.7989537.
V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, C. Shen and I. Reid (2019) Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations, Proceedings of the 2019 International Conference on Robotics and Automation, IEEE, pp. 7101-7107. https://doi.org/10.1109/ICRA.2019.8794220.
A. Mousavian, H. Pirsiavash, and J. Kosecka (2019) Joint Semantic Segmentation and Depth Estimation with Deep Convolutional Networks, Proceedings of the 2016 Fourth International Conference on 3D Vision, IEEE, pp. 611-619. https://doi.org/10.1109/3DV.2016.69.
P. Z. Ramirez, M. Poggi, F. Tosi, S. Mattoccia, and L. Di Stefano (2018) Geometry meets semantic for semi-supervised monocular depth estimation, Proceedings of the 14th Asian Conference on Computer Vision, Springer, pp. 611-619. https://doi.org/10.1007/978-3-030-20893-6_19.
C. Godard, O. M. Aodha and G. J. Brostow (2017) Unsupervised Monocular Depth Estimation with Left-Right Consistency, Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition, IEEE, pp. 6602-6611. https://doi.org/10.1109/CVPR.2017.699.
J. P. Yusiong and P. Naval (2019) AsiANet: Autoencoders in Autoencoder for Unsupervised Monocular Depth Estimation, Proceedings of the IEEE Winter Conference on Applications of Computer Vision, IEEE, pp. 443-451. https://doi.org/10.1109/WACV.2019.00053.
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error measurement to structural similarity, IEEE Transactions on Image Processing, IEEE, vol. 13, no. 4, pp. 600-612.
M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu (2015) Spatial transformer networks, Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 2017-2025.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 3213-3223. https://doi.org/10.1109/CVPR.2016.350.
Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? The kitti vision benchmark suite, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 3354-3361. https://doi.org/10.1109/CVPR.2012.6248074.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al. (2016) Tensorflow: a system for large-scale machine learning, Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, USENIX Association, pp. 265-283.
D. Kingma and J. Ba (2015) Adam: A method for stochastic optimization, Proceedings of the International Conference on Learning Representations.
T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 6612-6619. https://doi.org/10.1109/CVPR.2017.700.
R. Mahjourian, M. Wicke, and A. Angelova (2018) Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 5667-5675. https://doi.org/10.1109/CVPR.2018.00594.
Z. Yin and J. Shi (2018) GeoNet: Unsupervised learning of dense depth, optical flow and camera pose, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 1983-1992. https://doi.org/10.1109/CVPR.2018.00212.
Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika