§ 瀏覽學位論文書目資料
系統識別號 U0002-0303202110461100
DOI 10.6846/TKU.2021.00070
論文名稱(中文) 基於ESP之聯合目標檢測與部件分割網路模型的設計與實現
論文名稱(英文) Design and Implementation of an ESP-Based Joint Object Detection and Affordance Segmentation Network Model
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 電機工程學系機器人工程碩士班
系所名稱(英文) Master's Program In Robotics Engineering, Department Of Electrical And Computer Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 109
學期 1
出版年 110
研究生(中文) 林翰博
研究生(英文) Han-Po Lin
學號 606470101
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2021-01-14
論文頁數 50頁
口試委員 指導教授 - 蔡奇謚(chiyi_tsai@gms.tku.edu.tw)
委員 - 周永山
委員 - 許陳鑑
關鍵字(中) 深度學習
部件分割
ESPNetv2
多任務影像識別
關鍵字(英) Deep Learning
Affordance Segmentation
ESPNetV2
Multi-Task Image Recognization
第三語言關鍵字
學科別分類
中文摘要
多部件分割是近年來機器人視覺領域中極具挑戰且受矚目的研究議題之一。現有的部件分割方法雖然可達到準確的分割結果,但其網路的參數量與網路架構都相當龐大,進而導致這些方法大多都無法達到高速的即時運算能力。為了降低網路參數量並簡化網路架構,本論文提出一種基於ESPNet所設計的輕量化部件分割模型,其可有效提升部件分割的處理速度與降低運行時所需要的運算需求。所提出的網路模型採用錨框式的一階段物件偵測模型為骨幹網路,搭配一個語意分割分支網路架構進行整合設計。因為所提出的網路模型仍然維持較為簡潔的一階段式架構,我們得以在網路架構實現上有較為簡化的優勢。此外,我們也使用ESPNet作為物件偵測與部件分割模型中的主要模組,將所提出的網路模型中的卷積網路模組均修改成輕量化的ESP模組來降低模型的參數數量與運算複雜度。實驗結果顯示,在IIT-AFF資料集中,所提出的方法的物件偵測準確率達到了90% mAP,且在部件分割的分割準確率達到了60.66% WBF。在運算速度測試上,以512x512的RGB影像作為輸入時,所提出的方法在搭配NVIDIA GeForce 1080Ti的運行平台上達到了35FPS的運算速度,相較於目前的AffordanceNet的運算速度(6.6FPS)快了五倍以上。
英文摘要
Affordance segmentation is one of the most challenging and popular research topics in the field of robotic vision in recent years. Although the existing affordance segmentation methods can achieve accurate segmentation results, the network parameters and network architecture are quite large, and most of these methods cannot achieve real-time performance. In order to reduce the amount of network parameters and simplify network architecture, this thesis proposes a lightweight affordance segmentation model based on ESPNetv2, which can effectively increase the processing speed and reduce the computational requirements required at runtime. The proposed method adopts a one-stage anchor-based object detection model as the backbone network for the integration with a semantic segmentation branch. Because the advantage of one-stage network architecture, the proposed network model can be implemented by a relatively simpler architecture. In addition, we also use a lightweight ESP module as the basic module in the object detection and affordance segmentation model to reduce the number of network parameters and computational complexity. Experimental results show that in the IIT-AFF data set, the proposed method reaches high segmentation accuracy of 60.66% WBF and the object detection accuracy of 90% mAP. In the test of processing speed, when the network takes 512x512 RGB image as the input, the proposed method achieves a real-time processing speed of 35FPS on a platform with NVIDIA GeForce 1080Ti, which is five times faster than the exisitng AffordanceNet.
第三語言摘要
論文目次
中文摘要	Ⅰ
英文摘要	II
目錄	IV
圖目錄	VII
表目錄	VIII
第一章 序論	1
1.1 研究背景	1
1.2 研究動機與目的	2
1.3 論文架構	4
第二章 相關研究與論文流程架構	6
2.1 物件偵測	6
2.1.1 RCNN系列	6
2.1.2 YOLOv3	7
2.1.3 SSD	8
2.2 語意分割	9
2.3 部件分割	11
2.4 輕量化網路	12
2.5 文獻總結	13
2.6 論文方法流程架構	14
第三章 ESP Encoder模組	17
3.1 Depth-wise Dilated Separable Convolution層	17
3.2 ESP-S1 Encoder模組與ESP-S2 Encoder模組	18
第四章 物件偵測Decoder與部件分割Decoder網路架構	22
4.1 ESP-Encoder特徵提取網路	22
4.2 物件偵測Decoder分支網路	24
4.3 部件分割Decoder分支網路	27
4.4 訓練方法與參數	28
4.4.1 損失函數與優化器	29
4.4.2 訓練策略	30
第五章 實驗結果與分析	34
5.1 實驗平台	34
5.2 評估標準	35
5.3 實驗數據集	35
5.4 與現有方法比較	36
5.5 實際結果	39
5.6 實際場景測試結果	41
第六章 結論與未來展望	45
參考文獻	47

圖目錄
圖2.1、本論文所提之輕量化物件偵測及部件分割系統架構圖	16
圖3.1、ESP-S1 Encoder模組與ESP-S2 Encoder模組架構圖	20
圖4.1、ESP-Encoder特徵提取網路示意圖	24
圖4.2、物件偵測Decoder分支網路示意圖	26
圖4.3、部件分割Decoder分支網路示意圖	28
圖4.4、循環學習率策略示意圖	31
圖5.1、Nvidia Geforce 1080Ti	34

表目錄
表4.1、循環學習率效果示意表	31
表5.1、實驗硬體規格表	34
表5.2、IIT-AFF資料集內容與數量表	36
表5.3、物件偵測與部件分割模型的準確率、模型大小與速度比較表	37
表5.4、物件偵測準確率mAP比較表	37
表5.5、部件分割準確率WBF比較表	38
表5.6、實際結果示意表	40
表5.7、實際場景測試結果示意表	42
參考文獻
[1]	R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran, “Detect-and-Track: Efficient Pose Estimation in Videos,” in IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, June 18-22, 2018, pp. 350-359.
[2]	D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A Closer Look at Spatiotemporal Convolutions for Action Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, June 18-22, 2018, pp. 6450-6459.
[3]	S. Ahmed, M.N. Huda, S. Rajbhandari, C. Saha, M. Elshawa and S. Kanarachos, “Pedestrian and Cyclist Detection and Intent Estimation for Autonomous Vehicles: A Survey,” Applied Sciences, Vol. 9, No. 11, 2019, pp. 3990.
[4]	S. Mehta, M. Rastegari, A. Caspi, L.G. Shapiro, and H. Hajishirzi, “ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network,” in IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, June 16-20, 2019, pp. 9190-9200.
[5]	O. Ronneberger, P. Fischer, and T. Brox. “U-net: Convolutional Networks for Biomedical Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention, Springer, LNCS, Vol.9351, 2015, pp. 234--241.
[6]	A. Nguyen, D. Kanoulas, D.G. Caldwell and N.Tsagarakis, “Object-Based Affordances Detection with Convolutional Neural Networks and Dense Conditional Random Fields,” in International Conference on Intelligent Robots and Systems, Vancouver, BC, September 24-28, 2017.
[7]	R. Margolin, L. Zelnik-Manor, and A. Tal, “How to Evaluate Fore-ground Maps,” in IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, Jun 23-28, 2014, pp. 248–255.
[8]	T.-T. Do, A. Nguyen, and I. Reid, “AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection,” in International Conference on Robotics and Automation, Brisbane, Australia, May 21-26, 2018.
[9]	L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFS,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, No. 4, April 1, 2018, pp. 834-848.
[10]	A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” in IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, July 22-25, 2017.
[11]	Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling, “M2det: A Single-Shot Object Detector Based on Multi-Level Feature Pyramid Network,” in Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, January 27 - February 1, 2019.
[12]	L.T.-Smith, L. Petersson, “Improving Object Localization with Fitness NMS and Bounded IoU Loss,” in IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, June 18-22, 2018, pp. 6877-6885.
[13]	M. Tan, R. Pang and Q. V. Le, “EfficientDet: Scalable and Efficient Object Detection,” in IEEE Conference on Computer Vision and Pattern Recognition, June 14-19, 2020, pp. 10781-10790.
[14]	R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, Jun 23-28, 2014, pp. 580-587.
[15]	R. Girshick, “Fast R-CNN,” in IEEE International Conference on Computer Vision, Santiago, Chile, December 13-16, 2015, pp. 1440-1448.
[16]	S. Ren, K. He, R. Girshick and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, 2017 pp. 1137-1149.
[17]	J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” in ArXiv preprint arXiv:1804.02767, 2018.
[18]	K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, June 26- July 1, 2016, pp. 770-778.
[19]	T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection,” in IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, July 22-25, 2017, pp.936-944.
[20]	W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single Shot Multibox Detector,” in European Conference on Computer Vision, Amsterdam, October 8-16, 2016, pp. 21-37.
[21]	J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, June 7-12, 2015, pp. 3431-3440.
[22]	V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 39, Issue: 12, December 1 2017, pp.2481-2495.
[23]	A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation,” arXiv preprint arXiv:1606.02147, 2016. 
[24]	H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for Real-Time Semantic Segmentation on High-Resolution Images,” in European Conference on Computer Vision, Munich, Germany, September 8-14, 2018, pp. 405-420. 
[25]	C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation,” in European Conference on Computer Vision, Munich, Germany, September 8-14, 2018, pp. 325-341.
[26]	K. He, G. Gkioxari, P. Dollar, and R. B. Girshick, “Mask R-CNN,” in International Conference on Computer Vision, Venice, Italy, October 22-29 2017, pp. 2961-2969.
[27]	F. Chu, R. Xu, P.A. Vela, “Learning Affordance Segmentation for Real-World Robotic Manipulation via Synthetic Images,” IEEE Robotics and Automation Letters, vol. 4, no. 2, 2019, pp. 1140–1147.
[28]	T. Lüddecke, T. Kulvicius, F. W ̈org ̈otter, “Context-Based Affordance Segmentation from 2D Images for Robot Actions,” in IEEE International Conference on Computer Vision Workshops, Venice, Italy, October 22-29, 2017.
[29]	H. Zhao, J. Shi, X. Qi, X. Wang and J. Jia, “Pyramid Scene Parsing Network,” in IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, July 22-25, 2017, pp.6230-6239.
[30]	F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally and K. Keutzer, “SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size,” in International Conference on Learning Representations, Puerto Rico, May 2-4, 2016.
[31]	G. Huang, Z. Liu, L. Maaten, and K.Q. Weinberger, “Densely Connected Convolutional Networks,” in IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, July 22-25, 2017, pp.4700-4708.
[32]	X. Zhang, X. Zhou, M. Lin and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices,” In IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, June 18-22, 2018, pp. 6848-6856.
[33]	N. Ma, X. Zhang, H. T. Zheng, J. Sun, “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design,” in European Conference on Computer Vision, Munich, Germany, September 8-14, 2018, pp.116-131.
[34]	M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” in IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, June 18-22, 2018, pp. 4510-4520.
[35]	S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi. “Espnet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation,” in European Conference on Computer Vision, Munich, Germany, September 8-14, 2018, pp. 552-568. 
[36]	L.N. Smith, “Cyclical Learning Rates for Training Neural Networks,” arXiv preprint arXiv:1506.01186v3, 2016.
[37]	J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object Detection via Region-Based Fully Convolutional Networks,” in Neural Information Processing Systems, Barcelona, Spain, December 5-10, 2016.
[38]	A. Nguyen, D. Kanoulas, D. G. Caldwell, and N. G. Tsagarakis, “Detecting Object Affordances with Convolutional Neural Networks,” in International Conference on Intelligent Robots and Systems, Daejeon, Korea, October 9-14, 2016, pp.2765-2770.
論文全文使用權限
校內
校內紙本論文延後至2026-04-01公開
同意電子論文全文授權校園內公開
校內電子論文延後至2026-04-01公開
校內書目立即公開
校外
同意授權
校外電子論文延後至2026-04-01公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信