§ 瀏覽學位論文書目資料
  
系統識別號 U0002-1207202118024200
DOI 10.6846/TKU.2021.00254
論文名稱(中文) 基於多任務串接卷積神經網路之人臉檢測與特徵點估計
論文名稱(英文) Joint Face Detection and Facial Landmark Estimation Based on Multi-task Cascaded Convolutional Neural Networks
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 電機工程學系碩士班
系所名稱(英文) Department of Electrical and Computer Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 109
學期 2
出版年 110
研究生(中文) 楊昀晟
研究生(英文) Yun-Cheng Yang
學號 608450192
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2021-07-02
論文頁數 83頁
口試委員 指導教授 - 易志孝(chyih@ee.tku.edu.tw)
委員 - 劉鴻裕(hongyuliu@ee.fju.edu.tw)
委員 - 楊淳良(clyang@mail.tku.edu.tw)
關鍵字(中) 人臉檢測
人臉特徵點估計
卷積神經網路
串接卷積神經網路
深度學習
多任務學習
關鍵字(英) Face detection
Facial Landmark Estimation
Convolutional Neural Network
Cascaded Convolutional neural Network
Deep Learning
Multi-task Learning
第三語言關鍵字
學科別分類
中文摘要
本論文提出基於多任務串接卷積神經網路之人臉檢測與特徵點估計,並探討人臉邊界框回歸向量與臉部特徵點之間的關係。本網路架構由三個卷積網路組成,分別為提案網路(Proposal Net)、提煉網路(Refine Net)以及輸出網路(Output Net)。提案網路進行初步的人臉候選框偵測,並計算人臉邊界框回歸向量矯正候選框,接著使用非最大抑制方法合併高重疊率的候選框。再將合併後的候選框輸入到提煉網路,刪除大量的錯誤候選框,利用回歸向量以及最大抑制方法矯正、合併候選框。處理完的候選框輸入至輸出網路中,加入臉部定位資訊(左眼、右眼、鼻、左嘴角、右嘴角),經過回歸向量校正候選框和臉部特徵點,利用非最大抑制方法合併候選框,最後輸出包含五個臉部特徵點的人臉框。在訓練過程中採用多任務學習方法,也引進了線上困難樣本篩選(Online Hard Sample Mining)來提升模型效能。
    實驗資料使用WIDER FACE資料集訓練人臉檢測,使用CelebA資料集訓練人臉特徵點估計。使用WIDER FACE提供的驗證集進行驗證並畫出查準率-查全率曲線(Precision-Recall curve)。簡單驗證集在查準率為0.6情況下,查全率達到0.695;中階驗證集達在查準率為0.78情況下,查全率達到0.670;困難驗證集在查準率為0.8情形下,查全率達到0.402,優於Faceness、Two-stage CNN、ACF等方法。在驗證人臉特徵點估計方面,使用Deep Convolutional Network Cascade for Facial Point Detection以及CelebA提供的驗證集,前者結果依序由左眼、右眼、鼻、左嘴角以及右嘴角的平均誤差百分比(%)為10.89、11.88、14.27、15.6、16.06,後者為10.60、10.95、13.36、14.15、14.44。前者失敗率(%)依序為27.71、28.11、45.38、53.82、54.22,後者為27.72、28.67、43.30、49.96、51.04。
英文摘要
This thesis proposes a face detection and facial landmark estimation method based on the multi-task cascaded convolutional neural networks (MTCNN) and discusses the effects of multi-task learning on the accuracy of the face bounding box regression vector and facial landmark locations. 
The proposed model consists of three convolutional neural networks, namely proposal network (P-Net), refine network (R-Net), and output net (O-Net). First, P-Net performs face bounding box detection and calculates the bounding box regression vector to calibrate the candidate boxes. Moreover, P-Net employs the non-maximum suppression (NMS) method to merge the overlapping candidate boxes. The remaining bounding boxes are input to the R-Net, which further rejects a large number of incorrect candidate boxes. Similar to the P-Net, R-Net also uses the bounding box regression vector to calibrate candidate boxes and merge candidate boxes by NMS. The remaining bounding boxes of the R-Net are sent to the O-Net which is trained based on both the face bounding boxes and facial landmark localization information (left eye, right eye, nose, left mouth corner, right mouth corner). Finally, the Q-Net outputs the detected face bounding boxes and the estimated facial landmark locations. In the training process, we use the multi-task learning method and online hard sample mining technique to improve performance. 
    The WIDER FACE dataset is employed to train the MTCNN for the task of face detection and the CelebA dataset is adopted to train the same network for the task of facial landmark estimation. We use the easy, medium, and hard validation sets in the WIDER FACE dataset to verify the performance of the MTCNN and draw the precision-recall curve. For the easy, medium, and hard validation sets, the recall rates of our trained MTCNN are 0.695, 0.67, and 0.402, and the precision rates are 0.6,0.78, and 0.8, respectively, which are better than those results obtained by Faceness, Two-stage CNN, and ACF methods.
To evaluate the performance of MTCNN on the task of facial landmark estimation, we use the validation sets in the Deep Convolutional Network Cascade for Facial Point Detection (DCNN) dataset and the CelebA dataset to calculate the mean error and failure rate. For the DCNN validation set, the mean error (in terms of percentage) of the left eye, right eye, nose, left mouth corner, and right mouth corner are 10.89, 11.8, 14.2, 15.6, 16.06, respectively. The failure rate (in terms of percentage) of the left eye, right eye, nose, left mouth corner, and right mouth corner are 27.71, 28.11, 45.38, 53.82, 54.22, respectively. For the CelebA validation set, the mean error (in terms of percentage) of the left eye, right eye, nose, left mouth corner, and right mouth corner are 10.60, 10.95, 13.36, 14.15, 14.44, respectively. The Failure rate (in terms of percentage) of the left eye, right eye, nose, left mouth corner, and right mouth corner are 27.72, 28.67, 43.30, 49.96, 51.04, respectively.
第三語言摘要
論文目次
致謝	I
中文摘要	II
英文摘要	IV
目錄	VI
圖目錄	IX
表目錄	XIII
第一章 緒論	1
1.1  研究背景	1
1.2  研究動機	5
第二章 背景知識	7
2.1  卷積神經網路	7
2.1.1  卷積層 (Convolutional Layer)	8
2.1.2  池化層 (Pooling Layer)	10
2.1.3  線性整流層(Rectified Linear Units Layer, ReLU Layer)	12
2.2  重疊率	13
2.3  非最大值抑制	13
第三章 系統架構	16
3.1  串接卷積式神經網路	16
3.2  檢測流程	18
3.2.1  影像金字塔	20
3.2.2  提案網路檢測流程	21
3.3  訓練流程	24
3.3.1  訓練資料	26
3.3.2  隨機擷取訓練數據	31
3.3.3  歸一化偏移量	35
3.3.4  訓練方法	37
第四章 實驗與討論	40
4.1  資料集	40
4.2  實驗環境	43
4.3  訓練數據	44
4.4  不同階段之各網路輸出	46
4.4.1  訓練階段	46
4.4.2  檢測階段	51
4.5  實驗結果	53
4.5.1  混淆矩陣以及查準率-查全率曲線圖	54
4.5.2  使用WIDER FACE進行驗證	55
4.5.3  人臉特徵點估計驗證	64
4.5.4  模型的特性及檢測結果	69
第五章 結論	78
參考文獻	79

圖目錄
圖2-1神經元運算架構	7
圖2-2卷積層運算	9
圖2-3補零(Zero Padding)	9
圖2-4池化層運算	10
圖2-5平移後池化	11
圖2-6旋轉後池化	11
圖2-7感受視野提高	12
圖2-8 ReLU函數	12
圖2-9兩框正常相交	14
圖2-10包覆框	14
圖2-11非最大值抑制方法執行流程	15
圖3-1提案網路架構 [15]	16
圖3-2提煉網路架構 [15]	17
圖3-3輸出網路架構 [15]	17
圖3-4檢測流程圖	19
圖3-5圖像金字塔	20
圖3-6 建立影像金字塔	21
圖3-7 提案網路預測-1	22
圖3-8 提案網路預測-2	23
圖3-9 實際使用提案網路預測之結果	23
圖3-10訓練流程圖	24
圖3-11提案網路訓練數據製作流程	27
圖3-12提煉網路訓練數據製作流程	28
圖3-13輸出網路訓練數據製作流程	29
圖3-14 製作數據流程圖	31
圖3-15 擷取框示意圖	32
圖3-16 擷取框示意圖	34
圖3-17人臉框偏移計算	35
圖3-18臉部特徵位置偏移計算	36
圖4-1WIDER FACE數據集	41
圖4-2人群	41
圖4-3單獨人臉	42
圖4-4複雜背景且人臉尺度較小	42
圖4-5 Deep CNN資料集	42
圖4-6 CelebA資料集	43
圖4-7訓練數據	44
圖4-8訓練數據偏移量儲存	45
圖4-9提案網路訓練數據	46
圖4-10提案網路輸出候選框校正結果比較	47
圖4-11提煉網路訓練數據蒐集結果	47
圖4-12提煉網路輸出校正結果比較	49
圖4-13輸出網路的訓練數據	50
圖4-14臉部特徵位置訓練數據	50
圖4-15 製作影像金字塔	51
圖4-16 提案網路輸出結果(左為回歸校正前,右為校正後)	52
圖4-17 提煉網路輸出結果(左為回歸校正前,右為校正後)	52
圖4-18輸出輸出結果(左邊為回歸校正前,右邊為加入臉部特徵點後校正)	53
圖4-19簡單驗證集比較	56
圖4-20中階驗證集比較	57
圖4-21困難驗證集比較	58
圖4-22調整批量大小對於訓練上的影響	60
圖4-23調整後對WIDER FACE驗證之結果	61
圖4-24調整參數後對模型預測的影響	62
圖4-25調整參數後對模型預測的影響	63
圖4-26調整參數後對模型預測的影響	63
圖4-27上圖(a)、(b)為WIDER FACE驗證結果比較	66
圖4-28 (a)、(b)為人臉特徵點估計驗證長條圖	67
圖4-29 (a)~(g)檢測結果	77

表目錄
表4-1混淆矩陣	54
表4-2參數調整設定表	60
表4-3參數設定	65
參考文獻
[1] 	Y. LeCun, Y. Bengio and G. Hinton, "Deep learning," Nature, vol. 521, pp. 436-444, 2015. 
[2] 	Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, pp. 2278-2324, 1998. 
[3] 	A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," Neural Information Processing Systems, vol. 1, pp. 1097-1105, 2012. 
[4] 	O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg and F. Li, "ImageNet Large Scale Visual Recognition Challenge," International Journal of Computer Vision, vol. 115, pp. 211-252, Sept. 2015. 
[5] 	R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587, 2014. 
[6] 	J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779-788, 2016. 
[7] 	W.Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Y. Fu and A. C. Berg, "SSD: Single Shot MultiBox Detector," Proceedings of the European Conference on Computer Vision (ECCV), vol. 9905, pp. 21-37, 2016. 
[8] 	S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1 June 2017. 
[9] 	P. Viola and M. J. Jones, "Robust real-time face detection," International journal of computer vision, vol. 57, no. 2, pp. 137-154, 2004. 
[10] 	Y. Freund and R. E. Schapire, "A brief introduction to boosting," Proceedings of the 16th international joint conference on Artificial intelligence, vol. 2, pp. 1401-1406, 1999. 
[11] 	J. Yan, Z. Lei, L. Wen and S. Z. Li, "The Fastest Deformable Part Model for Object Detection," IEEE Conference on Computer Vision and Pattern Recognition, pp. 2497-2504, 2014. 
[12] 	M. Mathias, R. Benenson, M. Pedersoli and L. Gool, "Face Detection without Bells and Whistles," European Conference on Computer Vision, pp. 720-735, 2014. 
[13] 	X. Zhu and D. Ramanan, "Face detection, pose estimation, and landmark localization in the wild," IEEE Conference on Computer Vision and Pattern Recognition, pp. 2879-2886, 2012. 
[14] 	H. Li, Z. Lin, X. Shen, J. Brandt and G. Hua, "A convolutional neural network cascade for face detection," IEEE Conference on Computer Vision and Pattern Recognition, pp. 5325-5334, 2015. 
[15] 	K. Zhang, Z. Zhang, Z. Li and Y. Qiao, "Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks," IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499-1503, Oct. 2016. 
[16] 	J. Long, E. Shelhamer and T. Darrell, "Fully convolutional networks for semantic segmentation," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431-3440, 2015. 
[17] 	S. Yang, P. Luo, C. C. Loy and X. Tang, "WIDER FACE: A Face Detection Benchmark," IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525-5533, 2016. 
[18] 	Y. Sun, X. Wang and X. Tang, "Deep Convolutional Network Cascade for Facial Point Detection," IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 3476-3483, 2013. 
[19] 	G. B. Huang, M. Ramesh, T. Berg and E. L. Miller, "Labeled Faces in the Wild: A Database for Studying," University of Massachusetts, Amherst, Technical Report, pp. 07-49, 2007. 
[20] 	Z. Liu, P. Luo, X. Wang and X. Tang, "Deep Learning Face Attributes in the Wild," 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3730-3738, 2015. 
[21] 	Z. Yang and R. Nevatia, "A multi-scale cascade fully convolutional network face detector," 23rd International Conference on Pattern Recognition (ICPR), pp. 633-638, 2016. 
[22] 	S. Yang, P. Luo, C. C. Loy and X. Tang, "Faceness-Net: Face Detection through Deep Facial Part Responses," in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. 
[23] 	B. Yang, J. Yan, Z. Lei and S. Z. Li, "Aggregate channel features for multi-view face detection," IEEE International Joint Conference on Biometrics, pp. 1-8, 2014. 
[24] 	D. Chen, S. Ren, Y. Wei, X. Cao, J. Sun, "Joint cascade face detection and alignment," in European Conference on Computer Vision, 2014. 
[25] 	J. Uijlings, K. van de Sande, T. Gevers, A. Smeulders, "Selective search for object recognition," International Journal of Computer Vision, vol. 104, no. 2, pp. 154-171, Spet. 2013. 
[26] 	M. Köstinger, P. Wohlhart, P. M. Roth and H. Bischof, "Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization," IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 2144-2151, 2011. 
[27] 	R. Girshick, "Fast R-CNN," 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440-1448, 2015. 
[28] 	C. Zhang and Z. Zhang, "Improving multiview face detection with multi-task deep convolutional neural networks," IEEE Winter Conference on Applications of Computer Vision, pp. 1036-1041, 2014.
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信