§ 瀏覽學位論文書目資料
  
系統識別號 U0002-2108202308080900
DOI 10.6846/tku202300588
論文名稱(中文) 基於影像處理、深度學習協同合作對劣質影片進行標準行為評鑑
論文名稱(英文) The Standard Behavioral Evaluation of Low-Quality Videos Based on Collaborative of Image Processing and Deep Learning
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系全英語碩士班
系所名稱(英文) Master's Program, Department of Computer Science and Information Engineering (English-taught program)
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 111
學期 2
出版年 112
研究生(中文) 尹矞田
研究生(英文) Yu-Tien Inn
學號 610780099
學位類別 碩士
語言別 英文
第二語言別
口試日期 2023-06-09
論文頁數 81頁
口試委員 指導教授 - 武士戎(wushihjung@mail.tku.edu.tw)
口試委員 - 陳宗禧
口試委員 - 陳裕賢
口試委員 - 張志勇(cychang@mail.tku.edu.tw)
關鍵字(中) 目標檢測
影像處理
行為辨識
深度學習
物件追蹤
孿生網路
3DCNN
YOLO
神秘客稽核
關鍵字(英) Image Processing
Object detection
Deep learning
Siamese network
Action recognition
YOLO
3DCNN
第三語言關鍵字
學科別分類
中文摘要
服務產業十分發達的國家,例如台灣、日本、韓國、大陸與歐、美等,在員工與顧客服務及銷售過程中,多半會採用以神秘客服務稽核方式來進行標準行為評鑑,以實地前往確認,企業員工對於企業服務的規範,是否落實企業文化,以及客戶對消費的感受滿意度,甚至可以進一步在查核過程中,神秘客以客戶觀點給予企業經營管理或創新翻轉之建議。
神秘客服務稽核調查,是由稽核員對服務人員秘密錄影,再藉由稽核人員對服務人員進行動作的判斷與評分。例如,稽核人員秘密地對百貨公司或餐廳的服務員,進行錄影,看看其是否『站姿不正確(37步)』;或是指引客人時,沒有『五指併攏』;又或是拿東西給客人時,沒有『雙手給予』等,然而,這往往會依賴稽核人員的主觀意識,雖然有定義上的標準答案,但其評語為形容詞類別,例如熱情、積極等,這會讓評判更沒有效率、更主觀。另一大痛點則是會耗費大量的時間來觀看影片以及找出服務不當之處。
隨著科技的進步與發展以及人工智慧與深度學習網路的躍進,本論文擬透過影像處理、目標檢測、影像孿生網路之技術,完成一套AI標準動作評鑑系統,提升服務稽核系統之速度與質量。本論文系統藉由影像辨識的方式,使用自訓練的物件偵測YOLO模型,從低質量的影片中抓取出需辨識目標的位置與特徵並分類出目標的動作;再來利用特徵增強的技術,將抓取出的目標進一步提取特徵,以達到清雜訊的目的;最後再利用3DCNN孿生網路將正確動作影片以及需檢測影片進行相似度比對,以達到細部動作比對之目的。另外,本論文使用新設計的自動化影片過濾與分類演算法,可以將完整影片中的有目標片段過濾出來,並分類至對應的訓練集內,節省大量資料前處理的時間。
英文摘要
In countries with well-developed service industries, such as Taiwan, Japan, South Korea, the mainland, Europe, and the United States, in the process of employee and customer service and sales, most of them will use the mystery customer service audit method to conduct standard behavior evaluation, and go to the field It is confirmed that the employees of the enterprise standardize the enterprise service, whether the enterprise culture is implemented, as well as the customer's satisfaction with consumption, and even further, Mysterio can give suggestions on enterprise management or innovation transformation from the customer's point of view during the inspection process.
The Mystery Customer service audit survey is conducted by auditors who secretly videotape service personnel, and then use auditors to judge and score the actions of service personnel. For example, the auditors secretly videotape the waiters in department stores or restaurants to see if they "stand incorrectly (37 steps)"; When it comes to guests, there is no "giving with both hands", etc. However, this often depends on the subjective awareness of the auditors. Although there are standard answers by definition, their comments are adjectives, such as enthusiasm, positive, etc., which will make the evaluation even more inefficient. , more subjective. Another big pain point is that it takes a lot of time to watch videos and find out what is wrong with the service.
With the advancement and development of science and technology and the leap forward of artificial intelligence and deep learning network, this paper intends to complete a set of AI standard action evaluation system through the technology of image processing, target detection, and image twinning network, so as to improve the quality of the service audit system. Speed and quality. This thesis system uses the self-trained object detection YOLO model to capture the position and characteristics of the target to be recognized from the low-quality video and classify the action of the target by means of image recognition; and then use the technology of target extraction , to further extract features from the captured target to achieve the purpose of clearing the noise; finally, use the 3DCNN twin network to compare the similarity between the correct action video and the video to be detected, so as to achieve the purpose of detailed action comparison. In addition, this paper uses a newly designed automatic video filtering and classification algorithm, which can filter out the target segments in the complete video and classify them into the corresponding training set, saving a lot of time for data pre-processing.
第三語言摘要
論文目次
LIST OF CONTENT
LIST OF CONTENT	VII
LIST OF FIGURE	X
LIST OF TABLE	XIII
CHAPTER 1 INTRODUCTION	1
1-1 Research Background and Motivation	1
1-2 Research Objectives	3
1-3 Research Contributions	4
CHAPTER 2 RELATED WORKS	8
2-1 Low-Quality Image Processing	8
2-1-1 Video Denoising	8
2-1-2 Video Super-Resolution, VSR	11
2-1-3 Resolution Knowledge Distillation, ResKD	15
2-2 Object Detection	17
2-2-1 Anchor-based	17
2-2-2 Anchor-free	18
2-3 Behavior Recognition	20
2-3-1 Two-Stream Network	20
2-3-2 Skeleton Keypoint-Based Action Recognition	21
2-3-3 3D Convolutional Network (3D CNN)	23
2-3-4 Multimodal Behavior Recognition	25
2-3-5 Siamese Network	26
CHAPTER 3 BACKGROUND KNOWLEDGE	30
3-1 Object Detection	30
3-1-1 YOLO	30
3-2 Behavior Recognition	33
3-2-1 3DCNN	33
3-2-2 C3D	34
3-3 Action Scoring System	36
3-3-1 Siamese Network	36
CHAPTER 4 RESEARCH METFODOLOGY	38
4-1 Problem Description	38
4-1-1 Context and Problem Description	38
4-1-2 GOAL	39
4-2 System Architecture	39
CHAPTER 5 EXPERIMENT ANALYSIS	60
5-1 Experimental Environment	60
5-2 Dataset	60
5-3 Experimental Results	61
CHPATER 6 CONCLUSION	77
REFERENCES	79

LIST OF FIGURE
Figure 1:Overall Research Architecture Diagram	4
Figure 2:Videnn network architecture (Taken from [16])	10
Figure 3:TGA network architecture (taken from [19])	12
Figure 4:Tiny VIRAT is for a single target and has low background noise (taken from [28])	14
Figure 5:Schematic diagram of ResKD (taken from [21])	16
Figure 6: Schematic diagram of SSD (taken from [10])	18
Figure 7: TSN network based on the dual-stream architecture (taken from [25])	21
Figure 8: Using 3D skeletal joints in deep learning models (taken from [26])	23
Figure 9: Schematic diagram of 3D convolution (taken from [13])	24
Figure 10:Schematic diagram of the multimodal classifier architecture for text, audio, and video (taken from [35])	25
Figure 11: Siamese network architecture diagram (taken from [36])	27
Figure 12: C3D network structure (taken from [14])	35
Figure 13: The Siamese network architecture proposed in this paper	37
Figure 14: This system will be designed based on artificial intelligence technology to achieve the objective	39
Figure 15: Overall research framework diagram	40
Figure 16:The data is divided into three main categories along with a description of the data characteristics	41
Figure 17:Semi-automatic annotation algorithm	42
Figure 18: Automated video filtering and classification algorithm	43
Figure 19: YOLO automatically segments target clips and records target information	44
Figure 20: Automatically segment target clips and record target information	45
Figure 21:Feature enhancement algorithm	46
Figure 22: Comparison before and after adjusting the target box	47
Figure 23: Reassembling clips after feature enhancement	48
Figure 24: Mysterious guest target action classifier training phase	49
Figure 25: Example of using the Mysterious Guest target action classifier	50
Figure 26: The C3D Siamese network established in this paper	51
Figure 27: C3D Siamese network training flowchart	54
Figure 28: C3D Siamese network training process	55
Figure 29: Usage phase and scoring algorithm of the C3D Siamese network	56
Figure 30: Schematic diagram of the output error segment	57
Figure 31:Outputting error intervals through target data and the recorded timeline	57
Figure 32: Implementation example of the system GUI designed in this paper	59
Figure 33: Training time and model size for various models using the dataset from this paper	62
Figure 34: Comparison of training parameters for various models using the dataset from this paper	63
Figure 35: Comparison of training for various models using the dataset from this paper	64
Figure 36: Comparison of resolution enhancement for different levels of features	66
Figure 37: Comparison of parameters using feature enhancement algorithms for different models	67
Figure 38:Accuracy comparison of the Mysterious Guest action matching with the Siamese model.	70
Figure 39: Precision comparison of the Mysterious Guest action matching with the Siamese model	71
Figure 40: Recall comparison of the Mysterious Guest action matching with the Siamese model	71
Figure 41: F1-score comparison of the Mysterious Guest action matching with the Siamese model	72
Figure 42:Comparison of accuracy for different actions with various threshold values	73
Figure 43:Comparison of Precision for different actions with various threshold values	73
Figure 44:Comparison of Recall for different actions with various threshold values	74
Figure 45:Comparison of F1-score for different actions with various threshold values	74

LIST OF TABLE
Table 1: Comparison of related works	29
Table 2: Symbol definitions of the Siamese network model	51
Table 3: Symbol definitions of Contrastive Loss	52
Table 4: Confusion matrix table	68
Table 5:Substract - Results of the Mysterious Guest action comparison using the Siamese model	69
Table 6: Euclidean distance - Results of the Mysterious Guest action comparison using the Siamese model	69
Table 7:Dot product - Results of the Mysterious Guest action comparison using the Siamese model	70
Table 8: Comparison of the methodology in this paper with methods from other papers	75
參考文獻
[1]	S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, vol. 28, 2015.
[2]	J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788. 
[3]	M. J. Shafiee, B. Chywl, F. Li, and A. Wong, "Fast YOLO: A fast you only look once system for real-time embedded object detection in video," arXiv preprint arXiv:1709.05943, 2017.
[4]	J. Redmon and A. Farhadi, "Yolov3: An incremental improvement," arXiv preprint arXiv:1804.02767, 2018.
[5]	A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, "Yolov4: Optimal speed and accuracy of object detection," arXiv preprint arXiv:2004.10934, 2020.
[6]	C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, "Scaled-yolov4: Scaling cross stage partial network," in Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2021, pp. 13029-13038. 
[7]	G. Jocher et al., "ultralytics/yolov5: v7. 0-yolov5 sota realtime instance segmentation," Zenodo, 2022.
[8]	C. Li et al., "YOLOv6: A single-stage object detection framework for industrial applications," arXiv preprint arXiv:2209.02976, 2022.
[9]	C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7464-7475. 
[10]	W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, "Ssd: Single shot multibox detector," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 2016: Springer, pp. 21-37. 
[11]	K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, "Centernet: Keypoint triplets for object detection," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6569-6578. 
[12]	K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," Advances in neural information processing systems, vol. 27, 2014.
[13]	S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional neural networks for human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221-231, 2012.
[14]	D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489-4497. 
[15]	S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[16]	M. Claus and J. Van Gemert, "Videnn: Deep blind video denoising," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0-0. 
[17]	X. Xu, M. Li, and W. Sun, "Learning deformable kernels for image and video denoising," arXiv preprint arXiv:1904.06903, 2019.
[18]	M. Maggioni, Y. Huang, C. Li, S. Xiao, Z. Fu, and F. Song, "Efficient multi-stage video denoising with recurrent spatio-temporal fusion," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3466-3475. 
[19]	T. Isobe et al., "Video super-resolution with temporal group attention," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8008-8017. 
[20]	A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos, "Video super-resolution with convolutional neural networks," IEEE transactions on computational imaging, vol. 2, no. 2, pp. 109-122, 2016.
[21]	C. Ma, Q. Guo, Y. Jiang, P. Luo, Z. Yuan, and X. Qi, "Rethinking resolution in the context of efficient video recognition," Advances in Neural Information Processing Systems, vol. 35, pp. 37865-37877, 2022.
[22]	M. Gao, Y. Shen, Q. Li, and C. C. Loy, "Residual knowledge distillation," arXiv preprint arXiv:2002.09168, 2020.
[23]	M. S. Sajjadi, R. Vemulapalli, and M. Brown, "Frame-recurrent video super-resolution," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6626-6634. 
[24]	D. Purwanto, R. Renanda Adhi Pramono, Y.-T. Chen, and W.-H. Fang, "Extreme low resolution action recognition with spatial-temporal multi-head self-attention and knowledge distillation," in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0-0. 
[25]	L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, "Temporal segment networks: Towards good practices for deep action recognition," in European conference on computer vision, 2016: Springer, pp. 20-36. 
[26]	B. Ren, M. Liu, R. Ding, and H. Liu, "A survey on 3d skeleton-based action recognition using learning method," arXiv preprint arXiv:2002.05907, 2020.
[27]	W. Du, Y. Wang, and Y. Qiao, "Rpan: An end-to-end recurrent pose-attention network for action recognition in videos," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3725-3734. 
[28]	U. Demir, Y. S. Rawat, and M. Shah, "Tinyvirat: Low-resolution video action recognition," in 2020 25th international conference on pattern recognition (ICPR), 2021: IEEE, pp. 7387-7394. 
[29]	R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580-587. 
[30]	R. Girshick, "Fast r-cnn," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440-1448. 
[31]	K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask r-cnn," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961-2969. 
[32]	Z. Zhang, "Microsoft kinect sensor and its effect," IEEE multimedia, vol. 19, no. 2, pp. 4-10, 2012.
[33]	W. Yang, W. Ouyang, H. Li, and X. Wang, "End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3073-3082. 
[34]	X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, "Multi-context attention for human pose estimation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1831-1840. 
[35]	A. Nagrani, C. Sun, D. Ross, R. Sukthankar, C. Schmid, and A. Zisserman, "Speech2action: Cross-modal supervision for action recognition," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10317-10326. 
[36]	S. Chopra, R. Hadsell, and Y. LeCun, "Learning a similarity metric discriminatively, with application to face verification," in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05), 2005, vol. 1: IEEE, pp. 539-546.
論文全文使用權限
國家圖書館
同意無償授權國家圖書館,書目與全文電子檔於繳交授權書後, 於網際網路立即公開
校內
校內紙本論文立即公開
同意電子論文全文授權於全球公開
校內電子論文立即公開
校外
同意授權予資料庫廠商
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信