| 系統識別號 | U0002-2108202308080900 |
|---|---|
| DOI | 10.6846/tku202300588 |
| 論文名稱(中文) | 基於影像處理、深度學習協同合作對劣質影片進行標準行為評鑑 |
| 論文名稱(英文) | The Standard Behavioral Evaluation of Low-Quality Videos Based on Collaborative of Image Processing and Deep Learning |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 資訊工程學系全英語碩士班 |
| 系所名稱(英文) | Master's Program, Department of Computer Science and Information Engineering (English-taught program) |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 111 |
| 學期 | 2 |
| 出版年 | 112 |
| 研究生(中文) | 尹矞田 |
| 研究生(英文) | Yu-Tien Inn |
| 學號 | 610780099 |
| 學位類別 | 碩士 |
| 語言別 | 英文 |
| 第二語言別 | |
| 口試日期 | 2023-06-09 |
| 論文頁數 | 81頁 |
| 口試委員 |
指導教授
-
武士戎(wushihjung@mail.tku.edu.tw)
口試委員 - 陳宗禧 口試委員 - 陳裕賢 口試委員 - 張志勇(cychang@mail.tku.edu.tw) |
| 關鍵字(中) |
目標檢測 影像處理 行為辨識 深度學習 物件追蹤 孿生網路 3DCNN YOLO 神秘客稽核 |
| 關鍵字(英) |
Image Processing Object detection Deep learning Siamese network Action recognition YOLO 3DCNN |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
服務產業十分發達的國家,例如台灣、日本、韓國、大陸與歐、美等,在員工與顧客服務及銷售過程中,多半會採用以神秘客服務稽核方式來進行標準行為評鑑,以實地前往確認,企業員工對於企業服務的規範,是否落實企業文化,以及客戶對消費的感受滿意度,甚至可以進一步在查核過程中,神秘客以客戶觀點給予企業經營管理或創新翻轉之建議。 神秘客服務稽核調查,是由稽核員對服務人員秘密錄影,再藉由稽核人員對服務人員進行動作的判斷與評分。例如,稽核人員秘密地對百貨公司或餐廳的服務員,進行錄影,看看其是否『站姿不正確(37步)』;或是指引客人時,沒有『五指併攏』;又或是拿東西給客人時,沒有『雙手給予』等,然而,這往往會依賴稽核人員的主觀意識,雖然有定義上的標準答案,但其評語為形容詞類別,例如熱情、積極等,這會讓評判更沒有效率、更主觀。另一大痛點則是會耗費大量的時間來觀看影片以及找出服務不當之處。 隨著科技的進步與發展以及人工智慧與深度學習網路的躍進,本論文擬透過影像處理、目標檢測、影像孿生網路之技術,完成一套AI標準動作評鑑系統,提升服務稽核系統之速度與質量。本論文系統藉由影像辨識的方式,使用自訓練的物件偵測YOLO模型,從低質量的影片中抓取出需辨識目標的位置與特徵並分類出目標的動作;再來利用特徵增強的技術,將抓取出的目標進一步提取特徵,以達到清雜訊的目的;最後再利用3DCNN孿生網路將正確動作影片以及需檢測影片進行相似度比對,以達到細部動作比對之目的。另外,本論文使用新設計的自動化影片過濾與分類演算法,可以將完整影片中的有目標片段過濾出來,並分類至對應的訓練集內,節省大量資料前處理的時間。 |
| 英文摘要 |
In countries with well-developed service industries, such as Taiwan, Japan, South Korea, the mainland, Europe, and the United States, in the process of employee and customer service and sales, most of them will use the mystery customer service audit method to conduct standard behavior evaluation, and go to the field It is confirmed that the employees of the enterprise standardize the enterprise service, whether the enterprise culture is implemented, as well as the customer's satisfaction with consumption, and even further, Mysterio can give suggestions on enterprise management or innovation transformation from the customer's point of view during the inspection process. The Mystery Customer service audit survey is conducted by auditors who secretly videotape service personnel, and then use auditors to judge and score the actions of service personnel. For example, the auditors secretly videotape the waiters in department stores or restaurants to see if they "stand incorrectly (37 steps)"; When it comes to guests, there is no "giving with both hands", etc. However, this often depends on the subjective awareness of the auditors. Although there are standard answers by definition, their comments are adjectives, such as enthusiasm, positive, etc., which will make the evaluation even more inefficient. , more subjective. Another big pain point is that it takes a lot of time to watch videos and find out what is wrong with the service. With the advancement and development of science and technology and the leap forward of artificial intelligence and deep learning network, this paper intends to complete a set of AI standard action evaluation system through the technology of image processing, target detection, and image twinning network, so as to improve the quality of the service audit system. Speed and quality. This thesis system uses the self-trained object detection YOLO model to capture the position and characteristics of the target to be recognized from the low-quality video and classify the action of the target by means of image recognition; and then use the technology of target extraction , to further extract features from the captured target to achieve the purpose of clearing the noise; finally, use the 3DCNN twin network to compare the similarity between the correct action video and the video to be detected, so as to achieve the purpose of detailed action comparison. In addition, this paper uses a newly designed automatic video filtering and classification algorithm, which can filter out the target segments in the complete video and classify them into the corresponding training set, saving a lot of time for data pre-processing. |
| 第三語言摘要 | |
| 論文目次 |
LIST OF CONTENT LIST OF CONTENT VII LIST OF FIGURE X LIST OF TABLE XIII CHAPTER 1 INTRODUCTION 1 1-1 Research Background and Motivation 1 1-2 Research Objectives 3 1-3 Research Contributions 4 CHAPTER 2 RELATED WORKS 8 2-1 Low-Quality Image Processing 8 2-1-1 Video Denoising 8 2-1-2 Video Super-Resolution, VSR 11 2-1-3 Resolution Knowledge Distillation, ResKD 15 2-2 Object Detection 17 2-2-1 Anchor-based 17 2-2-2 Anchor-free 18 2-3 Behavior Recognition 20 2-3-1 Two-Stream Network 20 2-3-2 Skeleton Keypoint-Based Action Recognition 21 2-3-3 3D Convolutional Network (3D CNN) 23 2-3-4 Multimodal Behavior Recognition 25 2-3-5 Siamese Network 26 CHAPTER 3 BACKGROUND KNOWLEDGE 30 3-1 Object Detection 30 3-1-1 YOLO 30 3-2 Behavior Recognition 33 3-2-1 3DCNN 33 3-2-2 C3D 34 3-3 Action Scoring System 36 3-3-1 Siamese Network 36 CHAPTER 4 RESEARCH METFODOLOGY 38 4-1 Problem Description 38 4-1-1 Context and Problem Description 38 4-1-2 GOAL 39 4-2 System Architecture 39 CHAPTER 5 EXPERIMENT ANALYSIS 60 5-1 Experimental Environment 60 5-2 Dataset 60 5-3 Experimental Results 61 CHPATER 6 CONCLUSION 77 REFERENCES 79 LIST OF FIGURE Figure 1:Overall Research Architecture Diagram 4 Figure 2:Videnn network architecture (Taken from [16]) 10 Figure 3:TGA network architecture (taken from [19]) 12 Figure 4:Tiny VIRAT is for a single target and has low background noise (taken from [28]) 14 Figure 5:Schematic diagram of ResKD (taken from [21]) 16 Figure 6: Schematic diagram of SSD (taken from [10]) 18 Figure 7: TSN network based on the dual-stream architecture (taken from [25]) 21 Figure 8: Using 3D skeletal joints in deep learning models (taken from [26]) 23 Figure 9: Schematic diagram of 3D convolution (taken from [13]) 24 Figure 10:Schematic diagram of the multimodal classifier architecture for text, audio, and video (taken from [35]) 25 Figure 11: Siamese network architecture diagram (taken from [36]) 27 Figure 12: C3D network structure (taken from [14]) 35 Figure 13: The Siamese network architecture proposed in this paper 37 Figure 14: This system will be designed based on artificial intelligence technology to achieve the objective 39 Figure 15: Overall research framework diagram 40 Figure 16:The data is divided into three main categories along with a description of the data characteristics 41 Figure 17:Semi-automatic annotation algorithm 42 Figure 18: Automated video filtering and classification algorithm 43 Figure 19: YOLO automatically segments target clips and records target information 44 Figure 20: Automatically segment target clips and record target information 45 Figure 21:Feature enhancement algorithm 46 Figure 22: Comparison before and after adjusting the target box 47 Figure 23: Reassembling clips after feature enhancement 48 Figure 24: Mysterious guest target action classifier training phase 49 Figure 25: Example of using the Mysterious Guest target action classifier 50 Figure 26: The C3D Siamese network established in this paper 51 Figure 27: C3D Siamese network training flowchart 54 Figure 28: C3D Siamese network training process 55 Figure 29: Usage phase and scoring algorithm of the C3D Siamese network 56 Figure 30: Schematic diagram of the output error segment 57 Figure 31:Outputting error intervals through target data and the recorded timeline 57 Figure 32: Implementation example of the system GUI designed in this paper 59 Figure 33: Training time and model size for various models using the dataset from this paper 62 Figure 34: Comparison of training parameters for various models using the dataset from this paper 63 Figure 35: Comparison of training for various models using the dataset from this paper 64 Figure 36: Comparison of resolution enhancement for different levels of features 66 Figure 37: Comparison of parameters using feature enhancement algorithms for different models 67 Figure 38:Accuracy comparison of the Mysterious Guest action matching with the Siamese model. 70 Figure 39: Precision comparison of the Mysterious Guest action matching with the Siamese model 71 Figure 40: Recall comparison of the Mysterious Guest action matching with the Siamese model 71 Figure 41: F1-score comparison of the Mysterious Guest action matching with the Siamese model 72 Figure 42:Comparison of accuracy for different actions with various threshold values 73 Figure 43:Comparison of Precision for different actions with various threshold values 73 Figure 44:Comparison of Recall for different actions with various threshold values 74 Figure 45:Comparison of F1-score for different actions with various threshold values 74 LIST OF TABLE Table 1: Comparison of related works 29 Table 2: Symbol definitions of the Siamese network model 51 Table 3: Symbol definitions of Contrastive Loss 52 Table 4: Confusion matrix table 68 Table 5:Substract - Results of the Mysterious Guest action comparison using the Siamese model 69 Table 6: Euclidean distance - Results of the Mysterious Guest action comparison using the Siamese model 69 Table 7:Dot product - Results of the Mysterious Guest action comparison using the Siamese model 70 Table 8: Comparison of the methodology in this paper with methods from other papers 75 |
| 參考文獻 |
[1] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, vol. 28, 2015. [2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788. [3] M. J. Shafiee, B. Chywl, F. Li, and A. Wong, "Fast YOLO: A fast you only look once system for real-time embedded object detection in video," arXiv preprint arXiv:1709.05943, 2017. [4] J. Redmon and A. Farhadi, "Yolov3: An incremental improvement," arXiv preprint arXiv:1804.02767, 2018. [5] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, "Yolov4: Optimal speed and accuracy of object detection," arXiv preprint arXiv:2004.10934, 2020. [6] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, "Scaled-yolov4: Scaling cross stage partial network," in Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2021, pp. 13029-13038. [7] G. Jocher et al., "ultralytics/yolov5: v7. 0-yolov5 sota realtime instance segmentation," Zenodo, 2022. [8] C. Li et al., "YOLOv6: A single-stage object detection framework for industrial applications," arXiv preprint arXiv:2209.02976, 2022. [9] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7464-7475. [10] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, "Ssd: Single shot multibox detector," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 2016: Springer, pp. 21-37. [11] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, "Centernet: Keypoint triplets for object detection," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6569-6578. [12] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," Advances in neural information processing systems, vol. 27, 2014. [13] S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional neural networks for human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221-231, 2012. [14] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489-4497. [15] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997. [16] M. Claus and J. Van Gemert, "Videnn: Deep blind video denoising," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0-0. [17] X. Xu, M. Li, and W. Sun, "Learning deformable kernels for image and video denoising," arXiv preprint arXiv:1904.06903, 2019. [18] M. Maggioni, Y. Huang, C. Li, S. Xiao, Z. Fu, and F. Song, "Efficient multi-stage video denoising with recurrent spatio-temporal fusion," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3466-3475. [19] T. Isobe et al., "Video super-resolution with temporal group attention," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8008-8017. [20] A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos, "Video super-resolution with convolutional neural networks," IEEE transactions on computational imaging, vol. 2, no. 2, pp. 109-122, 2016. [21] C. Ma, Q. Guo, Y. Jiang, P. Luo, Z. Yuan, and X. Qi, "Rethinking resolution in the context of efficient video recognition," Advances in Neural Information Processing Systems, vol. 35, pp. 37865-37877, 2022. [22] M. Gao, Y. Shen, Q. Li, and C. C. Loy, "Residual knowledge distillation," arXiv preprint arXiv:2002.09168, 2020. [23] M. S. Sajjadi, R. Vemulapalli, and M. Brown, "Frame-recurrent video super-resolution," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6626-6634. [24] D. Purwanto, R. Renanda Adhi Pramono, Y.-T. Chen, and W.-H. Fang, "Extreme low resolution action recognition with spatial-temporal multi-head self-attention and knowledge distillation," in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0-0. [25] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, "Temporal segment networks: Towards good practices for deep action recognition," in European conference on computer vision, 2016: Springer, pp. 20-36. [26] B. Ren, M. Liu, R. Ding, and H. Liu, "A survey on 3d skeleton-based action recognition using learning method," arXiv preprint arXiv:2002.05907, 2020. [27] W. Du, Y. Wang, and Y. Qiao, "Rpan: An end-to-end recurrent pose-attention network for action recognition in videos," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3725-3734. [28] U. Demir, Y. S. Rawat, and M. Shah, "Tinyvirat: Low-resolution video action recognition," in 2020 25th international conference on pattern recognition (ICPR), 2021: IEEE, pp. 7387-7394. [29] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580-587. [30] R. Girshick, "Fast r-cnn," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440-1448. [31] K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask r-cnn," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961-2969. [32] Z. Zhang, "Microsoft kinect sensor and its effect," IEEE multimedia, vol. 19, no. 2, pp. 4-10, 2012. [33] W. Yang, W. Ouyang, H. Li, and X. Wang, "End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3073-3082. [34] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, "Multi-context attention for human pose estimation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1831-1840. [35] A. Nagrani, C. Sun, D. Ross, R. Sukthankar, C. Schmid, and A. Zisserman, "Speech2action: Cross-modal supervision for action recognition," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10317-10326. [36] S. Chopra, R. Hadsell, and Y. LeCun, "Learning a similarity metric discriminatively, with application to face verification," in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05), 2005, vol. 1: IEEE, pp. 539-546. |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信