電子學位論文服務

§ 瀏覽學位論文書目資料

本論文紙本於2023-02-23起公開使用

系統識別號	U0002-2002202312522900
DOI	10.6846/TKU.2023.00097
論文名稱(中文)	設計與實作基於Transformer之AI技術以實現自動提取及剪輯精彩運動賽事
論文名稱(英文)	Design and Implementation of Transformer-Based AI Technologies for Automatically Extracting Wonderful Clips from Sporting Videos
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	資訊工程學系全英語碩士班
系所名稱(英文)	Master's Program, Department of Computer Science and Information Engineering (English-taught program)
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	111
學期	1
出版年	112
研究生(中文)	李欣樺
研究生(英文)	Hsin-Hua Lee
學號	610780024
學位類別	碩士
語言別	英文
第二語言別
口試日期	2023-01-06
論文頁數	41頁
口試委員	指導教授 - 張志勇(cychang@mail.tku.edu.tw) 口試委員 - 廖文華口試委員 - 武士戎
關鍵字(中)	棒球精華擷取人工智慧深度學習電腦視覺行為辨識特徵挖掘 Transformer BERT
關鍵字(英)	Baseball Highlight Extraction Artificial Intelligence Deep Learning Computer vision Behavior Recognition Feature Extraction Transformer BERT
第三語言關鍵字
學科別分類
中文摘要	棒球運動在臺灣深根發展已超過百年歷史，一直以來都是臺灣人最熱衷的體育賽事，不僅是最具指標性的運動，更是凝聚臺灣族群的共識認同。近年來民眾對於各類運動賽事的關注日漸激增，除了觀看賽事外，支持者們更是喜歡重複觀看這些明星選手大展身手的精彩片段，而這些片段不僅能夠帶來數千萬次的點閱率，同時在相關產業方能帶來無限的商機。棒球賽事往往需要持續數個鐘頭，然而在這幾個鐘頭的賽程中，運動員們展現精彩美技的時間卻相對短暫，過往只能憑藉剪輯師的經驗，花費大量時間及人力成本來剪輯精彩片段，如三振、撲接、全壘打等; 隨著科技的進步與發展，人工智慧與深度學習技術的躍進，尤其是Transformer在處理圖片、影片及自然語言的技術精進，進展神速。本論文亦擬針對一場棒球運動賽事影片，透過Transformer中 Self-Attention 的注意力機制進行平行處理運算，不僅能運用在自然語言挖掘前後文的關聯性，近年來已被證實，其也有能力挖掘圖片的特徵，有著類似卷積的運算能力，可以挖掘空間及時間維度的圖片特徵，因此，我們擬針對選手的行為進行特徵挖掘及辨識，並且對自定義之特徵進行片段自動剪輯。鑒於大多行為辨識是基於視覺特徵處理的方法或是基於深度學習CNN或是RNN的卷積與池化掃描來進行行為辨識，不同於以往的研究，本論文主要是透過Transformer Self-Attention運算，來學習語句的上下文關係及挖掘影像中全域及局域的特徵資訊，並且透過BERT QA模型，進行遷移式學習，使該模型有能力從一段影片中，辨識出精彩片段之始末點，相較於傳統行為辨識的做法更為新穎創新，對於體育相關產業與學術發展均帶來重大的影響與貢獻。
英文摘要	Baseball has been deeply rooted in Taiwan for more than a hundred years. It has always been the most popular sports event for the Taiwanese. It is not only the most iconic sport, but also unites the consensus of the Taiwanese community. In recent years, people can easily access all kinds of sports because of the Internet community media popularization. The supporters tend to watch these wonderful clips repeatedly on the multimedia platform, which leads to business opportunities due to advertisement. Baseball games often last for several hours. However, during these hours, the time for athletes to show their wonderful skills is in short duration. In the past, extracting wonderful clips usually highly relies on the experience of experts, consuming a lot of time and labor costs to extract highlights, such as strikeouts, catches, home runs, etc. With the advancement and development of science and technology, the advancement of artificial intelligence and deep learning technology, especially the advancement of Transformer in processing pictures, videos and natural language is progressing rapidly. This thesis intends to use the Self-Attention in Transformer to perform parallel processing operations on a baseball game video, which can be used in natural language to understand the relevance of the context. Moreover, it has been proven in recent years that it is also capable of extracting images from the features of pictures. It has similar convolutional computing capabilities and can extract the features of pictures in spatial dimensions. Therefore, this thesis also intends to conduct feature extraction and identification for the player’s behavior, and automatically extract clips for the predefined features to develop a smart video editing system. Most behavior recognitions were based on visual feature image processing or deep learning CNN or RNN convolution and pooling scanning for behavior classification or behavior prediction. However, different from these studies, this thesis mainly utilizes the Transformer Self-Attention mechanism to learn the context and extract the feature information from the entire domain in the image, and obtain the start and end points of its highlights through the BERT QA model. Compared with the traditional behavior recognition method, it is more novel and innovative. It is expected to have the significant impact and contribution to sports-related industries and academic development.
第三語言摘要
論文目次	Table of Contents Table of Contents VI List of Figure VIII List of Table IX Chapter 1 Introduction 1 Chapter 2 Related Work 6 2-1 Image Processing 6 2-2 Deep Learning 8 Chapter 3 Background Knowledge 11 3-1 BERT 11 3-2 Video Vision Transformer 13 3-3 Bert Question Answering 14 Chapter 4 System Architecture 16 4-1 Environment and Problem Description 16 4-1-1 Problem to be solved 16 4-1-2 Target 16 4-2 System Architecture 16 A. Data Collection 17 B. Data Preprocess 18 C. Model Building and Training 20 D. Model Evaluation 24 Chapter 5 Experiment Analysis 28 5-1 Environment Settings 28 5-2 Experimental Data 28 5-3 Experimental Results 29 5-4 Future Work 36 Chapter 6 Conclusion 38 Reference 39 List of Figure Figure 1: Research Purpose 2 Figure 2: Research Architecture 4 Figure 3: Word Embedding 12 Figure 4: NSP and MLM 13 Figure 5: ViT Model Architecture 14 Figure 6: BERT Question Answering 15 Figure 7: System Architecture 17 Figure 8: Data Source 18 Figure 9: Natural Language Preprocessing 19 Figure 10: Image Preprocessing 20 Figure 11: Overall Structure 21 Figure 12: Tubelet Embedding 23 Figure 13: ViViT Model Architecture 23 Figure 14: BERT Question Answering Model Architecture 24 Figure 15: IOU Example Diagram 26 Figure 16: Experimental Ratio 31 Figure 17: Confusion Matrix Experiment Results 31 Figure 18: IOU Code 33 Figure 19: IOU Experiment Results 36 Figure 20: IOU Experiment Results 37 List of Table Table 1: Comparison of Related Work 10 Table 2: Confusion Matrix 26 Table 3: Package of Training Environment 28 Table 4: Experimental Parameters 29 Table 5: Confusion Matrix Example 30 Table 6: IOU Experiment Results 33
參考文獻	Reference [1] Z. Jiang, F. Zhang, and L. J. S. P. Sun, "Sports action recognition based on image processing technology and analysis of the development of sports industry pattern," Scientific Programming, vol. 2021, pp. 1-11, 2021. [2] Y. Ke, R. Sukthankar, and M. Hebert, "Event detection in crowded videos," in 2007 IEEE 11th international conference on computer vision, Brazil, Oct. 2007. [3] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior recognition via sparse spatio-temporal features," in 2005 IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, China, Oct. 2005. [4] S. Yan, Y. Teng, J. S. Smith, and B. Zhang, "Driver behavior recognition based on deep convolutional neural networks," in 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), China, Aug. 2016. [5] C. J. C. I. Gao and Neuroscience, "Athlete behavior recognition technology based on siamese-rpn tracker model," Computational Intelligence and Neuroscience, vol. 2021, Oct. 2021. [6] J. Lu, M. Nguyen, and W. Q. Yan, "Deep learning methods for human behavior recognition," in 2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ), New Zealand, Nov. 2020. [7] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," International Conference on Machine Learning, pp. 6105-6114: PMLR. Long Beach, California, USA, Jun. 2019. [8] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," The IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778. Las Vegas, NV, USA, June. 2016. [9] A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv, Oct. 2020. [10] G. Bertasius, H. Wang, and L. Torresani, "Is space-time attention all you need for video understanding?" International Conference on Machine Learning,Vienna, Austria, July 2021. [11] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, "Vivit: A video vision transformer," The IEEE/CVF International Conference on Computer Vision, Montreal, Canada, Oct. 2021. [12] W. Kim, B. Son, and I. Kim, "Vilt: Vision-and-language transformer without convolution or region supervision," International Conference on Machine Learning, July 2021. [13] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. J. a. p. a. Wu, "Coca: Contrastive captioners are image-text foundation models," arXiv:2205.01917v2, June. 2022. [14] Devlin, M.-W. Chang, K. Lee, and K. J. a. p. a. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv.1810.04805, 2018. [15] Devlin, M.-W. Chang, K. Lee, and K. J. a. p. a. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv.1810.04805, 2018. [16] A. J. I. t. o. M. Hanjalic, "Adaptive extraction of highlights from a sport video based on excitement modeling," vol. 7, no. 6, pp. 1114-1122, Dec. 2005. [17] D. Chauhan, N. M. Patel, and M. Joshi, "Automatic summarization of basketball sport video," 2016 2nd International Conference on Next Generation Computing Technologies (NGCT), India, Oct. 2016.
論文全文使用權限	國家圖書館：不同意無償授權國家圖書館校內：校內紙本論文立即公開電子論文全文不同意授權校內書目立即公開校外：不同意授權予資料庫廠商校外書目立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信