§ 瀏覽學位論文書目資料
系統識別號 U0002-2002202312522900
DOI 10.6846/TKU.2023.00097
論文名稱(中文) 設計與實作基於Transformer之AI技術以實現自動提取及剪輯精彩運動賽事
論文名稱(英文) Design and Implementation of Transformer-Based AI Technologies for Automatically Extracting Wonderful Clips from Sporting Videos
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系全英語碩士班
系所名稱(英文) Master's Program, Department of Computer Science and Information Engineering (English-taught program)
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 111
學期 1
出版年 112
研究生(中文) 李欣樺
研究生(英文) Hsin-Hua Lee
學號 610780024
學位類別 碩士
語言別 英文
第二語言別
口試日期 2023-01-06
論文頁數 41頁
口試委員 指導教授 - 張志勇(cychang@mail.tku.edu.tw)
口試委員 - 廖文華
口試委員 - 武士戎
關鍵字(中) 棒球
精華擷取
人工智慧
深度學習
電腦視覺
行為辨識
特徵挖掘
Transformer
BERT
關鍵字(英) Baseball
Highlight Extraction
Artificial Intelligence
Deep Learning
Computer vision
Behavior Recognition
Feature Extraction
Transformer
BERT
第三語言關鍵字
學科別分類
中文摘要
棒球運動在臺灣深根發展已超過百年歷史,一直以來都是臺灣人最熱衷的體育賽事,不僅是最具指標性的運動,更是凝聚臺灣族群的共識認同。近年來民眾對於各類運動賽事的關注日漸激增,除了觀看賽事外,支持者們更是喜歡重複觀看這些明星選手大展身手的精彩片段,而這些片段不僅能夠帶來數千萬次的點閱率,同時在相關產業方能帶來無限的商機。
棒球賽事往往需要持續數個鐘頭,然而在這幾個鐘頭的賽程中,運動員們展現精彩美技的時間卻相對短暫,過往只能憑藉剪輯師的經驗,花費大量時間及人力成本來剪輯精彩片段,如三振、撲接、全壘打等; 隨著科技的進步與發展,人工智慧與深度學習技術的躍進,尤其是Transformer在處理圖片、影片及自然語言的技術精進,進展神速。本論文亦擬針對一場棒球運動賽事影片,透過Transformer中 Self-Attention 的注意力機制進行平行處理運算,不僅能運用在自然語言挖掘前後文的關聯性,近年來已被證實,其也有能力挖掘圖片的特徵,有著類似卷積的運算能力,可以挖掘空間及時間維度的圖片特徵,因此,我們擬針對選手的行為進行特徵挖掘及辨識,並且對自定義之特徵進行片段自動剪輯。
鑒於大多行為辨識是基於視覺特徵處理的方法或是基於深度學習CNN或是RNN的卷積與池化掃描來進行行為辨識,不同於以往的研究,本論文主要是透過Transformer Self-Attention運算,來學習語句的上下文關係及挖掘影像中全域及局域的特徵資訊,並且透過BERT QA模型,進行遷移式學習,使該模型有能力從一段影片中,辨識出精彩片段之始末點,相較於傳統行為辨識的做法更為新穎創新,對於體育相關產業與學術發展均帶來重大的影響與貢獻。
英文摘要
Baseball has been deeply rooted in Taiwan for more than a hundred years. It has always been the most popular sports event for the Taiwanese. It is not only the most iconic sport, but also unites the consensus of the Taiwanese community. In recent years, people can easily access all kinds of sports because of the Internet community media popularization. The supporters tend to watch these wonderful clips repeatedly on the multimedia platform, which leads to business opportunities due to advertisement. 
Baseball games often last for several hours. However, during these hours, the time for athletes to show their wonderful skills is in short duration. In the past, extracting wonderful clips usually highly relies on the experience of experts, consuming a lot of time and labor costs to extract highlights, such as strikeouts, catches, home runs, etc. With the advancement and development of science and technology, the advancement of artificial intelligence and deep learning technology, especially the advancement of Transformer in processing pictures, videos and natural language is progressing rapidly. This thesis intends to use the Self-Attention in Transformer to perform parallel processing operations on a baseball game video, which can be used in natural language to understand the relevance of the context. Moreover, it has been proven in recent years that it is also capable of extracting images from the features of pictures. It has similar convolutional computing capabilities and can extract the features of pictures in spatial dimensions. Therefore, this thesis also intends to conduct feature extraction and identification for the player’s behavior, and automatically extract clips for the predefined features to develop a smart video editing system. 
Most behavior recognitions were based on visual feature image processing or deep learning CNN or RNN convolution and pooling scanning for behavior classification or behavior prediction. However, different from these studies, this thesis mainly utilizes the Transformer Self-Attention mechanism to learn the context and extract the feature information from the entire domain in the image, and obtain the start and end points of its highlights through the BERT QA model. Compared with the traditional behavior recognition method, it is more novel and innovative. It is expected to have the significant impact and contribution to sports-related industries and academic development.
第三語言摘要
論文目次
Table of Contents
Table of Contents	VI
List of Figure	VIII
List of Table	IX
Chapter 1 Introduction	1
Chapter 2 Related Work	6
2-1 Image Processing	6
2-2 Deep Learning	8
Chapter 3 Background Knowledge	11
3-1 BERT	11
3-2 Video Vision Transformer	13
3-3 Bert Question Answering	14
Chapter 4 System Architecture	16
4-1 Environment and Problem Description	16
4-1-1 Problem to be solved	16
4-1-2 Target	16
4-2 System Architecture	16
A.	Data Collection	17
B.	Data Preprocess	18
C.	Model Building and Training	20
D.	Model Evaluation	24
Chapter 5 Experiment Analysis	28
5-1 Environment Settings	28
5-2 Experimental Data	28
5-3 Experimental Results	29
5-4 Future Work	36
Chapter 6 Conclusion	38
Reference	39
 
List of Figure
Figure 1: Research Purpose	2
Figure 2: Research Architecture	4
Figure 3: Word Embedding	12
Figure 4: NSP and MLM	13
Figure 5: ViT Model Architecture	14
Figure 6: BERT Question Answering	15
Figure 7: System Architecture	17
Figure 8: Data Source	18
Figure 9: Natural Language Preprocessing	19
Figure 10: Image Preprocessing	20
Figure 11: Overall Structure	21
Figure 12: Tubelet Embedding	23
Figure 13: ViViT Model Architecture	23
Figure 14: BERT Question Answering Model Architecture	24
Figure 15: IOU Example Diagram	26
Figure 16: Experimental Ratio	31
Figure 17: Confusion Matrix Experiment Results	31
Figure 18: IOU Code	33
Figure 19: IOU Experiment Results	36
Figure 20: IOU Experiment Results	37
 
List of Table
Table 1: Comparison of Related Work	10
Table 2: Confusion Matrix	26
Table 3: Package of Training Environment	28
Table 4: Experimental Parameters	29
Table 5: Confusion Matrix Example	30
Table 6: IOU Experiment Results	33
參考文獻
Reference
[1]	Z. Jiang, F. Zhang, and L. J. S. P. Sun, "Sports action recognition based on image processing technology and analysis of the development of sports industry pattern," Scientific Programming, vol. 2021, pp. 1-11, 2021.
[2]	Y. Ke, R. Sukthankar, and M. Hebert, "Event detection in crowded videos," in 2007 IEEE 11th international conference on computer vision, Brazil, Oct. 2007.
[3]	P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior recognition via sparse spatio-temporal features," in 2005 IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, China, Oct. 2005.
[4]	S. Yan, Y. Teng, J. S. Smith, and B. Zhang, "Driver behavior recognition based on deep convolutional neural networks," in 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), China, Aug. 2016.
[5]	C. J. C. I. Gao and Neuroscience, "Athlete behavior recognition technology based on siamese-rpn tracker model," Computational Intelligence and Neuroscience, vol. 2021, Oct. 2021.
[6]	J. Lu, M. Nguyen, and W. Q. Yan, "Deep learning methods for human behavior recognition," in 2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ), New Zealand, Nov. 2020.
[7]	M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," International Conference on Machine Learning, pp. 6105-6114: PMLR. Long Beach, California, USA, Jun. 2019.
[8]	K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," The IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778. Las Vegas, NV, USA, June. 2016.
[9]	A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv, Oct. 2020.
[10]	G. Bertasius, H. Wang, and L. Torresani, "Is space-time attention all you need for video understanding?" International Conference on Machine Learning,Vienna, Austria, July 2021.
[11]	A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, "Vivit: A video vision transformer," The IEEE/CVF International Conference on Computer Vision, Montreal, Canada, Oct. 2021.
[12]	W. Kim, B. Son, and I. Kim, "Vilt: Vision-and-language transformer without convolution or region supervision," International Conference on Machine Learning, July 2021.
[13]	J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. J. a. p. a. Wu, "Coca: Contrastive captioners are image-text foundation models," arXiv:2205.01917v2, June. 2022.
[14]	Devlin, M.-W. Chang, K. Lee, and K. J. a. p. a. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv.1810.04805, 2018.
[15]	Devlin, M.-W. Chang, K. Lee, and K. J. a. p. a. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv.1810.04805, 2018.
[16]	A. J. I. t. o. M. Hanjalic, "Adaptive extraction of highlights from a sport video based on excitement modeling," vol. 7, no. 6, pp. 1114-1122, Dec. 2005.
[17]	D. Chauhan, N. M. Patel, and M. Joshi, "Automatic summarization of basketball sport video," 2016 2nd International Conference on Next Generation Computing Technologies (NGCT), India, Oct. 2016.
論文全文使用權限
國家圖書館
不同意無償授權國家圖書館
校內
校內紙本論文立即公開
電子論文全文不同意授權
校內書目立即公開
校外
不同意授權予資料庫廠商
校外書目立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信