§ 瀏覽學位論文書目資料
  
系統識別號 U0002-2208202317021900
DOI 10.6846/tku202300595
論文名稱(中文) 基於影像特徵與深度學習之棒球三振影片自動剪輯
論文名稱(英文) Automatic Clipping of Baseball Strikeout Videos Based on Image Features and Deep Learning
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系碩士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 111
學期 2
出版年 112
研究生(中文) 范緒謙
研究生(英文) Shiu-Chien Fan
學號 611410027
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2023-06-09
論文頁數 67頁
口試委員 指導教授 - 張志勇(cychang@mail.tku.edu.tw)
口試委員 - 陳宗禧
口試委員 - 陳裕賢
共同指導教授 - 郭經華(chkuo@mail.tku.edu.tw)
關鍵字(中) 棒球
片段切割
人工智慧
深度學習
影像辨識
自然語言處理
YOLO
DeepSORT
Openpose
LSTM
3DCNN
BERT
關鍵字(英) You Only Look One(YOLO)
Baseball
Clip segmentation
Artificial intelligence
Deep learning
Image recognition
Natural language processing
DeepSORT
Openpose
Long Short Term Memory(LSTM)
3D Convolutional Neural Networks(3DCNN)
BERT
第三語言關鍵字
學科別分類
中文摘要
在臺灣眾多的運動賽事中,棒球是最受歡迎的運動之一,不僅僅有著超過百年的歷史,甚至有著「國球」之稱。近年來,隨著社群媒體發展及普及,越來越多的民眾,除了觀看完整的比賽直播、錄影之外,也很喜歡在社群平台上重複觀看運動員的精采表現片段、合集,而眾多廠商更是看準了這樣的曝光度,對這些影片進行置入性行銷,帶來無限商機。
  一場完整的棒球比賽往往長達數個小時,其中包含許多選手的精采表現,而觀眾最喜愛觀看的美技之一,為投手「三振」打者的瞬間,然而,這種精彩的鏡頭,卻只有短短的數分鐘,甚至是數秒鐘,因此如何從這漫長的時間中提取出這些片段,一直是轉播單位的一大難題。現今,只能透過剪輯師從這數個鐘頭的賽事中,擷取出精華片段,依靠過往的經驗,使用大量的時間尋找、判斷哪段時間為三振精華的片段,同時也需要花費大量的精力,確定確切的開始及結束時間,花費大量的時間及人力成本。
  隨著人工智慧技術的發展,本論文擬透過影像辨識、物件追蹤、自然語言處理技術,針對棒球比賽中三振片段的特徵設計不同的解法,並利用AI的方式令機器學習到各種重要的特徵,完成自動剪輯棒球賽事之三振片段精華系統。
  本論文之系統,透過YOLO模型,先將畫面中的出局數辨識出,並在其變化的瞬間,作為每個片段的結尾,再將此時間,往回推20秒,透過YOLO、DeepSORT、OpenPose、LSTM等模型,進行協同合作,找出投手準備投球的姿勢,作為每個片段的開頭,而這些片段我們將其稱為「三振候選片段」,意即可能為三振的片段。接著,透過YOLO模型將畫面中的「K」字辨識出;3DCNN模型提取整個畫面的特徵;BERT模型,針對棒球主播實況轉播所說的話,進行語意分析。再將這三種模型的結果,進行加權平均,綜合判斷送入的影片是否為三振片段,最後,即可從所有的三振候選片段中,篩選出三振影片,並將其進行合併成精華合輯,達到從數個小時的棒球比賽影片中,自動剪輯成三振精華影片之目的。
英文摘要
In Taiwan, among the numerous sports events, baseball is one of the most popular sports, with over a hundred years of history and even being referred to as the "national sport." In recent years, with the development and widespread use of social media, more and more people not only watch complete live broadcasts or recordings of games but also enjoy watching clips and compilations of athletes' outstanding performances on social platforms. Many companies have recognized the exposure these videos receive and have used them for product placements, bringing about endless business opportunities.
A full baseball game often lasts for several hours and includes many impressive moments from players. One of the favorite highlights for viewers is the moment when a pitcher strikes out a batter. However, these exciting moments are only a few minutes, or even seconds, long. Therefore, extracting these clips from the lengthy game has been a major challenge for broadcasting units. Currently, this task can only be done by video editors, who have to go through hours of footage to find and identify the sections containing the essence of the strikeout. This process requires a significant amount of time and human effort to determine the exact start and end times of each clip.
With the advancement of artificial intelligence technology, this paper proposes using image recognition, object tracking, and natural language processing techniques to design different solutions for identifying strikeout segments in baseball games. By leveraging AI, the system will learn various essential features and automatically edit the highlight clips of baseball games that feature strikeouts.
The system proposed in this paper employs a YOLO model to first identify the number of outs in the frame. At the moment of its change, which serves as the end of each segment, the system traces back 20 seconds and collaboratively utilizes YOLO, DeepSORT, OpenPose, LSTM, and other models to identify the pitching posture for the upcoming pitch. These identified segments are referred to as "strikeout candidate segments," denoting the portions likely to result in a strikeout. Subsequently, the YOLO model recognizes the "K" symbol within the frame. A 3DCNN model extracts features from the entire frame, while a BERT model performs semantic analysis on the commentary provided by baseball broadcasters. The outcomes from these three models are subjected to a weighted average for a comprehensive evaluation of whether the input video segment represents a strikeout scene. Finally, from all the identified strikeout candidate segments, genuine strikeout scenes are filtered and consolidated into a highlight reel. This approach aims to automatically condense hours of baseball game footage into a compilation of strikeouts, achieving the purpose of creating a condensed highlight video.
第三語言摘要
論文目次
目錄
目錄	IV
圖目錄	VI
表目錄	IX
第一章緒論	1
第二章、相關研究	8
2-1 影像處理	8
2-2 深度學習	11
第三章前景知識	16
3-1 Coarse-Grain:三振候選片段辨識	16
3-1-1 YOLO	16
3-1-2 DeepSORT	20
3-1-3 OpenPose	21
3-1-4 LSTM	22
3-2 候選片段前處理	22
3-2-1 Gamma矯正	22
3-2-2 直方圖均衡化	23
3-2-3 高斯去躁	23
3-2-4 SpeechRecognition	24
3-2-5 CKIP	24
3-3 Fine-Grain:多維度三振特徵辨識	24
3-3-1 3DCNN	24
3-3-2 BERT	25
第四章研究方法	28
4-1 問題描述	28
4-1-1 情境與問題描述	28
4-1-2 目標	28
4-2 問題及挑戰	29
4-3 系統架構	30
第五章實驗分析	50
5-1環境設置	50
5-2實驗數據	51
5-3 實驗結果	51
第六章結論	62
參考文獻	63

 
圖目錄
圖1、研究目標	2
圖2、研究架構	5
圖3、YOLOv1網路架構 (圖取自研究[4])	17
圖4、Openpose使用示意圖(圖取自研究[29])	21
圖5、BERT兩階段訓練	25
圖6、BERT Word Embedding(圖取自研究[14])	27
圖7、系統架構總覽	30
圖8、提取出局前最後一球之流程圖	31
圖9、出局數模型-訓練期	32
圖10、出局數模型-使用期	33
圖11、抓取投手位置-訓練期	34
圖12、抓取投手位置-訓練期	34
圖13、抓取投手準備投球動作模型-訓練期	35
圖14、抓取投手準備投球動作模型-訓練期	36
圖15、影像前處理	37
圖16、自然語言前處理	38
圖17、判斷三振之模型	39
圖18、抓取三振特徵「K」-YOLO模型-訓練期	40
圖19、抓取三振特徵「K」-YOLO模型-使用期	40
圖20、3DCNN-訓練期	42
圖21、3DCNN-使用期	42
圖22、BERT模型-訓練期	44
圖23、BERT模型-使用期	44
圖24、模型綜合判斷之流程	45
圖25 、IOU使用範例	49
圖26、投手準備投球動作模型之混淆矩陣	52
圖27、判斷出局數模型之混淆矩陣	53
圖28、IOU在模型準確度不同的變化	54
圖29、IOU在不同數量的測試影片之數值	55
圖30、判斷三振模型之正確率	56
圖31、訓練中loss值的變化	56
圖32、資料集大小與幀率對Precision的影響	57
圖33、資料集大小與幀數對Recall的影響	58
圖34、資料集大小與幀率對F1-Score的影響	58
圖35、閥值與資料集大小對Precision的影響	59
圖36、閥值與資料集大小對Recall的影響	60
圖37、閥值與資料集大小對F1-score的影響	60
 
表目錄
表1、相關研究比較表	15
表2、候選片段模型符號之意義	36
表3、篩選候選片段模型符號之意義	46
表4、混淆矩陣表格	47
表5、環境版本套件表	50
表6、整體系統效能總覽	61
參考文獻
參考文獻
[1]	S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional neural networks for human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221-231, 2012.
[2]	S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[3]	A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, "Vivit: A video vision transformer," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836-6846. 
[4]	J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788. 
[5]	M. J. Shafiee, B. Chywl, F. Li, and A. Wong, "Fast YOLO: A fast you only look once system for real-time embedded object detection in video," arXiv preprint arXiv:1709.05943, 2017.
[6]	J. Redmon and A. Farhadi, "Yolov3: An incremental improvement," arXiv preprint arXiv:1804.02767, 2018.
[7]	A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, "Yolov4: Optimal speed and accuracy of object detection," arXiv preprint arXiv:2004.10934, 2020.
[8]	G. Jocher et al., "ultralytics/yolov5: v7. 0-yolov5 sota realtime instance segmentation," Zenodo, 2022.
[9]	C. Li et al., "YOLOv6: A single-stage object detection framework for industrial applications," arXiv preprint arXiv:2209.02976, 2022.
[10]	C. Wang, A. Bochkovskiy, and H. Liao, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022," arXiv preprint arXiv:2207.02696, 2022.
[11]	N. Wojke, A. Bewley, and D. Paulus, "Simple online and realtime tracking with a deep association metric," in 2017 IEEE international conference on image processing (ICIP), 2017: IEEE, pp. 3645-3649. 
[12]	N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," in European conference on computer vision, 2020: Springer, pp. 213-229. 
[13]	A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[14]	J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[15]	N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, "Mars: Motion-augmented rgb stream for action recognition," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7882-7891. 
[16]	C. Feichtenhofer, H. Fan, J. Malik, and K. He, "Slowfast networks for video recognition," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202-6211. 
[17]	K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," Advances in neural information processing systems, vol. 27, 2014.
[18]	L. Wang et al., "Temporal segment networks: Towards good practices for deep action recognition," in European conference on computer vision, 2016: Springer, pp. 20-36. 
[19]	P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior recognition via sparse spatio-temporal features," in 2005 IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, 2005: IEEE, pp. 65-72. 
[20]	L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, "Actions as space-time shapes," IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 12, pp. 2247-2253, 2007.
[21]	I. Laptev, "On space-time interest points," International journal of computer vision, vol. 64, pp. 107-123, 2005.
[22]	A. F. Bobick and J. W. Davis, "The recognition of human movement using temporal templates," IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 3, pp. 257-267, 2001.
[23]	Y. Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1110-1118. 
[24]	J. Lin, C. Gan, and S. Han, "Tsm: Temporal shift module for efficient video understanding," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7083-7093. 
[25]	R. Girshick, "Fast r-cnn," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440-1448. 
[26]	K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778. 
[27]	K. He, X. Zhang, S. Ren, and J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904-1916, 2015.
[28]	Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[29]	Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, "Realtime multi-person 2d pose estimation using part affinity fields," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291-7299. 
[30]	W. Zaremba, I. Sutskever, and O. Vinyals, "Recurrent neural network regularization," arXiv preprint arXiv:1409.2329, 2014.
論文全文使用權限
國家圖書館
同意無償授權國家圖書館,書目與全文電子檔於繳交授權書後, 於網際網路立即公開
校內
校內紙本論文立即公開
同意電子論文全文授權於全球公開
校內電子論文立即公開
校外
同意授權予資料庫廠商
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信