| 系統識別號 | U0002-0309202321423100 |
|---|---|
| DOI | 10.6846/tku202300630 |
| 論文名稱(中文) | 基於影像、光流、及骨架之深度學習模型自動剪輯棒球撲接精彩影片 |
| 論文名稱(英文) | Automated Baseball Diving Catch Highlights Clip with Image, Optical Flow, and Skeleton by Deep Learning Models |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 資訊工程學系全英語碩士班 |
| 系所名稱(英文) | Master's Program, Department of Computer Science and Information Engineering (English-taught program) |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 111 |
| 學期 | 2 |
| 出版年 | 112 |
| 研究生(中文) | 李世邦 |
| 研究生(英文) | Shih-Pang Lee |
| 學號 | 611780023 |
| 學位類別 | 碩士 |
| 語言別 | 繁體中文 |
| 第二語言別 | 英文 |
| 口試日期 | 2023-06-09 |
| 論文頁數 | 70頁 |
| 口試委員 |
指導教授
-
張志勇(cychang@mail.tku.edu.tw)
口試委員 - 陳裕賢 口試委員 - 陳宗禧 共同指導教授 - 郭經華(chkuo@mail.tku.edu.tw) |
| 關鍵字(中) |
棒球 自動剪輯 精華片段分析 SRT 自然語言 openpose骨架 光流 比對模型 孿生模型 |
| 關鍵字(英) |
baseball auto clipping highlight analysis SRT NLP openpose skeleton optical flow comparison model Siamese model |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
論文名稱: 基於影像、光流、及骨架之深度學習模型自動剪輯棒球撲接精彩影片 頁數:70 頁 校系(所)組別:淡江大學資訊工程學系全英語碩士班 畢業時間及提要別: 111 學年度第 2 學期 碩士學位論文提要 研究生:李世邦 指導教授: 張志勇博士 論文主要內容: 在臺灣棒球一直以來都是臺灣人最熱衷的體育賽事,不僅具有極高的知名度,更是凝合臺灣族群認同和台灣價值的重要象徵。隨著網路直播的發達,大眾對各種運動體育賽事的關注不斷攀升,民眾不僅熱衷於現場觀賽,更喜歡反覆觀看那些明星選手展現出的精彩時刻。這些精采片段不僅在線上平台獲得數億萬次的重播率,而由此也為相關產業帶來了巨大商機。 每場比賽常常延續數個小時,然而運動員們精彩技巧稍縱即逝。專業剪輯師需要依靠經驗與時間來剪輯出精彩瞬間,例如撲接、全壘打、三振等。隨著深度學習在處理影片技術的進步以及 Bert 等模型更為成熟,本論文亦使用3小時棒球賽事影片,透過三大技術包含自然語言、openpose骨架座標、光流,從完整比賽影片中,自動剪輯有可能出現的潛在棒球撲接片段,我們會稱之為「候選撲接片段」,我們再將所有候選片段中的結果,使用滑動視窗,將原先候選影片片段,依照我們設計的演算法,對候選片段進行滑動切割,將影片切分得更為細緻,再將以上切割完的結果,分別送入我們的比對和孿生模型,使其藉由神經網路深度模型去辨識,哪個段落是和撲接相似度最高的。為了考量到運行效能,我們會另外設計「撲接簡單快速比對」與「撲接精細比對」兩種辨識方式。在簡單快速比對中,我們會使用孿生比對的 3DCNN 進行辨識,在精細比對中我們會運行所有的分類器去對影片進行全局的辨識。 根據我們做的實驗數據顯示,我們使用多種不同的棒球資料集,其中包括MLB、愛爾達棒球資料集,去對我們所提出的做法和模型進行測試,我們會計算Precision、Accuracy、Recall三個數值,對我們的模型進行評估,而我們其中一個指標為AUC,當我們設定AUC閥值的時候,在AUC的閥值為相同的數值的情況下,隨著資料量的變多,各項趨勢都是有增加的,尤其是當閥值在0.6之後,在我們的作法都有顯著的提升,而Recall,在AUC的閥值在0.6以後,隨著負樣本的數值變多,Recall的數值也隨著閥值增加,總體的數值也隨之下降;我們所提出的骨架和自然語言分析的方法,都和我們的資料量有明顯的正相關;除了單純使用資料集評估,同時我們也會和其他幾個先前研究中已設計的針對棒球或影片分析的模型,如I3D、SEA…等,進行效能與準確度比較,而和I3D模型進行比對時,我們在準確度上不論是比對模型或深度模型,即可達成約略9成的準確度,和先前其他的模型相比,效能與準確度都有顯著的提升。 關鍵字:棒球、自動剪輯、精華片段分析、SRT、自然語言、openpose骨架、光流、比對模型、學伴模型 |
| 英文摘要 |
Title of Thesis: Automated Baseball Diving Catch Highlights Clip with Image, Optical Flow, and Skeleton by Deep Learning Models Total pages: 70 Keyword: Baseball, Automated Video Clip, Highlight analysis, SRT, Natural Language Processing, Openpose Skeleton, Optical Flow, Comparison Model, Siamese model. Name of Institute: Master’s Program, Department of Computer Science and Information Engineering, Tamkang University (English- Taught Program) Graduate date: June 2023 Degree conferred: Master Name of student: SHIH-PANG, LEE Advisor: Dr. Chih-Yung Chang 李世邦 張志勇 博士 Abstract: In Taiwan, baseball has always been the most popular sports event of Taiwanese, not only has a high popularity, but also is an important symbol of Taiwan's ethnic identity and Taiwanese values. With the development of online live broadcasting, the public's attention to various sports and sports events continues to rise, and the public is not only keen to watch the games live, but also prefer to watch the wonderful moments of the star players repeatedly. These highlights not only received hundreds of millions of replays on online platforms, but also brought huge business opportunities to related industries. In baseball, each game often has several hours, but the time for the athletes to actually show their skills is fleeting. In the past, professional editors relied on them experience and spent a lot of time and money to edit those wonderful moments, such as exciting strikeouts, flashy diving catch, morale-boosting home runs, etc… With the iteration and progress of artificial intelligence deep learning technology in the field of computer vision, especially deep learning neural networks are frequently iteration in processing videos, pictures, natural language model technologies such as Bert and NLP. This paper uses a 3-hour video of a complete baseball game, and we use three novel technologies including natural language, OpenPose skeleton, and optical flow to obtain potential baseball diving catch clips that may appear in the complete game video, which we will call "candidate diving catch clips". We use a sliding window to slide and cut the original candidate diving catch clips according to the algorithm we designed, cut the film into more detail, and then send the cut results to our comparison model and the Siamese model, so that through the neural network depth model, we can identify which clip has the highest similarity to the ground truth. In order to consider the operating efficiency, we will additionally design two identification methods of "simple and fast comparison" and "fine-grain comparison", and then in the simple and fast comparison, we will simply use the 3DCNN of “Siamese model” for identification, and in the fine-grain comparison, we will run all classifiers to identify the film globally. According to our experimental data, we use a variety of different baseball data sets, including MLB and Alda baseball data sets, to test the methods and models we proposed. We will calculate Precision, Accuracy, and Recall. To evaluate our model, one of our indicators is AUC. When we set the AUC threshold, when the AUC threshold is the same value, as the amount of data increases, each The trends of all items are increasing, especially when the threshold is after 0.6, our approach has significantly improved. As for Recall, when the AUC threshold is after 0.6, as the number of negative samples increases, Recall's The value also increases with the threshold, and the overall value also decreases; the skeleton and natural language analysis methods we proposed have a clear positive correlation with the amount of our data; in addition to simply using the data set for evaluation, we also The performance and accuracy will be compared with several other models designed for baseball or film analysis in previous research, such as I3D, SEA... etc. When comparing with the I3D model, we will compare in terms of accuracy regardless of the accuracy. For models or deep models, an accuracy of about 90% can be achieved. Compared with other previous models, both performance and accuracy have been significantly improved |
| 第三語言摘要 | |
| 論文目次 |
List of content Content.....................................................................................................................VI Figure Content ......................................................................................................VIII Table Content ............................................................................................................... X Chapter 1 - Introduction .............................................................................................. 1 Chapter 2 - Related work............................................................................................. 14 2-1 Temporality change but not spatiality .......................................................... 14 2-2 Temporality and spatiality change................................................................ 18 2-3 Action Recognition with specific movement ............................................... 23 Chapter 3 - System model ........................................................................................... 28 3-1 Collecting the baseball dataset...................................................................... 29 3-2 Natural language processing BERT ............................................................... 30 3-3 Marking the human skeleton......................................................................... 32 3-4 Generating the human optical flow............................................................... 33 Chapter 4 - System architecture .................................................................................. 35 I. Generating diving and non-diving catch diversity data sample............ 36 Content XV II. Establishment and training of deep model for classification and comparison .................................................................................................. 37 A. Design of classification models......................................................... 37 B. Training of classification models...................................................... 39 C. Design of Siamese comparison model.............................................. 40 D. Training of Siamese comparison models......................................... 42 III.Simple and fast comparison of diving catch.......................................... 44 A. Candidate video normalization sample cutting Process................ 44 B. First round of hidden feature comparison....................................... 46 IV. Fine-grain of comparing diving catch...................................................... 47 A. Fine-grain comparison........................................................................ 47 Chapter 5 - experiment analysis.................................................................................. 49 5-1 Basic system environment configuration and evaluation standards............. 49 5-2 Data sets and experimental data.................................................................... 52 5-3 Experimental numerical results .................................................................... 53 5-4 Future development and work ...................................................................... 63 Chapter 6 - Conclusion and Contribution ................................................................... 65 References ................................................................................................................... 67 Figure Content XVI Figure content Figure 1 – research goals to be achieved................................................................. 2 Figure 2 – Flow chart of this research..................................................................... 4 Figure 3 - Overall system architecture diagram....................................................... 28 Figure 4 – Baseball dataset source .......................................................................... 30 Figure 5 – Automatically generated diving catch sentences, and ready for effective training of the BERT model .................................................................................... 31 Figure 6 - Use automatically generated sentences and train them by the BERT.... 32 Figure 7 – Schematic diagram of character skeleton markers.................................. 33 Figure 8 – Capture the optical flow vector through the diving catch movement.... 34 Figure 9 – Generate diversity sample...................................................................... 36 Figure 10 – The diving and non-diving RGB, skeletal point coordinates, and attitude optical flow vectors are used as input data for the classification model................. 38 Figure 11 – Comparsion neural network ................................................................. 40 Figure 12 - Comparsion neural network training ................................................... 41 Figure 13 – Training of LSTM skeleton comparison network................................ 44 Figure 14 – Normalization ~ cutting and segmenting of target videos................... 45 Figure15 - A brief and quick comparison only uses the features of different layers of 3DCNN for quick comparison of cos similarity. .................................................... 46 Figure 16 – Fine-grain comparsion ......................................................................... 47 Figure 17 – Classification and comparison model co-work.................................... 48 Figure Content XVII Figure 18 – confusion matrix .................................................................................. 50 Figure19 – Confusion matrix diagram .................................................................... 54 Figure 20 – The comparison between the precision of this paper’s approach and the I3D algorithm.......................................................................................................... 55 Figure 21 – The comparison between the recall of this paper’s approach and the I3D algorithm.................................................................................................................. 55 Figure 22 – The comparison between the F1-score of this paper’s approach and the I3D algorithm........................................................................................................... 56 Figure 23 – The accuracy of different datasets in the practice of this paper and the amount of data ........................................................................................................ 57 Figure 24 – The recall of different datasets in the practice of this paper and the amount of data ...................................................................................................................... 58 Figure 25 – The F1-score of different datasets in the practice of this paper and the amount of data ......................................................................................................... 58 Figure 26 – The relationship between the number of skeleton generation and LSTM accuracy ................................................................................................................... 60 Figure 27 – The relationship between the number of video texts generated and the accuracy of Bert....................................................................................................... 61 Figure 28 – Comparison of the accuracy of different datasets and the practice of this paper......................................................................................................................... 62 |
| 參考文獻 |
[1] ZHOU, Bolei, et al. Temporal relational reasoning in videos. In: Proceedings of the European conference on computer vision (ECCV). 2018. p. 803-818. [2] WANG, Yunbo, et al. Spatiotemporal pyramid network for video action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017. p. 1529-1538. [3] DAVE, Ishan, et al. Tclr: Temporal contrastive learning for video representation. Computer Vision and Image Understanding, 2022, 219: 103406. [4] PIERGIOVANNI, A. J.; RYOO, Michael S. Fine-grained activity recognition in baseball videos. In: Proceedings of the ieee conference on computer vision and pattern recognition workshops. 2018. p. 1740-1748. [5] ESCORCIA, Victor, et al. Daps: Deep action proposals for action understanding. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer International Publishing, 2016. p. 768-784. [6] FISH, Edward; WEINBREN, Jon; GILBERT, Andrew. Two-Stream Transformer Architecture for Long Form Video Understanding. In: British Machine Vision Conference (BMVC). 2022. [7] ULLAH, Amin, et al. One-shot learning for surveillance anomaly recognition using siamese 3d cnn. In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020. p. 1-8. [8] WANG, Limin, et al. Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. Springer, Cham, 2016. p. 20-36. [9] FEICHTENHOFER, Christoph; PINZ, Axel; WILDES, Richard P. Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 4768-4777. [10] FEICHTENHOFER, Christoph; PINZ, Axel; ZISSERMAN, Andrew. Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 1933-1941. [11] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review. ACM Computing Surveys, 43:16:1–16:43, April 2011. [12] H. Xu, A. Das, and K. Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. arXiv preprint arXiv:1703.07814, 2017. [13] A. J. I. t. o. M. Hanjalic, "Adaptive extraction of highlights from a sport video based on excitement modeling," vol. 7, no. 6, pp. 1114-1122, Dec. 2005. [14] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 [15] D. Chauhan, N. M. Patel, and M. Joshi, "Automatic summarization of basketball sport video," 2016 2nd International Conference on Next Generation ComputingTechnologies (NGCT), India, Oct. 2016. [16] E. Fenil, G. Manogaran, et al, “Real time violence detection framework for football stadium comprising of big data analysis and deep learning through bidirectional LSTM,” Computer Networks, vol. 151, pp. 191-200, Mar. 2019. [17] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2625–2634, 2015. [18] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem. Daps: Deep action proposals for action understanding. In Proceedings of European Conference on Computer Vision (ECCV), pages 768–784. Springer, 2016. [19] Z. Jiang, F. Zhang, and L. J. S. P. Sun, "Sports action recognition based on image processing technology and analysis of the development of sports industry pattern," Scientific Programming, vol. 2021, pp. 1-11, 2021. [20] D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017. [21] Y. Ke, R. Sukthankar, and M. Hebert, "Event detection in crowded videos," in 2007 IEEE 11th international conference on computer vision, Brazil, Oct. 2007. [22] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multiperson 2d pose estimation using part affinity fields. In CVPR, 2017. [23] F. Qi, X. Yang and C. Xu, “Emotion knowledge driven video highlight detection,” IEEE Transactions on Multimedia, vol. 23, pp. 3999-4013, Nov. 2020. [24] Y. Shi, Y. Tian, et al, “Sequential deep trajectory descriptor for action recognition with three-stream CNN,” IEEE Transactions on Multimedia, vol. 19, no. 7, pp. 1510-1520, Jul. 2017. [25] A. J. Piergiovanni and M. S. Ryoo, “Recognizing actions in videos from unseen viewpoints,” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, Tennessee, USA, pp. 4124-4132, Jun. 2021. [26] P. Wang, et al, “RGB-D-based human motion recognition with deep learning: A survey,” Computer Vision and Image Understanding, vol. 171, pp. 118-139, Jun. 2018. [27] B. G. Fabian Caba Heilbron, Victor Escorcia and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Pages 961-970, 2015. [28] S. Xu, H. Rao, et al, “Attention-based multilevel co-occurrence graph convolutional LSTM for 3-D action recognition,” IEEE Internet of Things Journal, vol. 8, no. 21, pp. 15990-16001, Nov. 2021. [29] Shou, Z., Wang, D., Chang, S.: Action temporal localization in untrimmed videos via multi-stage cnns. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2016) [30] Atmosukarto, I., Ghanem, B., Ahuja, N.: Trajectory-based fisher kernel representation for action recognition in videos. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 3333–3336. IEEE (2012) [31] Lillo, I., Niebles, J.C., Soto, A.: A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016). [32] Yu, G., Yuan, J.: Fast action proposals for human action detection and search. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1302–1311 (2015) |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信