| 系統識別號 | U0002-0508202515315200 |
|---|---|
| DOI | 10.6846/tku202500665 |
| 論文名稱(中文) | 手動實作教學影片之問答系統 |
| 論文名稱(英文) | An Intelligent Question-Answering System for Hands-on Instructional Videos |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 資訊工程學系全英語碩士班 |
| 系所名稱(英文) | Master's Program, Department of Computer Science and Information Engineering (English-taught program) |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 113 |
| 學期 | 2 |
| 出版年 | 114 |
| 研究生(中文) | 許淵智 |
| 研究生(英文) | Yuan-Chih Hsu |
| 學號 | 613780039 |
| 學位類別 | 碩士 |
| 語言別 | 英文 |
| 第二語言別 | |
| 口試日期 | 2025-06-26 |
| 論文頁數 | 104頁 |
| 口試委員 |
指導教授
-
張志勇(cychang@mail.tku.edu.tw)
口試委員 - 陳宗禧 口試委員 - 陳裕賢 共同指導教授 - 郭經華(chkuo@mail.tku.edu.tw) |
| 關鍵字(中) |
教學影片理解 多模態分析 物件識別 語意槽位填充 RAG 檢索 CLIP對比學習 DAG語意結構 問答系統 |
| 關鍵字(英) |
Instructional Video Understanding Multimodal Analysis Object Recognition Semantic Slot Filling Retrieval-Augmented Generation (RAG) Retrieval CLIP-based Contrastive Learning DAG-based Semantic Structure Question Answering System |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
在數位學習快速發展的時代,教學影片已成為學生自學的重要資源,特別是在實驗操作與機台應用等實務導向領域。然而,使用者往往需要花費大量時間反覆觀看影片,只為尋找一個片段的操作說明,這樣的學習效率極低。本研究針對此痛點,提出一套「手動實作教學影片之問答系統」,讓使用者能以自然語言提問,並直接獲得影片中對應的操作片段,提升理解速度與學習成效。
本系統核心技術結合了多模態資料處理流程,包含語音轉文字、圖像分割、關鍵幀萃取、語意標註,以及大型語言模型(如ChatGPT)進行語意理解。透過 CLIP、SAM 與超分辨率模型增強畫面辨識能力,並構建有向無環圖(DAG)結合時間與步驟邏輯,最後藉由RAG技術找出TOP1片段,輸出對應且連貫的影片解答片段。
本系統可應用於線上實驗課程、產業技能訓練、手術操作教學等場景,大幅減少學習者尋找資訊的時間,並提高對動態操作的理解精準度。整合語音、文字與影像的檢索架構,是本研究的一大創新。實驗結果亦顯示,本系統在關鍵片段命中率、模型表現穩定度等指標上,均優於傳統文字型 RAG 系統,展現出高度的實用價值與未來發展潛力。
|
| 英文摘要 |
In an era of rapid development of digital learning, teaching videos have become an important resource for students to learn independently, especially in practical-oriented fields such as experimental operation and machine application. However, users often need to spend a lot of time watching videos repeatedly just to find the operating instructions for a clip, which makes learning efficiency extremely low. To address this pain point, this study proposes a "Question-Answering System for Manual Practice Tutorial Videos" that allows users to ask questions in natural language and directly obtain the corresponding operation clips in the video, thereby improving comprehension speed and learning effectiveness. The core technology of this system integrates a multimodal data processing pipeline, including speech-to-text conversion, image segmentation, keyframe extraction, semantic annotation, and semantic understanding via large language models (such as ChatGPT). It leverages CLIP, SAM, and super-resolution models to enhance visual recognition. A directed acyclic graph (DAG) is constructed to capture temporal and procedural logic. Finally, Retrieval-Augmented Generation (RAG) is employed to identify the TOP-1 segment, producing a coherent and contextually relevant video-based answer fragment. This system can be applied to online experimental courses, industrial skills training, surgical operation teaching and other scenarios, greatly reducing the time learners spend searching for information and improving the accuracy of their understanding of dynamic operations. The retrieval architecture that integrates voice, text and images is a major innovation of this study. Experimental results also show that this system is superior to traditional text-based RAG systems in terms of key segment hit rate, temporal reasoning accuracy and other indicators, demonstrating high practical value and future development potential. |
| 第三語言摘要 | |
| 論文目次 |
Table of Contents
Table of Contents VI
List of Figures VIII
List of Tables IX
Chapter 1 Introduction 1
Chapter 2 Related Work 8
2.1 Unimodal and Early Multimodal Video Question Answering Methods 8
2.2 Models Supporting Object Semantics and Operational Reasoning 13
2.3 Temporal Modeling and Graph Reasoning Enhanced Systems 16
2.4 Overview and Comparative Analysis 20
Chapter 3 Background Knowledge 22
3.1 SAM 22
3.2 CLIP 24
3.3 JIEBA 26
3.4 RAG 27
3.5 DAG 30
3.6 GPT 31
Chapter 4 System Design 35
4.1 Overall System Architecture 35
4.2 Data Collection and Preprocessing 37
4.3 Multimodal Annotation 39
4.4 Multimodal Semantic Fusion 43
4.4.1 External Knowledge Extraction and Semantic Expansion 45
4.4.2 Definition of Semantic Slots and Filling Strategy 46
4.4.3 Vectorized Semantic Encoding and Contrastive Learning 47
4.4.4 Edge Relation Construction of the Semantic Graph 50
4.5 DAG Construction and Paragraph-Level Semantic Integration 52
4.6 User Question Answering and Retrieval Process 56
Chapter 5 Experimental Analysis 61
5.1 Datasets 61
5.2 Environment and System Configuration 63
5.3 Experimental Results 63
5.3.1 Temporal Segment Localization Accuracy Analysis (IoU Evaluation) 65
5.3.2 Model Stability Analysis (Multiple Runs on the Tutorial VQA Dataset) 68
5.3.3 Model Stability Analysis (Multiple Runs on the COIN Dataset) 70
5.3.4 Model Stability Analysis (Multiple Runs on the IOTQA ) 71
5.3.5 Ablation Study Analysis 73
5.3.6 Semantic Slot Combination Analysis 77
5.3.7 Semantic Vector Visualization Analysis 79
5.3.8 Analysis of the Impact of Question-Answer 81
5.3.9 Overall Trends in Accuracy and Stability 83
5.3.10 Core Module Contribution Analysis in Multi-Task Scenarios (COIN) 85
5.3.11 Comparative Analysis of Module (Tutorial VQA) 88
5.3.12 Comparative Analysis of Module (IOTQA) 91
Chapter 6 Conclusion 95
References 97
List of Figures
Figure 1. Architecture of the Segment Anything Model (SAM) 23
Figure 2. Architecture of CLIP 26
Figure 3. Architecture of RAG 30
Figure 4. Architecture of GPT 34
Figure 5. Overall System Architecture 37
Figure 6. Data Preprocessing 39
Figure 7. Multimodal Annotation 40
Figure 8. Component Label Database Construction 42
Figure 9. Object Occurrence Frequency 43
Figure 10. Multimodal Semantic Fusion 44
Figure 11. Contrastive Learning Training 48
Figure 12. Preliminary Construction of the Semantic Graph 51
Figure 13. Correction Phase 52
Figure 14. DAG Refinement 55
Figure 15. Paragraph-level Semantic Graph 56
Figure 16. Retrieval Phase 57
Figure 17. Retrieval Workflow 60
Figure 18. QA Performance for Different Slot Combinations 78
Figure 19. QA Performance for Different Slot Combinations (Alternative) 80
Figure 20. Model Accuracy for Short vs. Long QA Pairs 82
Figure 21. Cross-Dataset Trendlines Across Metrics 84
Figure 22. Radar Chart of Multimodal Module Contributions (COIN) 86
Figure 23. Radar Chart of Multimodal Module Contributions (Tutorial VQA) 89
Figure 24. Radar Chart of Multimodal Module Contributions (IOTQA) 92
List of Tables
Table 1. Comparative Summary of Related Work 21
Table 2. Experimental Environment of the Proposed System 63
Table 3. Confusion Matrix Table 64
Table 4. Segment Overlap Evaluation Results of Different Models on Datasets 65
Table 5. Performance Comparison of Different Models on the Tutorial VQA 68
Table 6. Performance Comparison of Different Models on the COIN Dataset 70
Table 7. Performance Comparison of Different Models on the IOTQA Dataset 72
Table 8. Ablation Study Results on the COIN Dataset 74
Table 9. Ablation Study Results on the Tutorial VQA Dataset 75
Table 10. Ablation Study Results on the IOTQA Dataset 76
|
| 參考文獻 |
[1] Shutao Li, Bin Li, Bin Sun, and Yixuan Weng, " Towards Visual-Prompt Temporal Answer Grounding in Instructional Video," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, pp. 8836-8853, Dec. 2024. [2] Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou, "Span-Based Localizing Network for Natural Language Video Localization," Proceedings of The 58th Annual Meeting of The Association for Computational Linguistics, Online (originally scheduled in Seattle, Washington, USA), pp. 6543-6554, Jul. 2020. [3] Jonghwan Mun, Minsu Cho, and Bohyung Han, "Local-Global Video-Text Interactions for Temporal Grounding," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, Washington, USA , Apr. 2020. [4] Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh, "Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, pp. 4252-4266, Feb. 2021. [5] Haoyu Tang, Jihua Zhu, Meng Liu, Zan Gao, and Zhiyong Cheng, "Frame-Wise Cross-Modal Matching for Video Moment Retrieval," IEEE Transactions on Multimedia, Vol. 24, pp. 1338-1349, Mar. 2021. [6] Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, "Relation-Aware Network for Temporal Language Grounding in Videos," Proceedings of The 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic (Hybrid) , pp. 3978-3988, Nov. 2021. [7] Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo," Multi Scale 2D Temporal Adjacency Networks for Moment Localization with Natural Language," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, pp. 9073-9087, Oct. 2021. [8] Xun Jiang, Xing Xu, Jingran Zhang, Fumin Shen, Zuo Cao, and Heng Tao Shen, "SDN: Semantic Decoupling Network for Temporal Language Grounding," IEEE Transactions on Neural Networks and Learning Systems (TNNLS), Vol. 35, pp. 6598-6612, Oct. 2022. [9] Zhe Xu, Da Chen, Kun Wei, Cheng Deng, and Hui Xue, "HiSA: Hierarchically Semantic Associating for Video Temporal Grounding," IEEE Transactions on Image Processing (TIP), Vol. 31, pp. 5178-5188, Aug. 2022. [10] Xin Sun, Xuan Wang, Jianin Gao, Qiong Liu, and Xi Zhou, "Multi-Granularity Perception Network for Moment Retrieval in Videos," 45th ACM SIGIR Conference, Madrid, Spain (Hybrid) , pp. 1022-1032, Jul. 2022. [11] Xin Sun, Jialin Gao, Yizhe Zhu, Xuan Wang, and Xiang Zhao, "Video Moment Retrieval Via Comprehensive Relation Aware Network," IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), Vol. 33, pp. 5281-5295, Feb. 2023. [12] Linchao Zhu, and Yi Yang, "ActBERT: Learning Global-Local Video-Text Representations," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, Washington, USA , Jun. 2020. [13] Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven C. H. Hoi, "Align and Prompt: Video-and-Language Pre Training with Entity Prompts," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, Louisiana, USA (Hybrid) , Sep. 2022. [14] Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, and Ying Shan, "Object-aware Video-language Pre-training for Retrieval," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, Louisiana, USA (Hybrid) , Jun. 2022. [15] Reza Ghoddoosian, Saif Sayed, and Vassilis Athitsos, “Hierarchical Modeling for Task Recognition and Action Segmentation in Weakly-Labeled Instructional Videos,” Proceedings of The IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, Hawaii, USA, pp. 1922–1932, Jan. 2022. [16] Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, and Aude Oliva, “Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions,” Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, pp. 14871–14881, Jun. 2021. [17] Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia, “Mac: Mining Activity Concepts for Language-Based Temporal Localization,” IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, Hawaii, USA , pp. 245–253, Jan. 2019. [18] Y. Yang, Z. Li, and G. Zeng, “A Survey of Temporal Activity Localization Via Language in Untrimmed Videos,” International Conference on Culture-Oriented Science Technology (ICCST), Beijing, China, pp. 596–601, Oct. 2020. [19] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia, “Tall: Temporal Activity Localization Via Language Query,” Proceedings of The IEEE International Conference on Computer Vision, Venice, Italy, pp. 5267–5275, Oct. 2017. [20] Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis, “Man: Moment Alignment Network for Natural Language Moment Retrieval Via Iterative Graph Adjustment,” Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, California, USA, pp. 1247–1257, Jun. 2019. [21] Tom B. Brown, Benjamin Mann, et al., “Language Models are Few-Shot Learners,” NeurIPS, 2020. [22] Da Cao, Yawen Zeng, Meng Liu, Xiangnan He, Meng Wang, and Zheng Qin, “Strong: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization,” Proceedings of The 28th ACM International Conference on Multimedia, New York, NY, USA, pp. 4162–4170, Oct. 2020. [23] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pretraining of Deep Bidirectional Transformers for Language Understanding,” Proceedings of The 2019 Conference of The North American Chapter of The Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, USA, Vol. 1, pp. 4171–4186, Jun. 2019. [24] Laria Reynolds and Kyle McDonell, “Prompt Programming for Large Language Models: Beyond The Few-Shot Paradigm,” In Extended Abstracts of The 2021 CHI Conference on Human Factors in Computing Systems, Originally scheduled in Yokohama, Japan but held virtually (online) due to COVID-19, pp 1–7, May 2021. [25] Pingchuan Ma, Stavros Petridis, and Maja Pantic, “End-to-End Audio-Visual Speech Recognition with Conformers,” In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada(virtual), pp 7613–7617, Jun. 2021. [26] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang and H. Wang, “Retrieval Augmented Generation for Large Language Models: A Survey,” CoRR, vol. abs/2312.10997, Dec. 2023. [27] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang, “Squad: 100,000+ Questions for Machine Comprehension of Text,” Proceedings of The 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, USA, pp. 2383–2392, Nov. 2016. [28] Wei Wang, Ming Yan, and Chen Wu, “Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering,” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, Vol. 1, pp. 1705–1714, Jul. 2018. [29] Xinya Du, Junru Shao, and Claire Cardie, “Learning to Ask: Neural Question Generation for Reading Comprehension,” Proceedings of The 55th Annual Meeting of The Association for Computational Linguistics, Vancouver, British Columbia, Canada, Vol. 1, pp. 1342–1352, Jul. 2017. [30] Hanxiao Liu, Chunyuan Li, Yuheng Li, et al., “Language is Not All You Need: Aligning Perception with Language Models,” arXiv:2304.02643, Apr. 2023. [31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep Residual Learning for Image Recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, pp. 770–778, Jun. 2016. [32] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He, “Slowfast Networks for Video Recognition,” International Conference on Computer Vision, Seoul, South Korea, Oct. 2019. [33] Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen, “Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos,” The 33rd AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, USA, Jan. 2019. [34] Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua, “Attentive Moment Retrieval in Videos,” The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, Michigan, USA, Jul. 2018. [35] Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, and Bohyung Han, “Marioqa: Answering Questions by Watching Gameplay Videos,” IEEE International Conference on Computer Vision, Venice, Italy, Oct. 2017. [36] Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan, “Graph Convolutional Networks for Temporal Action Localization,” IEEE/CVF International Conference on Computer Vision, Seoul, South Korea, Oct. 2019. [37] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin, “Temporal Action Detection with Structured Segment Networks,” IEEE International Conference on Computer Vision, Venice, Italy, Oct. 2017. [38] Yitian Yuan, Tao Mei, and Wenwu Zhu, “To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression,” The Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, USA, Jan. 2019. [39] Serena Yeung, Olga Russakovsky, Greg Mori, and Li FeiFei, “End-to-End Learning of Action Detection From Frame Glimpses in Videos,” IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA, Jun. 2016. [40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, Jun. 2014. [41] J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word Representation,” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1532–1543, Oct. 2014. [42] Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou, and Zichuan Xu, “Jointly Cross-and Self-Modal Graph Attention Network for Query-Based Moment Localization,” Proceedings of The 28th ACM International Conference on Multimedia, Seattle, USA(Online due to COVID-19), pp. 4070–4078, Oct. 2020. [43] Xiaoye Qu, Pengwei Tang, Zhikang Zou, Yu Cheng, Jianfeng Dong, Pan Zhou, and Zichuan Xu, “Fine-Grained Iterative Attention Network for Temporal Language Localization in Videos,” Proceedings of The 28th ACM International Conference on Multimedia, Seattle, Washington, USA, pp. 4280–4288, Oct. 2020. [44] Brian Lester, Rami Al-Rfou, and Noah Constant, “The Power of Scale for Parameterefficient Prompt Tuning,” Proceedings of The 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, pp. 3045–3059, Nov. 2021. |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信