電子學位論文服務

§ 瀏覽學位論文書目資料

本論文紙本於2025-08-06起公開使用

系統識別號	U0002-0508202509331800
DOI	10.6846/tku202500662
論文名稱(中文)	基於 RAG 的影片理解與問答系統
論文名稱(英文)	From Video to Knowledge: A RAG-Based QA System
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	資訊工程學系碩士班
系所名稱(英文)	Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	113
學期	2
出版年	114
研究生(中文)	李炯豪
研究生(英文)	Chiung-Hao Lee
學號	612410778
學位類別	碩士
語言別	繁體中文
第二語言別
口試日期	2025-06-06
論文頁數	73頁
口試委員	指導教授 - 張志勇(cychang@mail.tku.edu.tw) 口試委員 - 陳宗禧口試委員 - 陳裕賢共同指導教授 - 郭經華(chkuo@mail.tku.edu.tw)
關鍵字(中)	影片問答跨模態整合角色識別 RAG 語意檢索語音文字對齊影像語意理解
關鍵字(英)	Video Question Answering Cross-Modal Integration Character Recognition Retrieval-Augmented Generation (RAG) Semantic Retrieval Speech-to-Text Alignment Visual Semantic Understanding
第三語言關鍵字
學科別分類
中文摘要	隨著多媒體平台日益普及，觀眾對於影片內容的理解與查詢需求逐漸提升，特別是在角色辨識與劇情對應問答等應用場景中更為明顯。然而目前主流語言模型雖具備強大生成能力，但在處理影片這類具時間序列與多模態資訊的資料時，仍面臨語意錯誤、角色混淆與知識對齊不足等問題。本研究針對此困境，提出一套結合語意檢索與多模態特徵對齊的影片問答系統，透過RAG（Retrieval-Augmented Generation）架構進行優化設計。本研究整合劇本、字幕、角色資訊與場景描述，建立統一文本資料格式，並透過BART與CLIP模型結合ArcFace進行角色識別與語境建模，提升回答品質與準確率。系統亦支援使用者自然語言提問，能回傳正確的問題內容與時間點，實現影片語意理解與高效問答的目標。實驗結果顯示，本系統在角色情境理解與問答正確率上皆優於現有基準模型，具有實用性與發展潛力。
英文摘要	With the widespread popularity of multimedia platforms, audience demand for understanding and querying video content is on the rise, especially in applications such as character identification and storyline-based question answering. Although current large language models possess strong generative capabilities, they still face limitations when handling temporally structured and multimodal video data. Common issues include semantic inaccuracies, confusion between characters, and inadequate alignment between contextual knowledge and visual cues. To address these challenges, this thesis proposes a video-based question answering system that integrates semantic retrieval with multimodal feature alignment using a Retrieval-Augmented Generation (RAG) framework. The system consolidates scripts, subtitles, character metadata, and scene descriptions into a unified text representation. Leveraging CLIP and ArcFace models, combined with a fine-tuned BART architecture, the system performs character recognition and contextual modeling to improve response quality and accuracy. The system also supports natural language queries and can return relevant responses along with precise timestamps, achieving effective video comprehension and answering. Experimental results demonstrate that the proposed system outperforms baseline models in both character-context understanding and question answering accuracy, showcasing its practical potential and extensibility.
第三語言摘要
論文目次	目錄目錄VI 圖目錄VIII 表目錄IX 第一章簡介1 1.1 研究背景與動機1 1.2 研究目的2 第二章相關研究5 2.1 多模態角色辨識與特徵融合5 2.2 影片問答推理與時間建模6 2.3 視覺可信度與多模態對齊7 2.4 場景擴展與圖結構事件建模7 2.5 相關研究總結8 第三章前景知識11 3.1 語音辨識與語者區分12 3.2 SAM（Segment Anything Model）14 3.3 CLIP（Contrastive Language-Image Pretraining）16 3.4 ArcFace 17 3.5 TimeSformer 18 3.6 Positional Encoding 19 3.7 Temporal Attention 21 3.8 Graph Neural Networks（GNN 23 3.9 Retrieval-Augmented Generation（RAG） 25 第四章系統架構26 4.1 資料前處理 27 4.2 人物對比 28 4.2.1 共指預測 30 4.2.2 說話者還原 32 4.3 時間對齊 35 4.3.1 TimeSformer 37 4.3.2 Positional Encoding 39 4.3.3 Temporal Attention 41 4.4 事件圖譜建構模組 43 第五章實驗 45 5.1 資料集 45 5.2 角色辨識 47 5.3 共指消解 51 5.4 代名詞模型訓練 53 5.5 時間推論能力 55 5.6 回答生成品質 58 5.7 實驗總結 61 第六章結論與貢獻 66 參考文獻 68 圖目錄圖 1 Speaker Diarization 方法 13 圖 2 SAM 的運作方式 14 圖 3 CLIP 對比學習 17 圖 4 ArcFace 方法 18 圖 5 TimeSformer 展示 19 圖 6 Positional Encoding 方法 21 圖 7 Temporal Attention 方法 23 圖 8 GNN 更新節點 24 圖 9 RAG 方法 25 圖 10 系統架構 27 圖 11 資料前處理 28 圖 12 人物對比 29 圖 13 共指預測 31 圖 14 還原匿名說話者過程 33 圖 15 時間對齊 35 圖 16 TimeSformer 38 圖 17 週期函數 40 圖 18 temporal attention 42 圖 19 建構圖譜 44 圖 20 資料集內容 46 圖 21 找出特定角色 49 圖 22 角色辨識波浪圖 50 圖 23 模型性能比較圖 51 圖 24 共指消解三維圖 52 圖 25 訓練損失54 圖 26 時間推論波浪圖56 圖 27 多指標比較 60 圖 28 系統效能折線圖 64 圖 29 系統效能三維圖 65 表目錄表 1、多模態角色辨識與特徵融合 9 表 2、影片問答推理與時間建模 10 表 3、視覺可信度與多模態對齊 10 表 4、角色辨識結果 48 表 5、共指消解結果 53 表 6、時間推論能力結果 57 表 7、回答生成品質結果 61
參考文獻	[1] A. Y. Aytar, K. Kilic, and K. Kaya, “A retrieval-augmented generation framework for academic literature navigation in data science,” arXiv preprint arXiv:2412.15404, 2024. [2] X. Yan, W. Luo, J. Wang, and X. Shen, “A Study on Context-Matching-Based Joint Training for Chinese Coreference Resolution,” in International Conference on Web Information Systems and Applications, Springer, 2024, pp. 126–137. [3] Z. Liao, J. Li, L. Niu, and L. Zhang, “Align and aggregate: Compositional reasoning with video alignment and answer aggregation for video question-answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13395–13404. [4] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690–4699. [5] J. Chen, J. Zhu, and Y. Kong, “Atm: Action temporality modeling for video question answering,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 4886–4895. [6] A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [7] Z. Wang, K. Funakoshi, and M. Okumura, “Automatic answerability evaluation for question generation,” arXiv preprint arXiv:2309.12546, 2023. [8] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning, PMLR, 2023, pp. 19730–19742. [9] J. Xiao, A. Yao, Y. Li, and T.-S. Chua, “Can I trust your answer? visually grounded video question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13204–13214. [10] Y. Yu, J. Kim, H. Yun, J. Chung, and G. Kim, “Character grounding and re-identification in story of videos and text descriptions,” in European Conference on Computer Vision, Springer, 2020, pp. 543–559. [11] S. Geng, J. Zhang, Z. Fu, P. Gao, H. Zhang, and G. de Melo, “Character matters: Video story understanding with character-aware relations,” arXiv preprint arXiv:2005.08646, 2020. [12] W. Wu, F. Wang, A. Yuan, F. Wu, and J. Li, “CorefQA: Coreference resolution as query-based span prediction,” in Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 6953–6963. [13] Y. Li, J. Xiao, C. Feng, X. Wang, and T.-S. Chua, “Discovering spatio-temporal rationales for video question answering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13869–13878. [14] S. Choi et al., “Dramaqa: Character-centered video story understanding with hierarchical qa,” in Proceedings of the aaai conference on artificial intelligence, 2021, pp. 1166–1174. [15] F. Liu, J. Liu, X. Zhu, R. Hong, and H. Lu, “Dual hierarchical temporal convolutional network with QA-aware dynamic normalization for video story question answering,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4253–4261. [16] J. Nie, X. Wang, R. Hou, G. Li, H. Chen, and W. Zhu, “Dynamic spatio-temporal graph reasoning for videoqa with self-supervised event recognition,” IEEE Transactions on Image Processing, vol. 33, pp. 4145–4158, 2024. [17] H.-T. Su et al., “End-to-end video question-answer generation with generator-pretester network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 11, pp. 4497–4507, 2021. [18] Y. Li, X. Wang, J. Xiao, and T.-S. Chua, “Equivariant and invariant grounding for video question answering,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4714–4722. [19] F. Li, W. Wang, Z. Liu, H. Wang, C. Yan, and B. Wu, “Frame aggregation and multi-modal fusion framework for video-based person recognition,” in International Conference on Multimedia Modeling, Springer, 2021, pp. 75–86. [20] Y. Wang, X. Meng, Y. Wang, J. Liang, Q. Liu, and D. Zhao, “Friends-mmc: A dataset for multi-modal multi-party conversation understanding,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2025, pp. 25425–25433. [21] Z. Bai, R. Wang, and X. Chen, “Glance and focus: Memory prompting for multi-event video question answering,” Advances in Neural Information Processing Systems, vol. 36, pp. 34247–34259, 2023. [22] J. S. Park, T. Darrell, and A. Rohrbach, “Identity-aware multi-sentence video description,” in European Conference on Computer Vision, Springer, 2020, pp. 360–378. [23] R. Krishna, M. Bernstein, and L. Fei-Fei, “Information maximizing visual question generation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2008–2018. [24] Y. Li, X. Wang, J. Xiao, W. Ji, and T.-S. Chua, “Invariant grounding for video question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2928–2937. [25] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?,” in Icml, 2021, p. 4. [26] T. Chen et al., “Mecd: Unlocking multi-event causal discovery in video reasoning,” Advances in Neural Information Processing Systems, vol. 37, pp. 92554–92580, 2024. [27] D. Gao, L. Zhou, L. Ji, L. Zhu, Y. Yang, and M. Z. Shou, “Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14773–14783. [28] Z. Liao, D. Di, J. Hao, J. Zhang, S. Zhu, and J. Yin, “MMM-GCN: Multi-Level Multi-Modal Graph Convolution Network for Video-Based Person Identification,” in International Conference on Multimedia Modeling, Springer, 2023, pp. 3–15. [29] Y. Wang, S. Haruta, D. Zeng, J. Vizcarra, and M. Kurokawa, “Multi-object event graph representation learning for video question answering,” arXiv preprint arXiv:2409.07747, 2024. [30] R. Amoroso, G. Zhang, R. Koner, L. Baraldi, R. Cucchiara, and V. Tresp, “Perceive. Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries,” in 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), IEEE, 2025, pp. 8853–8862. [31] H. Bredin, “pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” in 24th INTERSPEECH Conference (INTERSPEECH 2023), ISCA, 2023, pp. 1983–1987. [32] H. Qin, J. Xiao, and A. Yao, “Question-answering dense video events,” in Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025, pp. 884–894. [33] Z. Huang, Y. Chang, W. Chen, Q. Shen, and J. Liao, “Residualdensenetwork: A simple approach for video person identification,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 2521–2525. [34] A. Kirillov et al., “Segment anything,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026. [35] S. Yu, J. Cho, P. Yadav, and M. Bansal, “Self-chained image-language model for video localization and question answering,” Advances in Neural Information Processing Systems, vol. 36, pp. 76749–76771, 2023. [36] Z. Jia, T. Zhao, J. Ru, Y. Meng, and B. Xia, “SPNet: A Serial and Parallel Convolutional Neural Network algorithm for the cross-language coreference resolution,” Computer Speech & Language, vol. 91, p. 101729, 2025. [37] Y. Wang, Y. Wang, K. Chen, and D. Zhao, “STAIR: spatial-temporal reasoning with auditable intermediate results for video question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2024, pp. 19215–19223. [38] H. Wang, W. Wang, and J. Liu, “Temporal memory attention for video semantic segmentation,” in 2021 IEEE International Conference on Image Processing (ICIP), IEEE, 2021, pp. 2254–2258. [39] H. Liu, X. Ma, C. Zhong, Y. Zhang, and W. Lin, “Timecraft: Navigate weakly-supervised temporal grounded video question answering via bi-directional reasoning,” in European Conference on Computer Vision, Springer, 2024, pp. 92–107. [40] V. Sharma, M. Tapaswi, M. S. Sarfraz, and R. Stiefelhagen, “Video face clustering with self-supervised representation learning,” IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 2, no. 2, pp. 145–157, 2019. [41] J. Xie, J. Chen, Z. Liu, Y. Cai, Q. Huang, and Q. Li, “Video question generation for dynamic changes,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 9, pp. 8710–8721, 2024. [42] S. Castro, N. Deng, P. Huang, M. Burzo, and R. Mihalcea, “WildQA: In-the-wild video question answering,” arXiv preprint arXiv:2209.06650, 2022. [43] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 8748–8763. [44] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 1, pp. 4–24, Jan, 2021, doi: 10.1109/TNNLS.2020.2978386. [45] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” arXiv preprint arXiv:2212.04356, Dec, 2022. [46] A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” arXiv preprint arXiv:1807.03748, Jul, 2018. [47] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in Advances in Neural Information Processing Systems, vol. 30, 2017. [48] M. Bucher, T.-H. Vu, M. Cord, and P. Pérez, “Zero-Shot Semantic Segmentation,” arXiv preprint arXiv:1906.00817, Jun, 2019. [49] T. Wu, M. Tjandrasuwita, Z. Wu, X. Yang, K. Liu, R. Sosic, and J. Leskovec, “ZeroC: A Neuro-Symbolic Model for Zero-shot Concept Recognition and Acquisition at Inference Time,” arXiv preprint arXiv:2206.15049, Jun, 2022. [50] X. Pradhan et al., “CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes,” Proc. CoNLL Shared Task, pp. 1–40, 2012. [51] H. Xu et al., “CLUE: A Chinese Language Understanding Evaluation Benchmark,” arXiv preprint arXiv:2004.05986, Apr, 2020. [52] J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9772–9781. [53] S. Mahendru et al., “The Promise of Premise: Harnessing Question Premises in Visual Question Answering,” arXiv preprint arXiv:1705.00601, May, 2017. [54] J. Lei, L. Yu, M. Bansal, and T. L. Berg, "TVQA: Localized, compositional video question answering," in Proc. 2018 Conf. Empirical Methods in Natural Language Processing (EMNLP), 2018, pp. 1369–1380.
論文全文使用權限	國家圖書館：不同意無償授權國家圖書館校內：校內紙本論文立即公開電子論文全文不同意授權校內書目立即公開校外：不同意授權予資料庫廠商校外書目立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信