| 系統識別號 | U0002-2606202522081900 |
|---|---|
| DOI | 10.6846/tku202500412 |
| 論文名稱(中文) | 基於文本與流程圖的多模態 RAG 系統開發與研究 |
| 論文名稱(英文) | Development and Research of a Multi-Modal RAG System Based on Text and Flowcharts |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 資訊工程學系碩士在職專班 |
| 系所名稱(英文) | Department of Computer Science and Information Engineering |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 113 |
| 學期 | 2 |
| 出版年 | 114 |
| 研究生(中文) | 陳姵瑜 |
| 研究生(英文) | Pei-Yu Chen |
| 學號 | 710410043 |
| 學位類別 | 碩士 |
| 語言別 | 繁體中文 |
| 第二語言別 | |
| 口試日期 | 2025-06-06 |
| 論文頁數 | 55頁 |
| 口試委員 |
口試委員
-
陳裕賢(yschen@mail.ntpu.edu.tw)
口試委員 - 陳宗禧(chents@mail.nutn.edu.tw) 指導教授 - 張峯誠(135170@mail.tku.edu.tw) 共同指導教授 - 張志勇(cychang@mail.tku.edu.tw) |
| 關鍵字(中) |
多模態問答系統 檢索增強生成 流程圖理解 YOLOv8物件偵測 YOLOv8實例分割 語意檢索 RAGAS 評估指標 |
| 關鍵字(英) |
Multi-modal Question Answering System Retrieval-Augmented Generation (RAG) Flowchart Understanding YOLOv8 Object Detection YOLOv8 Instance Segmentation Semantic Retrieval RAGAS Evaluation Metrics |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
本研究旨在開發一套結合文字與圖像(流程圖)資訊的多模態語意問答系統,採用 Retrieval-Augmented Generation(RAG)架構為核心,整合 YOLOv8 Object Detection 與 Instance Segmentation 模型,以實現對專業文件中多模態資訊的語意檢索與答案生成。針對衛生福利部《傳染病標準檢驗方法》PDF 文件進行視覺區塊辨識與語意轉換,將節點、流程與文字說明轉換為具邏輯順序的自然語言敘述,進而建立跨模態語意索引。系統分別以向量檢索、BM25 最佳欄位檢索與向量最佳欄位檢索等策略執行資料擷取,並透過 GPT-4o 模型生成回應。實驗設計涵蓋多種資料型態與檢索策略組合,並採用 RAGAS 指標進行量化評估,結果顯示本系統於多模態環境下具備良好之語意理解與問答品質,為結構化圖像資訊應用於問答系統提供實證基礎。 |
| 英文摘要 |
This study proposes the development of a multi-modal semantic question answering system that integrates both textual and visual (flowchart) information based on the Retrieval-Augmented Generation (RAG) framework. The system incorporates YOLOv8 Object Detection and Instance Segmentation models to identify structural elements in professional documents and employs semantic transformation modules to convert flowchart nodes and logic into natural language descriptions. Using the official PDF of "Standard Inspection Methods for Infectious Diseases" from Taiwan CDC as the data source, the system extracts multi-modal content and builds a unified semantic index. Three retrieval strategies—vector-based retrieval, BM25 best-field retrieval, and vector best-field retrieval—are implemented to retrieve relevant segments, which are then fed into the GPT-4o model for answer generation. Experiments are conducted across various data types and retrieval strategies, and answer quality is quantitatively evaluated using the RAGAS metrics. The results demonstrate that the proposed system performs well in multi-modal environments, offering a practical foundation for integrating structured visual information into question answering systems. |
| 第三語言摘要 | |
| 論文目次 |
目錄 誌謝 I 目錄 VI 圖目錄 IX 表目錄 XI 第一章 簡介 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 研究貢獻 2 1.4 論文架構 3 第二章 相關研究 4 2.1 流程圖理解與圖像語意轉換技術 4 2.2 RAG問答系統 5 2.3 RAG 檢索策略 6 2.4 現有方法限制與比較 7 第三章 背景知識 9 3.1 YOLOv8 Object Detection 9 3.2 YOLOv8 Instance Segmentation 11 3.3 RAG 12 3.4 RAG 檢索策略 14 第四章 系統設計 16 4.1 系統應用架構 16 4.2 研究架構 17 4.3資料準備及視覺模型訓練 19 4.3.1 資料來源與問答對建立 19 4.3.2 頁面區塊辨識模型訓練 21 4.3.3 流程圖細節辨識模型訓練 23 4.4 RAG系統設計與評估 25 4.4.1 資料轉換模組 26 4.4.2 語意檢索與回答生成 30 4.4.3 問答品質評估 32 第五章 實驗分析 37 5.1 YOLOv8 Object Detect模型之頁面區塊辨識模型效能分析 37 5.2 YOLOv8 Instance Segmentation模型之流程圖細節辨識模型效能分析 40 5.3 多模態問答系統效能評估 43 5.3.1 實驗組別設計與變項規劃 43 5.3.2 環境設定 45 5.3.3 文字型資料之實驗結果 46 5.3.4 表格型資料之實驗結果 48 5.3.5 流程圖型資料之實驗結果 50 第六章 結論 52 6.1 研究總結 52 6.2 研究限制 52 6.3 未來展望 53 參考文獻 54 圖目錄 圖 1、YOLOv8 Object Detection 架構(資料來源:[11]) 9 圖 2、YOLOv8 Instance Segmentation 架構(資料來源:[12]) 12 圖 3、常見 RAG 系統之運作流程 13 圖 4、系統應用架構圖 17 圖 5、研究架構 18 圖 6、問答對資料表 20 圖 7、問答對資料表 20 圖 8、Roboflow 平台邊界框繪製與類別標記 21 圖 9、Roboflow 平台設定資料增強策略 22 圖 10、Roboflow 平台進行流程圖元件之實例分割標註 24 圖 11、Roboflow 平台設定資料增強策略 25 圖 12、文字向量化索引(a1)之主要資料格式範例 26 圖 13、流程圖判斷與轉換 28 圖 14、多模態語意轉換索引(a2)之主要資料格式範例 30 圖 15、送入 GPT-4o 模型之 Request 結構 31 圖 16、儲存紀錄之資料表結構 32 圖 17、匯出之其中一筆 JSON 格式結構 32 圖 18、雲端架構圖 38 圖 19、訓練期間之損失與指標變化曲線 38 圖 20、Precision-Recall 表現 39 圖 21、正規化混淆矩陣結果 39 圖 22、YOLOv8 Instance Segmentation 模型訓練期間之損失與指標變化曲線 40 圖 23、F1 分數與信心閾值之關係圖 41 圖 24、各類別之 Precision-Recall 曲線圖 42 圖 25、驗證集之正規化混淆矩陣 42 圖 26、Docker Compose 架構 46 表目錄 表 1、相關研究比較表 8 表 2、實驗組別與變項配置 44 表 3、文字型資料實驗結果 47 表 4、表格型資料實驗結果 49 表 5、流程圖型資料實驗結果 51 |
| 參考文獻 |
[1] M. Barochiya, P. Makhijani, H. N. Patel, P. Goel and B. Patel, "Evaluating RAG Pipeline in Multimodal LLM-based Question Answering Systems," 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 2024, pp. 69–75, doi: 10.1109/ICACRS62842.2024.10841620. [2] C. Su, J. Wen, J. Kang, Y. Wang, Y. Su, H. Pan, Z. Zhong, and M. S. Hossain, "Hybrid RAG-Empowered Multi-Modal LLM for Secure Data Management in Internet of Medical Things: A Diffusion-Based Contract Approach," IEEE Internet of Things Journal, early access, 2024, doi: 10.1109/JIOT.2024.3521425. [3] L. Sun, H. Du, and T. Hou, "FR-DETR: End-to-End Flowchart Recognition With Precision and Robustness," IEEE Access, vol. 10, pp. 64292–64301, 2022, doi: 10.1109/ACCESS.2022.3183068. [4] H. Pan, Q. Zhang, C. Caragea, E. Dragut, and L. J. Latecki, "FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding," arXiv preprint arXiv:2407.05183, 2024. [Online]. Available: https://arxiv.org/abs/2407.05183 [5] A. Arbaz, H. Fan, J. Ding, M. Qiu, and Y. Feng, “GenFlowchart: Parsing and Understanding Flowchart Using Generative AI,” in Proc. 17th Int. Conf. Knowledge Science, Engineering and Management (KSEM), Birmingham, UK, Aug. 16–18, 2024, pp. 99–111, Springer, doi: 10.1007/978-981-97-5492-2_8. [6] B. Saha and U. Saha, "Enhancing international graduate student experience through AI-driven support systems: A LLM and RAG-based approach," in Proc. 2024 Int. Conf. Data Sci. Appl. (ICoDSA), Kuta, Bali, Indonesia, 2024, pp. 300–304, doi: 10.1109/ICoDSA62899.2024.10651944. [7] S. Bag, A. Gupta, R. Kaushik, and C. Jain, "RAG Beyond Text: Enhancing Image Retrieval in RAG Systems," in Proc. 2024 Int. Conf. on Electrical, Computer and Energy Technologies (ICECET), Sydney, Australia, 2024, pp. 1–6, doi: 10.1109/ICECET61485.2024.10698598. [8] S. P. Timsina, J. Lockart, S. Amar, M. I. Shortt, D. Deb and E. R. Dunkel, "Extended Abstract: Simplifying Accessibility to NASA's Planetary Data System Using LLMs and Retrieval-Augmented Generation Techniques," in Proceedings of SoutheastCon 2025, pp. 1322–1323, 2025, doi: 10.1109/SoutheastCon56624.2025.10971470. [9] K. Sawarkar, A. Mangal, and S. R. Solanki, "Blended RAG: Improving RAG (Retriever-Augmented Generation) accuracy with semantic search and hybrid query-based retrievers," in Proc. 2024 IEEE 7th Int. Conf. Multimedia Inf. Process. Retrieval (MIPR), San Jose, CA, USA, 2024, pp. 155–161, doi: 10.1109/MIPR62202.2024.00031. [10] Elastic, "ELSER: Elastic Learned Sparse Encoder," Elastic Machine Learning Documentation, 2024. [Online]. Available: https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html. [Accessed: May 25, 2025]. [11] W. Wang, G. Hou, H. Zhou, Z. Zhang, W. Li, and P. Liu, "Defect detection method for power communication fibre optic cables based on improved YOLOv8," in Proc. 2024 IEEE 7th Int. Conf. Autom., Electron. Electr. Eng. (AUTEEE), Shenyang, China, 2024, pp. 77–81, doi: 10.1109/AUTEEE62881.2024.10869688. [12] R. Bai, M. Wang, Z. Zhang, J. Lu and F. Shen, "Automated Construction Site Monitoring Based on Improved YOLOv8-seg Instance Segmentation Algorithm," IEEE Access, vol. 11, pp. 139082–139096, 2023, doi: 10.1109/ACCESS.2023.3340895. [13] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Kuttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” arXiv preprint arXiv:2005.11401, 2020. [Online]. Available: https://arxiv.org/abs/2005.11401 [14] BME, "SAMU Diagrams Large Dataset," Roboflow, [Online]. Available: https://universe.roboflow.com/bme-fek2c/samu_diagrams_large. [Accessed: May 9, 2025]. [15] Microsoft, "OCR for images (v4.0) - Azure AI services," Microsoft Learn, [Online]. Available: https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/concept-ocr. [Accessed: May 17, 2025]. [16] infinilabs, analysis-ik: The IK Analysis plugin for Elasticsearch, GitHub repository, 2023. [Online]. Available: https://github.com/infinilabs/analysis-ik [17] Exploding Gradients, "RAGAS: Evaluation framework for retrieval-augmented generation," GitHub, [Online]. Available: https://github.com/explodinggradients/ragas. [Accessed: May 20, 2025]. [18] 衛生福利部疾病管制署,《傳染病標準檢驗方法手冊》,第1131113版,臺北市:衛生福利部疾病管制署,2024年。[線上]。可得:https://www.cdc.gov.tw/Category/Page/WV_GRwCIYrWQsEVa8ctTWg。 |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信