| 系統識別號 | U0002-2907202517175100 |
|---|---|
| DOI | 10.6846/tku202500630 |
| 論文名稱(中文) | 我獨自蒸餾:結合 RAG 增強與 LoRA 微調開發的語言模型自監督訓練系統 |
| 論文名稱(英文) | Solo Distillation: Self-Supervised Language Model Training via RAG-Augmented Pseudo Labels and LoRA Fine-Tuning |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 機械與機電工程學系碩士班 |
| 系所名稱(英文) | Department of Mechanical and Electro-Mechanical Engineering |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 113 |
| 學期 | 2 |
| 出版年 | 114 |
| 研究生(中文) | 吳岱霖 |
| 研究生(英文) | TAI-LIN WU |
| 學號 | 612370139 |
| 學位類別 | 碩士 |
| 語言別 | 繁體中文 |
| 第二語言別 | |
| 口試日期 | 2025-07-04 |
| 論文頁數 | 61頁 |
| 口試委員 |
指導教授
-
王銀添(090488@o365.tku.edu.tw)
口試委員 - 許閔傑 口試委員 - 吳志清 |
| 關鍵字(中) |
語言模型 自蒸餾 參數高效微調 RAG 免標註訓練 |
| 關鍵字(英) |
language model self-distillation parameter-efficient fine-tuning retrieval-augmented generation label-free training |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
本論文提出「我獨自蒸餾 (Solo Distillation)」,一種結合檢索增強生成 (Retrieval-Augmented Generation, RAG)、自我蒸餾 (Self-Distillation) 與 LoRA (Low-Rank Adaptation) 微調技術的語言模型自監督訓練框架。此框架旨在解決大型語言模型面臨的知識過時、微調成本高昂以及缺乏有效無監督學習方式等挑戰 。研究的核心方法是利用具備 RAG 能力的教師模型,從外部知識庫檢索資訊以生成高品質的偽標籤 (Pseudo-labels) 。接著,學生模型在不依賴人工標註的前提下,透過監督式微調 (Supervised Fine-Tuning, SFT) 學習教師模型的輸出,並僅針對 LoRA 層進行參數更新,以實現參數高效的自我蒸餾過程 。教師與學生模型共享相同的基礎模型架構,透過切換 LoRA 權重來區分其角色 。本研究選用 QASC (Question Answering via Sentence Composition) 資料集進行實驗驗證 。此研究證實了該自監督訓練框架的可行性與有效性。它展示了在無需人工標註的條件下,透過 RAG 增強的偽標籤與參數高效的 LoRA 微調,能夠成功地將知識從教師模型遷移至學生模型,為低資源條件下的語言模型優化提供了一種可擴展且高效的解決方案。 |
| 英文摘要 |
This thesis proposes “Solo Distillation,” a self-supervised training framework for language models that integrates Retrieval-Augmented Generation (RAG), self-distillation, and LoRA (Low-Rank Adaptation) fine-tuning. The framework addresses three key challenges faced by large language models (LLMs): knowledge obsolescence, high fine-tuning costs, and the lack of effective unsupervised learning methods. The core idea is to employ a teacher model equipped with RAG capabilities to retrieve information from an external knowledge base and generate high-quality pseudo-labels. Without any human annotations, a student model then learns the teacher’s outputs through Supervised Fine-Tuning (SFT) while updating only the LoRA layers, enabling a parameter-efficient self-distillation process. The teacher and student share the same underlying model architecture and switch roles simply by loading different LoRA weights.Experiments are conducted on the QASC (Question Answering via Sentence Composition) dataset. Results confirm the feasibility and effectiveness of the proposed self-supervised framework: by combining RAG-enhanced pseudo-labels with parameter-efficient LoRA fine-tuning, knowledge can be transferred from the teacher to the student model without manual labeling. This provides a scalable and efficient solution for optimizing LLMs under low-resource conditions. |
| 第三語言摘要 | |
| 論文目次 |
目錄 目錄 III 圖目錄 V 表目錄 VI 第1章 序論 1 1.1 研究動機 1 1.2 研究目的 2 1.3 研究範圍 3 1.4 論文貢獻 4 1.5 論文架構 5 第2章 文獻探討 6 2.1 自蒸餾 6 2.2 參數高效微調技術LoRA(Low-Rank Adaptation) 7 2.3 RAG(Retrieval-Augmented Generation) 9 2.4 技術整合:自蒸餾、微調與 RAG 的融合策略 10 2.4.1 LoRA與蒸餾方式的結合與比較 10 2.4.2 RAG 與知識蒸餾結合之應用現況 11 第3章 研究方法 13 3.1研究架構 14 3.2模型設定 15 3.3訓練流程 16 第4章 資料集處理及應用 20 4.1 資料來源概述 20 4.2 原始資料預處理 21 4.3 向量資料庫建構 22 4.4 向量資料庫檢索流程與儲存偽標籤 24 4.5 資料切分與版本控制 25 4.6 品質檢查與過濾 26 第5章 實驗數據 27 5.1實驗設定 27 5.1.1 資料集設定 27 5.1.2 模型設定 27 5.1.3 實作細節 28 5.2 評估方式 30 5.2.1 定量分析方法 31 5.2.2 定性分析方法 33 5.3 實驗結果與定量分析 34 5.4 定性分析與錯誤探討 41 5.5 綜合討論 50 第6章 結果與討論 53 6.1 研究成果 53 6.2 未來研究方向 53 參考文獻 56 圖目錄 圖1.1 能力提升示意圖 3 圖1.2 運用不同模組來建構區分教師-學生 5 圖2.1 SDFT流程圖 7 圖2.2 (a)、(b)、(c)分別為全參微調、LoRA、DoRA在注意力層中的架構圖 9 圖2.3 LLM-NEO做法 11 圖2.4 KARD 流程圖 12 圖3.1 各模型資料示意圖 13 圖3.2 流程圖 14 圖3.3 教師模型示意圖 15 圖3.4 學生模型示意圖 15 圖3.5 Pseudo label產生過程 17 圖3.6 LoRA訓練流程 18 圖4.1 向量資料庫製作流程圖 23 圖4.2 原資料集與新產生的資料集用UMAP降維圖 26 圖5.1 各方式訓練時使用之VRAM 30 圖5.2 資料集Train_data準確率 35 圖5.3 資料集Test_data準確率 37 圖5.4 資料集Train_data各模型推論顯存使用量 38 圖5.5運行整體流程使用的VRAM 39 圖5.6 資料集Test_data各模型推論顯存使用量 40 圖5.7 錯誤分析流程圖 46 圖5.8 失誤類型占比 50 圖6.1 一段式訓練架構 54 表目錄 表3.1 Pseudocode 19 表4.1 資料集範例 21 表4.2 資料集使用方式 22 表4.3 節錄自長段落資料集 23 表4.4 實驗用資料集 25 表5.1 LoRA層參數 28 表5.2 模型介紹 29 表5.3 性能比較定量分析資料集Train_data 34 表5.4 性能比較定量分析資料集Test_data 36 表5.5 資源比較定量分析資料集Train_data 38 表5.6資源比較定量分析資料集Test_data 40 表5.7 定性分析內容 42 表5.8 定性分析 44 表5.9 ID 3DL65MZB8DEXDSG44TVUAV620P2CED回答比較 47 表5.10 ID 3FK0YFF9PZFAEC8QQ0F90RIDKNWVV3回答比較 48 表5.11 ID 3QECW5O0KH0E3QPMFEXHVB0TAG8T5C回答比較 49 表5.12 失誤類型占比 49 表5.13 LoRA、DoRA橫向比對 51 表6.1 Solo Distillation相較於以往方式的優勢 53 |
| 參考文獻 |
[1] OpenAI, "ChatGPT: Optimizing language models for dialogue," 2022. [2] H. Touvron et al., "LLaMA: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023. [Online]. Available: https://arxiv.org/abs/2302.13971 [3] A. Vaswani et al., "Attention is all you need," in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008 [4] L. Ouyang et al., "Training language models to follow instructions with human feedback," in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2022. [5] N. Chan et al., "A theoretical and empirical exploration of reinforcement learning from AI feedback," arXiv preprint arXiv:2309.00668, 2023. [Online]. Available: https://arxiv.org/abs/2309.00668 [6] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, "LoRA: Low-rank adaptation of large language models," in Proc. Int. Conf. on Learning Representations (ICLR), 2022. [Online]. Available: https://arxiv.org/abs/2106.09685 [7] P. Lewis et al., "Retrieval-augmented generation for knowledge-intensive NLP tasks," in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 9459–9474. [8] B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and J. Y. Choi, “Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 3713–3722. [Online]. Available: https://arxiv.org/abs/1905.08094 [9] W. A. Khot, A. Sabharwal, and P. Clark, "QASC: A dataset for question answering via sentence composition," in Proc. AAAI Conf. on Artificial Intelligence, 2020. [10] Hugging Face, "Hugging Face – The AI community building the future." [Online]. Available: https://huggingface.co [11] G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network," arXiv preprint arXiv:1503.02531, 2015. [Online]. Available: https://arxiv.org/abs/1503.02531 [12] D.-H. Lee, "Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks," in Proc. ICML Workshop on Challenges in Representation Learning, 2013. [13] J. Kirkpatrick et al., "Overcoming catastrophic forgetting in neural networks," Proc. Natl. Acad. Sci. U.S.A., vol. 114, no. 13, pp. 3521–3526, 2017. [Online]. Available: https://arxiv.org/abs/1612.00796 [14] Q. Wang, H. B. Sailor, T. Liu, and A. T. Aw, "Contextual paralinguistic data creation for multi-modal speech-LLM: Data condensation and spoken QA generation," arXiv preprint arXiv:2505.13338, May 2025. [Online]. Available: https://arxiv.org/abs/2505.13338 [15] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, "Born-again neural networks," in Proc. Int. Conf. on Machine Learning (ICML), 2018. [Online]. Available: https://arxiv.org/abs/1805.04770 [16] Y. Yang, L. Kong, S. Hou, Z. Zhou, L. Cheng, and J. Liu, "Self-distillation fine-tuning: Aligning pretrained language models without human labels," in Proc. Annual Meeting of the Association for Computational Linguistics (ACL), 2024. [Online]. Available: https://arxiv.org/abs/2402.03960 [17] Y. Wang et al., "Self-instruct: Aligning language models with self-generated instructions," in Proc. Annual Meeting of the Association for Computational Linguistics (ACL), 2023. [Online]. Available: https://arxiv.org/abs/2212.10560 [18] Y. Fu, H. Wang, Z. Zhang, L. Feng, and Y. Song, "A comprehensive survey on parameter-efficient fine-tuning of pre-trained language models," arXiv preprint arXiv:2303.15647, 2023. [Online]. Available: https://arxiv.org/abs/2303.15647 [19] Y. Gao, J. Lin, H. He, Z. Han, X. Ma, and J. Tang, "DoRA: Weight-decomposed low-rank adaptation for efficient fine-tuning of language models," arXiv preprint arXiv:2402.09353, 2024. [Online]. Available: https://arxiv.org/abs/2402.09353 [20] Q. Li, S. Geng, T. Sun, S. Liu, H. Chen, J. Yang, and Z. Wang, "LoftQ: LoRA-fine-tuning-aware quantization for large language models," arXiv preprint arXiv:2310.08659, 2023. [Online]. Available: https://arxiv.org/abs/2310.08659 [21] Y. Gao, X. Ma, J. Lin, P. He, and J. Tang, "Retrieval-augmented generation: A survey," arXiv preprint arXiv:2312.10997, 2023. [Online]. Available: https://arxiv.org/abs/2312.10997 [22] R. Azimi, A. Wang, K. Yang, S. Yuan, and A. Roy-Chowdhury, "KD-LoRA: A hybrid approach to efficient fine-tuning using knowledge distillation and low-rank adaptation," arXiv preprint arXiv:2410.20777, 2024. [Online]. Available: https://arxiv.org/abs/2410.20777 [23] Y. Yang, Y. Sun, R. Menon, A. Singh, and M. Kankanhalli, "LLM-NEO: Parameter efficient knowledge distillation for large language models," arXiv preprint arXiv:2403.06545, 2024. [Online]. Available: https://arxiv.org/abs/2403.06545 [24] S. Yang, L. Kong, S. Hou, L. Cheng, Z. Zhou, and J. Liu, "NutePrune: Efficient progressive pruning with numerous teachers for large language models," arXiv preprint arXiv:2402.09773, 2024. [Online]. Available: https://arxiv.org/abs/2402.09773 [25] Y. Kang, H. Hu, S. Liu, H. Liu, H. Chen, and J. Li, "KARD: Knowledge-augmented reasoning distillation," in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023. [Online]. Available: https://openreview.net/forum?id=8nWyS23A0Z [26] Y. Zhang, S. Tang, D. Zhang, H. Sun, Z. Wang, and S. Li, "ReAugKD: Retrieval-augmented knowledge distillation for pretrained language models," in Proc. Annual Meeting of the Association for Computational Linguistics (ACL), 2023, pp. 3209–3222. [Online]. Available: https://aclanthology.org/2023.acl-long.210 [27] J. Wei et al., "Chain-of-thought prompting elicits reasoning in large language models," arXiv preprint arXiv:2201.11903, 2022. [Online]. Available: https://arxiv.org/abs/2201.11903 [28] Gemma Team et al., "Gemma 3 technical report," arXiv preprint arXiv:2503.19786, Mar. 2025. [Online]. Available: https://arxiv.org/abs/2503.19786 [29] W.-L. Chiang et al., "Chatbot arena: An open platform for evaluating LLMs by human preference," arXiv preprint arXiv:2403.04132, Mar. 2024. [Online]. Available: https://arxiv.org/abs/2403.04132 [30] LMArena, "Chatbot Arena Leaderboard." [Online]. Available: https://lmarena.ai/ (Accessed: May 27, 2025). [31] Gemini Team et al., "Gemini: A family of highly capable multimodal models," arXiv preprint arXiv:2312.11805, 2023. [Online]. Available: https://arxiv.org/abs/2312.11805 [32] N. Reimers and I. Gurevych, "Sentence-BERT: Sentence embeddings using siamese BERT-networks," in Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), 2019, pp. 3982–3992. [33] J. Liu, "LlamaIndex," 2022. [Online]. Available: https://github.com/jerryjliu/llama_index [34] L. McInnes, J. Healy, and J. Melville, "UMAP: Uniform manifold approximation and projection for dimension reduction," arXiv preprint arXiv:1802.03426, 2018. [Online]. Available: https://arxiv.org/abs/1802.03426 [35] C.-Y. Lin, "ROUGE: A package for automatic evaluation of summaries," in Text Summarization Branches Out, Barcelona, Spain, 2004, pp. 74–81. [36] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, "BERTScore: Evaluating text generation with BERT," in Proc. Int. Conf. on Learning Representations (ICLR), 2020. [37] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proc. Conf. of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019, pp. 4171–4186. [38] X. Li, F. Yin, Z. Sun, X. Li, A. Wang, X. Li, and A. Chen, "Entity-relation extraction as multi-turn question answering," in Proc. Annual Meeting of the Association for Computational Linguistics (ACL), 2019. [39] A. Asai et al., "One question answering model for many languages with cross-lingual dense passage retrieval," in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2021. [40] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, "Tree of thoughts: Deliberate problem solving with large language models," arXiv preprint arXiv:2305.10601, 2023. [Online]. Available: https://arxiv.org/abs/2305.10601 [41] Y. Zuo, K. Zhang, L. Sheng, Y. Chen, Y. Liu, and J. Wang, "TTRL: Test-time reinforcement learning," arXiv preprint arXiv:2504.16084, 2025. [Online]. Available: https://arxiv.org/abs/2504.16084 [42] S. O'Brien and M. Lewis, "Decoding is an art: A survey of decoding methods for large language models," arXiv preprint arXiv:2402.06925, 2024. [Online]. Available: https://arxiv.org/abs/2402.06925 [43] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, "A learning algorithm for boltzmann machines," Cognitive Science, vol. 9, no. 1, pp. 147–169, 1985. [44] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, "The curious case of neural text degeneration," in Proc. Int. Conf. on Learning Representations (ICLR), 2020. [45] A. Fan, M. Lewis, and Y. Dauphin, "Hierarchical neural story generation," in Proc. Annual Meeting of the Association for Computational Linguistics (ACL), 2018. [46] J. Shlens, "Notes on Kullback-Leibler divergence and likelihood theory," arXiv preprint arXiv:1404.2000, 2014. [47] 吳岱融, “餐廳顧客評論分析:以 LDA 和 k-mean 為基礎的主題與情感分析,” 碩士論文, 資訊管理系研究所, 國立中正大學, 嘉義縣, 臺灣, 2023. [Online]. Available: https://hdl.handle.net/11296/acqs3p [48] 周晁揚, “用於 LEO 之透射陣列單元與透射陣列設計,” 碩士論文, 電信工程學研究所, 國立臺灣大學, 臺北市, 臺灣, 2022. [Online]. Available: https://hdl.handle.net/11296/q4wj7t [49] 蕭兆翔, “應用 AI 視覺模型於監控系統之大型室內空間的行人定位,” 碩士論文, 機械與機電工程學系碩士班, 淡江大學, 新北市, 臺灣, 2022. [Online]. Available: https://hdl.handle.net/11296/j8f825 [50] 李厚誼, “基於 FPGA 車牌辨識 AI 模型之開發,” 碩士論文, 機械與機電工程學系碩士班, 淡江大學, 新北市, 臺灣, 2024. [Online]. Available: https://hdl.handle.net/11296/qb994k |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信