系統識別號 | U0002-0508202509331400 |
---|---|
DOI | 10.6846/TKU_Electronic Theses & Dissertations Service202500382 |
論文名稱(中文) | 多教師向量蒸餾與融合學習框架 |
論文名稱(英文) | Multi-Teacher Vector Distillation and Fusion Learning Framework |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系碩士班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 113 |
學期 | 2 |
出版年 | 114 |
研究生(中文) | 陳彥赫 |
研究生(英文) | Yen-Ho Chen |
學號 | 612410570 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
第二語言別 | |
口試日期 | 2025-06-06 |
論文頁數 | 86頁 |
口試委員 |
指導教授
-
張志勇(cychang@mail.tku.edu.tw)
共同指導教授 - 郭經華(chkuo@mail.tku.edu.tw) 口試委員 - 陳宗禧 口試委員 - 陳裕賢 |
關鍵字(中) |
知識蒸餾 孿生編碼器 對比學習 多教師學習 半監督學習 跨模態表示 嵌入對齊 模型壓縮 零樣本泛化 教師-學生框架 |
關鍵字(英) |
Knowledge Distillation Twin Encoder Architecture Contrastive Learning Multi-Teacher Learning Semi-Supervised Learning Cross-Modal Representation Embedding Alignment Model Compression Zero-Shot Generalization Student-Teacher Framework |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
知識蒸餾在傳統上往往需要事先了解教師模型的權重或對齊的訓練數據,這在特定模型或跨領域的現實場景中帶來了很大的限制。在本論文中,我們提出了 MultiVecGen,這是一個新穎的半監督學習框架,目標是從一組不同的教師模型(涵蓋各種模組和任務)中不斷提取知識,形成單一的學生模型。我們的方法引入了一種雙編碼器架構,其中由資料集編碼器和教師模型編碼器組成,首先,它們以共享權重的方式進行第一階段訓練,將資料集標籤和教師輸出投射到共享的嵌入空間中。接著再透過對比學習訓練,系統將學會對齊語意相似的標籤輸出資料,同時排斥不相關的資料。在最後階段,使用軟標籤(來自模型嵌入)和硬標籤(來自資料嵌入)對預訓練transformer進行再訓練。 MultiVecGen 無需存取內部權重即可從非開源 API 實現可擴展且高效的知識傳輸,並支援跨任務泛化,使其非常適合企業使用。 |
英文摘要 |
Knowledge distillation traditionally requires prior knowledge of the teacher model's weights or aligned training data, which brings great limitations in real-world scenarios for specific models or across domains. In this paper, we propose MultiVecGen, a novel semi-supervised learning framework that aims to continuously extract knowledge from a set of different teacher models (covering various modules and tasks) to form a single student model. Our approach introduces a dual-encoder architecture consisting of a dataset encoder and a teacher model encoder, which are first trained in a shared-weights manner to project the dataset labels and teacher outputs into a shared embedding space. Then, through contrastive learning training, the system will learn to align semantically similar label output data while rejecting irrelevant data. In the final stage, the pre-trained transformer is retrained using soft labels (from the model embeddings) and hard labels (from the data embeddings). MultiVecGen enables scalable and efficient knowledge transfer from non-open source APIs without accessing internal weights, and supports cross-task generalization, making it ideal for enterprise use. |
第三語言摘要 | |
論文目次 |
目錄 VI 圖目錄 IX 表目錄 XI 第一章 簡介 1 第二章 相關研究 9 2.1知識蒸餾分類 9 2.1.1個體知識蒸餾 9 2.1.2關係知識蒸餾 10 2.1.3基於向量的蒸餾:MASCKD 11 2.2知識蒸餾做法影響 12 2.2.1溫度對知識蒸餾的影響 12 2.2.2改進 KD 的 Logit 歸一化 13 2.3大型語言模型中的知識蒸餾 14 2.4 MultiVecGen 的定位 14 第三章 背景知識 17 3.1 BERT 17 3.2 Transformer 19 3.3 SBERT 21 3.4 對比學習(CLIP 架構) 23 3.5 Knowledge Distillation 25 第四章 系統設計 28 4.1 模型大致步驟 28 4.2 整體架構 34 4.3 蒸餾教師模型輸出與轉化輸出格式 35 4.4 編碼器訓練 37 4.4.1 兩階段訓練過程 39 4.4.2 第一階段:共享權重編碼器CS 39 4.4.3 第二階段:對比學習CLIP架構 41 4.4.4 用兩次損失訓練 43 4.5 向量知識蒸餾 44 4.5.1 軟損失 45 4.5.2 硬損失 46 4.5.3 知識蒸餾總損失 47 4.6 總結 47 第五章 實驗分析 50 5.1資料集 50 5.2各模型size 53 5.3實驗結果 56 5.3.1 ASR 指標(Adversarial Success Rate)介紹 57 5.3.2影像辨識任務蒸餾效能 58 5.3.3文本生成任務蒸餾效能(GPT-2) 60 5.3.4不同模型參數規模下的對抗魯棒性分析(GPT-2) 62 5.3.5不同模型參數規模下的OOD分析(GPT-2) 63 5.3.6文本生成任務蒸餾效能(LLaMA、LLaMA 2) 64 5.3.7不同模型參數規模下的對抗魯棒性分析(LLaMA、LLaMA 2) 66 5.3.8不同模型參數規模下的OOD分析(LLaMA、LLaMA 2) 68 5.3.9 箱型圖 69 5.3.10 可訓練權重的變動 71 5.3.11 GPT-2消融實驗 72 5.3.12 LLaMA消融實驗 75 5.3.13 LLaMA 2消融實驗 76 第六章 結論 79 參考文獻 82 圖目錄 圖1、過往企業使用模型 2 圖2、論文使用情境 3 圖3、不同任務一併用一種學習框架 5 圖4、用孿生模型將同一問題不同答案學習起來 6 圖5、不同模型的輸出格式不同需要將他們轉成固定的格式 7 圖6、BERT 18 圖7、Transformer 21 圖8、Sentence BERT 23 圖9、CLIP 25 圖10、Knowledge Distillation 27 圖11、MultiVecGen 使用方法 30 圖12、MultiVecGen 目標從向量學習知識 31 圖13、MultiVecGen 如何提取正確知識 32 圖14、RGB圖譜 33 圖15、向量代表了更豐富語意 34 圖16、MultiVecGen 框架 35 圖17、蒸餾教師模型輸出與轉化輸出格式模塊 36 圖18、編碼器訓練模塊 38 圖19、共享權重編碼器CS 40 圖20、對比學習CLIP架構 42 圖21、經過兩次損失後向量分布圖 44 圖22、向量知識蒸餾模塊 45 圖23、GPT-2 ASR趨勢圖 63 圖24、GPT-2 Flipkart OOD比較 64 圖25、LLaMA與LLaMA 2 ASR趨勢圖 67 圖26、LLaMA與LLaMA 2 Flipkart OOD比較 68 圖27、LLaMA與LLaMA 2 DDXPlus OOD比較 69 圖28、箱型圖 70 圖29、超參數討論 72 表目錄 表1、相關研究比較表 15 表2、資料集相關資訊 53 表3、各模型參數量與預估消耗GPU 56 表4、在CIFAR-100上影像辨識任務與其他論文比較的知識蒸餾效能評估 59 表5、在文本生成任務上教師為GPT-2 1.5B與其他論文比較的效能評估 61 表6、教師為LLaMA 13B 與LLaMA2 13B的知識蒸餾效能評估 66 表7、教師為GPT-2 1.5B時,在不同參數量下學生模型的消融實驗效能評估 74 表8、教師為LLaMA 13B時,學生模型的消融實驗效能評估 76 表9、教師為LLaMA 2 13B時,學生模型的消融實驗效能評估 78 |
參考文獻 |
[1] J. Gou, L. Sun, B. Yu, S. Wan, W. Ou, and Z. Yi, “Multilevel Attention-Based Sample Correlations for Knowledge Distillation,” IEEE Transactions on Industrial Informatics, vol. 19, no. 5, pp. 7099–7109, 2023. [2] S. Sun, W. Ren, J. Li, Rui Wang, and X. Cao, “Logit Standardization in Knowledge Distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15731–15740, 2024. [3] C.Yang, Yao Zhu, and Wang Lu et al., “Survey on Knowledge Distillation For Large Language Models:Methods, Evaluation, and Application,” ACM Transactions on Intelligent Systems and Technology, vol. 37, no. 4, pp. 1–28, 2024. [4] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” Neural Information Processing Systems, pp. 1–9, 2014. [5] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for thin deep nets,” in Proceedings of the International Conference on Learning Representations, pp. 1–13, 2015. [6] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” in Proceedings of the International Conference on Learning Representations, pp. 1–13, 2017. [7] Y. Qu, W. Deng, and J. Hu, “H-AT:Hybrid attention transfer for knowledge distillation,” in Proceedings of the Third Chinese Conference on Pattern Recognition and Computer Vision, pp. 249–260, 2020. [8] S. Jung, D. Lee, T. Lee, and T. Moon, “Fair feature distillation for visual recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12115–12124, 2021. [9] Z. Li, Y. Huang, D. Chen, T. Luo, N. Cai, and Z. Pan, “Online knowledge distillation via multi-branch diversity enhancement,” in Proceedings of the Asian Conference on Computer Vision, vol. 12625, pp. 318–333, 2020. [10] A. Umer, C. Termritthikun, T. Qiu, P. H. W. Leong, and I. Lee, “On-device saliency prediction based on pseudo knowledge distillation,” IEEE Transactions on Industrial Informatics, vol. 18, no. 9, pp. 6317–6325, 2022. [11] C. Tan and J. Liu, “Online knowledge distillation with elastic peer,” Information Sciences, vol. 583, pp. 1–13, 2022. [12] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3962–3971, 2019. [13] J. Zhu et al., “Complementary relation contrastive distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9260–9269, 2021. [14] S. Zhou et al., “Distilling holistic knowledge with graph neural net works,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10387–10396, 2021. [15] B. Li et al., “Knowledge distillation via channel correlation structure,” in Proceedings of the International Conference on Knowledge Science, Engineering and Management, vol. 12815, pp. 357–368, 2018. [16] H. Ye, S. Lu, and D. Zhan, “Generalized knowledge distillation via relationship matching,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 2, pp. 1817–1834, 2023. [17] D. Chen, J. Mei, Can Wang, Yan Feng, and Chun Chen, “Online knowledge distillation with diverse peers,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 4, pp. 3430–3437, 2020. [18] Y. Jin, J. Wang, and D. Lin, “Multi-level logit distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24276–24285, 2023. [19] S. I. Mirzadeh, M. Farajtabar, Ang Li, Nir Levine, A. Matsukawa, and H. Ghasemzadeh, “Improved knowledge distillation via teacher assistant,” in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5191–5198, 2020. [20] R. Zhang, J. Shen, T. Liu, J. Liu, M. Bendersky, Marc Najork, and Chao Zhang, “Do not blindly imitate the teacher: Using perturbed loss for knowledge distillation,” in Proceedings of the 30th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 4278–4289, 2023. [21] Ying Zhang, Tao Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4320–4328, 2018. [22] B. Zhao, Quan Cui, R. Song, Yiyu Qiu, and J. Liang, “Decoupled knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11953–11962, 2022. [23] S. Ahn, S. Xu Hu, A. Damianou, N. D. Lawrence, and Z. Dai, “Variational information distillation for knowledge transfer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9163–9171, 2019. [24] D. Chen, J. Mei, H. Zhang, Can Wang, Yan Feng, and Chun Chen, “Knowledge distillation with the reused teacher classifier,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11923–11932, 2022. [25] P. Chen, Shu Liu, H. Zhao, and J. Jia, “Distilling knowledge via knowledge review,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5008–5017, 2021. [26] Z. Guo, H. Yan, Hui Li, and X. Lin, “Class attention transfer based knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11868–11877, 2023. [27] B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and Jin Y. Choi, “A comprehensive overhaul of feature distillation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1921–1930, 2019. [28] Z. Li, J. Ye, M. Song, Ying Huang, and Z. Pan, “Online knowledge distillation for efficient pose estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11740–11750, 2021. [29] S. Lin, H. Xie, Bing Wang, K. Yu, X. Chang, X. Liang, and Gang Wang, “Knowledge distillation via the target-aware transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10915–10924, 2022. [30] Y. Liu, Ke Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured knowledge distillation for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2599–2608, 2019. [31] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” in Proceedings of the International Conference on Learning Representations, pp. 1–19, 2020. [32] Junho Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7130–7138, 2017. [33] Tao Huang, Shan You, Fei Wang, Chen Qian, and Chang Xu, “Knowledge distillation from a stronger teacher,” Neural Information Processing Systems, pp. 3600270–3602713, 2022. [34] Gang Li, Xiang Li, Y. Wang, S. Zhang, Y. Wu, and Ding Liang, “Knowledge distillation for object detection via rank mimicking and prediction-guided feature imitation,” in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1306–1313, 2022. [35] B. Peng, Xiao Jin, J. Liu, D. Li, Y. Wu, Yu Liu, S. Zhou, and Z. Zhang, “Correlation congruence for knowledge distillation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 513–522, 2019. [36] F. Tung and Greg Mori, “Similarity-preserving knowledge distillation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 107–116, 2019. [37] Jing Yang, B. Martinez, A. Bulat, G. Tzimiropoulos et al., “Knowledge distillation via softmax regression representation learning,” in Proceedings of the International Conference on Learning Representations, 2021. [38] K. Chandrasegaran, N. Tran, Y. Zhao, and N. Cheung, “Revisiting label smoothing and knowledge distillation compatibility: What was missing? ,” in Proceedings of the 39th International Conference on Machine Learning, pp. 1071–1081, 2022. [39] J. Liu, B. Liu, H. Li, and Yu Liu, “Meta knowledge distillation,” arXiv preprint arXiv:2202.07940, 2022. [40] Zheng Li, Xiang Li, L. Yang, B. Zhao, R. Song, Lei Luo, Jun Li, and Jian Yang, “Curriculum temperature for knowledge distillation,” in Proceedings of the 37th AAAI Conference on Artificial Intelligence , vol. 37, no. 2, pp. 1504–1512, 2023. [41] Jia Guo, “Reducing the teacher-student gap via adaptive temperatures,” in Proceedings of the International Conference on Learning Representations, pp. 1–13, 2022. [42] J. Devlin, M. Chang, K. Lee, K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [43] A. Vaswani, N. Shazeer, Niki Parmar et al., “Attention is All you Need,” Neural Information Processing Systems, pp. 6000–6010,2017. [44] Nils Reimers, Iryna Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 3982–3992, 2019. [45] Alec Radford, Jong Wook Kim, Chris Hallacy et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8748–8763, 2021. [46] D. Wang, Y. Li, L. Wang, B. Gong, “Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distil lation from a Blackbox Model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1498–1507, 2020 [47] Dang Nguyen, Sunil Gupta, Kien Do, S. Venkatesh, “Black-box Few-shot Knowledge Distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 36–43, 2022 [48] Sam Shleifer, A. M. Rush, “Pre-trained Summarization Distillation,” arXiv preprint arXiv:2010.13002, 2020. [49] T. Wu, C. Tao, J. Wang, R. Yang, Zhe Zhao, Ngai Wong, “Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models,” arXiv preprint arXiv:2404.02657, 2024. [50] Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, Douwe Kiela, “Adversarial NLI: A New Benchmark for Natural Language Understanding,” arXiv preprint arXiv:1910.14599, 2019. [51] D. Hendriks, P. Spitzer, N. Kühl, G. Satzger, “Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability,” arXiv preprint arXiv:2504.16056, 2025. [52] O. Camburu, Tim Rocktäschel, T. Lukasiewicz, Phil Blunsom, “e-SNLI: Natural Language Inference with Natural Language Explanations,” arXiv preprint arXiv:1812.01193, 2018. |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信