§ 瀏覽學位論文書目資料
  
系統識別號 U0002-1207202523255800
DOI 10.6846/tku202500562
論文名稱(中文) 基於大型語音模型的模型壓縮技術應用於邊緣運算裝置
論文名稱(英文) Model Compression Techniques for Large Speech Models Applied to Edge Computing Devices
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 機械與機電工程學系碩士班
系所名稱(英文) Department of Mechanical and Electro-Mechanical Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 113
學期 2
出版年 114
研究生(中文) 陳昇德
研究生(英文) Sheng-De Chen
學號 611370338
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2025-07-04
論文頁數 48頁
口試委員 指導教授 - 王銀添(ytwang@mail.tku.edu.tw)
口試委員 - 許閔傑
口試委員 - 吳志清
關鍵字(中) 語音辨識
邊緣運算
模型量化
知識蒸餾
模型微調
Whisper
LoRA
關鍵字(英) Automatic Speech Recognition
Edge Computing
Model Compression
Knowledge Distillation
Model Finetune
Whisper
LoRA
第三語言關鍵字
學科別分類
中文摘要
OpenAI發展自動語音辨識的Whisper模型,讓人工智慧系統除文字與影像的輸入之外,增加語音輸入的選項,達到多模態輸入的需求。但是要將自動語音辨識模型應用到資源受限的邊緣設備,仍是一項具挑戰的研究議題。本研究將探討應用蒸餾(Distillation)與量化(Quantization)等模型壓縮方法,進行Whisper模型的壓縮。另外,也將使用剪枝(Pruning)方法直接減少模型層數,以及使用低秩適應(Low-Rank Adaption, LoRA)技術,訓練特定任務的語音資料集。最後將壓縮的Whisper模型應用在邊緣運算裝置,發展一套企業專用的輕量化自動語音辨識系統。本論文將整合自適應低秩適應(Adaptive LoRA, AdaLoRA)技術,以及訓練後量化(Post-Training Quantization, PTQ)技術,應用於Whisper模型的壓縮,稱為自適應蒸餾量化之Whisper (AdaLoRA Distillation PTQ Whisper, ADPWhisper)。本論文提出的ADPWhisper模型將進一步部屬於Jetson AGX Orin等邊緣運算裝置,降低其在低記憶體空間及共用其他子模型。並且發展出多模態等AI應用,以提升語音辨識的準確性與減少所需暫存空間。本研究分成兩個階段,在第一階段,本研究首先採用LoRA方法,利用低秩矩陣內積相乘來減少模型的訓練參數量,並保留關鍵語音辨識能力,接著採用PTQ方法,將模型從Fp32轉換至Int8,使得模型大幅度的壓縮,且輕微影響其辨識能力,並且提高模型運算效能。在第二階段,本研究採取AdaLoRA方法,利用可自調適應的低秩微調。這使得模型能夠針對特定領域進行優化,尤其是針對小資料集,保留其資料集的關鍵特徵。接著,採用知識蒸餾技術,使用老師模型教導學生模型的方式,讓模型更好的吸收語音特徵,並同時減少模型大小。最後,採用PTQ的方式進一步的縮小模型,並得出壓縮比最高的Whisper模型,縮小了87.3%。本研究成果驗證了Whisper模型在邊緣設備上的可行性,並為低功耗及低記憶體暫存成本的語音辨識系統開發提供了具體的技術路徑。未來研究將進一步探討多模態AI融合技術,以及不同壓縮策略對Whisper在邊緣設備上運行效能影響,期望能夠進一步擴展系統的適應性與應用範圍。
英文摘要
OpenAI’s Whisper model enables multimodal AI input by supporting speech in addition to text and images. However, deploying such models on resource-constrained edge devices remains challenging. This study proposes a lightweight ASR system based on Whisper compression, integrating Distillation, Quantization, Pruning, and Low-Rank Adaptation (LoRA). We present ADPWhisper (AdaLoRA Distillation PTQ Whisper), a compressed Whisper model designed for edge deployment. The process involves: Phase 1: Apply LoRA to reduce trainable parameters and preserve key recognition ability. Then, convert the model from FP32 to INT8 using Post-Training Quantization (PTQ) for smaller size and faster inference. Phase 2: Use AdaLoRA for adaptive low-rank fine-tuning on small domain-specific datasets. Next, apply Knowledge Distillation to transfer features from a teacher model to a student model, and perform further PTQ for maximum compression. The final ADPWhisper model achieves an 87.3% size reduction, while maintaining recognition performance. It is deployed on Jetson AGX Orin, demonstrating the feasibility of Whisper-based ASR on edge devices. Future work will explore multimodal integration and the impact of various compression strategies on real-time edge performance. This approach not only reduces memory and compute requirements but also enables on-device ASR for enterprise applications. Experimental results show that ADPWhisper maintains low WER while accelerating inference speed. The proposed system serves as a practical solution for deploying large-scale ASR models on low-power hardware.
第三語言摘要
論文目次
目錄
致謝	I
摘要	II
目錄	IV
圖目錄	VI
表目錄	VII
第 1 章 序論	1
1.1 研究動機	1
1.2 研究目的	2
1.3 系統	2
1.4 研究範圍	3
1.5 論文貢獻	4
1.6 文獻探討	5
1.6.1 語音辨識相關文獻	5
1.6.2 邊緣運算相關文獻	6
1.6.3 模型壓縮相關文獻	6
1.6.4 知識蒸餾相關文獻	6
1.7 論文架構	8
第 2 章 模型壓縮	9
2.1 宜鼎資料集建立	9
2.2 模型量化壓縮	11
2.3 OpenAI Whisper	12
2.4 Whisper模型剪枝與LoRA	13
2.5 Whisper模型PTQ	14
2.6 第一階段邊緣運算設備推論結果	15
第 3 章 ADPWhisper	18
3.1研究與實現規劃	18
3.2 HuggingFace宜鼎資料集建立	18
3.3 針對Whisper所做架構改變之規劃圖	20
3.4 Whisper知識蒸餾訓練	20
3.5 Whisper AdaLoRA加上知識蒸餾架構	22
3.5.1 Whisper AdaLoRA架構圖	24
3.6 ADPWhisper架構圖	24
3.7第二階段在邊緣運算設備與伺服器推論結果	25
3.7.1在伺服器上進行LoRA版本模型推論結果	27
3.7.2在伺服器上進行AdaLoRA版本模型之女性聲音的推論結果	27
3.7.3在伺服器上進行不同版本模型推論結果	28
3.7.4在伺服器上進行不同層數模型推論結果	29
3.7.5在伺服器上進行ADPWhisper不同版本比較	30
3.7.6在伺服器上進行不同公開資料集推論結果	32
第 4 章 系統整合與測試	35
4.1 導入進Jetson AGX Orin	35
4.1.1軟硬體環境	35
4.1.2 Whisper部屬流程	35
第 5 章 結果討論與未來研究	37
5.1 結果討論	37
5.2 未來研究規劃	38
參考文獻	40
附錄A 系統開發環境與程式摘要	46
A.1 軟體開發環境	46
A.2 模型參數	46
A.3 程式摘要	48

圖目錄
圖 1.1 Whisper 應用於邊緣裝置目標	2
圖 1.2 宜鼎Whisper 壓縮系統流程	3
圖 2.1完整系統架構其中語音的部分	9
圖 2.2宜鼎資料集JSON格式	10
圖 2.3 Whisper 模型架構	12
圖 2.4 LoRA架構	14
圖 2.5 Whisper PTQ with CTranslate2	15
圖 3.1研究與實現之流程圖	18
圖 3.2 HuggingFace之INNODisk資料集檔案	19
圖 3.3 製作宜鼎訓練資料集	19
圖 3.4 訓練與驗證架構規劃	20
圖 3.5 Distill Whisper模型架構	22
圖 3.6 Whisper AdaLoRA 加Distill模型架構	22
圖 3.7 Whisper AdaLoRA模型架構	24
圖 3.8 ADPWhisper模型架構	25
圖 3.9 男性與女性測試集結果比對	25
圖 3.10 所有版本的模型準確度	31
圖 3.11 不同蒸餾層次模型的Token Throughput VS. Relative Latency	33
圖 3.12 不同模型的GPU使用率及記憶體使用率	34
圖 5.1 階段性流程圖	37

表目錄
表 2.1 模型準確率	16
表 3.1 在RTX4090伺服器上的AdaLoRA之Teacher準確度	26
表 3.2在RTX4090伺服器上的AdaLoRA之Student準確度	26
表 3.3在RTX4090伺服器上的LoRA之Teacher準確度	27
表 3.4在RTX4090伺服器上的LoRA之Student準確度	27
表 3.5在RTX4090伺服器上的AdaLoRA之Teacher女性測試集準確度	28
表 3.6在RTX4090伺服器上的AdaLoRA之Student女性測試集準確度	28
表 3.7 在RTX4090伺服器上的不同版本模型	29
表 3.8在RTX4090伺服器上使用AdaLoRA與知識蒸餾的不同版本模型	29
表 3.9在RTX4090伺服器上的不同蒸餾層次的模型	30
表 3.10在RTX4090伺服器上的ADPWhisper模型比較	31
表 3.11 Common Voice 16_1在Teacher準確度	32
表 3.12 Common Voice 16_1在Student準確度	32
表 4.1在NVIDIA Jetson AGX Orin上的Teacher準確度	36
表 4.2在NVIDIA Jetson AGX Orin上的Student準確度	36 
參考文獻
[1]	Single Board Computer, https://blog.dfi.com/zh-tw/single-board-computers-raspberry-pi
[2]	Edge Computing Device, https://www.twtm.com.tw/backend/filedownload_news.a shx?fileid=ab8351ec-755f-41a7-95bb-2483b643d9dc&tnclass=K&tnid=14043&v=
[3]	T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, "Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks," Journal of Machine Learning Research, vol. 22, no. 241, pp. 1-124, 2021.
[4]	W. Balzer, M. Takahashi, J. Ohta, and K. Kyuma, "Weight quantization in Boltzmann machines," Neural Networks, vol. 4, no. 3, pp. 405-409, 1991.
[5]	G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network," arXiv preprint arXiv:1503.02531, 2015.
[6]	Z. Han, C. Gao, J. Liu, J. Zhang, and S. Q. Zhang, "Parameter-efficient fine-tuning for large models: A comprehensive survey," arXiv preprint arXiv:2403.14608, 2024.
[7]	J. Pamina and B. Raja, "Survey on deep learning algorithms," International Journal of Emerging Technology and Innovative Engineering, vol. 5, no. 1, 2019.
[8]	A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, "Robust speech recognition via large-scale weak supervision," in International Conference on Machine Learning, 2023: PMLR, pp. 28492-28518.
[9]	C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, "Model compression," in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 535-541.
[10]	Jetson AGX Orin, https://www.nvidia.com/zh-tw/autonomous-machines/embed ded-systems/jetson-orin/
[11]	E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, "Lora: Low-rank adaptation of large language models," arXiv preprint arXiv:2106.09685, 2021.
[12]	Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, "Post-training quantization for vision transformer," Advances in Neural Information Processing Systems, vol. 34, pp. 28092-28103, 2021.
[13]	Compute Unified Device Architecture (CUDA), https://zh.wikipedia.org/zh-tw/CUDA. (Accessed on June 1, 2023)
[14]	S. Gandhi, P. von Platen, and A. M. Rush, "Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling," arXiv preprint arXiv:2311.00430, 2023.
[15]	Q. Zhang et al., "Adalora: Adaptive budget allocation for parameter-efficient fine-tuning," arXiv preprint arXiv:2303.10512, 2023.
[16]	H. Touvron et al., "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023.
[17]	S. Mirzaei, J. Arzate, and Y. Vijay, "Enhancing Aviation Communication Transcription: Fine-Tuning Distil-Whisper with LoRA," arXiv preprint arXiv:2503.22692, 2025.
[18]	R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, "End-to-end speech recognition: A survey," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 325-351, 2023.
[19]	L. Rabiner and B. Juang, "An introduction to hidden Markov models," ieee assp magazine, vol. 3, no. 1, pp. 4-16, 1986.
[20]	D. A. Reynolds, "Gaussian mixture models," Encyclopedia of biometrics, vol. 741, no. 659-663, p. 3, 2009.
[21]	A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.
[22]	I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," Advances in neural information processing systems, vol. 27, 2014
[23]	A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[24]	ASR performance, https://paperswithcode.com/sota/automatic-speech-recogni-tion-on-lrs2
[25]	IoT, https://www.techtarget.com/iotagenda/definition/Internet-of-Things-IoT
[26]	NVIDIA Jetson, https://www.nvidia.com/zh-tw/autonomous-machines/embedded-systems/
[27]	Coral Edge TPU, https://coral.ai/products/
[28]	AMD Versal, https://www.amd.com/zh-tw/products/adaptive-socs-and-fpgas/versal/ai-edge-series.html
[29]	TensorRT, https://developer.nvidia.com/tensorrt
[30]	L. Li, Y. Lin, S. Ren, P. Li, J. Zhou, and X. Sun, "Dynamic knowledge distillation for pre-trained language models," arXiv preprint arXiv:2109.11295, 2021.
[31]	A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, "Fitnets: Hints for thin deep nets," arXiv preprint arXiv:1412.6550, 2014.
[32]	X. Jiao et al., "Tinybert: Distilling bert for natural language understanding," arXiv preprint arXiv:1909.10351, 2019.
[33]	X. Wang, R. Zhang, Y. Sun, and J. Qi, "Kdgan: Knowledge distillation with generative adversarial networks," Advances in neural information processing systems, vol. 31, 2018.
[34]	W. Xu et al., "Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling," arXiv preprint arXiv:2410.11325, 2024.
[35]	Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, "Deep mutual learning," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4320-4328.
[36]	Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D., "Quantization and training of neural networks for efficient integer-arithmetic-only inference," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704-2713.
[37]	S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients," arXiv preprint arXiv:1606.06160, 2016.
[38]	Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chandra, "Llm-qat: Data-free quantization aware training for large language models," arXiv preprint arXiv:2305.17888, 2023.
[39]	H. S. Kim, C. H. Cho, H. Won, and K. H. Park, "Adapt and Prune Strategy for Multilingual Speech Foundational Model on Low-resourced Languages," in Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL), 2023, pp. 85-94.
[40]	D. Klakow and J. Peters, "Testing the correlation of word error rate and perplexity," Speech Communication, vol. 38, no. 1-2, pp. 19-28, 2002.
[41]	Whisper Finetune, https://github.com/yeyupiaoling/Whisper-Finetune
[42]	PEFT, https://github.com/huggingface/peft
[43]	M. A. Davenport and J. Romberg, "An overview of low-rank matrix recovery from incomplete observations," IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 4, pp. 608-622, 2016.
[44]	Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, and K. Macherey, "Google's neural machine translation system: Bridging the gap between human and machine translation," arXiv preprint arXiv:1609.08144, 2016.
[45]	M. U. Kraemer et al., "The effect of human mobility and control measures on the COVID-19 epidemic in China," Science, vol. 368, no. 6490, pp. 493-497, 2020.
[46]	CTranslate2,https://github.com/OpenNMT/CTranslate2?tab=readme-ov-file
[47]	J. James and D. P. Gopinath, "Advocating character error rate for multilingual asr evaluation," arXiv preprint arXiv:2410.07400, 2024.
[48]	Y.-Y. Wang, A. Acero, and C. Chelba, "Is word error rate a good indicator for spoken language understanding accuracy," in 2003 IEEE workshop on automatic speech recognition and understanding (IEEE Cat. No. 03EX721), 2003: IEEE, pp. 577-582.
[49]	J. L. Hieronymus, X. Liu, M. J. Gales, and P. C. Woodland, "Exploiting Chinese character models to improve speech recognition performance," in INTERSPEECH, 2009, pp. 364-367.
[50]	Whisper-distill github, https://github.com/huggingface/distil-whisper?tab=read me-ov-file
[51]	R. Ardila et al., "Common voice: A massively-multilingual speech corpus," arXiv preprint arXiv:1912.06670, 2019.
[52]	Whisper-Large-v3, https://huggingface.co/openai/whisper-large-v3.
[53]	M. Morise, F. Yokomori, and K. Ozawa, "WORLD: a vocoder-based high-quality speech synthesis system for real-time applications," IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877-1884, 2016.
[54]	V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, "Efficient processing of deep neural networks: A tutorial and survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, 2017.
[55]	https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-861/developer-guide/index.html
[56]	N. Benedek and L. Wolf, "PRILoRA: Pruned and rank-increasing low-rank adaptation," arXiv preprint arXiv:2401.11316, 2024.
[57]	https://docs.fireworks.ai/guides/understanding_lora_performance
[58]	JetPack SDK, https://developer.nvidia.com/embedded/jetpack
[59]	Pytorch, https://pytorch.org/
[60]	Whisper Hugging Face API, https://huggingface.co/openai/whisper-large-v3
[61]	cuDNN, https://developer.nvidia.com/cudnn.
[62]	Dynamic Batching, https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/examples/jetson/concurrency_and_dynamic_batching /README.html
[63]	K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, "Haq: Hardware-aware automated quantization with mixed precision," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8612-8620.
[64]	X. L. Li and P. Liang, "Prefix-tuning: Optimizing continuous prompts for generation," arXiv preprint arXiv:2101.00190, 2021.
[65]	R. He et al., "On the effectiveness of adapter-based tuning for pretrained language model adaptation," arXiv preprint arXiv:2106.03164, 2021.
[66]	E. L. Hill-Yardin, M. R. Hutchinson, R. Laycock, and S. J. Spencer, "A Chat (GPT) about the future of scientific publishing," Brain, behavior, and immunity, vol. 110, pp. 152-154, 2023.
[67]	P. Xu et al., "Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[68]	T. Chen et al., "{TVM}: An automated {End-to-End} optimizing compiler for deep learning," in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 578-594.
論文全文使用權限
國家圖書館
同意無償授權國家圖書館,書目與全文電子檔於繳交授權書後, 於網際網路立即公開
校內
校內紙本論文立即公開
同意電子論文全文授權於全球公開
校內電子論文立即公開
校外
同意授權予資料庫廠商
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信