| 系統識別號 | U0002-2008202313351900 |
|---|---|
| DOI | 10.6846/tku202300585 |
| 論文名稱(中文) | 基於語音及角色辨識技術的法庭自動筆錄系統 |
| 論文名稱(英文) | A Smart Court Recording System base on Speech and Role Recognition Techniques |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 資訊工程學系全英語碩士班 |
| 系所名稱(英文) | Master's Program, Department of Computer Science and Information Engineering (English-taught program) |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 111 |
| 學期 | 2 |
| 出版年 | 112 |
| 研究生(中文) | 羅建昇 |
| 研究生(英文) | Jian-Sheng Luo |
| 學號 | 610780065 |
| 學位類別 | 碩士 |
| 語言別 | 英文 |
| 第二語言別 | |
| 口試日期 | 2023-06-12 |
| 論文頁數 | 75頁 |
| 口試委員 |
指導教授
-
武士戎(wushihjung@mail.tku.edu.tw)
口試委員 - 蒯思齊 共同指導教授 - 張志勇(cychang@mail.tku.edu.tw) |
| 關鍵字(中) |
Bert N-gram Ecapa-Tdnn 深度學習 自然語言處理 |
| 關鍵字(英) |
Bert N-gram Ecapa-Tdnn Deep Learning Nature Language Processing (NLP) |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
本論文的研究目標在有效輔助書記官紀錄的工作,開發一套「自動法庭筆錄系統」。面對市面語音辨識軟體對法庭場景的準確率不高,以及無法精準識別法庭角色的挑戰,本研究將語音辨識與角色辨識技術結合,採用深度學習(transformers模型)與自然語言處理技術,自動將法庭開庭過程轉換為文字紀錄。在內容辨識部分,為確保專有名詞的正確性,本研究設計一套糾錯模組(ACTS),透過爬蟲收集法庭專業術語進行訓練,修正語音轉文字後的錯誤。在角色辨識部分,本研究考慮到法庭角色的特定發言程序、常用語及對話上下文語句,並結合聲紋辨識技術,進行精準角色判定。本研究創新地整合語音與角色辨識技術,提出四種角色識別策略,並設計由粗至細的ACTS糾錯模組技術,有效提升法庭自動筆錄系統的效率與精確度。本論文ACTS實驗結果在1050句法律用語+一般用語交錯混雜的情形下,Precision為88.5,而比較對象[12]為84.1。本論文在角色辨識模組的評估中,在1050句、12種角色別的情形下,Precision為90.6,而比較對象[22]為86.1。而本論文在聲紋辨識的評估中,在閥值0.7、12種角色別的情形下,Precision為97.2,而比較對象[19]為91.9,其不管是糾錯性能、角色辨識、聲紋辨識都略優於各比較對象。 |
| 英文摘要 |
The research goal of this thesis is to effectively assist the clerk's record work and develop an "automatic court record system". Faced with the challenges of the low accuracy of speech recognition software on the market for court scenes and the inability to accurately identify courtroom characters, this study combines speech recognition with character recognition technology, using deep learning (transformers model) and natural language processing technology to automatically The court proceedings are converted into written records. In the part of content identification, in order to ensure the correctness of proper nouns, this study designs an error correction module (ACTS), which uses crawlers to collect court terminology for training, and corrects errors after speech-to-text conversion. In the part of role identification, this study takes into account the specific speech procedures, common expressions and dialogue context sentences of court roles, and combines voiceprint recognition technology to make accurate role determination. This research innovatively integrates speech and character recognition technology, proposes four character recognition strategies, and designs a coarse-to-fine ACTS error correction module technology to effectively improve the efficiency and accuracy of the court automatic transcript system. The results of the ACTS experiment in this paper are in the case of 1050 sentences of legal terms + general terms, the Precision is 88.5, and the comparison object [12] is 84.1. In the evaluation of the character recognition module in this thesis, in 1050 sentences and 12 different roles, the Precision is 90.6, while the comparison object [22] is 86.1. In the evaluation of voiceprint recognition in this paper, in the case of a threshold of 0.7 and 12 different roles, the Precision is 97.2, while the comparison object [19] is 91.9. Whether it is error correction performance, character recognition, or voiceprint recognition are slightly better than the comparison objects. |
| 第三語言摘要 | |
| 論文目次 |
LIST OF THE CONTENT LIST OF THE CONTENT III LIST OF FIGURE V LIST OF TABLE VII CHAPTER 1 INTRODUCTION 1 1.1 BACKGROUND 1 1.2 MOTIVATION 2 1.3 TARGET 3 1.4 CONTRIBUTION 5 CHAPTER 2 RELATED WORK 7 2.1 Domestic and International Research on Speech Content Recognition 7 2.2 Domestic and International Research on Content Correction Techniques 9 2.3 Domestic and International Research on Speaker Role Recognition 11 CHAPTER 3 BACKGROUND KNOWLEDGE 16 3.1 Spectral Subtraction 16 3.2 Voice Activity Detection (VAD) 16 3.3 Google Speech Recognition 17 3.4 BERT 18 3.5 BERT Error Correction Technique. 21 3.6 Ngram Error Correction Technique. 24 3.7 Fuzzy-wuzzy Algorithm 25 3.8 Voiceprint Matching Technique 27 CHAPTER 4 SYSTEM STRUCTURE 30 4.1 Environment and Problem Description. 30 4.1.1 Problem to be Solved. 30 4.1.2 Problem Statement 31 4.2 System Architecture 33 A. Audio Noise Reduction and Segmentation. 35 B. Speech Recognition and Content Correction Module. 36 C. Role Recognition Module. 43 D. Automated Collection of Role Voiceprints. 54 E. Voiceprint Matching for Role Identification. 55 CHAPTER 5 EXPERIMENT ANALYSIS 57 CHAPTER 6 CONCLUSION 71 REFERENCE 72 LIST OF FIGURE Figure 1、 Comparison Between Existing Speech Recognition Technologies and This Research 4 Figure 2、Two major challenges 30 Figure 3、Challenge 1: Difficulty in Distinguishing Among Numerous Roles 31 Figure 4、Challenge 2: Content Intermingling and Confusion 31 Figure 5、Court Automatic Transcription System Architecture Diagram 35 Figure 6、Voice Activity Detection (VAD) for Audio File Segmentation 36 Figure 7、BERT Error Correction Model Training Period 37 Figure 8、Google speech-to-text generation error training sample 38 Figure 9、Speech-to-Text Error Sentence Training Samples 39 Figure 10、BERT Error Correction Model Usage Period 41 Figure 11、N-gram Model 42 Figure 12、Court Standard Process Flowchart 44 Figure 13、Collect Standard Process Phrases 44 Figure 14、Automatically Collect Common Words for Each Role 48 Figure 15、Automatically Collect BERT Slot Filling Training Samples 50 Figure 16、BERT Contextual Sentence Inference Slot filling Training 53 Figure 17、BERT Contextual Sentence Inference Classification Training 53 Figure 18、Contextual Sentence Inference BERT Model Usage Period 54 Figure 19、Voiceprint Matching Model Training Period 56 Figure 20、Comparison of Precision, Recall, and F1-score for Two Approaches 60 Figure 21、Operational Time Required for Two Methods 61 Figure 22、Comparison of Precision, Recall, and F1-score for Two Approaches 63 Figure 23、Comparison of Precision, Recall, and F1-score for Two Approaches 66 Figure 24、"Visualization of Courtroom Standard Process" 70 Figure 25、Speech Proportions for Each Role in Court Hearings 70 LIST OF TABLE Table 1、Comparison Table of Related Works 15 Table 2、Experimental Setup of the Content Correction Module 58 Table 3、Comparison between Ours+ Google and Only Google in This Paper 68 |
| 參考文獻 |
[1] H. Miao, G. Cheng, P. Zhang and Y. Yan, "Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1452-1465, 2020. [2] H. Hu, R. Zhao, J. Li, L. Lu and Y. Gong, "Exploring Pre-Training with Alignments for RNN. Transducer Based End-to-End Speech Recognition," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020. [3] Hsiao, Roger, Dogan Can, Tim Ng, Ruchir Travadi and Arnab Ghoshal. “Online Automatic Speech Recognition With Listen, Attend and Spell Model.” IEEE Signal Processing Letters 27, pp. 1889-1893, 2020. [4] K. Hu, T. N. Sainath, R. Pang and R. Prabhavalkar, "Deliberation Model Based Two-Pass End-To-End Speech Recognition," IEEE International Conference on Acoustics, Speech and Signal Processing, Speech and Signal Processing (ICASSP), pp. 7799-7803, 2020. [5] I. Sklyar, A. Piunova, Y. Liu, “Streaming Multi-Speaker ASR With RNN-T,” IEEE International Conference on Acoustics, pp. 6903-6907, 2021. [6] Zhang, Z., Geiger, J., Pohjalainen, J., Mousa, A. E. D., Jin, W., & Schuller, B. (2018). Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Transactions on Intelligent Systems and Technology (TIST), 9(5), 1-28. [7] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., ... & Zhu, Z. (2016, June). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning (pp. 173-182). PMLR. [8] Li, X., Yang, Y., Pang, Z., & Wu, X. (2015). A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition. Neurocomputing, 170, 251-256. [9] Wang, S., Zhou, P., Chen, W., Jia, J., & Xie, L. (2019, November). Exploring rnn-transducer for chinese speech recognition. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1364-1369). IEEE. [10] S. Zhang, H. Huang, J. Liu, and H. Li, “Spelling Error Correction with Soft-Masked BERT,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. [11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of naacL-HLT, Vol. 1, p. 2, 2019. [12] X. Cheng, W. Xu, K. Chen, S. Jiang, F. Wang, T. Wang, W. Chu, and Y. Qi. “SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check,” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistic, 2020. [13] Guo, Z., Ni, Y., Wang, K., Zhu, W., & Xie, G. (2021, August). Global attention decoder for chinese spelling error correction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 1419-1428). [14] Lee, J. H., Kim, M., & Kwon, H. C. (2020). Deep learning-based context-sensitive spelling typing error correction. IEEE Access, 8, 152565-152578.99 [15] Zhang, R., Pang, C., Zhang, C., Wang, S., He, Z., Sun, Y., ... & Wang, H. (2021, August). Correcting Chinese spelling errors with phonetic pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 2250-2261). [16] Gou, W., & Chen, Z. (2021). Think twice: a post-processing approach for the Chinese spelling error correction. Applied Sciences, 11(13), 5832. [17] Li, J., Wu, G., Yin, D., Wang, H., & Wang, Y. (2021, July). DCSpell: A detector-corrector framework for chinese spelling error correction. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1870-1874). [18] Wang, Q., Liu, M., Zhang, W., Guo, Y., & Li, T. (2019). Automatic proofreading in chinese: detect and correct spelling errors in character-level with deep neural networks. In Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part II 8 (pp. 349-359). Springer International Publishing. [19] Snyder, David, et al. "X-vectors: Robust DNN Embeddings for Speaker Recognition," 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5329-5333, 2018. [20] Ghahabi, O., & Hernando, J. (2017). Deep learning backend for single and multisession i-vector speaker recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(4), 807-817. [21] Kanda, Naoyuki, et al. "Serialized Output Training for End-to-End Overlapped Speech Recognition," arXiv preprint arXiv:2003.12687, 2020. [22] Kanda, Naoyuki, et al. "Streaming Multi-talker ASR with Token-level Serialized Output Training," arXiv preprint arXiv:2202.00842, 2022. [23] A. Q. Ohi, M. F. Mridha, M. A. Hamid and M. M. Monowar, "Deep Speaker Recognition: Process, Progress, and Challenges," in IEEE Access, vol. 9, pp. 89619-89643, 2021, doi: 10.1109/ACCESS.2021.3090109. [24] Ravanelli, M., & Bengio, Y. (2018, December). Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT) (pp. 1021-1028). IEEE. [25] Hourri, S., Nikolov, N. S., & Kharroubi, J. (2020). A deep learning approach to integrate convolutional neural networks in speaker recognition. International Journal of Speech Technology, 23, 615-623. |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信