§ 瀏覽學位論文書目資料
  
系統識別號 U0002-2008202313351900
DOI 10.6846/tku202300585
論文名稱(中文) 基於語音及角色辨識技術的法庭自動筆錄系統
論文名稱(英文) A Smart Court Recording System base on Speech and Role Recognition Techniques
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系全英語碩士班
系所名稱(英文) Master's Program, Department of Computer Science and Information Engineering (English-taught program)
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 111
學期 2
出版年 112
研究生(中文) 羅建昇
研究生(英文) Jian-Sheng Luo
學號 610780065
學位類別 碩士
語言別 英文
第二語言別
口試日期 2023-06-12
論文頁數 75頁
口試委員 指導教授 - 武士戎(wushihjung@mail.tku.edu.tw)
口試委員 - 蒯思齊
共同指導教授 - 張志勇(cychang@mail.tku.edu.tw)
關鍵字(中) Bert
N-gram
Ecapa-Tdnn
深度學習
自然語言處理
關鍵字(英) Bert
N-gram
Ecapa-Tdnn
Deep Learning
Nature Language Processing (NLP)
第三語言關鍵字
學科別分類
中文摘要
本論文的研究目標在有效輔助書記官紀錄的工作,開發一套「自動法庭筆錄系統」。面對市面語音辨識軟體對法庭場景的準確率不高,以及無法精準識別法庭角色的挑戰,本研究將語音辨識與角色辨識技術結合,採用深度學習(transformers模型)與自然語言處理技術,自動將法庭開庭過程轉換為文字紀錄。在內容辨識部分,為確保專有名詞的正確性,本研究設計一套糾錯模組(ACTS),透過爬蟲收集法庭專業術語進行訓練,修正語音轉文字後的錯誤。在角色辨識部分,本研究考慮到法庭角色的特定發言程序、常用語及對話上下文語句,並結合聲紋辨識技術,進行精準角色判定。本研究創新地整合語音與角色辨識技術,提出四種角色識別策略,並設計由粗至細的ACTS糾錯模組技術,有效提升法庭自動筆錄系統的效率與精確度。本論文ACTS實驗結果在1050句法律用語+一般用語交錯混雜的情形下,Precision為88.5,而比較對象[12]為84.1。本論文在角色辨識模組的評估中,在1050句、12種角色別的情形下,Precision為90.6,而比較對象[22]為86.1。而本論文在聲紋辨識的評估中,在閥值0.7、12種角色別的情形下,Precision為97.2,而比較對象[19]為91.9,其不管是糾錯性能、角色辨識、聲紋辨識都略優於各比較對象。
英文摘要
The research goal of this thesis is to effectively assist the clerk's record work and develop an "automatic court record system". Faced with the challenges of the low accuracy of speech recognition software on the market for court scenes and the inability to accurately identify courtroom characters, this study combines speech recognition with character recognition technology, using deep learning (transformers model) and natural language processing technology to automatically The court proceedings are converted into written records. In the part of content identification, in order to ensure the correctness of proper nouns, this study designs an error correction module (ACTS), which uses crawlers to collect court terminology for training, and corrects errors after speech-to-text conversion. In the part of role identification, this study takes into account the specific speech procedures, common expressions and dialogue context sentences of court roles, and combines voiceprint recognition technology to make accurate role determination. This research innovatively integrates speech and character recognition technology, proposes four character recognition strategies, and designs a coarse-to-fine ACTS error correction module technology to effectively improve the efficiency and accuracy of the court automatic transcript system. The results of the ACTS experiment in this paper are in the case of 1050 sentences of legal terms + general terms, the Precision is 88.5, and the comparison object [12] is 84.1. In the evaluation of the character recognition module in this thesis, in 1050 sentences and 12 different roles, the Precision is 90.6, while the comparison object [22] is 86.1. In the evaluation of voiceprint recognition in this paper, in the case of a threshold of 0.7 and 12 different roles, the Precision is 97.2, while the comparison object [19] is 91.9. Whether it is error correction performance, character recognition, or voiceprint recognition are slightly better than the comparison objects.
第三語言摘要
論文目次
LIST OF THE CONTENT
LIST OF THE CONTENT	III
LIST OF FIGURE	V
LIST OF TABLE	VII
CHAPTER 1 INTRODUCTION	1
1.1 BACKGROUND	1
1.2 MOTIVATION	2
1.3 TARGET	3
1.4 CONTRIBUTION	5
CHAPTER 2 RELATED WORK	7
2.1 Domestic and International Research on Speech Content Recognition	7
2.2 Domestic and International Research on Content Correction Techniques	9
2.3 Domestic and International Research on Speaker Role Recognition	11
CHAPTER 3 BACKGROUND KNOWLEDGE	16
3.1 Spectral Subtraction	16
3.2 Voice Activity Detection (VAD)	16
3.3 Google Speech Recognition	17
3.4 BERT	18
3.5 BERT Error Correction Technique.	21
3.6 Ngram Error Correction Technique.	24
3.7 Fuzzy-wuzzy Algorithm	25
3.8 Voiceprint Matching Technique	27
CHAPTER 4 SYSTEM STRUCTURE	30
4.1 Environment and Problem Description.	30
4.1.1 Problem to be Solved.	30
4.1.2 Problem Statement	31
4.2 System Architecture	33
A.	Audio Noise Reduction and Segmentation.	35
B.	Speech Recognition and Content Correction Module.	36
C.	Role Recognition Module.	43
D.	Automated Collection of Role Voiceprints.	54
E.	Voiceprint Matching for Role Identification.	55
CHAPTER 5 EXPERIMENT ANALYSIS	57
CHAPTER 6 CONCLUSION	71
REFERENCE	72

LIST OF FIGURE
Figure 1、 Comparison Between Existing Speech Recognition Technologies and This Research	4
Figure 2、Two major challenges	30
Figure 3、Challenge 1: Difficulty in Distinguishing Among Numerous Roles	31
Figure 4、Challenge 2: Content Intermingling and Confusion	31
Figure 5、Court Automatic Transcription System Architecture Diagram	35
Figure 6、Voice Activity Detection (VAD) for Audio File Segmentation	36
Figure 7、BERT Error Correction Model Training Period	37
Figure 8、Google speech-to-text generation error training sample	38
Figure 9、Speech-to-Text Error Sentence Training Samples	39
Figure 10、BERT Error Correction Model Usage Period	41
Figure 11、N-gram Model	42
Figure 12、Court Standard Process Flowchart 	44
Figure 13、Collect Standard Process Phrases	44
Figure 14、Automatically Collect Common Words for Each Role	48
Figure 15、Automatically Collect BERT Slot Filling Training Samples	50
Figure 16、BERT Contextual Sentence Inference Slot filling Training	53
Figure 17、BERT Contextual Sentence Inference Classification Training	53
Figure 18、Contextual Sentence Inference BERT Model Usage Period	54
Figure 19、Voiceprint Matching Model Training Period	56
Figure 20、Comparison of Precision, Recall, and F1-score for Two Approaches	60
Figure 21、Operational Time Required for Two Methods	61
Figure 22、Comparison of Precision, Recall, and F1-score for Two Approaches	63
Figure 23、Comparison of Precision, Recall, and F1-score for Two Approaches	66
Figure 24、"Visualization of Courtroom Standard Process"	70
Figure 25、Speech Proportions for Each Role in Court Hearings	70

LIST OF TABLE
Table 1、Comparison Table of Related Works	15
Table 2、Experimental Setup of the Content Correction Module	58
Table 3、Comparison between Ours+ Google and Only Google in This Paper	68

參考文獻
[1]	H. Miao, G. Cheng, P. Zhang and Y. Yan, "Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1452-1465, 2020.
[2]	H. Hu, R. Zhao, J. Li, L. Lu and Y. Gong, "Exploring Pre-Training with Alignments for RNN. Transducer Based End-to-End Speech Recognition," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
[3]	Hsiao, Roger, Dogan Can, Tim Ng, Ruchir Travadi and Arnab Ghoshal. “Online Automatic Speech Recognition With Listen, Attend and Spell Model.” IEEE Signal Processing Letters 27, pp. 1889-1893, 2020.
[4]	K. Hu, T. N. Sainath, R. Pang and R. Prabhavalkar, "Deliberation Model Based Two-Pass End-To-End Speech Recognition," IEEE International Conference on Acoustics, Speech and Signal Processing, Speech and Signal Processing (ICASSP), pp. 7799-7803, 2020.
[5]	I. Sklyar, A. Piunova, Y. Liu, “Streaming Multi-Speaker ASR With RNN-T,” IEEE International Conference on Acoustics, pp. 6903-6907, 2021.
[6]	Zhang, Z., Geiger, J., Pohjalainen, J., Mousa, A. E. D., Jin, W., & Schuller, B. (2018). Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Transactions on Intelligent Systems and Technology (TIST), 9(5), 1-28.
[7]	Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., ... & Zhu, Z. (2016, June). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning (pp. 173-182). PMLR.
[8]	Li, X., Yang, Y., Pang, Z., & Wu, X. (2015). A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition. Neurocomputing, 170, 251-256.
[9]	Wang, S., Zhou, P., Chen, W., Jia, J., & Xie, L. (2019, November). Exploring rnn-transducer for chinese speech recognition. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1364-1369). IEEE.
[10]	S. Zhang, H. Huang, J. Liu, and H. Li, “Spelling Error Correction with Soft-Masked BERT,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.
[11]	J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of naacL-HLT, Vol. 1, p. 2, 2019. 
[12]	X. Cheng, W. Xu, K. Chen, S. Jiang, F. Wang, T. Wang, W. Chu, and Y. Qi. “SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check,” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistic, 2020.
[13]	Guo, Z., Ni, Y., Wang, K., Zhu, W., & Xie, G. (2021, August). Global attention decoder for chinese spelling error correction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 1419-1428).
[14]	Lee, J. H., Kim, M., & Kwon, H. C. (2020). Deep learning-based context-sensitive spelling typing error correction. IEEE Access, 8, 152565-152578.99
[15]	Zhang, R., Pang, C., Zhang, C., Wang, S., He, Z., Sun, Y., ... & Wang, H. (2021, August). Correcting Chinese spelling errors with phonetic pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 2250-2261).
[16]	Gou, W., & Chen, Z. (2021). Think twice: a post-processing approach for the Chinese spelling error correction. Applied Sciences, 11(13), 5832.
[17]	Li, J., Wu, G., Yin, D., Wang, H., & Wang, Y. (2021, July). DCSpell: A detector-corrector framework for chinese spelling error correction. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1870-1874).
[18]	Wang, Q., Liu, M., Zhang, W., Guo, Y., & Li, T. (2019). Automatic proofreading in chinese: detect and correct spelling errors in character-level with deep neural networks. In Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part II 8 (pp. 349-359). Springer International Publishing.
[19]	Snyder, David, et al. "X-vectors: Robust DNN Embeddings for Speaker Recognition," 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5329-5333, 2018.
[20]	Ghahabi, O., & Hernando, J. (2017). Deep learning backend for single and multisession i-vector speaker recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(4), 807-817.
[21]	Kanda, Naoyuki, et al. "Serialized Output Training for End-to-End Overlapped Speech Recognition," arXiv preprint arXiv:2003.12687, 2020.
[22]	Kanda, Naoyuki, et al. "Streaming Multi-talker ASR with Token-level Serialized Output Training," arXiv preprint arXiv:2202.00842, 2022.
[23]	A. Q. Ohi, M. F. Mridha, M. A. Hamid and M. M. Monowar, "Deep Speaker Recognition: Process, Progress, and Challenges," in IEEE Access, vol. 9, pp. 89619-89643, 2021, doi: 10.1109/ACCESS.2021.3090109.
[24]	Ravanelli, M., & Bengio, Y. (2018, December). Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT) (pp. 1021-1028). IEEE.
[25]	Hourri, S., Nikolov, N. S., & Kharroubi, J. (2020). A deep learning approach to integrate convolutional neural networks in speaker recognition. International Journal of Speech Technology, 23, 615-623.
論文全文使用權限
國家圖書館
同意無償授權國家圖書館,書目與全文電子檔於繳交授權書後, 於網際網路立即公開
校內
校內紙本論文立即公開
同意電子論文全文授權於全球公開
校內電子論文立即公開
校外
同意授權予資料庫廠商
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信