淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


系統識別號 U0002-1602202012292000
中文論文名稱 應用BERT語言模型於同音別字之訂正
英文論文名稱 Homophone correction using the BERT language model
校院名稱 淡江大學
系所名稱(中) 資訊管理學系碩士班
系所名稱(英) Department of Information Management
學年度 108
學期 1
出版年 109
研究生中文姓名 蔣宜靜
研究生英文姓名 Yi-Jing Chiang
學號 607630109
學位類別 碩士
語文別 中文
口試日期 2020-01-06
論文頁數 38頁
口試委員 指導教授-魏世杰
委員-魏世杰
委員-戴敏育
委員-陸承志
中文關鍵字 BERT  同音別字  注意力機制  深度學習  自然語言處理 
英文關鍵字 BERT  Typos  Attention Mechanism  Deep Learning  NLP 
學科別分類
中文摘要   文字是用來紀錄語言的工具,每個字都有它承載的涵義,但錯別字卻可能使得原本想傳達的意思遭到誤解,抑或導致文章閱讀上的麻煩。而隨著科技的普及,現在大多數人透過輸入文字訊息在溝通,雖不用考慮錯字的可能,但別字的問題隨之層出不窮。
  隨著預訓練模型的釋出,過往常需倚靠大量運算的自然語言處理領域得以百花齊放,大大降低了各種應用須要從頭訓練的資源門檻。本研究基於BERT預訓練架構,進行微調,建構一個錯別字偵測系統。在錯別字偵測的正確率達到0.878。接續錯別字偵測系統,則基於BERT預訓練模型,建立了一個錯別字訂正系統,含有錯別字的句子訂正正確率達0.747。達成有效識別及訂正中文錯別字的系統。
英文摘要   Text is a tool for recording languages. Every word has its meaning. With the popularity of technologies, most people communicate by entering text messages. However, typos may cause misunderstanding of the original meaning or cause trouble in reading the text. With the advent of the pre-training model, the field of natural language processing has seen significant progresses as each application is spared the initial cost of time-consuming computation from scratch.
  This work constructs an effective typo detection system by fine-tuning the BERT model. The accuracy rate of typo detection reached 0.878. Following the typo detection system, based on the BERT pre-training model, a typo correction system was established. The accuracy rate of sentence corrections containing typos was 0.747. Achieved a system for effectively identifying and correcting Chinese typos.
論文目次 目錄 III
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 1
1.3 研究目的 2
1.4 論文架構 2
第二章 文獻探討 3
2.1 錯別字訂正 3
2.2 深度學習 5
2.3 自然語言處理 5
2.4 BERT語言模型 6
第三章 方法介紹 10
3.1 前言 10
3.2 系統架構與流程 10
3.3 語料來源介紹 10
3.4 系統架構 12
3.5 前處理 13
3.6 微調 17
3.7 核心模組 19
3.8 評估方法 20
3.8 評估三項指標 21
第四章 實作架構與實驗結果 24
4.1 實驗環境 24
4.2 實作架構 24
4.3 BERT預訓練模型介紹 24
4.4 資料集介紹 24
4.4.1 混淆集: 24
4.4.2 訓練集: 25
4.4.3 測試集: 26
4.5 實驗參數 26
4.6 實驗結果 27
4.6.1 偵測階段實驗結果: 27
4.6.2 訂正階段實驗結果: 30
4.6.3 綜合階段得分結果: 30
4.6.4 結語討論 31
第五章 結語與未來發展 34
5.1 結語 34
5.2 研究限制 34
5.3 未來展望 35
5.3.1 同音混淆集的大小放寬限制 35
5.3.2 加入其他相似音混淆集 35
5.3.3 加入字形混淆集 35
參考文獻 36

圖目錄
圖 1 BERT、OPENAI GPT及ELMO預訓練模型架構差異(DEVLIN ET AL.(2018)) 6
圖 2 系統流程圖 10
圖 3 系統架構圖 12
圖 4 偵測資料集前處理流程 13
圖 5 偵測前處理步驟1 - 取欄位 14
圖 6 偵測前處理步驟2 - 簡轉繁 14
圖 7 偵測前處理步驟3 - 斷句 14
圖 8 偵測前處理步驟4 - 挑選句子(以在為例) 15
圖 9 偵測前處理步驟5 - 替換錯別字(在、再) 15
圖 10 偵測前處理步驟6 - 洗牌 15
圖 11 偵測前處理步驟7 - 轉換成NER輸入格式 15
圖 12 訂正資料集前處理流程 16
圖 13 訂正前處理步驟1 - 取錯別字位置 16
圖 14 訂正前處理步驟2 - 遮蔽錯別字位置 16
圖 15 微調情境 (DEVLIN ET AL.(2018)) 17
圖 16 BERT的輸入表示 (DEVLIN ET AL.(2019)) 18
圖 17 偵測任務情境 19
圖 18 訂正任務情境 19
圖 19 範例句 23
圖 20 範例句結果統計 23
圖 21 偵測模型訓練集範例 25

表目錄
表 1 MLM預訓練任務遮蔽說明 7
表 2 混淆字集統計音的字數四分位數 11
表 3 混淆矩陣 20
表 4 句子層級混淆矩陣 21
表 5 綜合得分狀況 22
表 6 混淆字集 25
表 7 訂正模型測試集輸入 26
表 8 BERT微調模型實驗參數 26
表 9 句子層級偵測結果情況 27
表 10 各情況例句 27
表 11句子層級偵測結果計算 28
表 12 字元層級偵測結果情況 29
表 13 字元層級偵測結果計算 29
表 14 測試集舉例 30
表 15 綜合得分舉例 31
表 16 句子層級混淆矩陣 31
表 17 字元層級混淆矩陣 32
表 18 訂正結果 33
參考文獻 [1] Bahdanau, D., Cho, K., Bengio, Y (2015) Neural Machine Translation by Jointly Learning to Align and Translate In: ICLR 2015.
[2] Chang, C H (1995) A New Approach for Automatic Chinese Spelling Correction Proceedings of Natural Language Processing Pacific Rim Symposium’95, Seoul, Korea, pages 278-283.
[3] Chang, T H., Chen, H C., Tseng, Y H., & Zheng, J L (2013) Automatic Detection and Correction for Chinese Misspelled Words Using Phonological and Orthographic Similarities Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing (SIGHAN-7), Nagoya, Japan, pages 97-101.
[4] Chelba,C., Mikolov, T., Schuster, M., Ge, Q (2018) Thorsten Brants, Phillipp Koehn, and Tony Robinson 2013 One billion word benchmark for measuring progress in statistical language modeling arXiv preprint arXiv:1312.3005.
[5] Chiu, H W., Wu, J C., & Chang, J S (2013) Chinese Spelling Checker Based on Statistical Machine Translation Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing (SIGHAN-7), Nagoya, Japan, pages 49-53.
[6] Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation In: Proceedings of EMNLP 2014.
[7] Dai, A.M & Le, Q.V (2015) Semi-supervised sequence learning In Advances in neural information processing systems, pages 3079–3087.
[8] Devlin, J., Chang, M-C., Lee, K., Toutanova, K (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding CoRR abs/1810.04805.
[9] Huang, C M., Wu, M C., & Chang, C C (2007) Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text Proceedings of the Fourth Conference on Modeling Decisions for Artificial Intelligence(MDAIIV), pages 463-476.
[10] Kaiser, L., Gomez, A.N., Chollet, F (2017) Depthwise Separable Convolutions for Neural Machine Translation arXiv preprint arXiv:1706.03059
[11] Kaiser, L., Gomez, A.N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., Uszkoreit, J (2017) One Model To Learn Them All arXiv preprint arXiv:1706.05137
[12] Lee, C.-W.,(2017) HMM-based Chinese Spelling Check Master’s thesis Taiwan University, pages 1-47.
[13] Liu, X., Cheng, F., Luo, Y., Duh, K., & Matsumoto, Y (2013) A Hybrid Chinese Spelling Correction Using Language Model and Statistical Machine Translation with Reranking Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing (SIGHAN-7), Nagoya, Japan, pages 54-58
[14] Luong, M.-T., Pham, H., Manning, C.D (2015) Effective Approaches to Attention-Based Neural Machine Translation In Proceedings of the 2015 Conference on EMNLP.
[15] Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L (2018) Deep contextualized word representations In: Proceedings of NAACL-HLT 2018, pages 2227–2237
[16] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I (2018) Improving language understanding with unsupervised learning Technical report, OpenAI.
[17] Ren, F., Shi, H., & Zhou, Q (2001) A hybrid approach to automatic Chinese text checking and error correction Proceedings of 2001 IEEE International Conference on Systems, Man, and Cybernetics, pages 1693-1698.
[18] Samanta, P., & Chaudhuri, B B (2013) A simple real-word error detection and correction using local word bigram and trigram In ROCLING.
[19] Sutskever, I., Vinyals, O., Le, Q.V (2014) Sequence to Sequence Learning with Neural Networks In Advances in Neural Information Processing systems, pages 3104-3112.
[20] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting The Journal of Machine Learning Research, 15(1), pages 1929–1958.
[21] Taylor, W.L (1953).Cloze procedure : A new tool for measuring readability Journalism Quarterly, 30, pages 415-433.
[22] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I (2017) Attention Is All You Need Advances in Neural Information Processing Systems , pages 5998-6008
[23] Wang, Y -R and Liao, Y -F (2014) NCTU and NTUT's Entry to CLP-2014 Chinese Spelling Check Evaluation Association for Computational Linguistics, In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 216-219.
[24] Wang, Y -R and Liao, Y -F (2015), Word Vector/Conditional Random Field-based Chinese Spelling Error Detection for SIGHAN-2015 Evaluation , Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing (SIGHAN-8), pages 46-49.
[25] Wu, S.-H., Chen, Y.-Z., Yang, P.-C., Ku, T., and Liu, C.-L (2010) Reducing the False Alarm Rate of Chinese Character Error Detection and Correction, Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP 2010), ,Beijing, 28-29 Aug., 2010, pages 54–61
[26] Yann, L., Bengio, Y., Hinton, G (2015) Deep learning Nature 521.7553 (2015), pages 436–444.
[27] Yeh, J F., Li, S F., Wu, M R., Chen, W Y., & Su, M C (2013) Chinese Word Spelling Correction Based on N-gram Ranked Inverted Index List Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing (SIGHAN-7), Nagoya, Japan, pages 43-48.
[28] Zhang, L., Huang, C., Zhou, M., & Pan, H (2000) Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 248–254.
[29] Zhang, S., Xiong J., Hou J., Zhang Q., Cheng X., (2015) HANSpeller++: A Unified Framework for Chinese Spelling Correction Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing (SIGHAN-8), pages 38-45
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2025-02-21公開。
  • 同意授權瀏覽/列印電子全文服務,於2025-02-21起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2486 或 來信