系統識別號 | U0002-1602202012292000 |
---|---|
DOI | 10.6846/TKU.2020.00445 |
論文名稱(中文) | 應用BERT語言模型於同音別字之訂正 |
論文名稱(英文) | Homophone correction using the BERT language model |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊管理學系碩士班 |
系所名稱(英文) | Department of Information Management |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 108 |
學期 | 1 |
出版年 | 109 |
研究生(中文) | 蔣宜靜 |
研究生(英文) | Yi-Jing Chiang |
學號 | 607630109 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
第二語言別 | |
口試日期 | 2020-01-06 |
論文頁數 | 38頁 |
口試委員 |
指導教授
-
魏世杰
委員 - 魏世杰 委員 - 戴敏育 委員 - 陸承志 |
關鍵字(中) |
BERT 同音別字 注意力機制 深度學習 自然語言處理 |
關鍵字(英) |
BERT Typos Attention Mechanism Deep Learning NLP |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
文字是用來紀錄語言的工具,每個字都有它承載的涵義,但錯別字卻可能使得原本想傳達的意思遭到誤解,抑或導致文章閱讀上的麻煩。而隨著科技的普及,現在大多數人透過輸入文字訊息在溝通,雖不用考慮錯字的可能,但別字的問題隨之層出不窮。 隨著預訓練模型的釋出,過往常需倚靠大量運算的自然語言處理領域得以百花齊放,大大降低了各種應用須要從頭訓練的資源門檻。本研究基於BERT預訓練架構,進行微調,建構一個錯別字偵測系統。在錯別字偵測的正確率達到0.878。接續錯別字偵測系統,則基於BERT預訓練模型,建立了一個錯別字訂正系統,含有錯別字的句子訂正正確率達0.747。達成有效識別及訂正中文錯別字的系統。 |
英文摘要 |
Text is a tool for recording languages. Every word has its meaning. With the popularity of technologies, most people communicate by entering text messages. However, typos may cause misunderstanding of the original meaning or cause trouble in reading the text. With the advent of the pre-training model, the field of natural language processing has seen significant progresses as each application is spared the initial cost of time-consuming computation from scratch. This work constructs an effective typo detection system by fine-tuning the BERT model. The accuracy rate of typo detection reached 0.878. Following the typo detection system, based on the BERT pre-training model, a typo correction system was established. The accuracy rate of sentence corrections containing typos was 0.747. Achieved a system for effectively identifying and correcting Chinese typos. |
第三語言摘要 | |
論文目次 |
目錄 III 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 1 1.3 研究目的 2 1.4 論文架構 2 第二章 文獻探討 3 2.1 錯別字訂正 3 2.2 深度學習 5 2.3 自然語言處理 5 2.4 BERT語言模型 6 第三章 方法介紹 10 3.1 前言 10 3.2 系統架構與流程 10 3.3 語料來源介紹 10 3.4 系統架構 12 3.5 前處理 13 3.6 微調 17 3.7 核心模組 19 3.8 評估方法 20 3.8 評估三項指標 21 第四章 實作架構與實驗結果 24 4.1 實驗環境 24 4.2 實作架構 24 4.3 BERT預訓練模型介紹 24 4.4 資料集介紹 24 4.4.1 混淆集: 24 4.4.2 訓練集: 25 4.4.3 測試集: 26 4.5 實驗參數 26 4.6 實驗結果 27 4.6.1 偵測階段實驗結果: 27 4.6.2 訂正階段實驗結果: 30 4.6.3 綜合階段得分結果: 30 4.6.4 結語討論 31 第五章 結語與未來發展 34 5.1 結語 34 5.2 研究限制 34 5.3 未來展望 35 5.3.1 同音混淆集的大小放寬限制 35 5.3.2 加入其他相似音混淆集 35 5.3.3 加入字形混淆集 35 參考文獻 36 圖目錄 圖 1 BERT、OPENAI GPT及ELMO預訓練模型架構差異(DEVLIN ET AL.(2018)) 6 圖 2 系統流程圖 10 圖 3 系統架構圖 12 圖 4 偵測資料集前處理流程 13 圖 5 偵測前處理步驟1 - 取欄位 14 圖 6 偵測前處理步驟2 - 簡轉繁 14 圖 7 偵測前處理步驟3 - 斷句 14 圖 8 偵測前處理步驟4 - 挑選句子(以在為例) 15 圖 9 偵測前處理步驟5 - 替換錯別字(在、再) 15 圖 10 偵測前處理步驟6 - 洗牌 15 圖 11 偵測前處理步驟7 - 轉換成NER輸入格式 15 圖 12 訂正資料集前處理流程 16 圖 13 訂正前處理步驟1 - 取錯別字位置 16 圖 14 訂正前處理步驟2 - 遮蔽錯別字位置 16 圖 15 微調情境 (DEVLIN ET AL.(2018)) 17 圖 16 BERT的輸入表示 (DEVLIN ET AL.(2019)) 18 圖 17 偵測任務情境 19 圖 18 訂正任務情境 19 圖 19 範例句 23 圖 20 範例句結果統計 23 圖 21 偵測模型訓練集範例 25 表目錄 表 1 MLM預訓練任務遮蔽說明 7 表 2 混淆字集統計音的字數四分位數 11 表 3 混淆矩陣 20 表 4 句子層級混淆矩陣 21 表 5 綜合得分狀況 22 表 6 混淆字集 25 表 7 訂正模型測試集輸入 26 表 8 BERT微調模型實驗參數 26 表 9 句子層級偵測結果情況 27 表 10 各情況例句 27 表 11句子層級偵測結果計算 28 表 12 字元層級偵測結果情況 29 表 13 字元層級偵測結果計算 29 表 14 測試集舉例 30 表 15 綜合得分舉例 31 表 16 句子層級混淆矩陣 31 表 17 字元層級混淆矩陣 32 表 18 訂正結果 33 |
參考文獻 |
[1] Bahdanau, D., Cho, K., Bengio, Y (2015) Neural Machine Translation by Jointly Learning to Align and Translate In: ICLR 2015. [2] Chang, C H (1995) A New Approach for Automatic Chinese Spelling Correction Proceedings of Natural Language Processing Pacific Rim Symposium’95, Seoul, Korea, pages 278-283. [3] Chang, T H., Chen, H C., Tseng, Y H., & Zheng, J L (2013) Automatic Detection and Correction for Chinese Misspelled Words Using Phonological and Orthographic Similarities Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing (SIGHAN-7), Nagoya, Japan, pages 97-101. [4] Chelba,C., Mikolov, T., Schuster, M., Ge, Q (2018) Thorsten Brants, Phillipp Koehn, and Tony Robinson 2013 One billion word benchmark for measuring progress in statistical language modeling arXiv preprint arXiv:1312.3005. [5] Chiu, H W., Wu, J C., & Chang, J S (2013) Chinese Spelling Checker Based on Statistical Machine Translation Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing (SIGHAN-7), Nagoya, Japan, pages 49-53. [6] Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation In: Proceedings of EMNLP 2014. [7] Dai, A.M & Le, Q.V (2015) Semi-supervised sequence learning In Advances in neural information processing systems, pages 3079–3087. [8] Devlin, J., Chang, M-C., Lee, K., Toutanova, K (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding CoRR abs/1810.04805. [9] Huang, C M., Wu, M C., & Chang, C C (2007) Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text Proceedings of the Fourth Conference on Modeling Decisions for Artificial Intelligence(MDAIIV), pages 463-476. [10] Kaiser, L., Gomez, A.N., Chollet, F (2017) Depthwise Separable Convolutions for Neural Machine Translation arXiv preprint arXiv:1706.03059 [11] Kaiser, L., Gomez, A.N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., Uszkoreit, J (2017) One Model To Learn Them All arXiv preprint arXiv:1706.05137 [12] Lee, C.-W.,(2017) HMM-based Chinese Spelling Check Master’s thesis Taiwan University, pages 1-47. [13] Liu, X., Cheng, F., Luo, Y., Duh, K., & Matsumoto, Y (2013) A Hybrid Chinese Spelling Correction Using Language Model and Statistical Machine Translation with Reranking Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing (SIGHAN-7), Nagoya, Japan, pages 54-58 [14] Luong, M.-T., Pham, H., Manning, C.D (2015) Effective Approaches to Attention-Based Neural Machine Translation In Proceedings of the 2015 Conference on EMNLP. [15] Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L (2018) Deep contextualized word representations In: Proceedings of NAACL-HLT 2018, pages 2227–2237 [16] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I (2018) Improving language understanding with unsupervised learning Technical report, OpenAI. [17] Ren, F., Shi, H., & Zhou, Q (2001) A hybrid approach to automatic Chinese text checking and error correction Proceedings of 2001 IEEE International Conference on Systems, Man, and Cybernetics, pages 1693-1698. [18] Samanta, P., & Chaudhuri, B B (2013) A simple real-word error detection and correction using local word bigram and trigram In ROCLING. [19] Sutskever, I., Vinyals, O., Le, Q.V (2014) Sequence to Sequence Learning with Neural Networks In Advances in Neural Information Processing systems, pages 3104-3112. [20] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting The Journal of Machine Learning Research, 15(1), pages 1929–1958. [21] Taylor, W.L (1953).Cloze procedure : A new tool for measuring readability Journalism Quarterly, 30, pages 415-433. [22] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I (2017) Attention Is All You Need Advances in Neural Information Processing Systems , pages 5998-6008 [23] Wang, Y -R and Liao, Y -F (2014) NCTU and NTUT's Entry to CLP-2014 Chinese Spelling Check Evaluation Association for Computational Linguistics, In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 216-219. [24] Wang, Y -R and Liao, Y -F (2015), Word Vector/Conditional Random Field-based Chinese Spelling Error Detection for SIGHAN-2015 Evaluation , Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing (SIGHAN-8), pages 46-49. [25] Wu, S.-H., Chen, Y.-Z., Yang, P.-C., Ku, T., and Liu, C.-L (2010) Reducing the False Alarm Rate of Chinese Character Error Detection and Correction, Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP 2010), ,Beijing, 28-29 Aug., 2010, pages 54–61 [26] Yann, L., Bengio, Y., Hinton, G (2015) Deep learning Nature 521.7553 (2015), pages 436–444. [27] Yeh, J F., Li, S F., Wu, M R., Chen, W Y., & Su, M C (2013) Chinese Word Spelling Correction Based on N-gram Ranked Inverted Index List Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing (SIGHAN-7), Nagoya, Japan, pages 43-48. [28] Zhang, L., Huang, C., Zhou, M., & Pan, H (2000) Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 248–254. [29] Zhang, S., Xiong J., Hou J., Zhang Q., Cheng X., (2015) HANSpeller++: A Unified Framework for Chinese Spelling Correction Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing (SIGHAN-8), pages 38-45 |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信