| 系統識別號 | U0002-2106202222415800 |
|---|---|
| DOI | 10.6846/TKU.2022.00564 |
| 論文名稱(中文) | 應用神經網路語言模型於同音別字之訂正 |
| 論文名稱(英文) | Correction of homophones using the neural network language model |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 資訊管理學系碩士班 |
| 系所名稱(英文) | Department of Information Management |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 110 |
| 學期 | 2 |
| 出版年 | 111 |
| 研究生(中文) | 廖俊凱 |
| 研究生(英文) | Chun-Kai Liao |
| 學號 | 609630453 |
| 學位類別 | 碩士 |
| 語言別 | 繁體中文 |
| 第二語言別 | |
| 口試日期 | 2022-05-28 |
| 論文頁數 | 33頁 |
| 口試委員 |
指導教授
-
魏世杰(sekewei@mail.tku.edu.tw)
口試委員 - 張昭憲 口試委員 - 壽大衛 |
| 關鍵字(中) |
BERT 混淆集 同音字 深度學習 自然語言處理 |
| 關鍵字(英) |
BERT Confusion Set Homophones Deep Learning NLP |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
隨著科技的普及,人們越來越少提筆寫字,而在使用科技產品輸入文字進行溝通時,別字可能是現代人最大的困擾。別字訂正是一項重要具有挑戰性的任務,令人滿意的解決方案常需要人類水平的語言理解能力。又因為正體中文世界人們較常使用注音輸入法,使得同音別字的偵測訂正需求較大,所以本研究將專注於同音別字之偵測訂正。 近年隨著眾多預訓練語言模型的釋出,先前常需倚靠大量運算的自然語言處理領域得以降低各種需要從頭訓練的資源門檻,其中BERT預訓練語言模型的表現亮眼獲得眾多青睞。本研究將使用BERT系列預訓練模型做別字訂正。資料集部分將針對教育部4808個常用字,依據全字庫文字屬性製作出每個常用字的同音字混淆集,再結合中文維基百科做成完整的同音別字資料集。經過評估其中Soft-Masked BERT模型在經過本同音別字資料集訓練後,其句子層級的別字偵測F1指標達到0.885,別字訂正F1指標為0.806,已逐漸達到可輔助人類訂正同音別字的效果。 |
| 英文摘要 |
As technology advances, people are increasingly less likely to write with a pen. Typos can be the biggest problem for modern people when using technology to enter text for communication. Typos correction is an important and challenging task, and a satisfactory solution often requires human-level language comprehension. As the phonetic input method is popular among people who use traditional Chinese, this study will focus on the detection and correction of homophone characters. For natural language processing (NLP), many pre-trained language models have been released in recent years. The need for massive compute-intensive resources to do the training from scratch is greatly reduced. Among these models, the Bidirectional Encoder Representations from Transformers (BERT) pre-trained language models have gained much attention. In this study, we will use a variant of the pre-trained BERT model to do the typos correction. Based on a set of 4,808 commonly used characters compiled by the Ministry of Education, and the phonetic information from the Master Ideographs Seeker codebase, a confusion set of homophones for each commonly used character is constructed. With sentences from Chinese Wikipedia, a complete homophone data set for training and testing of typos correction models is made. In the experiments, at the sentence level, the Soft-Masked BERT model has achieved an F1-score of 0.885 for homophonic typo detection and 0.806 for homophonic typo correction. It shows that the tested model has gradually reached a level that can assist humans in correcting homophones. |
| 第三語言摘要 | |
| 論文目次 |
目錄 III
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 1
1.3 研究目的 2
1.4 論文架構 2
第二章 文獻探討 3
2.1 自然語言處理 3
2.2 錯別字訂正 4
2.3 BERT語言模型 4
2.4 MACBERT模型 5
2.5 SOFT-MASKED BERT模型 7
2.6 混淆集 8
第三章 方法介紹 10
3.1 前言 10
3.2 系統流程 10
3.3 系統架構 11
3.4 前處理 11
3.5 資料集介紹 12
3.5.1 混淆集 12
3.5.2 資料集 13
3.6 評估方法 14
第四章 實作架構與實驗結果 16
4.1 實驗環境 16
4.2 實作架構 16
4.2.1 BERT預訓練模型介紹 16
4.2.2 Soft-Masked BERT模型 16
4.2.3 MacBERT 模型 16
4.3 實驗參數及時間 17
4.4 實驗結果 17
4.4.1 不同模型及訓練集數量在同音字測試集表現 17
4.4.2 上述最佳模型在SIGHAN測試集表現 25
4.4.3 資料集隨機切割之影響 26
4.4.4 連音錯誤之同音詞集表現 27
4.5 系統介面實作 29
第五章 結語與未來發展 30
5.1 結語 30
5.2 研究限制 31
5.3 未來展望 31
處理專有名詞 31
豁免諧音梗 31
預訓練帶入混淆集 31
加入單音加連音混淆集 31
增加類推訂正能力之測試實驗 31
參考文獻 32
圖目錄
圖1 BERT、OPENAI GPT以及ELMO預訓練模型架構上的差異(DEVLIN ET AL.(2018)) 4
圖2 MASKED LANGUAGE MODEL ((DEVLIN ET AL., 2018)) 5
圖3 EXAMPLES OF DIFFERENT MASKING STRATEGIES.(CUI ET AL., 2020) 7
圖4 SOFT-MASKED BERT (ZHANG ET AL., 2020) 8
圖5 系統流程圖 10
圖6 系統架構圖 11
圖7 偵測資料集前處理流程 12
圖8 同音別字訂正模型介面展示 29
表目錄
表1 混淆字集統計音的字數四分位數 13
表2 混淆字集 13
表3 不同N值下,資料集切割後之統計量 14
表4 混淆矩陣 14
表5 句子層級和字元層級的差別 15
表6 模型訓練參數及時間 17
表7 句子層級偵測結果原始數據 18
表8 句子層級偵測結果評估指標 19
表9 字元層級偵測結果原始數據 19
表10 字元層級偵測結果評估指標 20
表11 句子層級訂正結果原始數據 21
表12 句子層級訂正結果評估指標 21
表13 字元層級訂正結果原始數據 22
表14 字元層級訂正結果評估指標 22
表15 句子層級偵測結果 25
表16 句子層級訂正結果 25
表17 字元層級偵測結果 25
表18 字元層級訂正結果 26
表19 資料集隨機切割表現差異-句子層級偵測 26
表20 資料集隨機切割表現差異-句子層級訂正 26
表21 同音詞混淆集 27
表22 DATAH完整句數 28
表23 同音字模型及同音詞模型句子層級偵測結果 28
表24 同音字模型及同音詞模型句子層級偵測評估指標 28
表25 同音字模型及同音詞模型句子層級訂正結果 28
表26 同音字模型及同音詞模型句子層級訂正評估指標 29
|
| 參考文獻 |
中文文獻 [1].蔣宜靜 (2020)。 應用 BERT 語言模型於同音別字之訂正〔未出版之碩士論文〕。淡江大學資訊管理學系。 英文文獻 [2].Bahdanau, D., Cho, K., Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. In: ICLR 2015. [3].Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y. (2014). Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. In: Proceedings of EMNLP 2014. [4].Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., & Hu, G. (2020). Revisiting pre-trained models for Chinese natural language processing. arXiv preprint arXiv:2004.13922. [5].Dai, A.M., Le, Q.V. (2015). Semi-Supervised Sequence Learning. In Advances in Neural Information Processing Systems, pp. 3079-3087. [6].Devlin, J., Chang, M-C., Lee, K., Toutanova, K. (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805. [7].Faili, H. (2010). Detection and Correction of Real-Word Spelling Errors in Persian Language. In Proceedings of The International Conference on Natural Language Processing and Knowledge Engeneering (NLP-KE). [8].Huang, C. M., Wu, M. C., & Chang, C. C. (2007). Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text. Proceedings of the Fourth Conference on Modeling Decisions for Artificial Intelligence(MDAIIV), pages 463-476. [9].Kaiser, L., Gomez, A.N., Chollet, F. (2017). Depthwise Separable Convolutions for Neural Machine Translation. arXiv:1706.03059 [10].Kaiser, L., Gomez, A.N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., Uszkoreit, J. (2017). One Model to Learn Them All. arXiv:1706.05137 [11].Lee, C.-W.,(2017). HMM-based Chinese Spelling Check. Master’s thesis. Taiwan University, pages 1-47. [12].Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L. (2018). Deep contextualized word representations. In: Proceedings of NAACL-HLT 2018, pages 2227–2237 [13].Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding with Unsupervised Learning. Technical report, OpenAI. [14].Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958. [15].Tseng, Y. H., Lee, L. H., Chang, L. P., & Chen, H. H. (2015, July). Introduction to SIGHAN 2015 bake-off for Chinese spelling check. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing (pp. 32-37). [16].Staffird,T. & Webb,M. (2004). Mind Hacks: Tips & Tricks for Using Your Brain. O'Reilly Media, Inc. [17].Sutskever, I., Vinyals, O., Le, Q.V. (2014) Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing systems, pages 3104-3112. [18].Taylor, W.L. (1953).“Cloze Procedure”: A new tool for measuring readability. Journalism Quarterly, 30, 415-433. [19].Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I. (2017) Attention Is All You Need. Advances in Neural Information Processing Systems , pages 5998-6008. [20].Yann, L., Bengio, Y., Hinton, G. (2015). Deep Learning. Nature 521.7553 (2015): 436–444. [21].Yu, J., & Li, Z. (2014, October). Chinese spelling error detection and correction based on language model, pronunciation, and shape. In Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing (pp. 220-223). [22].Zhang, S., Huang, H., Liu, J., & Li, H. (2020). Spelling error correction with soft-masked BERT. arXiv preprint arXiv:2005.07421. [23].Zhang, L., Huang, C., Zhou, M., & Pan, H. (2000). Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 248–254. [24].Zhang, S., Xiong. J., Hou. J., Zhang. Q., Cheng. X., (2015). HANSpeller++: A Unified Framework for Chinese Spelling Correction. Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing (SIGHAN-8), pages 38-45. |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信