系統識別號 | U0002-1807202201023300 |
---|---|
DOI | 10.6846/TKU.2022.00444 |
論文名稱(中文) | 基於BERT之情感分析數據標記系統 |
論文名稱(英文) | Sentiment Analysis Data Labeling System Based On BERT |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 電機工程學系碩士班 |
系所名稱(英文) | Department of Electrical and Computer Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 110 |
學期 | 2 |
出版年 | 111 |
研究生(中文) | 徐楷崴 |
研究生(英文) | KAI WEI HSU |
學號 | 609450191 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
第二語言別 | |
口試日期 | 2022-06-30 |
論文頁數 | 40頁 |
口試委員 |
指導教授
-
衛信文(hwwei@mail.tku.edu.tw)
口試委員 - 李維聰 口試委員 - 朱國志 |
關鍵字(中) |
情感分析 自然語言處理 標記系統 BERT |
關鍵字(英) |
Sentiment Analysis NLP Labeling System BERT |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
人工智慧在近幾年成長迅速,而自然語言處理(NLP)是現今人工智慧以及計算機科學著重的領域,因硬體運算能力的成長,深度學習網路可大量分析人類的語言數據,使自然語言處理的技術與方法更加成熟。而自然語言處理常見的應用有:機器翻譯、文本生成、摘要文本大綱、語音辨識、情緒分析…等。而在NLP中常見的情感分析,在現今有相當多的應用是以正負度分類為主,二維數據的應用較少,有研究指出,二維座標式的情感分析數據集更符合人類心理的情緒,增加了積極度也可使情感更為細分與精確,若能訓練出一個優良的具有標記功能的模型,對於自然語言處理的發展有極大的幫助。 此外,現在座標式情感分析數據集大多是英文,繁體中文版本的座標式情感分析數據集非常少,因此,本論文的主要目的在於設計出一個基於BERT模型進行中文情感分析數據集的標記系統,達到生成座標式V-A (Valence and Arousal)情感分析數據集的應用。本論文所提出的架構可讓學習自然語言處理者想使用座標式情感分析數據集時可以先利用本系統進行標記,以大幅減少人工標記所需要的時間成本與人力成本。 然而,要利用BERT完成中文的座標式情感分析標記系統,現階段有兩種方法,第一種方式為使用BERT的下游任務:Sequence Classification去進行訓練,但是這個做法會有一個缺點,Sequence Classification雖然常用於情感分析,但多用於單純正負度(Valence)的情感分析,缺少積極度(Arousal)的分析,這會使訓練出來的效果不準確,MAE評分的分數也不盡理想,第二種方式為使用BERT Regression模型去進行訓練,並針對數據裡的句子預測一組座標數字(正負度,積極度),因為使用Regression模型,因此在成果上,預測出來的座標比起Sequence Classification模型還更接近標記值,且MAE評分分數更好。 因此,本論文所設計的BERT二維座標式情感分析標記架構,首先利用CKIP BERT-base-Chinese進行訓練,此預訓練模型可辨識中文,並對中文數據集進行訓練,接著,在下游任務的部分,則使用回歸模型來預測數據所代表的情緒,並生成二維座標,不同於Sequence Classification模型的分類任務著重於將數據的情緒分類成正負,回歸模型更傾向於依照出現在數據裡的關鍵字去預測每個句子的情緒正負與積極程度,也就是一組二維座標,這樣比起一般分類任務更加準確,且因為增加了積極度,可解決一般BERT在分類時相近情感容易搞混的問題。而在評分部分,則將正負度與積極度分開用MAE評分,以確保兩個數字的正確性,最後,再將生成出的二維座標存入list裡面,再利用Pandas DataFrame將生成的二維座標存入數據集裡,並對應各自的Text,以此來完成資料標記的功能。 綜合上述,本論文的貢獻如下: (1)減少人工標記的時間成本與人力成本、 (2)預測的情感符合句子的情感、 (3)完成二維座標的標記功能。 |
英文摘要 |
Artificial intelligence has grown rapidly in recent years, and natural language processing (NLP) is the focus of artificial intelligence and computer science today. Due to the growth of hardware computing power, deep learning networks can analyze a large amount of human language data and make natural language processing technology and methods more mature. The common applications of natural language processing are: machine translation, text generation, text summery, speech recognition, sentiment analysis, etc., And among the common sentiment analysis in NLP, most applications are based on the classification of positive and negative degrees of data, and few applications are based on two-dimensional sentiment analysis data. However, some studies have pointed out that the two-dimensional sentiment analysis data set is more in line with the emotions of human psychology. If a good sentiment labeling function can be trained, it will be of great help to the development of natural language processing in sentiment analysis. In addition, most of the current used coordinate sentiment analysis datasets are in English, and the traditional Chinese version of the coordinate sentiment analysis dataset is very small. Therefore, the main purpose of this thesis is to design a labeling system for Chinese sentiment datasets based on the BERT model to generate V-A (Valence and Arousal) coordinate value for given sentiment data. The proposed labeling architecture in this thesis will allow developers of natural language processing in sentiment analysis who want to use the sentiment dataset with V-A coordinate without greatly time cost and labor cost required for manual labeling dataset. To use BERT model to labeling the traditional Chinese sentiment data with V-A coordinate, there are two methods can be utilized. The first method is to use original BERT model for pre-train phase and Sequence Classification for fine-tune phase. But, this method has a disadvantage, Sequence Classification is mostly used for sentiment analysis of pure positive and negative degree (i.e., Valence) data only, and lacks the analysis of positive degree (i.e., Arousal) data. Use one degree data only will make the prediction inaccurate, and the MAE score of testing data is not ideal. The second method is to use BERT and Regression model for training, and predict a set of coordinate numbers (positive and negative, emotional arousal) for the sentences in the data. Because the Regression model is used, the predicted coordinates of data are closer to the real values than that of utilize the Sequence Classification model, and the MAE score of testing data is better. Therefore, the proposed Sentiment Data Labeling System based on BERT in this thesis is first trained using CKIP BERT-base-Chinese. This pre-training model can recognize traditional Chinese data, i.e., train the model with Chinese dataset. Second, in the downstream task, the regression model is used to predict the emotions represented by the data and generate two-dimensional coordinates for data. Different from the Sequence Classification model that focuses on classifying the emotions of the data into positive and negative, the regression model is more inclined to follow the keywords in the sentences to predict the positive, negative, and emotional arousal (i.e., two-dimensional coordinates) of each sentence, As a result, using regression model is more accurate than using general classification model, because of providing more emotional information for sentiment analysis. Also, combine with regression model can solve the problem that the general BERT model is easy to be confused by data that represented similar emotions when dealing with sentiment classification task. Third, as for scoring, the positive and negative degrees and the emotional arousal are separately scored by MAE to ensure the correctness of the two numbers. Finally, the generated two-dimensional coordinates are stored in the list, and then utilizes the Pandas DataFrame to store the coordinates with the corresponding text in the dataset, so as to complete the function of data labeling system. To sum up, there are three major contributions in this thesis: (1) Reduce the time cost and labor cost of manual labeling, (2) The predicted emotion matches the emotion of the given sentences, (3) Complete the two-dimensional coordinates labeling function for sentiment data. |
第三語言摘要 | |
論文目次 |
目錄 致謝 I 中文摘要 II 英文摘要 IV 目錄 Ⅵ 圖目錄 Ⅷ 表目錄 Ⅺ 公式目錄 ⅪI 第一章 緒論 1 1.1 前言 1 1.2 動機與目的 2 1.3 論文章節架構 3 第二章 背景知識與相關文獻 4 2.1 自然語言處理 4 2.2 自然語言的情感分析資料集的比較 10 2.3 BERT 12 2.3.1 BERT相關介紹 13 2.3.2 注意力機制 15 2.4 Regression model 16 第三章 基於BERT之情感分析數據標記系統 18 3.1 BERT Fine-Tuning 18 3.2 BERT+Regression Model 23 3.3 MAE評分 25 3.4 SmoothL1Loss 26 3.5 標記功能 28 3.6 系統流程圖 29 第四章 訓練與結果 31 4.1 資料集 31 4.2 實驗環境 32 4.3 實驗結果 32 第五章 貢獻與未來展望 37 5.1 主要貢獻 37 5.2 未來展望 37 參考文獻 39 圖目錄 圖2.1 生成式摘要模型 6 圖2.2 問答系統模型圖 7 圖2.3 seq2seq任務圖 8 圖2.4 BERT情感分析圖 10 圖2.5 情感座標圖 11 圖2.6 兩段式學習法 13 圖2.7 BERT輸入表示 14 圖2.8 注意力機制圖 16 圖2.9 回歸模型的幾何意義圖 17 圖3.1 問答任務微調示意圖 19 圖3.2 句子對分類任務示意圖 20 圖3.3 單句句子標記任務示意圖 21 圖3.4 單句分類任務示意圖 22 圖3.5 系統架構圖 25 圖3.6 標記系統流程圖 29 圖3.7 完整系統流程圖 30 圖4.1 CVAT Dataset 31 圖4.2 結果圖 35 圖4.3 結果抽取圖 36 表目錄 表4.1 參數設置 32 表4.2 MAE評分結果 34 公式目錄 式3.1 MAE方程式 26 式3.2 L1 Loss損失函數 26 式3.3 L1 Loss對x的導數 26 式3.4 L2 Loss損失函數 27 式3.5 L2 Loss對x的導數 27 式3.6 smoothL1 Loss損失函數 27 式3.7 smoothL1 Loss對x的導數 28 式4.1 Lasso Regression cost function 33 式4.2 Lasso Regression 目標函數 33 |
參考文獻 |
[1] Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher J. Pal, “Data Mining”. Elsevier Inc. Press, 2017. [2] J. Wang, L. C. Yu, K. R. Lai and X. Zhang, "Community-based Weighted Graph Model for Valence-Arousal Prediction of Affective Words," IEEE/ACM Trans. Audio, Speech and Language Processing, vol. 24, no. 11, pp. 1957-1968, 2016. [3] Elizabeth D. Liddy, “Natural language processing”, 2nd edn. Encyclopedia of Library and Information Science, Marcel Decker, 2001. [4] Bo-Yi Li, “An Enhanced Chinese Question Answering System based a Fine-Tuned BERT Model”, 2020 [5] Ortony, Andrew, Gerald L. Clore, and Allan Collins. “The cognitive structure of emotions”, Cambridge university press, 1990. [6] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding", arXiv preprint arXiv:1810.04805 (2018). [7] Devlin, Jacob, and Ming-Wei Chang. "Open sourcing BERT: State-of-the-art pre-training for natural language processing", Google AI Blog 2 (2018). [8] Ashish Vaswani,Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. “Attention Is All You Need”, NIPS,(2017,Jun,12). [9] Ramcharan, Rodney. Regressions: Why Are Economists Obsessed with Them? March 2006. Accessed 2011-12-03. [10] Seal, Hilary L, “The historical development of the Gauss linear model”, Yale University, 1968. [11] Willmott, Cort J., and Kenji Matsuura. "Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance." Climate research 30.1 (2005): 79-82. [12] Yu, Liang-Chih, et al. "Building Chinese affective resources in valence-arousal dimensions." Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016. [13] Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996): 267-288. |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信