系統識別號 | U0002-2309202021585200 |
---|---|
DOI | 10.6846/TKU.2020.00687 |
論文名稱(中文) | 利用深度學習建立中文新聞情緒分類器 |
論文名稱(英文) | Using deep learning to build a Chinese news sentiment classifier |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系碩士在職專班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 108 |
學期 | 2 |
出版年 | 109 |
研究生(中文) | 陳炳誠 |
研究生(英文) | Ping-Cheng Chen |
學號 | 707410113 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
第二語言別 | 英文 |
口試日期 | 2020-07-16 |
論文頁數 | 50頁 |
口試委員 |
指導教授
-
鄭建富
共同指導教授 - 陳俊豪 委員 - 黃科瑋 委員 - 陳洳瑾 委員 - 鄭建富 |
關鍵字(中) |
中文文字探勘 長短期記憶 文本情緒分析 |
關鍵字(英) |
Chinese news data mining LSTM GRU classification sentiment analysis |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
相關研究指出,新聞文字的情緒一直以來是扮演可能影響金融市場波動的角色之一。所以在每天千以萬計的新聞訊息之中,相較於傳統手工的方式,能夠有效快速地處理文字內容並且進行輿論導向分析,對市場交易走勢的後續預測與判斷是有幫助的。使用人工分類其缺點為成本高。故近年來,許多學者提出各種不同的新聞情緒分類器。在這些研究當中,大多利用新聞全文進行關鍵字的擷取並建立分類模型。因利用新聞全文之關鍵字進行新聞情緒判定會降低準確度,故本論文提出一個以句子為基礎的中文新聞情緒分類演算法。所提方法首先將收集的新聞進行斷句。接著,使用 TextRank 與 Word2Vec 進行關鍵句子的判定。所找出的關鍵句子進一步先透過財金新聞詞典進行情緒分數的計算用以判定每篇新聞的正負面情緒標籤,再用於產生句子的關鍵字,進而形成可用之訓練資料集。最後,透過長短期記憶深度學習與遞歸神經網路模型建立分類器。實驗數據部份,透過選用台灣二家股票公司近五年資料進行評估,結果顯示所提的方法是有效的。 |
英文摘要 |
The literature indicates that news sentiment always has impact on the financial market. Thus, if news can be analyzed effectively, it will have benefit for the following trading. Traditionally, news sentiment is annotated by human, and the cost is high. Therefore, many researchers proposed different approaches for building classifiers for sentiment classification. Most of them are extracting key words from news, and the extracted key news are utilized to construct classifiers. However, using key words extracted from news may reduce the accuracy of the model, this thesis thus proposed a sentence-based Chinese news sentiment classification. It first divides news into sentences. Then, using the TextRank and Word2Vec, the key sentences are generated. By using the generated key sentences, the sentiment scores of them are calculated by comparing the existing financial lexicon to determine positive or negative sentiment of news, and the key words are also generated from key sentences to form training data set. At last, the Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are used to construct the news sentiment classification models. Experiments on a five-year real data set of two companies at Taiwan were made to evaluate the proposed models. The results indicate that the proposed approaches are effective. |
第三語言摘要 | |
論文目次 |
第一章 緒論 1 1.1 研究動機 1 1.2 研究目的 3 1.3 讀者指南 3 第二章 文獻探討與背景知識 4 2.1 英文新聞情緒分類技術 4 2.2 中文新聞情緒分類技術 4 2.3 分類模型 5 2.3.1 SVM 5 2.3.2 CNN 6 2.3.3 RNN 7 2.3.4 LSTM 8 2.3.5 GRU 12 2.4 文字前處理技術 13 2.4.1 Word2Vec 模型 13 2.4.2 TextRank 演算法 15 第三章 基於深度句子的中文情感分類 17 3.1 系統架構圖 17 3.2 資料預處理 18 3.2.1 提取關鍵句子與字詞 18 3.2.2 文本情感分析 19 3.2.3 關鍵詞轉成特徵值清單 20 3.3 建立分類模型 21 3.3.1 訓練模型 22 3.3.2 測試模型 22 3.4 虛擬碼 23 3.5 流程範例 25 第四章 實驗分析與結果 33 4.1 實驗資料蒐集 33 4.1.1 網路新聞資料蒐集 33 4.1.2 正負面詞庫蒐集 33 4.1.3 Word to Vector 模型初始化資料蒐集 33 4.2 實驗環境設定 34 4.3 不同資料的時間區間對模型準確度的影響 34 4.4 關鍵句子數量對模型準確度的影響 35 4.5 關鍵字數量對正負面新聞分類的影響與模型準確度 36 第五章 結論與改進 38 參考文獻 39 附錄 英文論文 43 圖目錄 圖 1情緒分析技術組識圖 2 圖 2 CNN主流卷積網路LeNet-5 架構圖 6 圖 3遞迴類神經網路結構圖 7 圖 4長短記憶模型 8 圖 5長期記憶 9 圖 6遺忘閘門 9 圖 7輸入閘門 10 圖 8新舊狀態更迭 11 圖 9輸出閘門 11 圖 10 Gated Recurrent Unit (GRU) 13 圖 11 CBOW 與 Skip-grap 模型架構圖 15 圖 12以句子為基礎的中文情感分類器流程圖 17 圖 13新聞關鍵句提取流程圖 18 圖 14 決定新聞文章正負面情緒流程圖 19 圖 15關鍵詞轉成特徵值清單流程 21 圖 16 使用TextRank演算法找出關鍵字 25 圖 17使用特定標點符號對文章進行斷句得到句子集合 26 圖 18使用Jieba將句子進行斷詞 26 圖 19 第一篇文章關鍵詞轉換成固定長度的特徵值清單陣列 30 圖 20 模型初始化總覽 31 圖 21模型訓練與驗證之精確度 32 圖 22模型訓練與驗證之損失分數 32 表 1模型各神經網路層參數說明 21 表 2 模型訓練的方式 22 表 3 訓練步驟參數說明 22 表 4 關鍵句子提取的虛擬碼 23 表 5 情緒分析的虛擬碼 23 表 6 關鍵詞轉換特徵值清單的虛擬碼 24 表 7 Textrank關鍵詞與Jieba斷詞的一般詞進行相似度比較表 26 表 8 文章中每一句子重要程度得分 27 表 9 Top-k句子與Title選取得到文章的關鍵句 27 表 10新聞文章關鍵句子 28 表 11關鍵句子使用Jieba斷詞得到 KSWi 28 表 12正負情緒字典清單範例 29 表 13 KSWi與正負面情緒字典清單比較得到詞的傾向分數 29 表 14 WOSi加總起來得到每個關鍵句SOS之情緒分數 30 表 15 實驗環境軟硬體工具清單 34 表 16 不同訓練與測試資料的時間區間筆數 34 表 17不同訓練與測試資料的時間區間之實驗結果 35 表 18關鍵句的數量對模型準確度影響 35 表 19 不同關鍵字的個數對正負面新聞分類的影響 36 表 20不同關鍵字的個數對模型精確度影響 36 |
參考文獻 |
[1] P.Ekman, An argument for basic emotions, CognitionEmotion,vol.6, nos. 3–4, pp. 169–200, May 1992 [2] L. S. Chen、H. R. Chiu,”Developing a Neural Network based Index for Sentiment Classification,”,International Journal of Advanced Information Technologies, Vol. 3, No. 2, pp. 15-35, 2009 [3] A. Go, R. Bhayani and L. Huang, “Twitter sentiment classification using distant supervision,” Technical report, Stanford, 2009 [4] C. H. Zhang, “Text Mining and Sentiment Analysis for the Application of the Product Recommendation-The Case of PTT Movie Board,” 2019 [5] C. J. LEE, “Movie Review Mining and Sentiment Analysis R-Shiny Visualization System,” 2017 [6] J. H. Wang and S. Huang, “Improving Sentiment Classification from High Volatility Financial News,” Proceedings of the 33rd Annual ACM Symposium on Applied Computing, Pages 1790–1797, 2018 [7] Y. T. Cheng, ”Comparison of News Classification Methods and Implementation of Recommendation System,” Department of Computer Science and Information Engineering of National Chung Cheng University, 2019 [8] S. Y. Hsiang, ”Constructing Health Management and Prognostic Model of Maritime Engine Based on LSTM-RNN,” National Defense University, 2020. [9] P. W. Hsiao, “Deep Neural Networks and Ensemble Learning with Application to Speech Emotion Recognition,” Department of Computer Science and Engineering National Sun Yat-sen University, 2019 [10] O. Araque, G. Zhu, C. A. Iglesias, “A semantic similarity-based perspective of affect lexicons for sentiment analysis,” Knowledge-Based Systems 165 346–359, 2019 [11] X. Zhang, J. Zhao, Y. LeCun “Character-level Convolutional Networks for Text Classification,” 649-657, 2015 [12] Kui L and J. Wu, “Sentiment Analysis of Film Review Texts Based on Sentiment Dictionary and SVM”, Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence, Pages 73–77, 2019 [13] Diya W and Yixi Z, “Using News to Predict Investor Sentiment: Based on SVM Model”, Procedia Computer Science, 174 191–199, 2020 [14] N. Deng, Y. Tian and C. Zhang, “Support Vector Machines. Optimization Based Theory, Algorithms, and Extensions”, CRC Press, New York, 2012 [15] X. R. Lin, “Text Detection Using Support Vector Machine,” Institute of Computer Science and Engineering Yuan Ze University, 2008 [16] C. Cortes and V. Vapnik, “Support-vector network,” Machine Learning, vol. 20, pp. 273–297, 1995 [17] H. Cai, L. Wang and S. Duan, “Emotion classification model based on word embedding and CNN,” Journal of research in computer application, 33(10):2902-2905, 2016 [18] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-Based Learning Applied to Document Recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov 1998. [19] K. Goshima, H. Takahashi, “Quantifying News Tone to Analyze the Tokyo Stock Exchange with Deep Learning,” Security Analysis Journal 54 (3): 76-86, 2016 [20] T. Okimoto, E. Hirasawa, “Stock Market Predictability Using News Indexes,” Security Analysis Journal 52 (4): 67-75, 2014 [21] P. C. Tetlock, “Giving content to investor sentiment: The role of media in the stock market,” Journal of Finance 62 (3): 1139-1168, 2007 [22] P. C. Tetlock. et al., “More than words: Quantifying language to measure firms’ fundamentals,” Journal of Finance 63 (3): 1437-1467, 2008 [23] Y. LeCun, Y. Bengio and G. Hinton, “Deep Learning,” Nature 521, 436–444, 2015 [24] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation 9(8): 1735-1780, 1997 [25] M. Heikal, M. Torki and N. EI-Makky, “Sentiment Analysis of Arabic Tweets using Deep Learning,” Procedia Computer Science, 142 114–122, 2018 [26] K. Choet al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation”, arXiv:1406.1078 , 2014 [27] T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781v3 7 Sep, 2013 [28] G. Salton and A. Wong and C. S. Yang, “A Vector Space Model for Automatic Indexing,” Communications of the ACM, 18(11): 613-620, 1975 [29] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer Networks & Isdn Systems, 56(18): 3825-3833, 1998 [30] R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Texts,” Association for Computational Linguistics, 404-411, 2004 [31] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” Proceedings of the 26th International Conference on Neural Information Processing Systems, Volume 2 Pages 3111–3119, 2013 [32] S. Q. Li, S. M. DU and X. Z. Xing, “A Keyword Extraction Method for Chinese Scientific Abstracts”, WCNA 2017: Proceedings of the 2017 International Conference on Wireless Communications, Networking and Applications,133-137, 2017 [33] P. Wang, Y. Luo, Z. Chen, L. He and Z. Zhang, “Orientation Analysis for Chinese News Based on Word Embedding and Syntax Rules,” IEEE Access, Vol. 7, pp. 159888-15898, 2019 [34] T. Loughran and B. Mcdonald, “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10‐Ks,” The Journal of finance, vol. 66, issue 1, 35-65, 2011 |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信