電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2019-07-30起於校外公開使用
本論文紙本於2019-07-30起公開使用

系統識別號	U0002-2907201913280800
DOI	10.6846/TKU.2019.00983
論文名稱(中文)	中文降維正負評情感分析方法應用於PTT資料
論文名稱(英文)	Chinese dimension-reduction based sentiment analysis method applied to PTT data
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	統計學系應用統計學碩士班
系所名稱(英文)	Department of Statistics
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	107
學期	2
出版年	108
研究生(中文)	楊士白
研究生(英文)	Shih-Bai Yang
學號	607650024
學位類別	碩士
語言別	繁體中文
第二語言別
口試日期	2019-07-03
論文頁數	49頁
口試委員	指導教授 - 陳景祥(steve@stat.tku.edu.tw) 委員 - 李百靈委員 - 何宗武
關鍵字(中)	文本分類正負文分析降維分類集成學習
關鍵字(英)	Text classification sentiment analysis dimension reduction ensemble learning
第三語言關鍵字
學科別分類
中文摘要	現今的網路上有越來越多的平台提供使用者討論，在這些平台中的文章常涉及到近期發生的事件，而這些文章都會有正面或負面的傾向，我們希望可以利用多分類器技術將這些文章分類成正向或負向的文章，藉由分類後的結果可以針對負面文章的內容找到使用者不滿的事件並加以討論。批批踢實業坊(PTT)是台灣很大的討論平台，因此本篇利用組合相異空間(CoDiS)方法將PTT八卦版的文章分類，其中，資料在輸入分類器前用到相異性轉換進行降維。本研究提出三種新的表示集合(Representation set)選取方法，並比較隨機森林與支持向量機在多分類器系統中的表現。
英文摘要	There are more and more platforms on the internet today that provide users to share their opinions. The articles on these platforms often involve recent events, and these articles always have positive or negative tendencies. People often use the multi-classification systems to classified these articles into positive or negative class, and the classified results can be used to find and discuss user dissatisfaction with the content of negative articles. PPT is one of the largest discussion platforms in Taiwan. We use the multi-classification system CoDiS(Combining Dissimilarity) method to classify the articles of the PTT gossiping forum data. Among them, the data is used in the input classifier to reduce dimension by the dissimilarity transformation. In this paper, we propose three new methods for selecting sets of representations and compares the performances of random forests and support vector machine in multi-classifier systems.
第三語言摘要
論文目次	目錄目錄 I 圖目錄 IV 表目錄 VI 第一章緒論 1 第一節研究背景 1 第二節研究動機與目的 3 第三節論文架構 5 第二章文獻回顧 6 第一節中文斷詞工具 6 2.1.1 CKIP 6 2.1.2 jiebaR 6 第二節組合相異空間多分類系統(CoDiS) 8 2.2.1 相異性轉換 8 2.2.2 組合相異空間-訓練階段 9 2.2.3 組合相異空間-測試階段 11 第三節分類器 12 2.3.1 支持向量機(SVM) 12 2.3.2 隨機森林(Random Forest) 13 第四節 N-gram模型 15 第三章研究方法 17 第一節方法與架構 17 第二節資料處理 19 第三節文本相異性轉換 21 第四節表示集合(Representation set)比較 22 第五節多分類器系統超參數比較 25 3.5.1 超參數L比較 25 3.5.2 超參數Q比較 32 第六節利用N-Gram加權改進 37 第四章實驗資料與研究評估 38 第一節分析環境 38 第二節研究資料介紹 38 第三節評估指標 39 4.3.1 正確率 39 4.3.2 F-measure 39 第四節方法結果與評估 41 第五章結論 46 第一節總結 46 第二節未來研究 46 參考文獻 48 圖目錄圖 1 民眾想得知消息所選擇的媒體 2 圖 2 CKIP 中文斷詞系統結果 7 圖 3 jiebaR 斷詞結果 7 圖 4 相異性轉換之流程 9 圖 5 CoDiS的訓練階段 10 圖 6 Bootstrap aggregating流程 11 圖 7 CoDiS的測試階段 11 圖 8 支持向量機之邊際 12 圖 9 隨機森林之架構 14 圖 10 研究方法流程圖示 18 圖 11 批踢踢實業坊中八卦版文章示例 19 圖 12 資料結構示例 21 圖 13 不同選取方法下的SVM箱型圖 23 圖 14 不同選取方法下的Random Forest箱型圖 23 圖 15 L=3下，SVM與Random Forest表現之比較 29 圖 16 L=5下，SVM與Random Forest表現之比較 29 圖 17 L=10下，SVM與Random Forest表現之比較 30 圖 18 L=15下，SVM與Random Forest表現之比較 30 圖 19 L=20下，SVM與Random Forest表現之比較 31 圖 20 L=30下，SVM與Random Forest表現之比較 31 圖 21 Q=0.1時，SVM與Random Forest表現之比較 35 圖 22 Q=0.2時，SVM與Random Forest表現之比較 35 圖 23 Q=0.3時，SVM與Random Forest表現之比較 35 圖 24 dtm矩陣(2-Gram) 37 圖 25 多分類器系統中單個分類器的表現-accuracy 43 圖 26 多分類器系統中單個分類器的表現-precision 43 圖 27 多分類器系統中單個分類器的表現-recall 44 圖 28 多分類器系統中單個分類器的表現-F score 44 表目錄表 1 不同選取方法下的SVM與Random Forest表現 23 表 2 600筆資料下，超參數L的比較 26 表 3 800筆資料下，超參數L的比較 27 表 4 1000筆資料下，超參數L的比較 28 表 5 600筆資料下，超參數Q的比較 33 表 6 800筆資料下，超參數Q的比較 33 表 7 1000筆資料下，超參數Q的比較 34 表 8 研究資料類型篇數 38 表 9 TP、FN、FP及TN指標 40 表 10 多分類器系統在不同組合下的指標 41 表 11 降維前後之分類時間比較 45
參考文獻	[1] Ghiassi, M., Lee, S. (2018), A domain transferable lexicon set for Twitter sentiment analysis using a supervised machine learning approach. Expert Systems with Applications, 106(15), pp. 197-216. [2] Zhou, S., Chen, Q., Wnag, X. (2013), Active deep learning method for semi-supervised sentiment classification. Neurocomputing, 120(23), pp. 536-546. [3] Weidi Xu, Ying Tan. (2019), Semi-supervised Target-oriented Sentiment Classification, Neurocomputing, 337(14), pp. 120-128. [4] Venkateswarlu Naik, M., Vasumathi, D., Siva Kumar, A.P. (2018), An enhanced unsupervised learning approach for sentiment analysis using extraction of tri-co-occurrence words phrases. 712, pp. 17-26. [5] Jochen Hartmann, Juliana Huppertz, Christina Schamp, Mark Heitmann. (2018), Comparing automated text classification methods. International Journal of Research in Marketing. 36(1), pp. 20-38. [6] Xingtong Ge, Xiaofang Jin, Bo Miao, Chenming Liu, & Xinyi Wu. (2018), Research on the Key Technology of Chinese Text Sentiment Analysis. Proceedings of the IEEE International Conference on Software Engineering and Service Sciences, 395-398. [7] Pinheiro, R. H. W., Cavalcanti, G. D. C., & Tsang, I. R. (2017), Combining dissimilarity spaces for text categorization. Information Sciences, 406-407, pp. 87-101. [8] E. Pekalska and R.P.W. Duin.(2005), The Dissimilarity Representation for Pattern Recognition, Foundations and Applications, World Scientific, Singapore, ISBN 981-256-530-2. [9] Dey, A., Jenamani, M., Thakkar, J.J. (2018), Senti-N-Gram: An n-gram lexicon for sentiment analysis. Expert Systems with Applications. 103(1), pp. 92-105. [10] Jankowska, M., Conrad, C., Harris, J., Kešelj, V. (2018), N-gram based approach for automatic prediction of essay rubric. Lecture Notes in Computer Science. 10832, pp. 298-303. [11] Roul, R. K., Sahoo, J. K., & Arora, K. (2017), Modified TF-IDF Term Weighting Strategies for Text Categorization. 2017 14th IEEE India Council International Conference. [12] V. Mary Amala Bai, D. Manimegalai. (2017), Analysis of feature selection measures for text categorization. Int. J. Enterprise Network Management, 8(1). [13] Ying-Tse Sun, Chien-Liang Chen, Chun-Chieh Liu, Chao-Lin Liu, & Von-Wun Soo. (2010), Sentiment Classification of Short Chinese Sentences. Proceedings of the 22nd Conference on Computational Linguistics and Speech Processing (ROCLING 2010). pp. 184–198. [14] Rameshbhai, C.J., Paulose, J.RF. (2019), Opinion mining on newspaper headlines using SVM and NLP. International Journal of Electrical and Computer Engineering. 9(3), pp. 2152-2163. [15] Usaphapanus, P., Piromsopa, K. (2017), Performance analysis of computer virus detection from binary code using ensemble classifier. ACM International Conference Proceeding Series. pp. 8-12.
論文全文使用權限	校內：校內紙本論文立即公開同意電子論文全文授權校園內公開校內電子論文立即公開校外：同意授權予資料庫廠商校外電子論文立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信