§ 瀏覽學位論文書目資料
系統識別號 U0002-1007201615292100
DOI 10.6846/TKU.2016.00275
論文名稱(中文) 網路論壇之輿情探勘
論文名稱(英文) Opinion Mining on Internet Forums
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系碩士班
系所名稱(英文) Department of Computer Science and Information Engineering
學年度 104
學期 2
出版年 105
研究生(中文) 許維勻
研究生(英文) Wei-Yun Hsu
學號 603410126
學位類別 碩士
語言別 繁體中文
口試日期 2016-06-01
論文頁數 70頁
口試委員 指導教授 - 許輝煌(h_hsu@mail.tku.edu.tw)
委員 - 曾新穆(tsengsm@mail.ncku.edu.tw)
委員 - 陳俊豪(chchen@mail.tku.edu.tw)
關鍵字(中) 輿情探勘
關鍵字(英) Opinion Mining
Support Vector Machine
Natural Language Processing
Text Mining
Web Crawler
根據實驗的結果顯示,在負面類別資料量約非負面類別的一半時,所產生的負面類別分類效能並不好,所以本論文透過SMOTE(Synthetic Minority Over-sampling Technique)解決資料不平衡的問題,再利用本論文提出的特徵向量轉換方法進行情緒分類,而在經過SMOTE的處理後,Recall提昇了約8%至9%,以至於F1-score也提昇了一些,整體Accuracy也有不錯的分類效果。所以從實驗成果可以知道本研究的資料透過SMOTE的處理後雖然非負面類別評論的分類效能降低一點,但卻能更有效的找出網路論壇上的負面評論內容。
It is common that people can exchange their opinions on the Internet. If an enterprise can acquire the opinions about its products or services, it may find some negative evaluations from the opinions and then try to improve or learn the market demand. In this paper, we proposed a detection of negative comments through Web crawler to collect data for training and prediction from Internet forums. In the training process, the data are labeled as negative and non-negative class after pre-processing. Next, we use the proposed method to transform training data into 2-dimention vectors for the input to Support Vector Machine(SVM) to train the classifier. After training process, we use the classifier to classify the prediction data which are pre-processed and transformed into 2-dimention vectors. If the result of classification is negative, the data which contain the author, content, date, and title will be reported. The key point of this paper is how to effectively recognize the negative comments. The calculation of the proposed method of vector transformation is negative-oriented. Most of the researches for sentiment classification focus on the balance of the positive and negative classes. Compared with those researches, the negative-oriented calculation can be more effective for identifying negative sentiment classification. According the experimental results, imbalanced data can not be used so we use SMOTE (Synthetic Minority Over-sampling Technique) to fix problems about the imbalanced data. The experimental results show that Recall raises 8% to 9% also the negative class’s F1-score and Accuracy are good after the process of SMOTE.
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 論文組織與架構 5
第二章 文獻探討 6
2.1 網路爬蟲 6
2.2 機器學習 7
2.3 輿情探勘 10
第三章 網路論壇之負面評論偵測 21
3.1 負面評論偵測流程與開發環境 21
3.2 新進留言評論偵測 23
3.3 文字資料預處理 25
3.4 特徵向量轉換 28
3.5 情緒分類模型建立 31
3.6 負面評論回報 35
第四章 實驗結果 36
4.1 實驗說明 36
4.2 實驗流程與結果 40
4.3 討論 50
第五章 結論與未來展望 54
5.1 結論 54
5.2 問題討論與未來展望	 55
參考文獻 57
附錄一 資料不平衡之十次交叉驗證結果 59
附錄二 資料增量之十次交叉驗證結果 60
附錄三 英文論文 66

圖1 超平面在二維空間切割不同類別示意圖(引用自文獻[6]) 9
圖2 負面評論偵測流程 23
圖3 新評論偵測功能 23
圖4 γ 值越小,超平面越接近線性 33
圖5 γ 值越大,超平面越彎曲 33
圖6 cost越小,越有可能發生誤差 34
圖7 cost越大,誤差程度越小 34
圖8 Confusion matrix 41
圖9 實驗流程圖 43
圖10 資料增加倍率與Recall和Precision關係 48
圖11 RNWC與不同類別資料分佈狀況 51
圖12 RNWN與不同類別資料分佈狀況 52
圖13 關鍵字提取示意圖 53

表1 利用Score representation的特徵向量轉換的分類結果 17
表2 資料不平衡狀態情緒分類的十次交叉驗證平均結果 44
表3 負面類別資料量增加25%時情緒分類的結果 45
表4 負面類別資料量增加40%時情緒分類的結果 46
表5 負面類別資料量增加55%時情緒分類的結果 46
表6 負面類別資料量增加70%時情緒分類的結果 47
表7 負面類別資料量增加85%時情緒分類的結果 47
表8 負面類別資料量增加100%時情緒分類的結果 47
[1] 孫瑛澤、陳建良、劉峻杰、劉昭麟、蘇豐文,「中文短句之情緒分類」,自然語言與語音處理研討會,南投暨南大學,2010,第184-198頁。
[2] 中央研究院中文斷詞系統,http://ckipsvr.iis.sinica.edu.tw/,last accessed Apr 2016
[3] 楊昌樺、陳信希,「以部落格文本進行情緒分類之研究」,自然語言與語音處理研討會,新竹交通大學2006。
[4] P. Gupta, & K. Johari, “Implementation of Web crawler,” In 2009 2nd International Conference on Emerging Trends in Engineering and Technology (ICETET), pp. 838-843, 2009.
[5] V. N. Vapnik, “An overview of statistical learning theory,” IEEE Transactions on Neural Networks, 10(5), pp. 988-999, 1999.
[6] Support Vector Machines簡介, http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/SVM3.pdf, last accessed Apr 2016.
[7] P. Haseena Rahmath, “Fuzzy based Sentiment Analysis of Online Product Reviews using Machine Learning Techniques,” International Journal of Computer Applications, pp. 0975 – 8887 Volume 99 – No.17, 2014.
[8] S. Zhu, Y. Liu, M. Liu, & P. Tian, “Research on Feature Extraction from Chinese Text for Opinion Mining,” In 2009 IEEE International Conference on Asian Language Processing, IALP'09, pp. 7-10, 2009.
[9] L.W. Ku & H.H. Chen, “Mining Opinions from the Web: Beyond Relevance Retrieval.” Journal of American Society for Information Science and Technology, Special Issue on Mining Web Resources for Enhancing Information Retrieval, 58(12), pp 1838-1850, 2007. Software available at http://nlg18.csie.ntu.edu.tw:8080/opinion/index.html
[10] M. Farhadloo & E. Rolland, “Multi-class Sentiment analysis with clustering and score representation,” In 2013 IEEE 13th International Conference on Data Mining Workshops (ICDMW), pp. 904-912, 2013.
[11] F. Pedregosa, G. Varoquaux, A. Gramfort, et al, “Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research (JMLR) 12, pp. 2825-2830, 2011.
[12] mobile01.com-Alexa 2016/01/20,  http://www.alexa.com/siteinfo/mobile01.com
[13] Beautiful Soup, https://www.crummy.com/software/BeautifulSoup/, last accessed Mar 2016
[14] Jieba Chinese text segmentation, https://github.com/fxsjy/jieba, last accessed Jan 2016
[15] R. Mihalcea & P. Tarau, “TextRank: Bringing order into texts,” Association for Computational Linguistics, 2004.
[16] C.C. Chang & C.J. Lin, “LIBSVM : a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
[17] A. Ben-Hur & J. Weston, “A user’s guide to support vector machines,” Data Mining Techniques for The Life Sciences, pp. 223-239, 2010.
[18] N.V. Chawla, K.W. Bowyer, L.O. Hall & W.P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, pp. 321-357, 2002.
[19] H. Isah, P. Trundle & D. Neagu, “Social media analysis for product safety using text mining and sentiment analysis,” In 2014 14th UK Workshop on Computational Intelligence (UKCI), pp. 1-7, 2014.
[20] A. Agarwal, B. Xie, I. Vovsha, O. Rambow & R. Passonneau, “Sentiment analysis of twitter data,” In Proceedings of the workshop on languages in social media pp. 30-38, 2011.
[21] Z. Zhai, B. Liu, L. Zhang, H. Xu & P. Jia, “Identifying Evaluative Sentences in Online Discussions,” In Association for the Advancement of Artificial Intelligence (AAAI) 2011.
[22] D. Tang, F. Wei, B. Qin, L. Dong, T. Liu & M. Zhou, “A Joint Segmentation and Classification Framework for Sentiment Analysis,’ In Empirical Methods in Natural Language Processing(EMNLP). pp. 477-487, 2014.
[23] What are N-Grams?, http://www.text-analytics101.com/2014/11/what-are-n-grams.html, last accessed Apr 2016.
[24] Introducing JSON, http://www.json.org/, last accessed Apr 2016

圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信