淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-1701201404112100
中文論文名稱 應用機器學習與多辭典的中英雙語意見分析之研究
英文論文名稱 A Study of Applying Machine Learning with Multi-dictionary for Bilingual Opinion Analysis
校院名稱 淡江大學
系所名稱(中) 資訊管理學系碩士在職專班
系所名稱(英) On-the-Job Graduate Program in Advanced Information Management
學年度 102
學期 1
出版年 103
研究生中文姓名 謝衫蒂
研究生英文姓名 Shan-Ti Hsieh
學號 700630287
學位類別 碩士
語文別 中文
第二語文別 英文
口試日期 2014-01-13
論文頁數 89頁
口試委員 指導教授-戴敏育
委員-蕭瑞祥
委員-翁頌舜
中文關鍵字 意見分析  意見探勘  情感分析  情緒辭典  機器學習 
英文關鍵字 Opinion Analysis  Opinion Mining  Sentiment Analysis  Sentiment Dictionary  Machine Learning 
學科別分類
中文摘要   意見分析目的為利用自然語言處理的理論及運算技術,了解網路上意見文本、語句中所蘊含的主觀傾向。在中文評論裡,偶爾會出現中英文交替使用的現象。然而,以往研究中較少有同時針對不同語言共存的問題提出相關整合方法。

  本研究提出一中英雙語意見分析的方法,設計一中英雙語意見辭典,衡量各意見辭典與使用語法特徵,並且利用機器學習進行分類,最後運用特徵選取的方法得到最佳化的特徵集合。

  實驗結果顯示,意見辭典的搭配選擇會影響分類效果,使用雙語意見分析的方法於中文語料庫中時,在最佳化特徵集合後,使用21個特徵值於機器學習的整體正確率可達到交叉驗證74.98%與開放測試77.10%。除此之外,本論文亦針對英文資料在中文語料庫中的比例進行探討,結果顯示英文資料的比例越高,中英雙語意見分析的方法影響力越高。

  本論文主要貢獻為提出美妝保養專有領域意見詞、比較不同意見辭典之搭配的效果,以及證實雙語意見傾向之評估具有輔助機器學習的效果。
英文摘要 Opinion Analysis is a task that aims to determine the subjective orientation in contexts of expressing opinions on the Internet using computational techniques of Natural Language Processing. Posting opinions on the Internet that use bilingual expression is an occasional case in Chinese reviews. However, very little attention has been given to bilingual expression of opinion analysis in prior research.

This paper proposes an approach, which focuses on bilingual opinion analysis applying multi-dictionary, machine learning and feature selection in the contexts of bilingual opinion in Chinese reviews.

We found that accuracy would be strongly affected by different sets of general sentiment dictionaries. Our optimal experiment results showed that the overall performance by using 21 features of our proposed system achieved 74.98% with accuracy of cross validation and 77.10% with accuracy of open test. In addition to the experimental results, we also discovered the influential trend of our system by the variation of proportion of English data in Chinese reviews.

The contributions of this paper are threefold: (1) extracting a new Chinese sentiment dictionary in the field of cosmetic reviews from our experiment, (2) comparing the influences in different sentiment dictionaries, and (3) proving that bilingual opinion analysis can facilitate the performance of machine learning in Chinese reviews.
論文目次 第一章、緒論1
1.1研究背景1
1.2研究動機2
1.3問題定義3
1.4研究目的3
1.5論文架構4
第二章、文獻探討6
2.1意見分析6
2.2意見分析辭典13
2.3.1中文意見分析辭典13
2.3.2英文意見分析辭典18
2.3機器學習21
2.2.1監督式機器學習22
2.2.2非監督式機器學習22
2.4特徵降維24
2.4.1特徵選取24
2.4.2特徵擷取26
2.5本章小結28
第三章、研究方法與系統架構29
3.1研究方法29
3.2系統架構32
3.3意見評論語料蒐集33
3.3.1語料蒐集33
3.3.2網路探勘34
3.4資料前置處理37
3.4.1結構化語料集38
3.4.2斷詞斷句與詞性標記40
3.5中英雙語意見辭典整合43
3.5.1中文意見分析情緒辭典44
3.5.2英文意見分析情感辭典45
3.5.3新詞萃取擴張47
3.6特徵值產生49
3.7機器學習51
3.8特徵選取53
第四章、研究結果與討論54
4.1實驗資料分配與實驗評估方式54
4.1.1實驗語料54
4.1.2實驗使用之意見辭典55
4.1.3分類器模組選擇55
4.1.4實驗評估方式56
4.2特徵值效果比較58
4.3意見辭典效果評估59
4.4含語意特徵值效果評估64
4.5英語資料比例的影響65
4.6特徵選取分析67
第五章、結論與意涵72
5.1結論72
5.2研究貢獻73
5.3管理意涵74
5.4未來展望75
參考文獻76
附錄一、中研院平衡語料庫詞類標記集82
附錄二、中研院平衡語料庫標點符號標記集84
附錄三、PART OF SPEECH TAGS詞類標記集85
附錄四、ICOSMESD正面詞意見辭集87
附錄五、ICOSMESD負面詞意見辭集88

圖目次
=====================================
圖1多辭典的中英雙語意見分析之研究架構流程5
圖2意見分析的技術方法類型7
圖3《知網》知識義原定義示範圖13
圖4《哈工大同義詞詞林擴充版》詞匯編列之組織架構圖16
圖5SENTIWORDNET 單字查詢範例-MARVEL19
圖6機器學習的方法及處理流程23
圖7特徵降維的方法種類27
圖8系統發展方法的研究生命週期循環圖30
圖9系統發展研究方法論流程圖31
圖10應用機器學習與多辭典的中英雙語意見分析方法系統架構32
圖11URCOSME商品使用心得查詢介面33
圖12網站URCOSME之商品使用心得分析34
圖13擷取網頁流程圖35
圖14原始語料集正負評論數量分佈表36
圖15實驗用語料集正負評論數量分佈表36
圖16資料前置處理架構示意圖37
圖17PARSER程式所解析的商品資訊網頁38
圖18PARSER程式解析之後的結構化商品資訊39
圖19PARSER程式所解析的商品使用心得網頁39
圖20PARSER程式解析之後的結構化商品使用心得評論資訊40
圖21中英雙語意見分析詞語集整合流程44
圖22新詞萃取流程48
圖23YSYA執行結果48
圖24標記語料庫之正負極性49
圖25LIBSVM訓練資料之方法51
圖26LIBSVM最佳化參數圖52
圖27實驗語料分布55
圖28正確率運算資料區別之圖示說明57
圖29辭典的各別正確率比較圖60
圖30各辭典以辭典意見詞為基礎的效能統計圖61
圖31各辭典搭配的效能比較分析圖63
圖32各辭典含語意特徵的分類效能統計圖65
圖33調整英文資料比例的分類效果分析圖66
圖34特徵資料值分布圖67
圖35特徵選取組合效果分析圖71

表目次
=====================================
表1意見探勘的任務及其階層分類 12
表2《知網》中文情緒詞集之詞語數量統計14
表3《知網》中文情緒詞集各子辭集範例列表14
表4NTUSD範例詞列表15
表5第四層及第五層的詞組範例16
表6《廣義繁體知網》定義範例17
表7SENTIWORDNET同義詞集範例19
表8WORDNET-AFFECT A-LABELS 及相關同義詞範例20
表9ENGLISH OPINION LEXICON FROM HU & LIU範例詞列表21
表10監督式及非監督式機器學習法特性比較表23
表11特徵選取方法的優劣比較表26
表12CKIP中文斷詞斷句與詞性標記範例41
表13CKIP斷詞系統之詞性標記對應41
表14英文詞語POS TAGGER處理43
表15POS TAGGER之詞性標記對應43
表16知網中文情緒辭集之詞語數量統計表45
表17情緒程度標記範例表45
表18SENTIWORDNET範例46
表19英語情緒程度值標記範例表46
表20ICOSMESD情緒程度值計分範例表47
表21語意特徵詞範例表49
表22特徵值列表總覽50
表23各辭典的正、負向意見詞數量統計表55
表2410-FOLD交叉驗證結果分析56
表25特徵值交叉驗證及開放測試數據表58
表26辭典的各別正確率數據59
表27各辭典以意見詞為基礎之效能比較表61
表28辭典組合定義62
表29各辭典搭配的效能比較表62
表30含語意特徵之分類效能比較表65
表31調整英文資料比例的分類效果比較表66
表32SPEARMAN’S RANK CORRELATION COEFFICIENT 特徵選取69
表33特徵選取組合70
表34特徵選取組合正確率數據70
參考文獻 [中文文獻]

梅家駒, 竺一鳴, 高蘊琦, & 殷鴻翔. (1983). 同義詞詞林: 上海辭書出版社;上海.
陳克健, 黃淑齡, 施悅音, & 陳怡君. (2005). 多層次概念定義與複雜關係表達-繁體字知網的新增架構. Paper presented at the 漢語詞彙語義研究的現狀與發展趨勢國際學術研討會,北京大學.
黃慧庭. (2008). 中文篇章中時間關係的辨識研究. 國立交通大學, 資訊科學與工程研究所.
Dong, Z., & Dong, Q. (2001). 基於《 知網》 的中文信息结構抽取.

[英文文獻]

Abbasi, A. (2007). Affect intensity analysis of dark web forums. Paper presented at the Intelligence and Security Informatics, 2007 IEEE.
Abbasi, A., Chen, H., & Salem, A. (2008). Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums. ACM Transactions on Information Systems (TOIS), 26(3), 12.
Aciar, S., Zhang, D., Simoff, S., & Debenham, J. (2007). Informed recommender: Basing recommendations on consumer product reviews. Intelligent Systems, IEEE, 22(3), 39-47.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Paper presented at the Proc. 20th Int. Conf. Very Large Data Bases, VLDB.
Baccianella, S., Esuli, A., & Sebastiani, F. (2010). Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. Paper presented at the Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, May. European Language Resources Association (ELRA).
Berger, A. L., Pietra, V. J. D., & Pietra, S. A. D. (1996). A maximum entropy approach to natural language processing. Computational linguistics, 22(1), 39-71.
Cha, M., Haddadi, H., Benevenuto, F., & Gummadi, P. K. (2010). Measuring User Influence in Twitter: The Million Follower Fallacy. ICWSM, 10, 10-17.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
Chien, L.-F. (1997). PAT-tree-based keyword extraction for Chinese information retrieval. Paper presented at the ACM SIGIR Forum.
Choi, E., & Lee, C. (2003). Feature extraction based on the Bhattacharyya distance. Pattern Recognition, 36(8), 1703-1709.
Cunningham, P. (2008). Dimension reduction Machine learning techniques for multimedia (pp. 91-112): Springer.
Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent data analysis, 1(3), 131-156.
Devaney, M., & Ram, A. (1997). Efficient feature selection in conceptual clustering. Paper presented at the ICML.
Dong, Z., & Dong, Q. (2006). HowNet and the Computation of Meaning: World Scientific.
Dy, J. G., & Brodley, C. E. (2004). Feature selection for unsupervised learning. The Journal of machine learning research, 5, 845-889.
Ekman, P. (1992). An argument for basic emotions. Cognition & Emotion, 6(3-4), 169-200.
Esuli, A., & Sebastiani, F. (2005). Determining the semantic orientation of terms through gloss classification. Paper presented at the Proceedings of the 14th ACM international conference on Information and knowledge management.
Fellbaum, C. (2010). WordNet. Theory and Applications of Ontology: Computer Applications, 231-243.
Fodor, I. K. (2002). A survey of dimension reduction techniques: Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory.
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of machine learning research, 3, 1289-1305.
Greene, S., & Resnik, P. (2009). More than words: Syntactic packaging and implicit sentiment. Paper presented at the Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of machine learning research, 3, 1157-1182.
Guyon, I., & Elisseeff, A. (2006). An introduction to feature extraction Feature Extraction (pp. 1-25): Springer.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46(1-3), 389-422.
Han, J., & Kamber, M. (2006). Data mining: concepts and techniques: Morgan Kaufmann.
He, X., Cai, D., & Niyogi, P. (2005). Laplacian score for feature selection. Paper presented at the Advances in neural information processing systems.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6), 417.
Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. Paper presented at the Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.
Hyvarinen, A., & Oja, E. (2000). Independent component analysis: algorithms and applications. Neural networks, 13(4), 411-430.
Jindal, N., & Liu, B. (2006a). Identifying comparative sentences in text documents. Paper presented at the Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval.
Jindal, N., & Liu, B. (2006b). Mining comparative sentences and relations. Paper presented at the AAAI.
Jindal, N., & Liu, B. (2008). Opinion spam and analysis. Paper presented at the Proceedings of the international conference on Web search and web data mining.
Kim, S.-M., & Hovy, E. (2004). Determining the sentiment of opinions. Paper presented at the Proceedings of the 20th international conference on Computational Linguistics.
Kohavi, R. (1995). Wrappers for performance enhancement and oblivious decision graphs. Citeseer.
Ku, L.-W., Liang, Y.-T., & Chen, H.-H. (2006). Opinion Extraction, Summarization and Tracking in News and Blog Corpora. Paper presented at the AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.
Ku, L. W., & Chen, H. H. (2007). Mining opinions from the Web: Beyond relevance retrieval. Journal of the American Society for Information Science and Technology, 58(12), 1838-1850.
Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
Liu, B. (2010). Sentiment analysis: A multi-faceted problem. the IEEE Intelligent Systems.
Liu, B. (2011). Web Data Mining - Exploring Hyperlinks, Contents, and Usage Data (2nd Edition ed.): Springer.
Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 5(1), 1-167.
Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: analyzing and comparing opinions on the web. Paper presented at the Proceedings of the 14th international conference on World Wide Web.
Loughrey, J., & Cunningham, P. (2005a). Overfitting in wrapper-based feature subset selection: The harder you try the worse it gets Research and Development in Intelligent Systems XXI (pp. 33-43): Springer.
Loughrey, J., & Cunningham, P. (2005b). Using early-stopping to avoid overfitting in wrapper-based feature selection employing stochastic search.
Magnini, B., & Cavaglia, G. (2000). Integrating Subject Field Codes into WordNet. Paper presented at the LREC.
Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a Large Annotated Corpus of English: The Penn Treebank.
Martineau, J., & Finin, T. (2009). Delta TFIDF: An Improved Feature Space for Sentiment Analysis. Paper presented at the ICWSM.
Mihalcea, R., & Liu, H. (2006). A corpus-based approach to finding happiness. Paper presented at the Proceedings of the AAAI Spring Symposium on Computational Approaches to Weblogs.
Mihalcea, R., & Strapparava, C. (2005). Making computers laugh: Investigations in automatic humor recognition. Paper presented at the Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing.
Mladenić, D. (1998). Feature subset selection in text-learning: Springer.
Money, R. B. (2004). Word-of-mouth promotion and switching behavior in Japanese and American business-to-business service clients. Journal of Business Research, 57(3), 297-305.
Nunamaker Jr, J. F., Chen, M., & Purdin, T. D. M. (1990). Systems development in information systems research. Paper presented at the System Sciences, 1990., Proceedings of the Twenty-Third Annual Hawaii International Conference on.
Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. Paper presented at the Proceedings of the 42nd annual meeting on Association for Computational Linguistics.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. Paper presented at the Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10.
Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8), 1226-1238.
Quinlan, J. R. (1993). C4. 5: programs for machine learning (Vol. 1): Morgan kaufmann.
Saeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. bioinformatics, 23(19), 2507-2517.
Schapire, R. E., & Singer, Y. (2000). BoosTexter: A boosting-based system for text categorization. Machine learning, 39(2-3), 135-168.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.
Shelke, N. M., Deshpande, S., & Thakre, V. (2012). Survey of Techniques for Opinion Mining. International Journal of Computer Applications, 57(13).
Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation AI 2006: Advances in Artificial Intelligence (pp. 1015-1021): Springer.
Sood, S., Owsley, S., Hammond, K., & Birnbaum, L. (2007). Reasoning through search: A novel approach to sentiment classification. WWW, Banff.
Spertus, E. (1997). Smokey: Automatic recognition of hostile messages. Paper presented at the AAAI/IAAI.
Stone, P. J., Dunphy, D. C., & Smith, M. S. (1966). The General Inquirer: A Computer Approach to Content Analysis.
Strapparava, C., & Valitutti, A. (2004a). WordNet-Affect: an affective extension of WordNet. Paper presented at the Proceedings of LREC.
Strapparava, C., & Valitutti, A. (2004b). WordNet Affect: an Affective Extension of WordNet. Paper presented at the LREC.
Su, Q., Xu, X., Guo, H., Guo, Z., Wu, X., Zhang, X., . . . Su, Z. (2008). Hidden sentiment association in chinese web opinion mining. Paper presented at the Proceedings of the 17th international conference on World Wide Web.
Thomas, M., Pang, B., & Lee, L. (2006). Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. Paper presented at the Proceedings of the 2006 conference on empirical methods in natural language processing.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288.
Torkkola, K. (2003). Feature extraction by non parametric mutual information maximization. The Journal of machine learning research, 3, 1415-1438.
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network.
Trivedi, S. K., & Dey, S. (2013). Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams. International Journal of Computer Applications, 66(21).
Turban, E., Sharda, R., Delen, D., & Efraim, T. (2007). Decision support and business intelligence systems: Pearson Education India.
Turney, P. D. (2002). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. Paper presented at the Proceedings of the 40th annual meeting on association for computational linguistics.
Wiebe, J., Wilson, T., Bruce, R., Bell, M., & Martin, M. (2004). Learning subjective language. Computational linguistics, 30(3), 277-308.
Wu, S., & Flach, P. A. (2002). Feature selection with labelled and unlabelled data. Paper presented at the ECML/PKDD’02 workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning.
Xu, K., Liao, S. S., Li, J., & Song, Y. (2011). Mining comparative opinions from customer reviews for Competitive Intelligence. Decision support systems, 50(4), 743-754.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. Paper presented at the ICML.
Yu, H., & Hatzivassiloglou, V. (2003). Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. Paper presented at the Proceedings of the 2003 conference on Empirical methods in natural language processing.
Zhu, Z., Ong, Y.-S., & Dash, M. (2007). Wrapper–filter feature selection algorithm using a memetic framework. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 37(1), 70-76.
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2014-01-23公開。
  • 同意授權瀏覽/列印電子全文服務,於2014-01-23起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信