淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-2807201218571900
中文論文名稱 中文意見探勘系統設計
英文論文名稱 Design of a Chinese Opinion Mining System
校院名稱 淡江大學
系所名稱(中) 資訊工程學系碩士班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 100
學期 2
出版年 101
研究生中文姓名 簡立
研究生英文姓名 Lee Chien
學號 699410576
學位類別 碩士
語文別 中文
第二語文別 英文
口試日期 2012-07-05
論文頁數 71頁
口試委員 指導教授-蔣璿東
委員-王鄭慈
委員-葛煥昭
委員-蔣璿東
中文關鍵字 意見探勘  冷門詞  熱門詞  排除字  詞庫穩定 
英文關鍵字 Opinion Mining  unpopular words  popular words  exclude words  lexicon stable 
學科別分類 學科別應用科學資訊工程
中文摘要 由於中文文法結構與英文不同,字與字之間是沒有間隔分開來,若使用POS或Parser來找尋意見詞時,會很容易產生錯誤,因此本論文在採用詞庫方式來擷取意見詞同時,搭配著我們提出的排除字方法來改善意見詞擷取的準確率。
由於每個不同的領域都有不同的習慣用語(意見詞和排除字),所以一般的詞典很難涵蓋一個特定領域中所有的意見詞。但我們認為針對一個特定領域而言,只要訓練資料夠多,大部分的詞典外的意見詞和排除字都可被擷取,而未出現在訓練資料集的詞典外意見詞和排除字的除了數量並不多且呈現穩定狀態外,而且通常都是較冷門較不常使用的意見詞和排除字。本論文節將分別利用Mobile01電信和網路寬頻兩個不同但相似領域之實驗數據來證明此一觀點。
由於本論文是採用詞庫方式來來擷取意見詞和排除字,所以詞典內的意見詞和排除字都可被節取出來,而詞典外的意見詞和排除字則必須利用人工標註方式才可找出,但此法必須花費大量時間和人工;因此依據新增詞典外意見詞和排除字的穩定性,我們設計出二階段式詞庫訓練方法來解決非常耗時費力的問題。我們二階段式詞庫訓練方法,第一階段是借助人工半自動標註來擷取訓練資料的意見詞或排除字,第二階段則是在系統上線時,直接利用詞典來擷取文章中的意見詞或排除字,再利用人工檢查所擷取意見詞和排除字的正確性。依據實驗數據顯示,我們第二步驟訓練流程相較於第一個月的訓練,在犧牲準確率及回收率很少狀況下,能夠節省大量人力標註及檢查的時間。
英文摘要 Since the Chinese grammatical structure is different from English, there is no interval space in between Chinese words. Using POS or Parser in search of opinion words can easily lead to errors. Therefore, when capturing opinion words by using the thesaurus (lexicon) way, this study uses the proposed exclusion word method to improve the opinion word capturing precision.
As each of the different fields has different terminologies or idioms (opinion words and exclusion words), ordinary dictionaries can hardly cover all the opinion words in a specific field. However, for a specific field, as long as the training data are sufficient, most of the opinion words and exclusion words outside the dictionaries can be captured. The opinion words and exclusion words outside the dictionaries that have not been included in the training set are few, and at a stable state. Moreover, they are often opinion words and exclusion words that are not frequently used. This paper uses the experimental data of two different but similar fields of Mobile01 telecommunications.
As this paper uses the thesaurus/lexicon way to capture the opinion words and exclusion words, all the opinion words and exclusion words in dictionaries can be captured. The opinion words and exclusion words outside the dictionaries can be determined only by manual tagging, which is time and labor consuming. Therefore, according to the stability of the new opinion words and exclusion words outside the dictionaries, this study attempts to design a two-stage lexicon training method to solve this problem. Regarding the proposed two-stage lexicon training method, the first stage is to capture the opinion words or exclusion words of training data by manual semi-automated tagging. The second stage is to directly use the dictionaries to capture the opinion words or exclusion words of the articles when the system is online before manually inspecting the accuracy of the captured opinion words and exclusion words. According to the experimental data, the training procedure of the second stage can save a great deal of time for manual tagging.
論文目次 第1章 緒論 1
1.1 研究動機與目的 1
1.2 論文架構 5
第2章 文獻探討 7
2.1 定義意見單元(Definition of Opinion Unit) 7
2.2 半自動標註(Semi-Automated Tagging)與詞庫式意見擷取方式 9
2.3 中文意見探勘系統比較 10
第3章 詞庫的穩定性(Stability) 14
3.1 排除字 15
3.2 電信領域的實驗數據 23
3.3 網路寬頻領域的實驗數據 31
3.4 結論 36
第4章 系統架構 38
4.1 系統架構第一階段 38
4.2 系統架構第二階段 45
4.3 系統整體準確率 57
第5章 結論與未來展望 62
5.1 結論 62
5.2 未來展望 63
參考文獻 64
附錄-英文論文 67

圖目錄
圖 1 CKIP-POS輸出結果 3
圖 2 奇摩-POS輸出結果 4
圖 3 Semi-Automated Tagging 9
圖 4 領域外意見詞 17
圖 5 意見詞與數字敘述 17
圖 6 意見詞包含於名詞片語 18
圖 7 意見詞包含於名詞片語 19
圖 8 意見詞包含於名詞片語 19
圖 9 排除字演算法 22
圖 10 電信意見元素例子 24
圖 11 電信元素主觀句 25
圖 12 系統訓練第一步驟 39
圖 13 自動標註系統介面 41
圖 14 系統訓練第二步驟 46
圖 15 OP+1、OP-1排除字 48

表目錄
表 1 kobyayashi et al.(2007)意見元素 8
表 2 意見元素 8
表 3 中文意見探勘研究比較 12
表 4 電信元素輸出結果 26
表 5 Mobile01電信領域意見詞數量 26
表 6 Mobile01電信領域排除字數量 28
表 7 Mobile01寬頻領域意見詞數量 32
表 8 Mobile01寬頻領域排除字數量 34
表 9 對應關係 43
表 10 電信詞典外意見詞在各月份影響的數目 49
表 11 電信詞典外排除字在各月份影響的數目 50
表 12 電信完整句、詞庫維護準確率與回收率 51
表 13 寬頻詞典外意見詞在各月份影響的數目 53
表 14 寬頻詞典外排除字在各月份影響的數目 54
表 15 寬頻完整句、詞庫維護準確率與回收率 55
表 16 電信文法配對輸出 58
表 17 電信系統實際輸出 59
表 18 寬頻文法配對輸出 60
表 19 寬頻系統實際輸出 61
參考文獻 [1] 批踢踢 (Ptt). Available: http://www.ptt.cc/index.html
[2] Mobile01. Available: http://www.mobile01.com/
[3] B. Liu, M. Hu, and J. Cheng, "Opinion observer: analyzing and comparing opinions on the Web," 2005, pp. 342-351.
[4] E. Brill, "A simple rule-based part of speech tagger," 1992, pp. 112-116.
[5] T. Brants, "TnT: a statistical part-of-speech tagger," 2000, pp. 224-231.
[6] E. Brill, "Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging," Computational linguistics, vol. 21, pp. 543-565, 1995.
[7] A. Ratnaparkhi, "A maximum entropy model for part-of-speech tagging," 1996, pp. 133-142.
[8] 邱兆揚, "利用 Google 互聯網分類新聞語料之新詞自動擷取技術支援詞庫式中文斷詞系統," 臺灣師範大學應用電子科技研究所學位論文, 2005.
[9] 許中川 and 陳景揆, "探勘中文新聞文件," 資訊管理學報, vol. 7, pp. 103-122, 2001.
[10] L. Singh, P. Scheuermann, and B. Chen, "Generating association rules from semi-structured documents using an extended concept hierarchy," 1997, pp. 193-200.
[11] M. C. De Marneffe, B. MacCartney, and C. D. Manning, "Generating typed dependency parses from phrase structure parses," 2006, pp. 449-454.
[12] D. McClosky, W. Che, M. Recasens, M. Wang, R. Socher, and C. D. Manning, "Stanford’s System for Parsing the English Web," 2012.
[13] M. Hu and B. Liu, "Mining and summarizing customer reviews," presented at the Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, Seattle, WA, USA, 2004.
[14] B. L. M. H. J. Cheng, "Opinion Observer: Analyzing and Comparing Opinions," 2005.
[15] 屈剛 and 陸汝占, "一個改進的漢語詞性標注系統," 上海交通大學學報, vol. 37, pp. 897-900, 2003.
[16] 張孝飛, 陳肇雄, 黃河燕, and 蔡智, "詞性標注中生詞處理算法研究," 中文信息學報, vol. 17, pp. 1-5, 2003.
[17] 林士能, "專利文件語意之擷取與比對," 2005.
[18] 許孟淵 and 黃純敏, "以本體論為基礎之新聞事件檢索與瀏覽," ed: 國立雲林科技大學資訊管理系碩士論文, 2006.
[19] W. T. Lin, "An Online English Learning Environment," 2003.
[20] K. C. Hung, "A Corpus-Based Cloze-Test Problem Solving System," 2003.
[21] M. Diab, K. Hacioglu, and D. Jurafsky, "Automatic tagging of Arabic text: From raw text to base phrase chunks," 2004, pp. 149-152.
[22] Y. Zhang and S. Clark, "Joint word segmentation and POS tagging using a single perceptron," 2008, pp. 888-896.
[23] J. Ma, D. Huang, H. Liu, and W. Sheng, "An English part-of-speech tagger for machine translation in business domain," 2011, pp. 183-189.
[24] S. Petrov, D. Das, and R. McDonald, "A universal part-of-speech tagset," Arxiv preprint ArXiv:1104.2086, 2011.
[25] 陳永德, "中文斷詞中「長詞優先」、「詞頻對比」、「前詞優先」規則之使用," 國立台灣大學心理學研究所博士論文, 1997.
[26] 陳奕璁,林柏慎, "應用直方圖均化於統計式未知詞萃取之研究," Computational Linguistics and Processing(ROCLING),pp.364-378, 2010.
[27] 邱鴻達, "意見探勘在中文電影評論之應用," 國立交通大學 資訊科學與工程研究所, 2011.
[28] 侯锋, 王传廷, and 李国辉, "网络意见挖掘, 摘要与检索研究综述," 计算机科学, vol. 36, pp. 15-19, 2009.
[29] N. Kobayashi, K. Inui, and Y. Matsumoto, "Extracting aspect-evaluation and aspect-of relations in opinion mining," 2007, pp. 1065-1074.
[30] J. G. a. S. Conrad, "Opinion mining in legal blogs," Artificial intelligence and law, 2007.
[31] 陳立, "中文情感語意自動分類之研究," 2010.
[32] G. Mishne, "Experiments with Mood Classification in Blog Post," Proceedings of the 1st Workshop on Stylistic Analysis, 2005.
[33] 段秀婷, 何婷婷, and 宋乐, "基于 PMI-IR 算法的 Blog 情感分类研究," 第五届全国青年计算语言学研讨会论文集, 2010.
[34] 郭伟, "网络电影评论的情感挖掘分析," 吉林大学, 2010.
[35] 赵妍妍, 秦兵, and 刘挺, "文本情感分析综述."
[36] 张清亮 and 徐健, "网络情感词自动识别方法研究," 现代图书情报技术, pp. 24-28, 2011.
[37] M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni, A. Stutt, and F. Ciravegna, "MnM: Ontology driven semi-automatic and automatic support for semantic markup," Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web, pp. 213-221, 2002.
[38] K. Winkler and M. Spiliopoulou, "Semi-automated XML tagging of public text archives: A case study," 2001, pp. 271-285.
[39] 楊盛帆, "以整合式規則來做網路論壇上的3C產品口碑分析," 元智大學資訊管理學系, 2009.
[40] C.-H. Tsai, " Tsai's List of Chinese Words," University of Illinois at Urbana-Champaign, 1996.
[41] 林偉揚, "應用種子詞彙延伸方式於BBS電影評論之口碑分析," 元智大學資訊管理學系, 2011.
[42] S. Maosong, S. Dayang, and H. Changning, "CSeg& Tag1. 0: a practical word segmenter and POS tagger for Chinese texts," 1997, pp. 119-126.
[43] "CKIP AutoTag," Academia Sinica.
http://ckipsvr.iis.sinica.edu.tw/.
[44] 平震宇, " 一個適用於行動裝置的網頁搜尋結果分群系統之研究," 元智大學資訊管理研究所碩士論文, 2007.
[45] 陳子龍, "中文意見探勘系統之句法分析," 淡江大學資訊工程學系資訊網路與通訊研究所碩士論文, 2012.
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2017-07-30公開。
  • 同意授權瀏覽/列印電子全文服務,於2017-07-30起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信