系統識別號 | U0002-1708200900225300 |
---|---|
DOI | 10.6846/TKU.2009.00601 |
論文名稱(中文) | 利用關聯式法則改善文件分類準確度-結合其他分類器 |
論文名稱(英文) | Using Association Classification Rules to Improve The Accuracy of Text Categorization with Different Classifiers |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系碩士在職專班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 97 |
學期 | 2 |
出版年 | 98 |
研究生(中文) | 陳育民 |
研究生(英文) | Yu-Min Chen |
學號 | 796410065 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
第二語言別 | |
口試日期 | 2008-06-17 |
論文頁數 | 68頁 |
口試委員 |
指導教授
-
黃連進(micro@mail.tku.edu.tw)
委員 - 蔣定安(chiang@cs.tku.edu.tw) 委員 - 王鄭慈(ctwang@mail.fgu.edu.tw) 委員 - 黃連進(micro@mail.tku.edu.tw) |
關鍵字(中) |
關聯式 分類器 中文 文字 |
關鍵字(英) |
Association Classification Chinese Text. |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
在使用(Associative Classification, AC)做分類時,通常會將無法利用AC分類的資料,直接歸類到一個預先設定的類別,以避免資無法被分類的問題。但在使用AC建立分類器時,最容易遇到規則建立後門檻值設定的問題,定得太高會將很多可能有用的規則刪除而造成許多test cases不能分類,而太低又容易產生分類錯誤,這些情形都會影響到分類準確性。為了解決上述問題,提升分類結果的準確度,我們提出同時使用兩種不同分類器的概念,依據分類器特性,在不同階段做不同的事。本論文將利用KNN或貝氏分類器對文件做初步分類,然後利用所得之分類結果設定各種門檻值來篩選出滿足門檻值條件的關聯式分類法則(Associative Classification Rules, ACR),由於這些ACR之準確度皆高於初步分類的結果,我們可利用此特性篩選出ACR來進一步改善分類的結果。針對ACR不能分類的文件,則以KNN或貝氏分類器計算詞彙權重來分類,因此可減少規則產生的時間及數量進而加快分類速度,所以結合不同的分類器的優點則可有效提升文件分類的效能。經由實驗證明,使用本論文提出之結合兩種不同分類器的確可獲得比單一分類器更好的分類效能。 |
英文摘要 |
In recent years many wireless broadband networking technologies were brought up and discussed. IEEE 802.16j is one of the most impressive one with its MMR network structure. Through the mechanism of Relay Station, high-cost base station could be substituted to give a broader network coverage and bigger bandwidth. Therefore the choice of location of these relay stations has become a topic that is worth discussing. This dissertation will start from the present IEEE 802.16 network, to figure out a relay station placement mechanism. Inside the acceptable range of the base station, find the best and most efficient sub-area for relay station by considering the differences in data flow and attributed bandwidth. The findings of this research will be verified through a series of experiments, and to finally conclude with the most suitable rule for the placement of relay stations |
第三語言摘要 | |
論文目次 |
目錄 Table of Contents III List of Figures V List of Tables VI 第一章 緒論 1 1.1. 研究的動機與目的 1 1.2. 論文架構 4 第二章 相關文獻與研究探討 5 2.1 關聯式分類(Associative classification) 5 2.1.1 Rule Generation 6 2.1.2 Ranking 8 2.1.3 Pruning 9 2.1.4 關聯式分類器 11 2.2 文件分類 18 2.2.1 文件分類流程 18 2.2.2 特徵萃取 20 2.3其他分類器演算法 23 2.3.1 KNN最近鄰居法則[19] 23 2.3.2 貝氏分類法Naïve Bayesian Classifier [21] 27 2.4 評估值: 29 第三章 研究方法與執行步驟 31 3.1 研究方法 31 3.1.1 Lazy AC分類器及多重分類器 32 3.2 分類系統流程 34 3.3 KNN分類器 36 3.3.1空間向量的建立 36 3.3.2 KKN分類實作 37 3.4 貝式分類器 39 3.5 初次分類 39 3.6 門檻值設定 40 3.7 執行策略 40 第四章 實驗結果分析 44 4.1資料來源 44 4.2資料預處理結果 46 4.3 Lazy分類結果 47 4.4 KNN及貝式之分類結果 49 4.5 門檻值設定 51 4.6 Precision-based thresholds 53 4.7 F1-based thresholds 55 第五章 結論與未來展望 57 5.1 結論 57 5.2 未來展望 57 文獻參考 58 附錄-英文論文 61 圖目錄 圖2.1.2-1 Rule ranking method presented in Thabtah (2006) 9 圖2.1.4-5 L3演算法 15 圖2.1.4-6 關聯式分類器分類流程示意圖 16 圖2.2-1 文件分類之系統流程 20 圖2.3.1-1 KNN 分類方法示意圖 26 圖2.4-1 文件數量分佈表 29 圖3.2-1 分類流程圖 35 表目錄 表2.1-1 關連式比較圖 5 表2.1.3-1 Rruning演算法 10 表4.1-1 由各系所選取出的文章數 44 表4.1-2 文件描述的格式 45 表4.2-1 文件經過斷詞以後的結果 46 表4.2-2 TFIDF所得的特徵詞各項數據統計與權重 47 表4.3-1 採用Lazy法之結果 48 表4.4-1 KNN 分類結果 49 表4.4-2 貝氏分類器分類 準確率及F1值 50 表4.5-1 單一門檻值 51 表4.5-2 多重門檻值 52 表4.6-1 以KNN及貝式的準確率為門檻值之AC分類結果 53 表4.7-1 以KNN及貝式的F1為門檻值之AC分類結果 55 表4.8-1 最佳實驗結果比較 56 |
參考文獻 |
[1] F. THABTAH, “A review of associative classification mining,” Knowl. Eng. Rev., vol. 22, 2007, pp. 37-65. [2] P.G. Elena Baralis, “A Lazy Approach to Pruning Classification Rules,” Dec. 2002. [3] K. Wang, S. Zhou, and Y. He, “Growing decision trees on support-less association rules,” Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, Massachusetts, United States: ACM, 2000, pp. 265-269. [4] K. Wang, Y. He, and D.W. Cheung, “Mining confident rules without support requirement,” Proceedings of the tenth international conference on Information and knowledge management, Atlanta, Georgia, USA: ACM, 2001, pp. 89-96. [5] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., Advances in knowledge discovery and data mining, American Association for Artificial Intelligence, 1996. [6] J.R. Quinlan and R.M. Cameron-jones, “FOIL: A Midterm Report,” IN PROCEEDINGS OF THE EUROPEAN CONFERENCE ON MACHINE LEARNING, vol. 667, 1993, pp. 3--20. [7] B. Liu, W. Hsu, and Y. Ma, “Integrating Classification and Association Rule Mining,” Knowledge Discovery and Data Mining, 1998, pp. 86, 80. [8] E. Baralis, S. Chiusano, and P. Garza, “On support thresholds in associative classification,” Proceedings of the 2004 ACM symposium on Applied computing, Nicosia, Cyprus: ACM, 2004, pp. 553-558. [9] W. Li, J. Han, and J. Pei, “CMAR: accurate and efficient classification based on multiple class-association rules,” Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, 2001, pp. 376, 369. [10] X. Jiawei, “CPAR: Classification based on Predictive Association Rules.” [11] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Int. Conf. Very Large Data Bases, VLDB, J.B. Bocca, M. Jarke, and C. Zaniolo, eds., Morgan Kaufmann, 1994, pp. 487–499. [12] R. Schapire, Y. Freund, P. Bartlett, and W. Lee, “Boosting the margin: a new explanation for the effectiveness of voting methods,” The Annals of Statistics, vol. 26, 1998, pp. 1686, 1651. [13] Xiao-Yun Chen, Yi Chen, Lei Wang, and Yun-Fa Hu, “Text categorization based on frequent patterns with term frequency,” Machine Learning and Cybernetics, 2004. Proceedings of 2004 International Conference on, 2004, pp. 1610-1615 vol.3. [14] Yongwook Yoon and G. Lee, “Text Categorization Based on Boosting Association Rules,” Semantic Computing, 2008 IEEE International Conference on, 2008, pp. 136-143. [15] Y. Yang, “A study on thresholding strategies for text categorization,” PROCEEDINGS OF SIGIR-01, 24TH ACM INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2001, pp. 137--145. [16] K. Aas and L. Eikvil, “Text categorization: A survey,” Technical report, Norwegian Computing Center, 1999. [17] G. Salton and C. Buckley, Term Weighting Approaches in Automatic Text Retrieval, Cornell University, 1987. [18] C.M. Lee, “Classifying Chinese Text Documents byAssocation rule.” [19] P. Soucy and G. Mineau, “A simple KNN algorithm for text categorization,” Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, 2001, pp. 647-648. [20] W. Cheng, “The Research on Signal Classifier in Text Classifiation of Multi-Class.” [21] T.M. Mitchell, Machine Learning, McGraw-Hill Science/Engineering/Math, 1997. [22] P. Bickel and E. Levina, “Some theory for Fisher's linear discriminant function, `naive Bayes', and some alternatives when there are many more variables than observations,” Bernoulli, vol. 10, 2004, pp. 1010, 989. [23] Tseng, Yuen-Hsien, “Effectiveness Issues in Automatic Text Categorization,” Bulletin of the Library Association of China, vol. 68, Jun. 2002, pp. 62-83. [24] 國家圖書館, “全國博碩士論文資訊網.” [25] 中央研究院, “中文斷詞系統.” |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信