系統識別號 | U0002-1502201221323800 |
---|---|
DOI | 10.6846/TKU.2012.00572 |
論文名稱(中文) | 使用降低規則相依問題影響來改善關聯式分類效能 |
論文名稱(英文) | Improving the performance of association classifiers by reducing the impact of rule dependency problem |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系博士班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 100 |
學期 | 1 |
出版年 | 101 |
研究生(中文) | 陳智揚 |
研究生(英文) | Chih-Yang Chen |
學號 | 895410263 |
學位類別 | 博士 |
語言別 | 繁體中文 |
第二語言別 | |
口試日期 | 2012-01-03 |
論文頁數 | 121頁 |
口試委員 |
指導教授
-
蔣璿東(chiang@cs.tku.edu.tw)
委員 - 謝楠楨 委員 - 陳俊豪 委員 - 王亦凡 委員 - 葛煥昭 委員 - 蔣璿東 |
關鍵字(中) |
關聯式分類 文件分類 規則相依 |
關鍵字(英) |
Association Classification Text Classification Rule Dependency |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
關聯式分類法(Associative Classification, AC)在規則排序(Ranking)時會因為分類的方式不同而有所差異,但基本上,都是依照信賴度由高至低排序,所以在進行分類時,也都是先利用信賴度高的規則做分類。由於某些關聯式規則之間會有規則相依問題(rule dependency problem),對這些規則而言,執行的先後順序會影響到這些規則對尚未被分類資料的信賴度;由於rule dependency problem會造成規則信賴度的改變,所以對尚未被分類資料而言,目前關聯式分類法都實際上並未完全信賴度的由高低來進行分類,進而影響最後分類的結果。現有的關聯式分類法在進行分類時,都沒有考慮 Rule dependency problem。主要是因為訓練文件有可能會產生不同類別的規則,在分類時,也可能會因為不同的規則被分類至不同的類別,因此哪一條規則先執行的確會產生不同的分類結果。但對有 n 條規則的 AC 而言,規則有 n! 種執行順序,所以要解決rule dependency problem (找尋最佳規則執行順序) 將是一個非常耗時的工作。所以本研究主要探討關聯式分類器中Rule Dependency Problem的問題,並提出不同的多項式時間(polynomial time)排序演算法來設定分類器中規則的執行順序,以降低規則相依問題對分類結果的影響,進而提昇關聯式分類器的分類準確度。 |
英文摘要 |
Since the dependence of rules may affect the confidences of rules, the execution order of the remaining rules is not ranked by the actual confidence of rules to the unclassified data, which will directly influence the classification accuracy of the associative classifier. However, finding the optimal execution order of CARs is a combinational problem, it is a very time consuming process. In this project, instead of finding the optimal execution order of CARs, we plan to propose different algorithms to re-rank the execution order of CARs to reduce the influence of the rule dependency problem and improve the classification accuracy of the associative classifier. For resolving the rule dependency problem, the number of executing ranking for N rules should be N!. As a result, finding out the optimal rule-executing ranking is a time consuming task. Therefore, instead of finding the optimal execution order of CARs, in this paper, we propose a polynomial time algorithm to re-rank the execution order of CARs by rules’ priority to reduce the influence of the rule dependency problem. Consequently, the performance (the classification accuracy and recall rate) of the associative classification algorithm can be improved. |
第三語言摘要 | |
論文目次 |
目錄 第1章 緒論 1 1.1 研究動機與目的 1 1.2 研究架構 3 第2章 相關文獻與研究探討 5 2.1 Apriori演算法 5 2.2 關聯式分類 (Associative Classification) 9 2.2.1 Lazy 12 2.2.2 CBA 15 2.2.3 CMAR 19 2.2.4 CPAR與PRM 20 2.3 文件分類(Document Classification Or Text Categorization) 23 2.4 評估方式 27 第3章 問題描述 29 第4章 研究方法 33 4.1 The Multi Levels Rule Priority Algorithm 33 4.2 The Multi Levels Class Priority Algorithm 39 第5章 實驗結果 46 5.1 實驗結果(1)-使用Reuters-21578 46 5.1.1 資料來源 46 5.1.2 The Multi Levels Rule Priority Algorithm實驗結果 53 5.1.3 The Multi Levels Class Priority Algorithm實驗結果 57 5.2 實驗結果(2)-使用UCI資料庫 58 5.2.1 資料來源 58 5.2.2 The Multi Levels Rule Priority Algorithm實驗結果 60 5.2.3 The Multi Levels Class Priority Algorithm實驗結果 65 5.3 綜合比較 68 第6章 結論 75 Bibliography 76 VITA 83 附錄A 84 附錄B 91 附錄C 98 附錄D 104 附錄E 110 附錄F 116 圖目錄 圖 2-1 產生候選項目 7 圖 2-2 計算候選項目次數 7 圖 2-3 Apriori演算法過程 9 圖 2-4 database coverage 演算法 11 圖 2-5 L3 規則修剪演算法 15 圖 2-6 CBA-RG 演算法 16 圖 2-7 CBA-CB Naïve(called M1) Algorithm 17 圖 2-8 Selecting rules based on database coverage 20 圖 2-9 PRM演算法 22 圖 2-10 文件分類流程圖 24 圖 2-11 特徵詞刪除成效比較 25 圖 2-12 關聯式分類器分類流程示意圖 26 圖 2-13 文件數量分佈表 27 圖 4-1 MLRP 演算法流程圖 34 圖 4-2 MLRP演算法 36 圖 4-3 Rerank_Priority對最高階層的規則重新排序規則演算法 38 圖 4-4 選擇規則並回傳Improve value最大值 39 圖 4-5 MLCP 演算法流程圖 41 圖 4-6 MLCP演算法 42 圖 4-7 Rerank_Priority對最高階層的規則重新排序規則演算法 44 圖 4-8 選擇規則並回傳Improve value最大值 45 圖 5-1 Reuters文件標籤 46 圖 5-2 Reuters文件範例 48 圖 5-3 L3與L3 with MLRP五次實驗正確率比較 56 圖 5-4 L3與L3 with MLCP五次實驗正確率比較 58 圖 5-5 UCI_Balance資料庫五次MLRP實驗正確率 64 圖 5-6 UCI_Vehicle資料庫五次MLRP實驗正確率 65 圖 5-7 UCI_Balance資料庫五次MLCP實驗正確率 67 圖 5-8 UCI_Vehicle資料庫五次MLCP實驗正確率 68 圖 5-9 Reuters資料庫L3 with MLRP與L3 with MLCP五次實驗正確率比較 69 圖 5-10 Balance資料庫L3 with MLRP與L3 with MLCP五次實驗正確率比較 69 圖 5-11 Vehicle資料庫L3 with MLRP與L3 with MLCP五次實驗正確率比較 69 圖 5-12 Reuters資料庫L3 with MLRP與L3 with MLCP五次實驗執行時間比較 70 圖 5-13 Balance資料庫L3 with MLRP與L3 with MLCP五次實驗執行時間比較 70 圖 5-14 Vehicle資料庫L3 with MLRP與L3 with MLCP五次實驗執行時間比較 71 圖 5-15 Reuters資料庫L3 with MLRP與L3 with MLCP準確率比較圖 72 圖 5-16 Balance資料庫L3 with MLRP與L3 with MLCP準確率比較圖 72 圖 5-17 Vehicle資料庫L3 with MLRP與L3 with MLCP準確率比較圖 72 圖 5-18 Reuters資料庫L3 with MLRP與L3 with MLCP執行時間比較圖 73 圖 5-19 Balance資料庫L3 with MLRP與L3 with MLCP執行時間比較圖 74 圖 5-20 Vehicle資料庫L3 with MLRP與L3 with MLCP執行時間比較圖 74 表目錄 表 2-1 關聯式規則搜索與關聯式分類差異表 10 表 3-1 範例規則1 30 表 3-2 範例規則2 31 表 3-3 範例規則3 31 表 3-4 範例規則4 31 表 5-1 不同分類的文件數 49 表 5-2 訓練及測試文件數 49 表 5-3 stop word list的部分範例 51 表 5-4 DB2 Intelligent Miner關聯式規則範例 52 表 5-5 Classification results from using L3 without Rule Priority(Reuters-21578) 54 表 5-6 Classification results from using L3 with MLRP Algorithm(Reuters-21578) 55 表 5-7 Classification results from using L3 with MLCP Algorithm(Reuters-21578) 57 表 5-8 Balance資料庫範例 59 表 5-9 Vehicle資料庫範例 59 表 5-10 UCI_Balance資料庫訓練及測試資料筆數 60 表 5-11 UCI_Vehicle資料庫訓練及測試資料筆數 60 表 5-12 Classification results from using L3 without Rule Priority(Balance) 61 表 5-13 Classification results from using L3 with MLRP Algorithm(Balance) 61 表 5-14 Classification results from using L3 without Rule Priority(Vehicle) 62 表 5-15 Classification results from using L3 with MLRP Algorithm(Vehicle) 63 表 5-16 Classification results from using L3 with MLCP Algorithm(Balance) 65 表 5-17 Classification results from using L3 with MLCP Algorithm(Vehicle) 66 |
參考文獻 |
[1] F. THABTAH, “A review of associative classification mining,” Knowl. Eng. Rev., vol. 22, 2007, pp. 37-65. [2] J.-Y. Jiang, R.-J. Liou, and S.-J. Lee, "A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification,"accepted by IEEE Transactions on Knowledge and Data Engineering, Nov. 2009. [3] B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009. [4] P. Soucy and G. Mineau, “A simple KNN algorithm for text categorization,” Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, 2001, pp. 647-648. [5] W. Li, J. Han, and J. Pei, “CMAR: accurate and efficient classification based on multiple class-association rules,” Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, 2001, pp. 376, 369. [6] P.G. Elena Baralis, “A Lazy Approach to Pruning Classification Rules,” Dec. 2002. [7] B. Liu, W. Hsu, and Y. Ma, “Integrating Classification and Association Rule Mining,” Knowledge Discovery and Data Mining, 1998, pp. 86, 80. [8] Yin, X. & Han, J. 2003 CPAR: Classification based on predictive association rule. In Proceedings of the SIAM International Conference on Data Mining. San Francisco, CA: SIAM Press, pp. 369–376. [9] R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proceedings of the 1993 ACM SIGMOD Conferencen on Management of Data, Washington D.C., pp. 207-216, May 1993. [10] K. Wang, S. Zhou, and Y. He, “Growing decision trees on support-less association rules,” Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, Massachusetts, United States: ACM, 2000, pp. 265-269. [11] K. Wang, Y. He, and D.W. Cheung, “Mining confident rules without support requirement,” Proceedings of the tenth international conference on Information and knowledge management, Atlanta, Georgia, USA: ACM, 2001, pp. 89-96. [12] K. Aas and L. Eikvil. “Text categorisation: A survey,” Technical report, Norwegian Computing Center, 1999. [13] M.F. Porter, “An algorithm for suffix stripping,” Readings in information retrieval, Morgan Kaufmann Publishers Inc., 1997 , pp. 313-316. [14] M.E. Maron, “Automatic Indexing : an Experimental Inquiry,” J. of the ACM, Vol. 8, pp. 404-417, 1961 [15] Mingyu Lu, Keyun Hu, Yi Wu, Yuchang Lu, and Lizhu Zho, “SECTCS: towards improving VSM and Naive Bayesian classifier:Systems, Man and Cybernetics,” IEEE International Conference on 2002 , Vol. 5, pp. 6-9, Oct. 2002. [16] Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. [17] Yongwook yoon, Gary G. Lee, Tseng, “Text Categorization Based on Boosting Association Rules,” Semantic Computing 2008 IEEE International Conference on, 2008, pp. 136-143. [18] Merz, C. & Murphy, P. 1996 UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. [19] SQL server integration services "英文斷詞系統, http://msdn.microsoft.com/zh-tw/library/ms141026.aspx. [20] Hsin Yuan Chiou, “Improving the performance of Associative Classification by using the Multi-level Class Priority of Rule Ranking,” Master thesis of Tamkang University, Jun. 2010, pp. 1-52. [21] Mao-Sheng Hung, “Improve Document Classify Accuracy by Rule – Static threshold and Dynamic threshold Research,” Master thesis of Tamkang University, Jun. 2009, pp. 1-49. [22] Y.M. Chen, “Using Association Rule to Improve The Accuracy of Text Categorization - The Combination with other Classifiers,” Master thesis of Tamkang University, Jun. 2009, pp. 1-57. [23] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., Advances in knowledge discovery and data mining, American Association for Artificial Intelligence, 1996. [24] Jing Chen, Zhigang Zhang, Qing Li and Xiaoming Li, 2005, “A Pattern-Based Voting Approach for Concept Discovery on the Web,” Web Technologies Research and Development-APWeb 2005, Volume 3399/2005 [25] Karras, DA, 2006, “An Improved Text Categorization Methodology Based on Second and Third Order Probabilistic Feature Extraction and Neural Network Classifiers,” Lecture Notes in Computer Science, 2006, pp. 9-20. [26] J.R. Quinlan and R.M. Cameron-jones, “FOIL: A Midterm Report,” IN PROCEEDINGS OF THE EUROPEAN CONFERENCE ON MACHINE LEARNING, vol. 667, 1993, pp. 3--20. [27] E. Baralis, S. Chiusano, and P. Garza, “On support thresholds in associative classification,” Proceedings of the 2004 ACM symposium on Applied computing, Nicosia, Cyprus: ACM, 2004, pp. 553-558. [28] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Int. Conf. Very Large Data Bases, VLDB, J.B. Bocca, M. Jarke, and C. Zaniolo, eds., Morgan Kaufmann, 1994, pp. 487–499. [29] Y. Yang and X. Liu, “A re-examination of text categorization methods,” Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, United States: ACM, 1999, pp. 42-49. [30] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization,” Proceedings of the Fourteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., 1997, pp. 143-151. [31] P. Bickel and E. Levina, “Some theory for Fisher's linear discriminant function, `naive Bayes', and some alternatives when there are many more variables than observations,” Bernoulli, vol. 10, 2004, pp. 1010, 989. [32] Tseng, Yuen-Hsien, “Effectiveness Issues in Automatic Text Categorization,” Bulletin of the Library Association of China, vol. 68, Jun. 2002, pp. 62-83. [33] Cho-Ming Lee, “Classifying Chinese Text Documents by Association Rule,” Master thesis of Tamkang University, Jun. 2006, pp. 1-66. [34] Lee, L., Isa, D., Choo, W., & Chue, W. (2010). Tournament structure ranking techniques for Bayesian text classification with highly similar categories. Journal of Applied Sciences, 10, 1243-1254. [35] Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., & Demirbas, M. (2010). Short text classification in twitter to improve information filtering. [36] Botsis, T., Nguyen, M. D., Woo, E. J., Markatou, M., & Ball, R. (2011). Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection. Journal of the American Medical Informatics Association, 18(5), 631-638. [37] Crossley, S. A., & McNamara, D. S. (2011). Shared features of L2 writing: Intergroup homogeneity and text classification. Journal of Second Language Writing. [38] Ganiz, M., George, C., & Pottenger, W. (2011). Higher Order Naive Bayes: A Novel Non-IID Approach to Text Classification. Knowledge and Data Engineering, IEEE Transactions on(99), 1-1. [39] Shi, K., Li, L., Liu, H., He, J., Zhang, N., & Song, W. (2011). An improved KNN text classification algorithm based on density. [40] Zhang, W., Yoshida, T., & Tang, X. (2011). A comparative study of TF* IDF, LSI and multi-words for text classification. Expert Systems with Applications, 38(3), 2758-2765. [41] Oh, H. S., Choi, Y., & Myaeng, S. H. (2011). Text classification for a large-scale taxonomy using dynamically mixed local and global models for a node. Advances in Information Retrieval, 7-18. [42] Lin, W. R., Wu, Z. H., Feng, L. C., & Huang, W. B. (2011). Chinese Text Classification with a KNN Classifier Using an Adjusted Feature Weighting Method. Applied Mechanics and Materials, 50, 700-703. [43] Ota, S., & Mima, H. (2011). Machine Learning-based Syllabus Classification toward Automatic Organization of Issue-oriented Interdisciplinary Curricula. Procedia-Social and Behavioral Sciences, 27, 241-247. [44] XU, H. R., ZHANG, W. S., & TUN, S. (2011). Research of Web Text Classification Method Based on Neighborhood Preserving Embedding [J]. Computer Engineering, 37(17), 133-135. [45] SHI, K., HE, J., LIU, H., ZHANG, N., & SONG, W. (2011). Efficient text classification method based on improved term reduction and term weighting. The Journal of China Universities of Posts and Telecommunications, 18, 131-135. [46] Botsis, T., Nguyen, M. D., Woo, E. J., Markatou, M., & Ball, R. (2011). Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection. Journal of the American Medical Informatics Association, 18(5), 631-638. [47] Zhai, Z., Xu, H., Kang, B., & Jia, P. (2011). Exploiting effective features for chinese sentiment classification. Expert Systems with Applications. [48] Jiang, H., & Li, W. Q. (2012). Improved Algorithm Based on TFIDF in Text Classification. Advanced Materials Research, 403, 1791-1794. [49] Ma, X., Shen, Y., Chen, J., & Xue, G. (2012). Combining Naive Bayes and Tri-gram Language Model for Spam Filtering. Knowledge Engineering and Management, 509-520. [50] Nuipian, V., Meesad, P., & Boonrawd, P. (2012). Improve Abstract Data with Feature Selection for Classification Techniques. Advanced Materials Research, 403, 3699-3703. [51] Song, Y. (2012). Machine learning for text mining: Classification, retrieval and recommendation. THE PENNSYLVANIA STATE UNIVERSITY. [52] http://www.daviddlewis.com/resources/testcollections/reuters21578 [53] http://archive.ics.uci.edu/ml/index.html [54] Francesco Ricci, Lior Rokach, Bracha Shapira, Paul B. Kantor (Eds.): Recommender Systems Handbook. Springer 2011, ISBN 978-0-387-85819-7 [55] J. Hipp, U. Güntzer, and G. Nakhaeizadeh, “Algorithms for Association Rule Mining – A General Survey and Comparison,” SIGKDD Explorations, Vol. 2, Issue 1, pp. 58-64, 2000. [56] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast Discovery of Association Rules,” Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., AAAI/MIT Press, pp. 307-328, 1996. [57] J. Han and Y. Fu, “Mining Multiple-level Association Rules in Large Databases,” IEEE Trans. on Knowledge and Data Eng., Vol. 11. No. 5, pp. 798-805, 1999. [58] A. Savasere, E. Omiecinski, S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Database,” Proc. 21th VLDB, Zurich, Switzerland, pp. 432-444, 1995. [59] H. Toivonen, “Sampling large databases for association rules,” In Proc. 22nd VLDB Conference, Bombay, India, pp. 134-145,Sept. 1996. [60] J.S. Park, M.S. Chen, and P.S. Yu, “Using a Hash-Based Method with Transaction Trimming for Mining Association Rules,” IEEE Trans. on Knowledge and Data Eng., vol. 9, no. 5, pp. 813-825, Sept./Oct. 1997. [61] Hung-Ju Huang; Chun-Nan Hsu;, “Bayesian classification for data from the same unknown class: Systems, Man and Cybernetics”, Part B, IEEE Transactions on , Volume: 32 , Issue: 2 , April 2002,pp137-145 [62] H D Navone, D Cook ,T Downs and D Chen, “Boosting Naive-Bayes classifiers to predict outcomes for hip prostheses, Neural Networks”, 1999. IJCNN '99. International Joint Conference on , Volume: 5 , 10-16 July 1999,pp3622 – 3626 [63] Weiguo Fan, Michael D. Gordon, and Praveen Pathak,2004,” Discovery of Context-Specific Ranking Functions for Effective Information Retrieval Using Genetic Programming”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 4, pp. 523-527 |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信