系統識別號 | U0002-0706201021042200 |
---|---|
DOI | 10.6846/TKU.2010.00196 |
論文名稱(中文) | 利用不同分類器與多重靜態門檻值來改善文件分類的準確度 |
論文名稱(英文) | Improving the Accuracy of Text Classification by the Different Classifier with Multiple Confidence Threshold Values |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系博士班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 98 |
學期 | 2 |
出版年 | 99 |
研究生(中文) | 黃蕙華 |
研究生(英文) | Hui-Hua Huang |
學號 | 893190040 |
學位類別 | 博士 |
語言別 | 英文 |
第二語言別 | |
口試日期 | 2010-05-28 |
論文頁數 | 91頁 |
口試委員 |
指導教授
-
葛煥昭
委員 - 郭經華 委員 - 謝楠楨 委員 - 王亦凡 委員 - 蔣定安 委員 - 葛煥昭 |
關鍵字(中) |
關聯式分類 文件分類 文字採擷 |
關鍵字(英) |
Association Classification Text Classification Text Mining |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
在使用Associative Classification (AC)做分類時,通常會將無法利用Class Association Rules(CAR)做分類的資料,直接歸類到一個預先設定的類別,以避免資料無法被分類的問題。但在使用CAR建立AC分類器時,規則信賴度的標準很難設定,定得太高會將很多可能有用的規則刪除而造成許多資料不能使用CAR做分類,而定得太低則又容易產生分類錯誤,這些情形都會影響到分類準確性。為了解決預設類別和低信賴度規則造成分類錯誤的問題,提升分類結果的準確度,我們提出同時使用兩種不同分類器的概念,依據分類器特性,在不同階段做不同的事。本論文將利用貝氏分類器對訓練文件做分類,然後利用所得之平均準確率來設定門檻值,篩選出滿足門檻值條件的CAR。由於這些CAR之準確度皆高於貝氏分類器的結果,我們可利用這些篩選出CAR來進一步改善分類的結果。而針對CAR不能分類的文件,則以貝氏分類器來分類。經由實驗證明,這種結合不同的分類器的優點的作法的確可獲得比僅使用單一分類器更好的分類效能,換言之,這種結合不同的分類器的優點的作法可有效提升文件分類的效能。 |
英文摘要 |
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the Associative classifier and the Naive Bayes classifier to make up the shortcomings of each other, thus improving the accuracy of text classification. We will classify the training cases with the Naive Bayes classifier and set different confidence threshold values for different class association rules (CARs) to different classes by the obtained classification accuracy rate of the Naive Bayes classifier to the classes. Since the accuracy rates of all selected CARs of the class are higher than that obtained by the Naive Bayes classifier, we could further optimize the classification result through these selected CARs. Moreover, for those unclassified cases, we will classify them with the Naive Bayes classifier. The experimental results show that combining the advantages of these two different classifiers better classification result can be obtained than with a single classifier. |
第三語言摘要 | |
論文目次 |
Contents Chapter 1 Introduction 1 1.1 Research Motivation of this Dissertation 1 1.2 Research Objectives of this Dissertation 2 1.3 Organization of this Dissertation 5 Chapter 2 Background Knowledge 6 2.1 TFIDF 6 2.2 Classifiers 8 2.2.1 Associative Classifier of Lazy 8 2.2.2 Associative Classifier of CBA and CMAR 11 2.2.3 Naive Bayes Classifier 13 Chapter 3 Classification Process 19 3.1 How to set threshold values 19 3.2 The process of classification 25 Chapter 4 Experimental Results 30 4.1 Text Corpora 30 4.1.1 Chinese documents 30 4.1.2 Reuters 21578 32 4.2 The method of evaluation 35 4.3 The Classification Results of the Lazy classifier and the Naive Bayes classifier 36 4.3.1 The results of Chinese documents 36 4.3.2 The results of Reuters 21578 41 4.4 Results of using Single Confidence Threshold Values 45 4.4.1 Results of Chinese documents 45 4.4.2 Results of Reuters 21578 49 4.5 Results of using Multiple Confidence Threshold Values 52 4.5.1 Results of Chinese documents 52 4.5.2 Results of Reuters 21578 56 4.6 Analysis of the Experimental Results 59 4.6.1 Analysis of Chinese documents 59 4.6.2 Analysis of Reuters 21578 61 Chapter 5 Conclusion and Future Directions 64 5.1 Conclusion 64 5.2 Future Directions 64 Bibliography 67 VITA 73 Appendix A 74 Appendix B 84 List of Figures Figure 2.1 L3 algorithm 10 Figure 2.2 Selecting rules based on database coverage 13 Figure 2.3 The process of Naive Bayes classification 15 Figure 3.1 The training process of classification 27 Figure 3.2 The testing process of classification 29 List of Tables Table 3.1 The confidence threshold value is 91.58% 21 Table 3.2 The confidence threshold value is 75.21% 23 Table 4.1 Number of documents selected from each department of Chinese documents 31 Table 4.2 Number of documents selected from each class of Reuters 21578 34 Table 4.3 Classification results of the Lazy classifier based on Chinese documents 38 Table 4.4 Classification results of the Naive Bayes classifier based on Chinese documents 40 Table 4.5 Classification results of the Lazy classifier based on Reuters 21578 42 Table 4.6 Classification results of the Naive Bayes classifier based on Reuters 21578 44 Table 4.7 Classification results of using single threshold values based on Chinese documents 47 Table 4.8 Classification results of using single threshold values based on Reuters 21578 50 Table 4.9 Classification results of using multiple threshold values based on Chinese documents 54 Table 4.10 Classification results of using multiple threshold values based on Reuters 21578 57 Table 4.11 The comparison of classification results to different classifiers based on Chinese documents 60 Table 4.12 The comparison of classification results to different classifiers based on Reuters 21578 62 |
參考文獻 |
Aitchison, J., 1985. A general class of distributions on the simplex. Journal of the Royal Statistical Society. Series B (Methodological). 47(1), 136-146. Antonie, M-L., Za‥ıane, O. R., 2002, Text document categorization by term association, in ICDM '02: Proceedings of the 2002 IEEE International Conference on Data Mining, Washington, DC, USA: IEEE Computer Society, pp. 19 Bickel, P.J., Levina, E., 2004. Some theory for Fisher's linear discriminant function, 'naive Bayes', and some alternatives when there are many more variables than observations. Bernoulli. 10, 989-1010. Chen, S.G., 2008. A measure for the appropriateness of prior distributions in naive Bayesian classifiers. Industrial and Information Management of National Cheng Kung University, Taiwan. Clark, P., Boswell, R., 1991, Rule induction with CN2: Some recent improvements. In Proceedings of the 5th European Working Session on Learning. Berlin, Germany: Springer Verlag, pp. 151–163. Combarro, E., Montanes, E., Diaz, I., Ranilla, J., Mones, R., 2005. Introducing a family of linear measures for feature selection in text categorization. IEEE Transactions on Knowledge and Data Engineering. 17, 1223-1232. Garza, P., Baralis, E., 2002. A Lazy Approach to Pruning Classification Rules. Proceedings of the 2002 IEEE International Conference on Data Mining, IEEE Computer Society. 35-42. Hu, H., Li, J., 2005, Using association rules to make rule-based classifiers robust. In Proceedings of the 16th Australasian Database Conference, Newcastle, Australia, pp. 47–54. Langley, P., Iba, W., Thompson, K., 1992. An Analysis of Bayesian Classifiers. Proceedings of the tenth national conference on Artificial Intelligence. 223-228. Lee, C.H., 2007. Improving classification performance using unlabeled data: Naive bayesian case. Knowledge-Based Systems. 20(3), 220-224. Li, J., Shen, H., Topor, R., 2002. Mining the optimal class association rule set. Knowledge-Based Systems. 15(7), 399-405. Li, W., Han, J., Pei, J., 2001, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, First IEEE International Conference on Data Mining (ICDM'01), pp.369-376 Liu, B., Hsu, W., Ma, Y., 1998. Integrating Classification and Association Rule Mining. Knowledge Discovery and Data Mining. 80-86. McCallum, A., Nigam, K., 1998. A comparison of event models for Naive Bayes text classification. In AAAI-98 workshop on learning for text categorization. 41-48. Michalski, R.S., Carbonell, J.G., Mitchell, T.M., 1983. Machine learning: An artificial intelligence approach. Kaufman Publishers Inc., Los Altos, CA. Salton, G., 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management. 24(5), 513-523. Schneider, K., 2005. Techniques for Improving the Performance of Naive Bayes for Text Classification, Computational Linguistics and Intelligent Text Processing. 682-693. Thabtah, F., 2007. A review of associative classification mining. The Knowledge Engineering Review. 22, 37-65. Tsay, Y., Chiang, J., 2005. CBAR: an efficient method for mining association rules. Knowledge-Based Systems. 18(2-3), 99-105. Wang, K., He, Y., Cheung, D.W., 2001. Mining confident rules without support requirement. Proceedings of the tenth international conference on Information and knowledge management, Atlanta, Georgia, USA: ACM. 89-96. Wang, K., Zhou, S., He, Y., 2000. Growing decision trees on support-less association rules. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, Massachusetts, United States: ACM. 265-269. Wong, T., 2009. Alternative prior assumptions for improving the performance of naive Bayesian classifiers. Data Mining and Knowledge Discovery. 18, 183-213. Xiao, J., He, C., Jiang, X, 2009. Structure identification of bayesian classifiers based on GMDH Knowledge-Based Systems. 22(6), 461-470. Yoon, Y., Lee, G., 2008. Text Categorization Based on Boosting Association Rules, 2008 IEEE International Conference on Semantic Computing. 136-143. Za‥ıane, O. R., Antonie, M.-L., 2002. Classifying text documents by associating terms with text categories. In Thirteenth Aus-tralasian Database Conference (ADC'02), Melbourne, Australia, January 2002., pages 215–222. Zhang, H., 2004. The optimality of Naive Bayes., in: Proceedings of the 17th International FLAIRS Conference, AAAI press. Language and Knowledge Processing Group, Institute of Information Science, Academia Sinica, Chinese Word Segmentation System, http://ckipsvr.iis.sinica.edu.tw/. National Central Library, Electronic Theses and Dissertations System, http://etds.ncl.edu.tw/theabs/index.jsp. The reuters-21578 text categorization test collection. http://www.research.att.com/˜lewis/reuters21578.html. |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信