淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-0706201021042200
中文論文名稱 利用不同分類器與多重靜態門檻值來改善文件分類的準確度
英文論文名稱 Improving the Accuracy of Text Classification by the Different Classifier with Multiple Confidence Threshold Values
校院名稱 淡江大學
系所名稱(中) 資訊工程學系博士班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 98
學期 2
出版年 99
研究生中文姓名 黃蕙華
研究生英文姓名 Hui-Hua Huang
學號 893190040
學位類別 博士
語文別 英文
口試日期 2010-05-28
論文頁數 91頁
口試委員 指導教授-葛煥昭
委員-郭經華
委員-謝楠楨
委員-王亦凡
委員-蔣定安
委員-葛煥昭
中文關鍵字 關聯式分類  文件分類  文字採擷 
英文關鍵字 Association Classification  Text Classification  Text Mining 
學科別分類 學科別應用科學資訊工程
中文摘要 在使用Associative Classification (AC)做分類時,通常會將無法利用Class Association Rules(CAR)做分類的資料,直接歸類到一個預先設定的類別,以避免資料無法被分類的問題。但在使用CAR建立AC分類器時,規則信賴度的標準很難設定,定得太高會將很多可能有用的規則刪除而造成許多資料不能使用CAR做分類,而定得太低則又容易產生分類錯誤,這些情形都會影響到分類準確性。為了解決預設類別和低信賴度規則造成分類錯誤的問題,提升分類結果的準確度,我們提出同時使用兩種不同分類器的概念,依據分類器特性,在不同階段做不同的事。本論文將利用貝氏分類器對訓練文件做分類,然後利用所得之平均準確率來設定門檻值,篩選出滿足門檻值條件的CAR。由於這些CAR之準確度皆高於貝氏分類器的結果,我們可利用這些篩選出CAR來進一步改善分類的結果。而針對CAR不能分類的文件,則以貝氏分類器來分類。經由實驗證明,這種結合不同的分類器的優點的作法的確可獲得比僅使用單一分類器更好的分類效能,換言之,這種結合不同的分類器的優點的作法可有效提升文件分類的效能。
英文摘要 Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the Associative classifier and the Naive Bayes classifier to make up the shortcomings of each other, thus improving the accuracy of text classification. We will classify the training cases with the Naive Bayes classifier and set different confidence threshold values for different class association rules (CARs) to different classes by the obtained classification accuracy rate of the Naive Bayes classifier to the classes. Since the accuracy rates of all selected CARs of the class are higher than that obtained by the Naive Bayes classifier, we could further optimize the classification result through these selected CARs. Moreover, for those unclassified cases, we will classify them with the Naive Bayes classifier. The experimental results show that combining the advantages of these two different classifiers better classification result can be obtained than with a single classifier.
論文目次 Contents
Chapter 1 Introduction 1
1.1 Research Motivation of this Dissertation 1
1.2 Research Objectives of this Dissertation 2
1.3 Organization of this Dissertation 5
Chapter 2 Background Knowledge 6
2.1 TFIDF 6
2.2 Classifiers 8
2.2.1 Associative Classifier of Lazy 8
2.2.2 Associative Classifier of CBA and CMAR 11
2.2.3 Naive Bayes Classifier 13
Chapter 3 Classification Process 19
3.1 How to set threshold values 19
3.2 The process of classification 25
Chapter 4 Experimental Results 30
4.1 Text Corpora 30
4.1.1 Chinese documents 30
4.1.2 Reuters 21578 32
4.2 The method of evaluation 35
4.3 The Classification Results of the Lazy classifier and the Naive Bayes classifier 36
4.3.1 The results of Chinese documents 36
4.3.2 The results of Reuters 21578 41
4.4 Results of using Single Confidence Threshold Values 45
4.4.1 Results of Chinese documents 45
4.4.2 Results of Reuters 21578 49
4.5 Results of using Multiple Confidence Threshold Values 52
4.5.1 Results of Chinese documents 52
4.5.2 Results of Reuters 21578 56
4.6 Analysis of the Experimental Results 59
4.6.1 Analysis of Chinese documents 59
4.6.2 Analysis of Reuters 21578 61
Chapter 5 Conclusion and Future Directions 64
5.1 Conclusion 64
5.2 Future Directions 64
Bibliography 67
VITA 73
Appendix A 74
Appendix B 84



List of Figures
Figure 2.1 L3 algorithm 10
Figure 2.2 Selecting rules based on database coverage 13
Figure 2.3 The process of Naive Bayes classification 15
Figure 3.1 The training process of classification 27
Figure 3.2 The testing process of classification 29



List of Tables
Table 3.1 The confidence threshold value is 91.58% 21
Table 3.2 The confidence threshold value is 75.21% 23
Table 4.1 Number of documents selected from each department of Chinese documents 31
Table 4.2 Number of documents selected from each class of
Reuters 21578 34
Table 4.3 Classification results of the Lazy classifier based on
Chinese documents 38
Table 4.4 Classification results of the Naive Bayes classifier based on Chinese documents 40
Table 4.5 Classification results of the Lazy classifier based on
Reuters 21578 42
Table 4.6 Classification results of the Naive Bayes classifier based on Reuters 21578 44
Table 4.7 Classification results of using single threshold values based on Chinese documents 47
Table 4.8 Classification results of using single threshold values based on Reuters 21578 50
Table 4.9 Classification results of using multiple threshold values based on Chinese documents 54
Table 4.10 Classification results of using multiple threshold values based on Reuters 21578 57
Table 4.11 The comparison of classification results to different classifiers based on Chinese documents 60
Table 4.12 The comparison of classification results to different classifiers based on Reuters 21578 62
參考文獻 Aitchison, J., 1985. A general class of distributions on the simplex. Journal of the Royal Statistical Society. Series B (Methodological). 47(1), 136-146.
Antonie, M-L., Za‥ıane, O. R., 2002, Text document categorization by term association, in ICDM '02: Proceedings of the 2002 IEEE International Conference on Data Mining, Washington, DC, USA: IEEE Computer Society, pp. 19
Bickel, P.J., Levina, E., 2004. Some theory for Fisher's linear discriminant function, 'naive Bayes', and some alternatives when there are many more variables than observations. Bernoulli. 10, 989-1010.
Chen, S.G., 2008. A measure for the appropriateness of prior distributions in naive Bayesian classifiers. Industrial and Information Management of National Cheng Kung University, Taiwan.
Clark, P., Boswell, R., 1991, Rule induction with CN2: Some recent improvements. In Proceedings of the 5th European Working Session on Learning. Berlin, Germany: Springer Verlag, pp. 151–163.
Combarro, E., Montanes, E., Diaz, I., Ranilla, J., Mones, R., 2005. Introducing a family of linear measures for feature selection in text categorization. IEEE Transactions on Knowledge and Data Engineering. 17, 1223-1232.
Garza, P., Baralis, E., 2002. A Lazy Approach to Pruning Classification Rules. Proceedings of the 2002 IEEE International Conference on Data Mining, IEEE Computer Society. 35-42.
Hu, H., Li, J., 2005, Using association rules to make rule-based classifiers robust. In Proceedings of the 16th Australasian Database Conference, Newcastle, Australia, pp. 47–54.
Langley, P., Iba, W., Thompson, K., 1992. An Analysis of Bayesian Classifiers. Proceedings of the tenth national conference on Artificial Intelligence. 223-228.
Lee, C.H., 2007. Improving classification performance using unlabeled data: Naive bayesian case. Knowledge-Based Systems. 20(3), 220-224.
Li, J., Shen, H., Topor, R., 2002. Mining the optimal class association rule set. Knowledge-Based Systems. 15(7), 399-405.
Li, W., Han, J., Pei, J., 2001, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, First IEEE International Conference on Data Mining (ICDM'01), pp.369-376
Liu, B., Hsu, W., Ma, Y., 1998. Integrating Classification and Association Rule Mining. Knowledge Discovery and Data Mining. 80-86.
McCallum, A., Nigam, K., 1998. A comparison of event models for Naive Bayes text classification. In AAAI-98 workshop on learning for text categorization. 41-48.
Michalski, R.S., Carbonell, J.G., Mitchell, T.M., 1983. Machine learning: An artificial intelligence approach. Kaufman Publishers Inc., Los Altos, CA.
Salton, G., 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management. 24(5), 513-523.
Schneider, K., 2005. Techniques for Improving the Performance of Naive Bayes for Text Classification, Computational Linguistics and Intelligent Text Processing. 682-693.
Thabtah, F., 2007. A review of associative classification mining. The Knowledge Engineering Review. 22, 37-65.
Tsay, Y., Chiang, J., 2005. CBAR: an efficient method for mining association rules. Knowledge-Based Systems. 18(2-3), 99-105.
Wang, K., He, Y., Cheung, D.W., 2001. Mining confident rules without support requirement. Proceedings of the tenth international conference on Information and knowledge management, Atlanta, Georgia, USA: ACM. 89-96.
Wang, K., Zhou, S., He, Y., 2000. Growing decision trees on support-less association rules. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, Massachusetts, United States: ACM. 265-269.
Wong, T., 2009. Alternative prior assumptions for improving the performance of naive Bayesian classifiers. Data Mining and Knowledge Discovery. 18, 183-213.
Xiao, J., He, C., Jiang, X, 2009. Structure identification of bayesian classifiers based on GMDH Knowledge-Based Systems. 22(6), 461-470.
Yoon, Y., Lee, G., 2008. Text Categorization Based on Boosting Association Rules, 2008 IEEE International Conference on Semantic Computing. 136-143.
Za‥ıane, O. R., Antonie, M.-L., 2002. Classifying text documents by associating terms with text categories. In Thirteenth Aus-tralasian Database Conference (ADC'02), Melbourne, Australia, January 2002., pages 215–222.
Zhang, H., 2004. The optimality of Naive Bayes., in: Proceedings of the 17th International FLAIRS Conference, AAAI press.
Language and Knowledge Processing Group, Institute of Information Science, Academia Sinica, Chinese Word Segmentation System, http://ckipsvr.iis.sinica.edu.tw/.
National Central Library, Electronic Theses and Dissertations System, http://etds.ncl.edu.tw/theabs/index.jsp.
The reuters-21578 text categorization test collection.
http://www.research.att.com/˜lewis/reuters21578.html.
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2010-06-11公開。
  • 同意授權瀏覽/列印電子全文服務,於2010-06-11起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信