淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-1707200515384500
中文論文名稱 決策樹二階段局部特徵分類
英文論文名稱 The Decision Tree of the Construct of the Two-phase Document Classification System in Local Feature
校院名稱 淡江大學
系所名稱(中) 資訊工程學系碩士班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 93
學期 2
出版年 94
研究生中文姓名 廖冠登
研究生英文姓名 Kuan-Teng Liao
學號 692192072
學位類別 碩士
語文別 中文
口試日期 2005-06-17
論文頁數 57頁
口試委員 指導教授-蔣定安
委員-葛煥昭
委員-王鄭慈
中文關鍵字 文件分類  決策樹  文字探勘 
英文關鍵字 Document Classification  Decision Tree  Text Mining 
學科別分類 學科別應用科學資訊工程
中文摘要 利用關鍵字的觀念,我們可以從一群已經標示分類的文件,取得適當分類規則,也就是利用類別關鍵詞,並使用這樣的依據對未標示類別的文件進行分類的工作。
文件分類的訓練學習過程從學習樣本文件開始,計算樣本文件特徵詞的出現情形與分佈的狀況,經過統計後判斷該特徵詞是否屬於有類別代表意義的詞,若是,則將其作為一種分類的規則;只是,一份文件的特徵詞往往有字詞之間關係的問題,除此之外,在一份文件中,也可能帶著大量雜訊。如何有效解決關聯性問題,並且過濾掉不必要的雜訊,所以在本文提出了決策樹法來解決字詞間相關性的問題,再配合局部特徵化,弱化不重要的關鍵詞,以突顯出重要的關鍵字,根據本論文結果得知,在少量樣本中,決策樹與特徵二階化的配合,在文件分類的正確率與回收率上,也有不錯的效能。
英文摘要 By using feature keywords, we can obtain some appropriate rules from a group of labeled documents. According to this way, we can classify the documents which haven’t been labeled. In this paper, we will discuss how to choose some training datum to be a basic, to calculate all keywords’ weights, to judge the keywords’ importance by their distribution, and to solve the problems of keywords’ correlation.

We will try to solve to avoid the relation of keywords efficiently and filter the noise. So, we use decision tree to solve relative problems, because it can ignore the relation from word to words in first step. Second, we use the two-phase local feature to reduce amount of noisy. In chapter 4 we can observe the results that are more efficiency than before.
論文目次 目錄 ─ Contents

第一章 緒論....................................1
1.1 前言.......................................1
1.2 研究的動機與目的...........................2
1.3 論文架構...................................4

第二章 相關文件與研究探討......................5
2.1 文件分類流程...............................6
2.2.1 特徵萃取.................................8
2.2.2 文件資料预處理..........................11
2.2.3 文件呈現與還原..........................12
2.2.4 特徵選取................................14
2.3 文件分類..................................17
2.4 文件學習..................................18
2.5 分類演算法介紹............................20
2.5.1 Rocchio分類法...........................21
2.5.2 Window-hoff分類法.......................22
2.5.3 決策樹分類法............................23
2.5.4 SVM支持向量機...........................24
2.5.5 KNN最近鄰居法則.........................25
2.5.6 Naïve-Bayes貝氏分類.....................27

第三章 研究方法...............................28
3.1分類問題探討...............................28
3.2TFIDF權重值的改良..........................31
3.3利用決策樹修正權重值.......................34
3.4二階段局部學習.............................38
範例1 .....................................41

第四章 實驗方法及步驟.........................43
4.1 資料來源..................................43
4.2 資料預處理結果............................46
4.3 貝氏機率分類..............................49
4.4 實驗結果..................................50

第五章 結論與未來展望.........................54
5.1 結論......................................54
5.2 未來展望..................................54

參考文獻......................................55
英文稿........................................58
參考文獻 [1]Vapnik V, Golowich S, Smola A.,1997, “Support vector method for function approximation, regression estimation, and signal Processing”,Neural Information Processing Systems 9, pp. 281--287

[2]Müller K-R, Smola A J, Ra tsch G, et al.,1997, “Predicting time series with support vector machines.”,In: Proc. of ICANN'97, Springer Lecture Notes in Computer Science, 1997, pp999-1005

[3]Maron M.E.,1961, “Automatic Indexing : an Experimental Inquiry”,J. of the ACM,V8,pp404-417

[4]Kwok K.L.,1975, “The Use of Title and Cited Titles as Document Representation for Automatic Classfication”,Inform Proc. and Manag,V11,pp201-206

[5]Hamill Karen A. and Zamora Antonio,1980, “The Use of Titles for Automatic Document Classification”,JASIS,V31,n6,pp396-402

[6]Larson Ray R.,1992, “Experiments in Automatic Library of Congress Classification”,JASIS,V43,n2,pp130-148

[7]Blosseville M.J., and Hebrail G., and Monteil M.G. and Penot N.,1992, “Automatic Document Classification : Natural Language Processing, Statistical Analysis, and Expert System Techniques Used Together”,SIGIR'92 : Proc. of 15th Ann. International ACM SIGIR Conf. on R. and D. in Inform. Retr,pp51-57

[8]Lewis David D.,1992, “An Evaluation of Phrasal and Clustered Representation on a Text Categorization Task”,SIGIR'92:Proc. of 15th Ann. International ACM SIGIR Conf. on R. and D. in Inform. Retr.,pp37-50

[9]Thorsten Joachims, “A probabilistic analysis of the Rocchio Algorithm with TFIDF for text categorization”,Proceedings of ICML-97, 14th International Conference on Machine Learning, pp. 143--151, 1997.

[10]Li-Ping Jing; Hou-Kuan Huang; Hong-Bo Shi, “Improved feature selection approach tfidf in text mining: Machine Learning and Cybernetics”, 2002. Proceedings. 2002 International Conference on , Volume: 2 , 4-5 Nov. 2002,pp944-946

[11]http://rocling.iis.sinica.edu.tw/CKIP/

[12]http://godel.iis.sinica.edu.tw/ROCLING/CNS98.DOC

[13]Soucy, P.; Mineau, G.W., “A simple KNN algorithm for text categorization.” Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on , 29 Nov.-2 Dec. 2001,pp64-68

[14]Wen-Jyi Hwang; Kuo-Wei Wen, “Fast kNN classification algorithm based on partial distance search: Electronics Letters” , Volume: 34 , Issue: 21 , 15 Oct. 1998,pp2062-2063

[15]Parthasarathy, G.; Chatterji, B.N.;, “A class of new KNN methods for low sample problems: Systems, Man and Cybernetics”, IEEE Transactions on , Volume: 20 , Issue: 3 , May-June 1990,pp715-718

[16]Mingyu Lu; Keyun Hu; Yi Wu; Yuchang Lu; Lizhu Zho;, “SECTCS: towards improving VSM and Naive Bayesian classifier:Systems, Man and Cybernetics”, 2002 IEEE International Conference on , Volume: 5 , 6-9 Oct. 2002,pp5

[17]Hung-Ju Huang; Chun-Nan Hsu;, “Bayesian classification for data from the same unknown class: Systems, Man and Cybernetics”, Part B, IEEE Transactions on , Volume: 32 , Issue: 2 , April 2002,pp137-145

[18]H D Navone, D Cook ,T Downs and D Chen, “Boosting Naive-Bayes classifiers to predict outcomes for hip prostheses, Neural Networks”, 1999. IJCNN '99. International Joint Conference on , Volume: 5 , 10-16 July 1999,pp3622 – 3626

[19]D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, 1996, “Training algorithms for linear text classifiers”, Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval,pp298--306

[20]J. R.,1986, “Quinlan, Induction of decision trees”, Machine Learning, vol. 1, pp81-106

[21]Tom M. Mitchell, 1997, “Machine Learning”, The McGraw-Hill Companies, Inc.

[22]G. Salton and C. Buckley, 1988, “Term weighting approaches in automatic text retrieval”, Information Processing and Management, vol. 24, No. 5, pp. 513-523.

[23]K. Aas and L. Eikvil, 1999, “Text categorization: A survey,”,Technical report, Norwegian Computing Center

[24] Weiguo Fan, Michael D. Gordon, and Praveen Pathak,2004,” Discovery of Context-Specific Ranking Functions for Effective Information Retrieval Using Genetic Programming”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 4, pp. 523-527
論文使用權限
  • 不同意紙本論文無償授權給館內讀者為學術之目的重製使用。
  • 同意授權瀏覽/列印電子全文服務,於2005-07-21起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信