§ 瀏覽學位論文書目資料
  
系統識別號 U0002-1707200515384500
DOI 10.6846/TKU.2005.00358
論文名稱(中文) 決策樹二階段局部特徵分類
論文名稱(英文) The Decision Tree of the Construct of the Two-phase Document Classification System in Local Feature
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系碩士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 93
學期 2
出版年 94
研究生(中文) 廖冠登
研究生(英文) Kuan-Teng Liao
學號 692192072
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2005-06-17
論文頁數 57頁
口試委員 指導教授 - 蔣定安
委員 - 葛煥昭
委員 - 王鄭慈
關鍵字(中) 文件分類
決策樹
文字探勘
關鍵字(英) Document Classification
Decision Tree
Text Mining
第三語言關鍵字
學科別分類
中文摘要
利用關鍵字的觀念,我們可以從一群已經標示分類的文件,取得適當分類規則,也就是利用類別關鍵詞,並使用這樣的依據對未標示類別的文件進行分類的工作。
文件分類的訓練學習過程從學習樣本文件開始,計算樣本文件特徵詞的出現情形與分佈的狀況,經過統計後判斷該特徵詞是否屬於有類別代表意義的詞,若是,則將其作為一種分類的規則;只是,一份文件的特徵詞往往有字詞之間關係的問題,除此之外,在一份文件中,也可能帶著大量雜訊。如何有效解決關聯性問題,並且過濾掉不必要的雜訊,所以在本文提出了決策樹法來解決字詞間相關性的問題,再配合局部特徵化,弱化不重要的關鍵詞,以突顯出重要的關鍵字,根據本論文結果得知,在少量樣本中,決策樹與特徵二階化的配合,在文件分類的正確率與回收率上,也有不錯的效能。
英文摘要
By using feature keywords, we can obtain some appropriate rules from a group of labeled documents. According to this way, we can classify the documents which haven’t been labeled. In this paper, we will discuss how to choose some training datum to be a basic, to calculate all keywords’ weights, to judge the keywords’ importance by their distribution, and to solve the problems of keywords’ correlation.

 We will try to solve to avoid the relation of keywords efficiently and filter the noise. So, we use decision tree to solve relative problems, because it can ignore the relation from word to words in first step. Second, we use the two-phase local feature to reduce amount of noisy. In chapter 4 we can observe the results that are more efficiency than before.
第三語言摘要
論文目次
目錄 ─ Contents

第一章 緒論....................................1
1.1 前言.......................................1
1.2 研究的動機與目的...........................2
1.3 論文架構...................................4

第二章 相關文件與研究探討......................5
2.1 文件分類流程...............................6
2.2.1 特徵萃取.................................8
2.2.2 文件資料预處理..........................11
2.2.3 文件呈現與還原..........................12
2.2.4 特徵選取................................14
2.3 文件分類..................................17
2.4 文件學習..................................18
2.5 分類演算法介紹............................20
2.5.1 Rocchio分類法...........................21
2.5.2 Window-hoff分類法.......................22
2.5.3 決策樹分類法............................23
2.5.4 SVM支持向量機...........................24
2.5.5 KNN最近鄰居法則.........................25
2.5.6 Naïve-Bayes貝氏分類.....................27

第三章 研究方法...............................28
3.1分類問題探討...............................28
3.2TFIDF權重值的改良..........................31
3.3利用決策樹修正權重值.......................34
3.4二階段局部學習.............................38
    範例1	.....................................41

第四章 實驗方法及步驟.........................43
4.1 資料來源..................................43
4.2 資料預處理結果............................46
4.3 貝氏機率分類..............................49
4.4 實驗結果..................................50

第五章 結論與未來展望.........................54
5.1 結論......................................54
5.2 未來展望..................................54

參考文獻......................................55
英文稿........................................58
參考文獻
[1]Vapnik V, Golowich S, Smola A.,1997, “Support vector method for function approximation, regression estimation, and signal Processing”,Neural Information Processing Systems 9, pp. 281--287

[2]Müller K-R, Smola A J, Ra tsch G, et al.,1997, “Predicting time series with support vector machines.”,In: Proc. of ICANN'97, Springer Lecture Notes in Computer Science, 1997, pp999-1005

[3]Maron M.E.,1961, “Automatic Indexing : an Experimental Inquiry”,J. of the ACM,V8,pp404-417

[4]Kwok K.L.,1975, “The Use of Title and Cited Titles as Document Representation for Automatic Classfication”,Inform Proc. and Manag,V11,pp201-206

[5]Hamill Karen A. and Zamora Antonio,1980, “The Use of Titles for Automatic Document Classification”,JASIS,V31,n6,pp396-402

[6]Larson Ray R.,1992, “Experiments in Automatic Library of Congress Classification”,JASIS,V43,n2,pp130-148

[7]Blosseville M.J., and Hebrail G., and Monteil M.G. and Penot N.,1992, “Automatic Document Classification : Natural Language Processing, Statistical Analysis, and Expert System Techniques Used Together”,SIGIR'92 : Proc. of 15th Ann. International ACM SIGIR Conf. on R. and D. in Inform. Retr,pp51-57

[8]Lewis David D.,1992, “An Evaluation of Phrasal and Clustered Representation on a Text Categorization Task”,SIGIR'92:Proc. of 15th Ann. International ACM SIGIR Conf. on R. and D. in Inform. Retr.,pp37-50

[9]Thorsten Joachims, “A probabilistic analysis of the Rocchio Algorithm with TFIDF for text categorization”,Proceedings of ICML-97, 14th International Conference on Machine Learning, pp. 143--151, 1997.

[10]Li-Ping Jing; Hou-Kuan Huang; Hong-Bo Shi, “Improved feature selection approach tfidf in text mining: Machine Learning and Cybernetics”, 2002. Proceedings. 2002 International Conference on , Volume: 2 , 4-5 Nov. 2002,pp944-946

[11]http://rocling.iis.sinica.edu.tw/CKIP/

[12]http://godel.iis.sinica.edu.tw/ROCLING/CNS98.DOC

[13]Soucy, P.; Mineau, G.W., “A simple KNN algorithm for text categorization.” Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on , 29 Nov.-2 Dec. 2001,pp64-68

[14]Wen-Jyi Hwang; Kuo-Wei Wen, “Fast kNN classification algorithm based on partial distance search: Electronics Letters” , Volume: 34 , Issue: 21 , 15 Oct. 1998,pp2062-2063

[15]Parthasarathy, G.; Chatterji, B.N.;, “A class of new KNN methods for low sample problems: Systems, Man and Cybernetics”, IEEE Transactions on , Volume: 20 , Issue: 3 , May-June 1990,pp715-718

[16]Mingyu Lu; Keyun Hu; Yi Wu; Yuchang Lu; Lizhu Zho;, “SECTCS: towards improving VSM and Naive Bayesian classifier:Systems, Man and Cybernetics”, 2002 IEEE International Conference on , Volume: 5 , 6-9 Oct. 2002,pp5

[17]Hung-Ju Huang; Chun-Nan Hsu;, “Bayesian classification for data from the same unknown class: Systems, Man and Cybernetics”, Part B, IEEE Transactions on , Volume: 32 , Issue: 2 , April 2002,pp137-145

[18]H D Navone, D Cook ,T Downs and D Chen, “Boosting Naive-Bayes classifiers to predict outcomes for hip prostheses, Neural Networks”, 1999. IJCNN '99. International Joint Conference on , Volume: 5 , 10-16 July 1999,pp3622 – 3626

[19]D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, 1996, “Training algorithms for linear text classifiers”, Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval,pp298--306

[20]J. R.,1986, “Quinlan, Induction of decision trees”, Machine Learning, vol. 1, pp81-106

[21]Tom M. Mitchell, 1997, “Machine Learning”, The McGraw-Hill Companies, Inc.

[22]G. Salton and C. Buckley, 1988, “Term weighting approaches in automatic text retrieval”, Information Processing and Management, vol. 24, No. 5, pp. 513-523.

[23]K. Aas and L. Eikvil, 1999, “Text categorization: A survey,”,Technical report, Norwegian Computing Center

[24] Weiguo Fan, Michael D. Gordon, and Praveen Pathak,2004,” Discovery of Context-Specific Ranking Functions for Effective Information Retrieval Using Genetic Programming”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 4, pp. 523-527
論文全文使用權限
校內
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信