電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2006-06-29起於校外公開使用
本論文紙本於2006-06-29起公開使用

系統識別號	U0002-2906200609220700
DOI	10.6846/TKU.2006.00916
論文名稱(中文)	局部特徵強化結合關聯式法則與特殊類別優先權分類系統建置
論文名稱(英文)	The construct of document classification system in strengthening local feature with association rule and special priority of classification
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	資訊工程學系碩士班
系所名稱(英文)	Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	94
學期	2
出版年	95
研究生(中文)	廖英凱
研究生(英文)	Ying-Kai Liao
學號	693192253
學位類別	碩士
語言別	繁體中文
第二語言別
口試日期	2006-06-13
論文頁數	67頁
口試委員	指導教授 - 蔣定安委員 - 王鄭慈委員 - 葛煥昭
關鍵字(中)	文件分類關聯式法則文字探勘
關鍵字(英)	document classification association rule text mining
第三語言關鍵字
學科別分類
中文摘要	利用關鍵字的觀念，我們可以從一群已經標示分類的文件，取得適當分類規則，也就是利用類別關鍵詞，並使用這樣的依據對未標示類別的文件進行分類的工作。文件分類的訓練學習過程從學習樣本文件開始，計算樣本文件特徵詞的出現情形與分佈的狀況，經過統計後判斷該特徵詞是否屬於有類別代表意義的詞，若是，則將其作為一種分類的規則。在一份文件中，也可能帶著大量雜訊，為了有效過濾掉不必要的雜訊，在本文提出了改良式TFIDF修正關鍵詞權重的計算方式，再配合關聯式法則，找出能幫助分類的複合關鍵詞，用來修正文件的權重，最後再根據文件資料的特性，給予不同類別不同的優先權。由本論文的實驗結果，在經過本論文提出的方法修正後，能夠大幅度提高文件分類的效率。
英文摘要	By using feature keywords, we can obtain some appropriate rules from a group of labeled documents. According to this way, we can classify the documents which haven’t been labeled. In this paper, we will discuss how to choose some training datum to be a basic, to calculate all keywords’ weights, to judge the keywords’ importance by their distribution, first, we will use a better way to calculate the keywords weight, and then combine two words as a new word by association rule to help us increase the keywords. At last, according to the character of the datum, we give different category with different priority. It will make the classification more efficiency.
第三語言摘要
論文目次	目錄 ─ Contents 第一章緒論 1 1.1前言 1 1.2研究的動機與目的 2 1.3論文架構 5 第二章相關文件與研究探討 6 2.1文件分類流程 7 2.2.1特徵萃取 9 2.2.2文件資料预處理 11 2.2.3文件呈現與還原 13 2.2.4特徵選取 16 2.3文件分類 19 2.4機器學習 21 2.5分類演算法介紹 23 2.5.1Rocchio分類法 23 2.5.2Window-hoff分類法 24 2.5.3決策樹分類法 25 2.5.4 SVM支持向量機 26 2.5.5 KNN最近鄰居法則 27 2.5.6 Naïve-Bayes貝氏分類 28 2.6關聯式法則分析 29 第三章研究方法 30 3.1分類系統流程 30 3.2關鍵詞選取 32 3.3改良傳統TFIDF權重值 34 3.4利用關聯式法則結合多個詞彙當關鍵詞 37 3.5跨領域文件分類修正 43 第四章實驗方法及步驟 46 4.1資料來源 47 4.2資料預處理結果 49 4.3貝氏機率分類 51 4.4實驗結果 52 第五章結論與未來展望 56 5.1結論 56 5.2未來展望 57 參考文獻 58 英文論文 60 圖目錄圖2.1-1 文件分類之系統流程 8 圖2.2.1-1 以詞彙與頻率的陣列表示一份文件 10 圖2.2.2-1 特徵詞頻率與重要性之關係圖 12 圖2.4-1 文件分類學習過程 21 圖2.4-2 機器學習流程 22 圖2.5.4-1 SVM 26 圖3.1-1分類系統流程 30 圖3.4-1 關聯式法則採礦結果 40 圖3.4-2 複合關鍵詞在Training Data中各類別的分布情形 41 圖3.4-3 複合關鍵詞在Training Data中的各項數據統計 41 表目錄表3.3-1 改良後使無用的關鍵詞權重弱化 35 表3.3-2 改良後加強關鍵詞在不同類別的差異 36 表3.4-1 改良式TFIDF結合關聯式法則後對Testing Data分類修正情形 42 表3.5-1 修正化學類被分到生物類的文件修正情形 44 表3.5-2 對跨領域文件分類修正前後情形 45 表4.1-1 由各系所選取出的文章數 47 表4.1-2 文件描述的格式 48 表4.2-1 文件經過斷詞以後的結果 49 表4.2-2(a) 傳統TFIDF所得的關鍵詞各項數據統計與權重 50 表4.2-2(b) 改良式TFIDF所得的關鍵詞各項數據統計與權重 50 表4.4-1 傳統分類法與改良後各階段分類正確數比較 53 表4.4-2 傳統分類法與改良後各階段之分類回收率比較 53 表4.4-3 傳統分類法與改良後各階段之分類正確率比較 54 表4.4-4化學系的文件中出現大量偏生物的詞彙 55 公式目錄公式3.3-1 改良式TFIDF 34
參考文獻	[1]Vapnik V, Golowich S, Smola A.,1997, “Support vector method for function approximation, regression estimation, and signal Processing”，Neural Information Processing Systems 9, pp. 281--287 [2]Müller K-R, Smola A J, Ra tsch G, et al.,1997, “Predicting time series with support vector machines.”，In: Proc. of ICANN'97, Springer Lecture Notes in Computer Science, 1997, pp999-1005 [3]Maron M.E.，1961, “Automatic Indexing : an Experimental Inquiry”，J. of the ACM，V8，pp404-417 [4]Kwok K.L.，1975, “The Use of Title and Cited Titles as Document Representation for Automatic Classfication”，Inform Proc. and Manag，V11，pp201-206 [5]Hamill Karen A. and Zamora Antonio，1980, “The Use of Titles for Automatic Document Classification”，JASIS，V31，n6，pp396-402 [6]Larson Ray R.，1992, “Experiments in Automatic Library of Congress Classification”，JASIS，V43，n2，pp130-148 [7]Blosseville M.J., and Hebrail G., and Monteil M.G. and Penot N.，1992, “Automatic Document Classification : Natural Language Processing, Statistical Analysis, and Expert System Techniques Used Together”，SIGIR'92 : Proc. of 15th Ann. International ACM SIGIR Conf. on R. and D. in Inform. Retr，pp51-57 [8]Lewis David D.，1992, “An Evaluation of Phrasal and Clustered Representation on a Text Categorization Task”，SIGIR'92：Proc. of 15th Ann. International ACM SIGIR Conf. on R. and D. in Inform. Retr.，pp37-50 [9]Thorsten Joachims, “A probabilistic analysis of the Rocchio Algorithm with TFIDF for text categorization”，Proceedings of ICML-97, 14th International Conference on Machine Learning, pp. 143--151, 1997. [10]Li-Ping Jing; Hou-Kuan Huang; Hong-Bo Shi, “Improved feature selection approach tfidf in text mining: Machine Learning and Cybernetics”, 2002. Proceedings. 2002 International Conference on , Volume: 2 , 4-5 Nov. 2002，pp944-946 [11]http://rocling.iis.sinica.edu.tw/CKIP/ [12]http://godel.iis.sinica.edu.tw/ROCLING/CNS98.DOC [13]Soucy, P.; Mineau, G.W., “A simple KNN algorithm for text categorization.” Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on , 29 Nov.-2 Dec. 2001，pp64-68 [14]Wen-Jyi Hwang; Kuo-Wei Wen, “Fast kNN classification algorithm based on partial distance search: Electronics Letters” , Volume: 34 , Issue: 21 , 15 Oct. 1998，pp2062-2063 [15]Parthasarathy, G.; Chatterji, B.N.;, “A class of new KNN methods for low sample problems: Systems, Man and Cybernetics”, IEEE Transactions on , Volume: 20 , Issue: 3 , May-June 1990，pp715-718 [16]Mingyu Lu; Keyun Hu; Yi Wu; Yuchang Lu; Lizhu Zho;, “SECTCS: towards improving VSM and Naive Bayesian classifier:Systems, Man and Cybernetics”, 2002 IEEE International Conference on , Volume: 5 , 6-9 Oct. 2002，pp5 [17]Hung-Ju Huang; Chun-Nan Hsu;, “Bayesian classification for data from the same unknown class: Systems, Man and Cybernetics”, Part B, IEEE Transactions on , Volume: 32 , Issue: 2 , April 2002，pp137-145 [18]H D Navone, D Cook ,T Downs and D Chen, “Boosting Naive-Bayes classifiers to predict outcomes for hip prostheses, Neural Networks”, 1999. IJCNN '99. International Joint Conference on , Volume: 5 , 10-16 July 1999，pp3622 – 3626 [19]D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, 1996, “Training algorithms for linear text classifiers”, Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval，pp298--306 [20]J. R.,1986, “Quinlan, Induction of decision trees”, Machine Learning, vol. 1, pp81-106 [21]Tom M. Mitchell, 1997, “Machine Learning”, The McGraw-Hill Companies, Inc. [22]G. Salton and C. Buckley, 1988, “Term weighting approaches in automatic text retrieval”， Information Processing and Management, vol. 24, No. 5, pp. 513-523. [23]K. Aas and L. Eikvil, 1999, “Text categorization: A survey,”，Technical report, Norwegian Computing Center [24] Weiguo Fan, Michael D. Gordon, and Praveen Pathak,2004,” Discovery of Context-Specific Ranking Functions for Effective Information Retrieval Using Genetic Programming”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 4, pp. 523-527
論文全文使用權限	校內：校內紙本論文立即公開同意電子論文全文授權校園內公開校內電子論文立即公開校外：同意授權校外電子論文立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信