淡江大學覺生紀念圖書館 (TKU Library)

系統識別號 U0002-2906200609220700
中文論文名稱 局部特徵強化結合關聯式法則與特殊類別優先權分類系統建置
英文論文名稱 The construct of document classification system in strengthening local feature with association rule and special priority of classification
校院名稱 淡江大學
系所名稱(中) 資訊工程學系碩士班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 94
學期 2
出版年 95
研究生中文姓名 廖英凱
研究生英文姓名 Ying-Kai Liao
電子信箱 693192253@s93.tku.edu.tw
學號 693192253
學位類別 碩士
語文別 中文
口試日期 2006-06-13
論文頁數 67頁
口試委員 指導教授-蔣定安
中文關鍵字 文件分類  關聯式法則  文字探勘 
英文關鍵字 document classification  association rule  text mining 
學科別分類 學科別應用科學資訊工程
中文摘要 利用關鍵字的觀念,我們可以從一群已經標示分類的文件,取得適當分類規則,也就是利用類別關鍵詞,並使用這樣的依據對未標示類別的文件進行分類的工作。
英文摘要 By using feature keywords, we can obtain some appropriate rules from a group of labeled documents. According to this way, we can classify the documents which haven’t been labeled. In this paper, we will discuss how to choose some training datum to be a basic, to calculate all keywords’ weights, to judge the keywords’ importance by their distribution, first, we will use a better way to calculate the keywords weight, and then combine two words as a new word by association rule to help us increase the keywords. At last, according to the character of the datum, we give different category with different priority. It will make the classification more efficiency.
論文目次 目錄 ─ Contents
第一章 緒論 1
1.1前言 1
1.2研究的動機與目的 2
1.3論文架構 5
第二章 相關文件與研究探討 6
2.1文件分類流程 7
2.2.1特徵萃取 9
2.2.2文件資料预處理 11
2.2.3文件呈現與還原 13
2.2.4特徵選取 16
2.3文件分類 19
2.4機器學習 21
2.5分類演算法介紹 23
2.5.1Rocchio分類法 23
2.5.2Window-hoff分類法 24
2.5.3決策樹分類法 25
2.5.4 SVM支持向量機 26
2.5.5 KNN最近鄰居法則 27
2.5.6 Naïve-Bayes貝氏分類 28
2.6關聯式法則分析 29
第三章研究方法 30
3.1分類系統流程 30
3.2關鍵詞選取 32
3.3改良傳統TFIDF權重值 34
3.4利用關聯式法則結合多個詞彙當關鍵詞 37
3.5跨領域文件分類修正 43
第四章實驗方法及步驟 46
4.1資料來源 47
4.2資料預處理結果 49
4.3貝氏機率分類 51
4.4實驗結果 52
第五章結論與未來展望 56
5.1結論 56
5.2未來展望 57
參考文獻 58
英文論文 60
圖2.1-1 文件分類之系統流程 8
圖2.2.1-1 以詞彙與頻率的陣列表示一份文件 10
圖2.2.2-1 特徵詞頻率與重要性之關係圖 12
圖2.4-1 文件分類學習過程 21
圖2.4-2 機器學習流程 22
圖2.5.4-1 SVM 26
圖3.1-1分類系統流程 30
圖3.4-1 關聯式法則採礦結果 40
圖3.4-2 複合關鍵詞在Training Data中各類別的分布情形 41
圖3.4-3 複合關鍵詞在Training Data中的各項數據統計 41
表3.3-1 改良後使無用的關鍵詞權重弱化 35
表3.3-2 改良後加強關鍵詞在不同類別的差異 36
表3.4-1 改良式TFIDF結合關聯式法則後對Testing Data分
類修正情形 42
表3.5-1 修正化學類被分到生物類的文件修正情形 44
表3.5-2 對跨領域文件分類修正前後情形 45
表4.1-1 由各系所選取出的文章數 47
表4.1-2 文件描述的格式 48
表4.2-1 文件經過斷詞以後的結果 49
表4.2-2(a) 傳統TFIDF所得的關鍵詞各項數據統計與權重 50
表4.2-2(b) 改良式TFIDF所得的關鍵詞各項數據統計與權重 50
表4.4-1 傳統分類法與改良後各階段分類正確數比較 53
表4.4-2 傳統分類法與改良後各階段之分類回收率比較 53
表4.4-3 傳統分類法與改良後各階段之分類正確率比較 54
表4.4-4化學系的文件中出現大量偏生物的詞彙 55
公式3.3-1 改良式TFIDF 34
參考文獻 [1]Vapnik V, Golowich S, Smola A.,1997, “Support vector method for function approximation, regression estimation, and signal Processing”,Neural Information Processing Systems 9, pp. 281--287
[2]Müller K-R, Smola A J, Ra tsch G, et al.,1997, “Predicting time series with support vector machines.”,In: Proc. of ICANN'97, Springer Lecture Notes in Computer Science, 1997, pp999-1005
[3]Maron M.E.,1961, “Automatic Indexing : an Experimental Inquiry”,J. of the ACM,V8,pp404-417
[4]Kwok K.L.,1975, “The Use of Title and Cited Titles as Document Representation for Automatic Classfication”,Inform Proc. and Manag,V11,pp201-206
[5]Hamill Karen A. and Zamora Antonio,1980, “The Use of Titles for Automatic Document Classification”,JASIS,V31,n6,pp396-402
[6]Larson Ray R.,1992, “Experiments in Automatic Library of Congress Classification”,JASIS,V43,n2,pp130-148
[7]Blosseville M.J., and Hebrail G., and Monteil M.G. and Penot N.,1992, “Automatic Document Classification : Natural Language Processing, Statistical Analysis, and Expert System Techniques Used Together”,SIGIR'92 : Proc. of 15th Ann. International ACM SIGIR Conf. on R. and D. in Inform. Retr,pp51-57
[8]Lewis David D.,1992, “An Evaluation of Phrasal and Clustered Representation on a Text Categorization Task”,SIGIR'92:Proc. of 15th Ann. International ACM SIGIR Conf. on R. and D. in Inform. Retr.,pp37-50
[9]Thorsten Joachims, “A probabilistic analysis of the Rocchio Algorithm with TFIDF for text categorization”,Proceedings of ICML-97, 14th International Conference on Machine Learning, pp. 143--151, 1997.
[10]Li-Ping Jing; Hou-Kuan Huang; Hong-Bo Shi, “Improved feature selection approach tfidf in text mining: Machine Learning and Cybernetics”, 2002. Proceedings. 2002 International Conference on , Volume: 2 , 4-5 Nov. 2002,pp944-946
[13]Soucy, P.; Mineau, G.W., “A simple KNN algorithm for text categorization.” Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on , 29 Nov.-2 Dec. 2001,pp64-68
[14]Wen-Jyi Hwang; Kuo-Wei Wen, “Fast kNN classification algorithm based on partial distance search: Electronics Letters” , Volume: 34 , Issue: 21 , 15 Oct. 1998,pp2062-2063
[15]Parthasarathy, G.; Chatterji, B.N.;, “A class of new KNN methods for low sample problems: Systems, Man and Cybernetics”, IEEE Transactions on , Volume: 20 , Issue: 3 , May-June 1990,pp715-718
[16]Mingyu Lu; Keyun Hu; Yi Wu; Yuchang Lu; Lizhu Zho;, “SECTCS: towards improving VSM and Naive Bayesian classifier:Systems, Man and Cybernetics”, 2002 IEEE International Conference on , Volume: 5 , 6-9 Oct. 2002,pp5
[17]Hung-Ju Huang; Chun-Nan Hsu;, “Bayesian classification for data from the same unknown class: Systems, Man and Cybernetics”, Part B, IEEE Transactions on , Volume: 32 , Issue: 2 , April 2002,pp137-145
[18]H D Navone, D Cook ,T Downs and D Chen, “Boosting Naive-Bayes classifiers to predict outcomes for hip prostheses, Neural Networks”, 1999. IJCNN '99. International Joint Conference on , Volume: 5 , 10-16 July 1999,pp3622 – 3626
[19]D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, 1996, “Training algorithms for linear text classifiers”, Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval,pp298--306
[20]J. R.,1986, “Quinlan, Induction of decision trees”, Machine Learning, vol. 1, pp81-106
[21]Tom M. Mitchell, 1997, “Machine Learning”, The McGraw-Hill Companies, Inc.
[22]G. Salton and C. Buckley, 1988, “Term weighting approaches in automatic text retrieval”, Information Processing and Management, vol. 24, No. 5, pp. 513-523.
[23]K. Aas and L. Eikvil, 1999, “Text categorization: A survey,”,Technical report, Norwegian Computing Center
[24] Weiguo Fan, Michael D. Gordon, and Praveen Pathak,2004,” Discovery of Context-Specific Ranking Functions for Effective Information Retrieval Using Genetic Programming”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 4, pp. 523-527
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2006-06-29公開。
  • 同意授權瀏覽/列印電子全文服務,於2006-06-29起公開。

  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信