系統識別號 | U0002-1807200700243400 |
---|---|
DOI | 10.6846/TKU.2007.00524 |
論文名稱(中文) | 利用關聯式法則將中文文件分類 |
論文名稱(英文) | Classifying Chinese Text Documents by Association rule |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系碩士在職專班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 95 |
學期 | 2 |
出版年 | 96 |
研究生(中文) | 李卓銘 |
研究生(英文) | Cho-Ming Lee |
學號 | 794190081 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
第二語言別 | |
口試日期 | 2007-06-14 |
論文頁數 | 66頁 |
口試委員 |
指導教授
-
黃連進
委員 - 王鄭慈 委員 - 蔣定安 |
關鍵字(中) |
文件分類 關聯式法則 文字探勘 |
關鍵字(英) |
document classification association rule text mining |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
利用改良式TFIDF公式計算每個特徵詞的權重,依據權重表可以計算出每份文件對各類別的權重值總和,同時利用關聯式法則採礦,找出同時會出現於一份文件中的特徵詞作為新的規則,統計新規則在訓練文件中各個類別出現的情形,依據每個規則之信賴度(confidence)及支持度(support)篩選出可以幫助分類的新規則,利用新規則修正文件的錯誤類別,以提升分類正確率。 本論文除利用改良式TFIDF弱化分布過廣之雜訊詞權重減少預處理時未刪減完全所帶來的影響,主要利用關聯式法則採礦出之新規則,並針對各種可能的情況篩選重覆性規則,依據信賴度遞減、規則長度遞減作為規則引用之排序準則以修正分類錯誤,並將分類類別調整先後順序,使分類的正確率提高。由本論文的實驗結果,在經過本論文提出的方法修正後,能夠大幅度提高文件分類的效率。 |
英文摘要 |
Use improved TFIDF to build weighting table. Thereby, the system computes the sum of weight of each document relative to each category. According to this way, we can classify the documents which haven’t been labeled. In this paper, we use improve TFIDF to calculate the keywords weight and then combine two words as a new word by association rule to help us increase the keywords. We exploit association rule technology to apply to the data mining miner. The features of weight table are input into the data mining miner and examined whether these rules sorted by confidence, support and the length of rule to save into rule base. It will make the classification more efficiency. |
第三語言摘要 | |
論文目次 |
第一章 緒論 1 1.1前言 1 1.2研究的動機與目的 3 1.3論文架構 5 第二章 相關文件與研究探討 6 2.1文件分類流程 7 2.2特徵萃取 9 2.2.1文件預處理 9 2.2.2文件呈現與還原 13 2.2.3特徵選取 19 2.2.4關聯式法則分析 22 2.3文件分類 27 第三章研究方法 34 3.1分類系統流程 34 3.2關鍵詞選取 36 3.3利用關聯式法則結合多個詞彙當關鍵詞 38 第四章實驗方法及步驟 47 4.1資料來源 48 4.2資料預處理結果 50 4.3實驗結果 52 第五章結論與未來展望 55 5.1結論 55 5.2未來展望 56 參考文獻 57 英文論文 60 圖目錄 圖2-1 文件分類之系統流程 8 圖2-2特徵詞頻率與重要性之關係圖 11 圖2-3 以詞彙與頻率的陣列表示一份文件 13 圖2-4 SVM向量空間示意圖 30 圖3-1分類系統流程 35 圖3-2 關聯式法則採礦結果 39 圖3-3 新規則在訓練文件中各類別的分布情形 41 圖3-4 新規則在訓練文件中的各項數據統計 41 表目錄 表2-1 改良後無用的特徵詞權重將被弱化 17 表2-2 改良後加強特徵詞在不同類別的差異 18 表3-1 保留信賴度高的規則 42 表3-2 保留短規則,刪除長規則 43 表3-3 保留在不同類別,信賴度不同的規則 44 表3-4 同一詞彙再不同類別的優先順序 45 表3-5 改良式TFIDF結合關聯式法則之分類修正情形 46 表4-1 由各系所選取出的文章數 48 表4-2 文件描述的格式 49 表4-3 文件經過斷詞以後的結果 50 表4-4(a) 傳統TFIDF所得的特徵詞各項數據統計與權重 51 表4-4(b) 改良式TFIDF所得的特徵詞各項數據統計與權重 51 表4-5 傳統分類法與改良後各階段分類正確數比較 53 表4-6 傳統分類法與改良後各階段之推測分類數比較 53 表4-7 傳統分類法與改良後各階段之分類回收率比較 54 表4-8 傳統分類法與改良後各階段之分類正確率比較 54 表4-9 新規則修正之回收率、正確率、F1值 54 |
參考文獻 |
[1] Blosseville M.J., and Hebrail G., and Monteil M.G. and Penot N.,1992, “Automatic Document Classification : Natural Language Processing, Statistical Analysis, and Expert System Techniques Used Together”,SIGIR'92 : Proc. of 15th Ann. International ACM SIGIR Conf. on R. and D. in Inform. Retr,pp51-57 [2]D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, 1996, “Training algorithms for linear text classifiers”, Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval,pp298--306 [3]G. Salton and C. Buckley, 1988, “Term weighting approaches in automatic text retrieval”, Information Processing and Management, vol. 24, No. 5, pp. 513-523. [4] Hamill Karen A. and Zamora Antonio,1980, “The Use of Titles for Automatic Document Classification”,JASIS,V31,n6,pp396-402 [5] Han, J., and Kamber, M. (2001),” Data Mining: Concepts and Techniques,” Morgan Kanfmann Publishers. [6]H D Navone, D Cook ,T Downs and D Chen, “Boosting Naive-Bayes classifiers to predict outcomes for hip prostheses, Neural Networks”, 1999. IJCNN '99. International Joint Conference on , Volume: 5 , 10-16 July 1999,pp3622 – 3626 [7]Hung-Ju Huang; Chun-Nan Hsu;, “Bayesian classification for data from the same unknown class: Systems, Man and Cybernetics”, Part B, IEEE Transactions on , Volume: 32 , Issue: 2 , April 2002,pp137-145 [8] http://godel.iis.sinica.edu.tw/ROCLING/CNS98.DOC [9] http://rocling.iis.sinica.edu.tw/CKIP/ [10] J. Han, and M. Kamber,” Data Mining: Concepts and Techniques, Morgan Kaufmann”, 2000. [11] J. R.,1986, “Quinlan, Induction of decision trees”, Machine Learning, vol. 1, pp81-106 [12]K. Aas and L. Eikvil, 1999, “Text categorization: A survey,”,Technical report, Norwegian Computing Center [13]Kwok K.L.,1975, “The Use of Title and Cited Titles as Document Representation for Automatic Classfication”,Inform Proc. and Manag,V11,pp201-206 [14]Larson Ray R.,1992, “Experiments in Automatic Library of Congress Classification”,JASIS,V43,n2,pp130-148 [15]Lewis David D.,1992, “An Evaluation of Phrasal and Clustered Representation on a Text Categorization Task”,SIGIR'92:Proc. of 15th Ann. International ACM SIGIR Conf. on R. and D. in Inform. Retr.,pp37-50 [16]Li-Ping Jing; Hou-Kuan Huang; Hong-Bo Shi, “Improved feature selection approach tfidf in text mining: Machine Learning and Cybernetics”, 2002. Proceedings. 2002 International Conference on , Volume: 2 , 4-5 Nov. 2002,pp944-946 [17]Maron M.E.,1961, “Automatic Indexing : an Experimental Inquiry”,J. of the ACM,V8,pp404-417 [18]Mingyu Lu; Keyun Hu; Yi Wu; Yuchang Lu; Lizhu Zho;, “SECTCS: towards improving VSM and Naive Bayesian classifier:Systems, Man and Cybernetics”, 2002 IEEE International Conference on , Volume: 5 , 6-9 Oct. 2002,pp5 [19]Müller K-R, Smola A J, Ra tsch G, et al.,1997, “Predicting time series with support vector machines.”,In: Proc. of ICANN'97, Springer Lecture Notes in Computer Science, 1997, pp999-1005 [20] Qi-Rui Zhang; Ling Zhang; Shou-Bin Dong; Jing-Hua Tan ,2005” Document indexing in text categorization”. IEEE 2005 International Conference [21]Parthasarathy, G.; Chatterji, B.N.;, “A class of new KNN methods for low sample problems: Systems, Man and Cybernetics”, IEEE Transactions on , Volume: 20 , Issue: 3 , May-June 1990,pp715-718 [22] R. J. Bayardo Jr. and R. Agrawal, “Mining the Most Interesting Rules,” Proceedings of the 5th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining”, 1999, pp.145-154. [23]Soucy, P.; Mineau, G.W., “A simple KNN algorithm for text categorization.” Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on , 29 Nov.-2 Dec. 2001,pp64-68 [24]Thorsten Joachims, “A probabilistic analysis of the Rocchio Algorithm with TFIDF for text categorization”,Proceedings of ICML-97, 14th International Conference on Machine Learning, pp. 143--151, 1997. [25]Tom M. Mitchell, 1997, “Machine Learning”, The McGraw-Hill Companies, Inc. [26]Vapnik V, Golowich S, Smola A.,1997, “Support vector method for function approximation, regression estimation, and signal Processing”,Neural Information Processing Systems 9, pp. 281--287 [27] Weiguo Fan, Michael D. Gordon, and Praveen Pathak,2004,” Discovery of Context-Specific Ranking Functions for Effective Information Retrieval Using Genetic Programming”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 4, pp. 523-527 [28]Wen-Jyi Hwang; Kuo-Wei Wen, “Fast kNN classification algorithm based on partial distance search: Electronics Letters” , Volume: 34 , Issue: 21 , 15 Oct. 1998,pp2062-2063 |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信