系統識別號 | U0002-1608200923555900 |
---|---|
DOI | 10.6846/TKU.2009.00571 |
論文名稱(中文) | 使用關聯式法則解決決策樹不相關值問題 |
論文名稱(英文) | Remove The Irrelevant Values Problem in the Decision Tree by Using Association Rule |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系碩士在職專班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 97 |
學期 | 2 |
出版年 | 98 |
研究生(中文) | 林靜宜 |
研究生(英文) | Ching-Yi Lin |
學號 | 796410032 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
第二語言別 | |
口試日期 | 2009-06-17 |
論文頁數 | 83頁 |
口試委員 |
指導教授
-
蔣定安
委員 - 葛煥昭 委員 - 王鄭慈 委員 - 蔣定安 |
關鍵字(中) |
決策樹 不相關值問題 離散化 關聯式法則 |
關鍵字(英) |
Decision Tree Irrelevant Value Discretization Association Rule |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
決策樹的特點是以遞迴的計算方式,透過由上而下和各別擊破的策略,但是在建構決策樹中有可能產生「不相關值問題」,同時因決策樹樹狀的結構,在移除不相關值後,切點的選擇也會因而改變,另一方面,決策樹在處理連續型數值屬性離散化是利用區域離散化的方式處理,而區域離散化必須考慮到屬性之間的相依性關係,若屬性之間不存在相依性關係,則全域離散化比區域離散化較好。 因此,在本文裡提出了移除不相關值問題的方法,並且利用關聯法則,整合分類法則樣版與全域離散化後的屬性切點,同時提出一個分類器建構方法以整合所有的分類法則。由本論文所提出方法的實驗結果,不僅移除了不相關值,且透過關聯法則整合全域切點後更精準。 |
英文摘要 |
The characteristics of the decision tree is based recursive formula, through top-down and divide-and-conquer. However, decision tree construction may have the irrelevant values problem, at the same time, the tree structure of decision tree, remove irrelevant values, the choice of cut-off point will also be changed. On the other hand, discretization techniques of the decision tree is local method , and local method must have between attributes and attribute dependency relationship, if there is no dependency, then the global of discretization methods are better than the local of discretization methods. As a result, in this paper has been proposed to remove the irrelevant values problem, and the use of association rules, classification rules like the integration with the global discrete of the attributes after the cut-off point, at the same time to propose a method to build classifiers to integrate all of the classification rules. By the thesis of the proposed method of experimental results, not only to remove the relevant values, and using association rules integrate into the global cut-off point are more accurate . |
第三語言摘要 | |
論文目次 |
第一章 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 2 1.3 章節架構 5 第二章 相關文獻與研究探討 6 2.1 連續屬性資料離散化 7 2.1.1 全域離散化和區域離散化(Global and Local) 7 2.1.2 監督離散化和非監督離散化(Supervised and Unsupervised) 8 2.1.3 靜態離散化和動態離散化(Static and Dynamic) 8 2.1.4 直接離散化和間接離散化(Direct and Incremental) 9 2.1.5 由上而下離散化和由下而上離散化(Top-Down and Bottom-Up) 9 2.2 分類法種類 10 2.2.1 分類決策樹演算法 10 2.2.1.1. C4.5演算法 12 2.2.1.2. 卡方互動自動檢視樹(Chi-Square Automatic Interaction Detector, CHAID) 13 2.2.1.3. 分類與迴歸樹(Classification and Regression Tree, CART) 14 2.3 分類決策樹演算法產生的問題 16 2.3.1 不相關值問題(Irrelevant Values Problem) 16 2.3.2 遺失分支問題(Missing Branches Problem) 19 2.4 解決不相關值問題的相關演算法 21 2.4.1 Chiang 演算法 21 2.5 關聯式法則(Associative Rules) 24 2.5.1 Apriori 演算法 28 2.6 關聯式分類器(Associative Classifier) 31 2.6.1 以關聯法則為基礎之分類演算法(Classification Base on Association,CBA) 31 第三章 研究方法 34 3.1 決策樹分類問題討論 34 3.2 DTMGC(Decision Tree Marge Global Classifier演算法流程圖 40 3.3 移除分類規則中不相關值 43 3.4 合併離散屬性與分類規則 47 3.4.1 DTMGC_MC(Decision Tree Marge Global Classifier Of Marge Condition)演算法 49 3.4.2 DTMGC_STCR(Decision Tree Marge Global Classifier Of Select Top Confidence Rule)演算法 53 3.5 範例 58 第四章 實驗方法及步驟 64 4.1 資料來源及實驗步驟 64 4.2 實驗結果及驗證 65 第五章 結論及未來研究方向 69 參考文獻 70 附錄-英文論文 76 圖目錄 圖2.1 決策樹樹狀圖 10 圖2.2絶對分割圖 12 圖2.3 含有不相關值的決策樹 17 圖2.4含有遺失分支的決策樹 19 圖2.5移除不相關節點的決策樹 20 圖2.6根據Apriori演算法以商品物項集合推導過程 30 圖3.1根據表3.1建構決策樹樹狀圖 36 圖3.2根據表3.2建構決策樹樹狀圖 38 圖3.3 DTMGC演算法流程圖 40 圖3.4 DTMGC_RIR演算法 44 圖3.5 DTMGC_RRIR演算法 45 圖3.6 BuilderSubRule演算法 46 圖3.7 Marge Condition內部運作圖 48 圖3.8 A>a1 and A<=a3 and B>b1 and B<=b3→c1,信賴度=80% 54 圖3.9 A<a2 and B>b1 and B<=b3→c2,信賴度=85% 54 圖3.10 A>a1 and A<=a2 and B>b1 and B<=b2→c1,信賴度=90% 55 圖3.11 Blood資料決策樹樹狀圖 58 表目錄 表2.1決策樹演算法比較表 15 表2.2圖2.3的決策樹所轉換的法則 17 表2.3表2.2移除不相關值之後的法則 18 表3.1高血壓控制狀況表 36 表3.2成人身高及體重資料 38 表3.3飲酒量屬性全域離散化切點表 50 表3.4鈉鹽量屬性全域離散化切點表 50 表3.5關聯分析輸入項次資料表 51 表3.6分類規則候選集 52 表3.7分類規則執行次序表 55 表3.8飲酒量屬性全域離散化切點表 56 表3.9鈉鹽量屬性全域離散化切點表 56 表3.10關聯分析輸入項次資料表 57 表3.11 Blood的決策樹所轉換的規則 59 表3.12 表3.11移除不相關值後的規則 60 表3.13 表3.11連續數值的資料屬性全域離散化切點表 60 表3.14 R3規則以DTMGC_MR屬性全域離散化切點表 61 表3.15 R2規則以DTMGC_STCR屬性全域離散化切點表 62 表3.16 R2規則以DTMGC_STCR分類規則候選集 62 表3.17 分類系統有效表格 63 表3.18 建構3.11分類器表 63 表4.1平均準確值表 67 表4.2交叉驗證準確值表 67 表4.3 (II)及(V)方法中規則三的資料表 68 表4.4平均項次數量表 68 |
參考文獻 |
[1]J.R.Quilan, Induction o decision trees, Machine Learning, 1, pp.81-106,1986. [2]G.V.Kass, An exploratory technique for investigating large quantities of categorical data, Applied Statistics, vol.29, pp.119-127,1980. [3]J.Cheng, U.M.Fayyad, K.B.Irani and Z.Qian,“Improved Decision Tree: a Generalized Version of ID3,“Proc. Of the Fifth Int’1 Conf. on Machine Larning,pp.100-108,1998. [4]U.M.Fayyad and K.B.I, A Machine Learning Algorithm(GID3*) for Automated Knowledge Acquisition Improvements and Extensions, General Motors Research Report CS-634, Warren, MI:CM research labs,1991. [5]L. Breiman, J.H.Friedman, R.A.Olshen, and C.J.Stone. Classification and regression trees .Wadsworth, Belmont, CA, 1984. [6]J.R.Quinlan, C4.5:programs for machine learning. Morgan Kaufmann,1993. [7]B.Liu, W.Hsu, and Y.Ma, Integrating Classification and Association Rule Mining, In KDD,1998. [8]K.Wang, S.Q.Zhou, and Y.He, Growing Decision Trees On Support-Less Association Rules, ACM,2000. [9]G. Kundu, M.M.Islam, S.Munir, and M.F.Bari, ACN:An Associative Classifier with Negative Rules, IEEE,2008. [10]U.M.Fayyed and K.B.Irani, Multi-Interval Discretization of Continuous-Valued Attributes for Classification learning,In Proceedings Thirteenth International Joint Conference on Artificial Intelligence,1993. [11]S. Kotsiantis and D. Kanellopoulos,Discretization Techniques: A recent survey, GESTS International Transactions on Computer Science and Engineering, Vol.32 (1),pp.47-58,2006. [12]Hastie,T., Tibshirani, R.,Friedman, J., The Elements of Statistical Learning: Data Mining,Inference, and Prediction, Spring Series in Statistics,2001. [13]Elomaa, T. Rousu, J., Efficient multisplitting revisited: Optima-preserving eliminatin of partition candidates. Data Mining and Knowledge Discovery 8:2,pp.131-138,2002. [14]Cerquides,J.Lopez, R., Proposal and Empirical Comparison of a Parallelizable Distance-Based Discretization Method.III International Conference on Knowledge Discovery and Data Mining(KDDM97), Newport Beach,pp.139-142,1997. [15]Kerber, R., ChiMerge:Discretization of Numeric Attributes. XNational Conf. on Artificial Intelligence American Association(AAAI92),pp.123-128,1992. [16]Steck, H. and Setiono, R. , Feature selection and discretization. IEEE Transactions on Knowledge and Data Engineering,9:1-4,1997. [17]Holte, R.C., Very simple classification rules perform well on most commonly used datasets.Machine Learing II,pp.63-91,1993. [18]Dougherty J, Kohavi R, Sahami M. Supervised and unsuperivisd discretization of continuous features. In: Proceeding s of the 12th international conference on machine learing.pp.194-202,1995. [19]Liu,B., Hsu,W. & Ma,Y., Integrating Classification and Association Rule Mining,Proceedings of 1998 International Conference on Knowledge Discovery and DataMining, pp. 80-86, New York, NY,1998. [20]Han,J., and Kamber,M., Data Mining: Concepts and Techniques, Morgan Kanfmann Publishers.2001. [21]J.R. Quinlan, Learning efficient classification procedures and their application to chess end games, in Machine Learning: An Artificial intelligence approach, Michalski, Carbonell and Mitichell.eds.Morgan Kaufmann,pp.463-482,1983. [22]J.R. Quinlan,Simplifying decision tress.Int. J. Man-Machine Studies, vol.27,pp.221-234,1987. [23]Hamill Karen A. and Zamora Antonio,“The Use of Titles for Automatic Document Classification, JASIS, V31, n6, pp.396-402,1980. [24]Kwok K.L.,The Use of Title and Cited Titles as Document Representation for Automatic Classfication,Inform Proc. and Manag,V11,pp.201-206,1975. [25]Larson Ray R.,“Experiments in Automatic Library of Congress Classification,JASIS,V43,n2,pp.130-148,1992. [26]Maron M.E.,“Automatic Indexing : an Experimental Inquiry,J. of the ACM,V8,pp.404-417,1961. [27]Müller K-R, Smola A J, Ra tsch G, et al.,Predicting time series with support vector machines., In: Proc. of ICANN'97, Springer Lecture Notes in Computer Science, 1997. [28]Vapnik V, Golowich S, Smola A.,Support vector method for function approximation, regression estimation, and signal Processing,Neural Information Processing Systems 9, pp. 281-287,1997. [29]Srikant,R. and Agrawal, R., Mining Quantitative Association Rules in Large Relational Tables. ACM, pp.4-6,1996. [30]Han, J., Pei, J. & Yin, Y., Mining Frequent Patterns without Candidate Generation,Proceedings of the ACM SIGMOD Conference on Management of Data, pp.1-12,2000. [31]Bodon, F. & Rόnyai, L., TRIE: An Alternative Data Structure for Data Mining Algorithms, Mathematical and Computer Modelling, Vol.38, pp.739-751,2003. [32]吳育儒,「決策樹中移除不相關值問題的研究」,淡江大學,碩士論文,民國87年。 [33]葉介山、廖偉敦和卓政儒,「使用決策樹演算法之電子公文自動分類」,靜宜大學,碩士論文,民國95年。 [34]林文燦、柯美珠和王文杉,中華民國品質學會第43 屆年會暨第13 屆全國品質管理研討會「以資料探勘技術應用於教育訓練課程之規劃─以某A汽車公司為例」論文,民國96年。 [35] http://lovehome.kcg.gov.tw/service/word/110.pdf [36]顏博文,「 應用資料探勘技術分析學生選課特性與學業表現」,中原大學,碩士論文,民國92年。 [37]Ding-An Chiang* , Cheng-Tzu Wang, Yi-Hsin Wang and Chun-Chi Chen,The Irrelevant Values Problem of Decision Tree for Improving a Glass Sputtering Process,2008. [38]A. Veloso,W. Meira Jr., M. J. Zaki,“Lazy Associative Classification,” in Proceedings of the Sixth International Conference on Data Mining 2006 IEEE,2006. [39]王瀞誼,「衡量關聯法則的新方法」,高雄大學,碩士論文,民國96年。 |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信