淡江大學覺生紀念圖書館 (TKU Library)

系統識別號 U0002-0407201311154100
中文論文名稱 利用關聯演算法重現決策樹分類結果
英文論文名稱 Using Association Classification to Represent a Decision Tree
校院名稱 淡江大學
系所名稱(中) 資訊工程學系博士班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 101
學期 2
出版年 102
研究生中文姓名 陳俊旗
研究生英文姓名 Chun-Chi Chen
學號 894190049
學位類別 博士
語文別 中文
口試日期 2013-06-28
論文頁數 83頁
口試委員 指導教授-蔣璿東
中文關鍵字 決策樹  關聯分類  不相關決策條件  全域決策規則 
英文關鍵字 Decision tree  Association Classification  Irrelevant Note  Global rule 
學科別分類 學科別應用科學資訊工程
中文摘要 分類(Classification)是資料探勘(Data Mining)常用策略之一,而關聯演算法(Association Classification)與決策樹(Decision tree)更是分類上經常使用的方法。雖然關聯演算法的主要優點在提供全域決策規則(Global Decision Rules),但關聯演算法無法直接處理連續型數值資料,先行對連續型數值進行離散化處理,而要找最佳數值切點是一個NP-hard問題;且關聯演算法雖可以找出所有決策規則,但由於規則數量過多,較難建構一個完整知識解釋結構。另一方面,雖然決策樹能夠直接處理連續型數值與非連續型數值欄位及易於產生明確的知識結構,但決策樹因樹狀結構與演算法的限制,因此在由決策樹轉換後的決策規則是屬於區域決策規則(Local Decision Rules)且決策規則當中可能存在不相關決策條件(Irrelevant Classification Condition)。因此,本論文將針對此問題,提出解決方法。本研究首先利用決策樹在連續數值屬性以及快速處理的特性,找出隱藏在資料集內的知識與決策規則,同時藉由決策樹演算法的區域性(local)特性快速度出所有可能連續數值屬性的離散化切點集。而後再將決策樹轉換後決策規則與離散化切點集重新利用關聯演算法整理,將決策樹決策規則重新轉換成為以全域性、移除不相關決策條件及條件更簡單的關聯決策規則。最後,在本研究中利用卵巢子宮內膜異位症臨床資料集進行實驗,實驗结果表明,對比CART決策樹生成的決策規則在原始決策規則之下,提出分類精度較高、條件更簡單且可理解性強的關聯決策規則。
英文摘要 Since the derived rules of decision trees are local, the association classifier has higher accuracy than decision tree classifier and many useful rules are left undiscovered by the decision tree techniques.However, goal of the classification rule mining is to discover a small set of rules in the database, the association rule technique will capture all possible rules in the database and generate too many rules; one the other hand, many useful rules are left undiscovered by the decision tree techniques. Medical data always contains numeric (continuous values) attributes; however, the association rule technique can not deal with numeric data directly and it is not an easy task to find out the appropriate way to discrete numeric attributes. Moreover, in order to neutralize drawbacks of these two mining techniques and use current commercial mining tools to analyze postoperative status of ovarian endometriosis patients to discover rules, we propose a concept to take the advantages of decision tree and association rule techniques to mine the data. In this paper, our goal is to investigate the efficacy of transvaginal aspiration and sclerotherapy with 95% ethanol retained in situ for the treatment of recurrent ovarian endometriomas. Moreover, although several researchers have performed statistical method to prove that aspiration followed by injection 95% ethanol left in situ (retention) is an effective treatment of ovarian endometriomas, very few of them discuss about the conditions that could generate better recovery rate for the patients. Therefore, this study adopts the statistical method and data mining techniques together to analyze postoperative status of ovarian endometriosis patients to discover such conditions.
論文目次 目錄
第一章緒論 1
1-1 研究背景與動機 1
1-2 研究目的 4
1-3 研究方法 8
1-4 論文架構 9
第二章背景知識 11
2-1 決策樹 11
2-2 數量關聯分類演算法 15
2-2-1 關聯分類演算法 15
2-2-2 數量關聯規則 18
2-2-3 數量關聯演算法 19
2-2-4 數量關聯演算法合併 23
2-3 決策樹用於分類建構上的問題 28
2-3-1 連續數值屬性離散化 29
2-3-2 全域分類與區域分類 30
2-3-3 不相關決策條件問題 34
2-3-4 決策樹單一分支問題 38
第三章研究方法 40
3-1 使用CART演算法取得規則集與連續數值屬性切點集 42
3-2 全域規則演算法 45
第四章實驗結果 59
4-1 研究資料集來源 59
4-2 實驗結果 60
4-3 實驗討論 67
4-3-1 各別規則實驗結果討論 67
4-3-2 整體性實驗結果分析與討論 69
第五章結論 72
參考文獻 77

圖 2-1決策樹樹狀圖 12
圖 2-2 Cyst_size離散化結果 30
圖 2-3原始決策樹 32
圖 2-4裁減後決策樹 32
圖 2-5決策規則與資料分佈圖 33
圖 3-1 IRDTAC演算法 41
圖 3-2根據表3-1建構高血壓控制決策樹(未執行分支裁剪設定) 44
圖 3-3根據表3-1建構高血壓控制決策樹(已執行分支剪裁設定) 44
圖 3-4區域性之決策圖 46
圖 3-5全域性之決策圖 47
圖 3-6規則選擇有用切點演算法 51
圖 3-7選擇最佳規則演算法 58
圖 4-1臨床資料集CART原始決策樹 61
圖 4-2手動裁剪後臨床資料集決策樹結果 63

表 2-1決策樹轉換後決策規則 13
表 2-2移除不相關決策條件的全域決策規則 35
表 2-3轉換後決策規則 39
表 2-4簡化後決策規則 39
表 3-1高血壓控制狀況表 43
表 3-2屬性離散化切點表 45
表 3-3 x屬性切點集 49
表 3-4 x屬性切點集 52
表 3-5表3-4可用切點之關聯規則組合 53
表 3-6候選關聯規則集 56
表 4-1屬性描述 61
表 4-2 Cyst_size切點集 62
表 4-3 CA_125切點集 62
表 4-4 BMI切點集 62
表 4-5離散化切點數比較 63
表 4-6臨床資料集決策樹轉換後關聯規則 64
表 4-7以信賴度、支持度,規則長度與規則條件涵蓋度篩選後臨床資料集關聯規則 66
表 4-8卵巢子宮內膜異位症臨床資料集分類成果比較表 70
參考文獻 [1] B. Van Calster, G. Condous, E. Kirk et al., “An application of methods for the probabilistic three-class classification of pregnancies of unknown location,” Artificial Intelligence in Medicine, vol. 46, no. 2, pp. 139-154, 2009.
[2] O. Uzuner, X. Zhang, and T. Sibanda, “Machine Learning and Rule-based Approaches to Assertion Classification,” Journal of the American Medical Informatics Association, vol. 16, no. 1, pp. 109-115, 2009.
[3] I. ?tajduhar, B. Dalbelo-Ba?i?, and N. Bogunovi?, “Impact of censoring on learning Bayesian networks in survival modelling,” Artificial Intelligence in Medicine, vol. 47, no. 3, pp. 199-217, 2009.
[4] S. H. Huang, L. R. Wulsin, H. Li et al., “Dimensionality reduction for knowledge discovery in medical claims database: Application to antidepressant medication utilization study,” Computer Methods and Programs in Biomedicine, vol. 93, no. 2, pp. 115-123, 2009.
[5] L. J. Pino, D. W. Stashuk, S. G. Boe et al., “Motor unit potential characterization using "pattern discovery",” Medical Engineering and Physics, vol. 30, no. 5, pp. 563-573, 2008.
[6] W. H. Wu, A. A. T. Bui, M. A. Batalin et al., “Incremental diagnosis method for intelligent wearable sensor systems,” IEEE Transactions on Information Technology in Biomedicine, vol. 11, no. 5, pp. 553-562, 2007.
[7] N. Theera-Umpon, and S. Dhompongsa, “Morphological granulometric features of nucleus in automatic bone marrow white blood cell classification,” IEEE Transactions on Information Technology in Biomedicine, vol. 11, no. 3, pp. 353-359, 2007.
[8] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81-106, 1986.
[9] G. V. Kass, “An exploratory technique for investigating large quantities of categorical data,” Applied statistics, pp. 119-127, 1980.
[10] L. Breiman, Classification and regression trees: Chapman & Hall/CRC, 1984.
[11] J. R. Quinlan, C4. 5: programs for machine learning: Morgan kaufmann, 1993.
[12] G. F. Cooper, and E. Herskovits, “A Bayesian method for the induction of probabilistic networks from data,” Machine Learning, vol. 9, no. 4, pp. 309-347, 1992.
[13] C. M. Bishop, “Neural networks for pattern recognition,” 1995.
[14] T. Cover, and P. Hart, “Nearest neighbor pattern classification,” Information Theory, IEEE Transactions on, vol. 13, no. 1, pp. 21-27, 1967.
[15] S. Cost, and S. Salzberg, “A weighted nearest neighbor algorithm for learning with symbolic features,” Machine Learning, vol. 10, no. 1, pp. 57-78, 1993.
[16] C. Cortes, and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
[17] P. Clark, and R. Boswell, "Rule induction with CN2: Some recent improvements." pp. 151-163.
[18] R. C. Holte, “Very simple classification rules perform well on most commonly used datasets,” Machine Learning, vol. 11, no. 1, pp. 63-90, 1993.
[19] J. Furnkranz, and G. Widmer, "Incremental reduced error pruning." pp. 70-77.
[20] P. Clark, and T. Niblett, “The CN2 induction algorithm,” Machine Learning, vol. 3, no. 4, pp. 261-283, 1989.
[21] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123-140, 1996.
[22] Y. Freund, and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119-139, 1997.
[23] J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques: Morgan Kaufmann, 2011.
[24] H. Trevor, T. Robert, and J. H. Friedman, The elements of statistical learning: Springer, 2009.
[25] K. Wang, S. Zhou, and Y. He, "Growing decision trees on support-less association rules." pp. 265-269.
[26] F. Thabtah, “A review of associative classification mining,” Knowledge Engineering Review, vol. 22, no. 1, pp. 37-65, 2007.
[27] J. Dougherty, R. Kohavi, and M. Sahami, "Supervised and unsupervised discretization of continuous features." pp. 194-202.
[28] S. Mehta, S. Parthasarathy, and H. Yang, “Toward unsupervised correlation preserving discretization,” Knowledge and Data Engineering, IEEE Transactions on, vol. 17, no. 9, pp. 1174-1185, 2005.
[29] R. Vilalta, G. Blix, and L. Rendell, “Global data analysis and the fragmentation problem in decision tree induction,” Machine Learning: ECML-97, pp. 312-326, 1997.
[30] D. A. Chiang, W. Chen, Y. F. Wang et al., “The irrelevant values problem in the ID3 tree,” Computers and artificial intelligence, vol. 19, no. 2, pp. 169-182, 2000.
[31] D. A. Chiang, C. T. Wang, Y. H. Wang et al., “The Irrelevant Values Problem of Decision Tree for Improving a Glass Sputtering Process,” Journal of Applied Science and Engineering, vol. 13, no. 4, pp. 413-422, 2010.
[32] K.-Y. Chou, H.-C. Keh, N.-C. Huang et al., “The Irrelevant Values Problem of Decision Tree in Medical Examination,” JOURNAL OF APPLIED SCIENCE AND ENGINEERING, vol. 15, no. 1, pp. 89?96, 2012.
[33] J. R. Quinlan, "Generating production rules from decision trees." pp. 304-307.
[34] C.-Y. Lin, “Remove The Irrelevant Values Problem in the Decision Tree by Using Association Rule,” Department of Computer Science and Information Engineering, Tamkang University, 2009.
[35] R. Srikant, and R. Agrawal, "Mining quantitative association rules in large relational tables." pp. 1-12.
[36] T. Hastie, R. Tibshirani, J. Friedman et al., “The elements of statistical learning: data mining, inference and prediction,” The Mathematical Intelligencer, vol. 27, no. 2, pp. 83-85, 2005.
[37] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth et al., Advances in Knowledge Discovery and Data Mining: American Association for Artificial Intelligence Menlo Park, CA, USA, 1996.
[38] G. Dong, X. Zhang, L. Wong et al., "CAEP: Classification by Aggregating Emerging Patterns
Discovery Science," Lecture Notes in Computer Science, pp. 737-737: Springer Berlin / Heidelberg, 1999.
[39] R. J. Miller, and Y. Yang, "Association rules over interval data." pp. 452-461.
[40] J.-Y. Yeh, and G.-J. Liao, “Evaluation of Item Discretization Methods for Quantitative Association Rule Mining ” Journal of Quality Vol, vol. 18, no. 6, pp. 489, 2011.
[41] B. Lent, A. Swami, and J. Widom, "Clustering association rules." pp. 220-231.
[42] P. S. M. Tsai, and C. M. Chen, “Mining quantitative association rules in a large database of sales transactions,” 2000.
[43] G. Schmidberger, and E. Frank, “Unsupervised discretization using tree-based density estimation,” Knowledge Discovery in Databases: PKDD 2005, pp. 240-251, 2005.
[44] D. Taoa, Q. Shou-ningb, and C. Guang-qianga, “Study of method of coordinating data and mining association rules in relation of quantitative attribute [J],” Application Research of Computers, vol. 8, pp. 032, 2009.
[45] N.-C. Hsieh, “Finding Relevant Fuzzy Association Rules from Medical Databases,” Jouranl of Information Management, vol. 12, no. 2, pp. 25-51, 2005.
[46] N. C. Hsieh, “Handling indefinite and maybe information in logical fuzzy relational databases,” International journal of intelligent systems, vol. 19, no. 3, pp. 257-276, 2004.
[47] K. Wang, S. H. W. Tay, and B. Liu, "Interestingness-based interval merger for numeric association rules." pp. 121-127.
[48] O. R. Zaiane, and M.-L. Antonie, "Classifying text documents by associating terms with text categories." pp. 215-222.
[49] C. Fu Zan, J. S. KOU, and M. Q. LI, “A Lattice-based Mining Alogorithm for Quantitative Association Rules,” Systems Engineering - Theory & Practice, 2002.
[50] S. Kotsiantis, and D. Kanellopoulos, “Discretization Techniques: A recent survey,” GESTS International Transactions on Computer Science and Engineering, vol. 32, no. 1, pp. 47!V58, 2006.
[51] W. Jin-long, X. Cong-fu, and G. Xue-yu, “Study on global data mining based on local information,” Application Research of Computers, vol. 25, no. 7, pp. 1936-1939, 2008.
[52] M. Bramer, Principles of data mining: Springer, 2007.
[53] Y. L. Chen, and L. T. H. Hung, “Using decision trees to summarize associative classification rules,” Expert Systems with Applications, vol. 36, no. 2, pp. 2338-2351, 2009.
[54] J. L. Wang, C. F. Xu, and X. Y. Geng, “Study on global data mining based on local information,” Application Research of Computers, vol. 25, no. 7, pp. 1936-1939, 2008.
[55] J. Hollmen, J. K. Seppanen, and H. Mannila, "Mixture models and frequent sets: combining global and local methods for 0-1 data." pp. 289-293.
[56] H. Mannila, “Local and global methods in data mining: Basic techniques and open problems,” Automata, Languages and Programming, pp. 778-778, 2002.
[57] D. J. Hand, H. Mannila, and P. Smyth, Principles of data mining: The MIT press, 2001.
[58] C. L. Hsieh, C. S. Shiau, L. M. Lo et al., “Effectiveness of ultrasound-guided aspiration and sclerotherapy with 95% ethanol for treatment of recurrent ovarian endometriomas,” Fertility and Sterility, vol. 91, no. 6, pp. 2709-2713, 2009.
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2018-08-01公開。
  • 同意授權瀏覽/列印電子全文服務,於2018-08-01起公開。

  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信