§ 瀏覽學位論文書目資料
  
系統識別號 U0002-2806201202265800
DOI 10.6846/TKU.2012.01210
論文名稱(中文) 吉尼係數離散化演算法
論文名稱(英文) A discretization algorithm based on Class-Attribute Gini Index
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 數學學系碩士班
系所名稱(英文) Department of Mathematics
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 100
學期 2
出版年 101
研究生(中文) 黃鈺傑
研究生(英文) Yu-Chieh Huang
學號 698190260
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2012-06-12
論文頁數 22頁
口試委員 指導教授 - 伍志祥
委員 - 許志華
委員 - 張三奇
關鍵字(中) 分類
決策樹
離散化
吉尼係數
關鍵字(英) Classification
Decision tree
Discretization
Gini index
第三語言關鍵字
學科別分類
中文摘要
由於資訊化時代的來臨以及網際網路的蓬勃發展,資料數量以驚人速度成長,因此,現今的研究議題著重於如何從大量的資料中有效地擷取隱藏其中且具有參考價值的資訊。由於許多機器學習演算法只處理離散數值資料及名目資料;然而連續屬性資料是常見的資料形式,離散化演算法可以把連續屬性值分割成有限個離散區間,精簡資料的複雜度;不僅能使我們更容易了解資料的分佈和特性,也解決了機器學習演算法對連續屬性處理不易的限制。在這項研究中,提出了CAGI(Class-Attribute Gini Index)離散化演算法,並與CAIM(Class-Attribute Interdependence Maximization)離散化演算法以及CACC(Class-Attribute Contingency Coefficient)離散化演算法比較。根據實驗顯示,CAGI離散化演算法在某些資料集的表現不但可以更正確地離散連續屬性之資料,亦可提升分類器的預測準確度。
英文摘要
Due to the information age and the rapid development of Internet, the amount of data grows rapidly. Thus, the research topic nowadays focuses on how to capture valuable information among a large amount of data efficiently. The majority of machine learning algorithms can be applied only to data described by discrete numerical or nominal attributes, but continuous attribute data is the most common form of data. Discretization algorithm can divide a continuous attribute’s values into a finite number of intervals and simplify the complexity of data. It not only makes us easier to understand the distribution and characteristic of data, but ends the restriction of machine learning algorithms. In this paper, we propose CAGI(Class-Attribute Gini Index) discretization algorithm and compare with CAIM(Class-Attribute Interdependence Maximization) discretization algorithm and CACC(Class-Attribute Contingency Coefficient) discretization algorithm. The result of experiment shows that in some dataset, CAGI discretization algorithm not only discretizes the data more correctly, but improves the accuracy of classification.
第三語言摘要
論文目次
目錄
目錄·········································I
表目錄·······································II
第一章 緒論···································1
第二章 文獻探討································4
	2.1 決策樹簡介························4
	2.1.1 ID3、C4.5、C5.0················5
	2.1.2 CART··························7
	2.1.3 CHAID·························9
	2.2 離散化演算法······················9
	2.2.1 CAIM離散化演算法和CAIR準則·······10
	2.2.2 CACC離散化演算法················12
第三章 CAGI離散化演算法························14
第四章 效能分析·······························17
	4.1 離散化結果比較····················17
	4.2 分類效能評估······················18
第五章 結論···································20
參考文獻······································21



表目錄
表1 決策樹演算法的比較··························9
表2 目標類別屬性和離散化方案的列聯表··············11
表3 CAGI離散化演算法的演算過程··················16
表4 UCI-ML的資料集····························17
表5 離散化方案用在UCI-ML資料集的比較·············18
表6 CHAID決策樹執行UCI-ML資料集的比較···········18
參考文獻
參考文獻
[1]	D. Chiu, A. Wong, and B. Cheung,“Information Discovery through Hierarchical Maximum Entropy Discretization and Synthesis,”Knowledge Discovery in Databases, G. Piatesky-Shapiro and W.J. Frowley, ed., MIT Press, 1991.
[2]	K.J. Cios and L. Kurgan,“Hybrid Inductive Machine Learning Algorithm That Generates Inequality Rules,”Information Sciences, special issue on soft computing data mining, accepted , 2002.
[3]	K.J. Cios and L. Kurgan,“Hybrid Inductive Machine Learning:An
Overview of CLIP Algorithms,”New Learning Paradigms in Soft Computing, L.C. Jain and J. Kacprzyk, ed., pp. 276-322, Physica-Verlag (Springer), 2001.
[4]	P. Clark and T. Niblett,“The CN2 Algorithm,”Machine Learning, vol. 3, pp. 261-283 , 1989.
[5]	P. Clark and R. Boswell,“Rule Induction with CN2:Some Recent Improvements,”Proc. European Working Session on Learning, 1991.
[6]	U.M. Fayyad, K.B. Irani, On the handling of continuous-valued attributes in decision tree generation, Machine Learning 8 , pp. 87-102, 1992.
[7]	B. Hiram and P. William,“A Model-free Approach for Analysis of Complex Contingency Data in Survey Research,”Journal of Marketing Research vol. 17 , pp.503-515, 1980.
[8]	Quinlan JR. C4.5:Programs for Machine Learning. Morgan Kaufmann, 1993.
[9]	C.-Jung Tsai, C.-I Lee, W.-Pang Yang, A discretization algorithm based on Class-Attribute Contingency Coefficient, C.-J Tsai et al. / Information Sciences 178 , pp. 714-731 , 2008
[10]	K.A. Kaufman and R.S. Michalski,“Learning from Inconsistent and Noisy Data:The AQ18 Approach,”Proc. 11th Int’l Symp. Methodologies for Intelligent Systems, 1999.
[11]	L. Kurgan, K.J. Cios, CAIM discretization algorithm, IEEE Transactions on Knowledge and Data Engineering 16 (2) , pp. 145-153, 2004.
[12]	C.I. Lee, C.J. Tsai, Y.R. Yang, W.P. Yang, A top-down and greedy method for discretization of continuous attributes, in:Proceedings of Fourth International Conference on Fuzzy Systems and Knowledge Discovery, Haikou, China, 2007.
[13]	Breiman L, Friedman JH, Olshen RA and Stone CJ. Classification and Regression Trees, Wadsworth, Pacific Grove, 1984.
[14]	H. Liu and R. Setiono,“Feature Selection via Discretization,”IEEE Trans. Knowledge and Data Eng.,vol.9, no.4, pp.642-645, 1997.
[15]	R.S. Michalski, I. Mozetic, J. Hong, and N. Lavrac,“The Multipurpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains,”Proc. Fifth Nat’l Conf. Artificial Intelligence, pp. 1041-1045 , 1986.
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信