電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2014-07-29起於校外公開使用
本論文紙本於2014-07-29起公開使用

系統識別號	U0002-2907201409444300
DOI	10.6846/TKU.2014.01202
論文名稱(中文)	可處理巨量資料的平行化CHAID決策樹
論文名稱(英文)	Paralleled CHAID Decision Tree Algorithm with Big-Data Capability
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	統計學系應用統計學碩士班
系所名稱(英文)	Department of Statistics
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	102
學期	2
出版年	103
研究生(中文)	蔡育儒
研究生(英文)	Yu-Ju Tsai
學號	602650052
學位類別	碩士
語言別	繁體中文
第二語言別
口試日期	2014-07-10
論文頁數	45頁
口試委員	指導教授 - 陳景祥委員 - 何宗武委員 - 李百靈委員 - 陳景祥
關鍵字(中)	資料探勘分類器 CHAID決策樹平行化
關鍵字(英)	data mining classifiers parallel CHAID
第三語言關鍵字
學科別分類
中文摘要	隨著科技的進步，Big-Data的時代正式來臨。在資料量急增下，電腦處理速度的改良已成為一項重要的發展技術。若將資料處理及分析的時間縮短，可以提早進行預測或判斷，平行化處理就是減少分析時間的一個方法。本研究探討資料探勘常被使用的決策樹方法與平行化運算的結合。我們改寫了CHAID決策樹在合併及判斷變數的運算法則，利用多核心計算，使決策樹的建構時間縮短。在結論中，模擬的結果顯示，當CPU 的核心為一顆以上時，CHAID決策樹的計算時間比單核心狀況明顯縮短。在處理更大的資料量時，我們節省的時間會有更明顯的差異。
英文摘要	As technology advances, the era of Big-Data has finally arrived. As the amount of data increases , the improvement of computing speed becomes an important development technology. If data training and analysis time are reduced, we could make the prediction or decision much earlier then expected. As a result, parallel computation is one of the methods which can reduce the analysis time. In this paper, we rewrite the CHAID decision tree algorithm for parallel computation and Big-Data capability. Our simulation results show that, when the CPU has more than one kernel, the computation time of our improved CHAID tree is significantly reduced. When we have a huge amount of data, the difference of computation times is even more significant.
第三語言摘要
論文目次	目錄目錄 I 表目錄 III 圖目錄 IV 第一章緒論 1 第一節研究背景 1 第二節研究動機與目的 3 第三節論文結構 4 第二章文獻探討 6 第一節決策樹 6 第二節 CHAID 決策樹 8 第三節平行化(Parallization) 10 第四節決策樹平行化 11 第五節大量資料分析的硬體限制 16 第三章研究方法 19 第一節 R軟體套件運用 19 第二節 CHAID 套件運算流程 22 第三節決策樹平行化 23 第一小節 CHAID決策樹 23 第二小節平行化 24 第四節混淆矩陣與分類正確率 26 第四章模擬結果 27 第一節資料簡介 27 第二節時間比較 28 第三節預測正確率的比較 36 第五章結論與建議 39 第一節結論 39 第二節建議 40 參考文獻 42 表目錄表 1 模擬代號表 28 表 2在IRIS下運算時間表 30 表 3在Adult下運算時間表(1) 32 表 4在Adult下運算時間表(2) 34 圖目錄圖 1研究流程圖 5 圖 2 決策樹結構圖 7 圖 3 R軟體整數限制圖 17 圖 4 平行化範例圖 25 圖 5 在IRIS下運算時間的比較 30 圖 6 在Adult下運算時間的比較(1) 32 圖 7 在Adult下運算時間的比較(2) 34 圖 8 在Adult下做不同核心運算時間的比較 35 圖 9 在IRIS下不同方法的預測正確率箱型圖 36 圖 10 在Adult下不同方法的預測正確率箱型圖 37
參考文獻	中文文獻： 1. 陳景祥(2010)，R軟體：應用統計方法，二版，台北：台灣東華。 2. 李智慎(2013)，平行化處理在決策樹演算法之應用，碩士論文，淡江大學統計系應用統計所。 3. 劉欣陽等(2004)，「決策樹的併行策略」，計算機科學，31，8。英文文獻： 1. Adler et al. (2008). Large atomic data in R package 'ff'. Presentation at UseR!2008, statistics department, University of Dortmund. 2. Hothorn, T.and Zeileis, A. (2014). partykit: A Modular Toolkit for Recursive Partytioning in R. Working Paper 2014-10. Working Papers in Economics and Statistics, Research Platform Empirical and Experimental Economics, Universitaet Innsbruck. URL http://EconPapers.RePEc.org/RePEc:inn:wpaper:2014-10 3. Hunt, E., Martin, J., Stone, P. (1966). Experiments in Induction, New York, Academic Press 4. Joshi ,M.J., Karypis , G., and Kumar, V. (1998). ScalParC : A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets, IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium 5. Kane,M. J. et al.(2013). Scalable Strategies for Computing with Massive Data. Journal of Statistical Software, 55(14), 1-19. 6. Kass, G. V.(1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data ,Applied Statistics, 29(2), 119-127. 7. Li, W.L. and Xing, C.Z.(2010). Parallel Decision Tree Algorithm Based on Combination, International Forum on Information Technology and Applications - IFITA , 2010 8. Mayer-Schonberger, V. and Cukier,K. (2012). Big Data: A Revolution That Transforms How we Work, Live, and Think, Houghton Mifflin Harcourt 9. Narlikar,G.J.(1998). A Parallel, Multithreaded Decision Tree Builder, Technical Report CMU-CS-98-184 10. Quinlan,J.R.(1986). Induction of decision trees, Machine Learning, 1 , 81-106. 11. Rokach,L. and Maimon,O.(2008). Data mining with decision trees : theory and applications, World Scientific Pub Co Inc. 12. Shafer, J.,Agrawal, R.,and Mehta, M. (1996). SPRINT:A Scalable Parallel Classifier for Data Mining, Morgan Kaufmann 13. Srivastava,A ,Han,E.,Kumar, V.,and Singh,V. (1999) . Parallel Formulations of Decision-Tree Classification Algorithms, Data Mining and Knowledge Discovery, 3(3), 237-261. 14. Torgo, L.(2011). Data mining with R : learning by case studies, Chapman & Hall 15. Yildiz, O.T.and Dikmen, O.(2007). Parallel univariate decision trees, Pattern Recognition Letters, 28, 825-832. 16. Yael, B.H. and Elad,T.T. (2010). A Streaming Parallel Decision Tree Algorithm, Journal of Machine Learning Research, 11(2010), 849-872
論文全文使用權限	校內：校內紙本論文立即公開同意電子論文全文授權校園內公開校內電子論文立即公開校外：同意授權校外電子論文立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信