電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2013-07-04起於校外公開使用
本論文紙本於2013-07-04起公開使用

系統識別號	U0002-0207201303332600
DOI	10.6846/TKU.2013.00040
論文名稱(中文)	平行化處理在決策樹演算法之應用
論文名稱(英文)	Application of Parallel Computation on the Decision Tree Algorithms
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	統計學系應用統計學碩士班
系所名稱(英文)	Department of Statistics
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	101
學期	2
出版年	102
研究生(中文)	李智慎
研究生(英文)	Chih-Shen Li
學號	600650302
學位類別	碩士
語言別	繁體中文
第二語言別
口試日期	2013-06-18
論文頁數	34頁
口試委員	指導教授 - 陳景祥委員 - 歐士田委員 - 王藝華
關鍵字(中)	資料探勘分類器平行化迴歸樹
關鍵字(英)	data mining classifiers parallel CART
第三語言關鍵字
學科別分類
中文摘要	在資料量急劇增加之下，電腦資料的處理速度成了一項重要的技術，如果能夠將資料處理分析的時間節省下來則可提早一步作預測或判斷，而平行化處理就是一個可以使計算時間縮短的一個方法。而在本篇研究裡就是將資料探勘上常用到的決策樹與平行化作結合，本研究用的是以CART演算法配合兩種平行處理方式，第一個方法為資料平行化，是將資料分作多個等分且以平行化的方式建構多顆決策樹；第二個方法為預測變數平行化，其方法是在決策樹計算最佳變數和最佳切割點時將資料以變數個數作區分，分作多個部份來作平行運算，而在最後模擬結果可以得知速度方面CART使用預測變數平行化方法節省的時間比資料平行化還要來得多，而資料平行化方法則必須要在資料量大於大概6萬以後才會在計算速方面比原始無平行化的CART方法快，且在CPU從2核心上升到4時，時間上面的變化都有很明顯的縮減，兩方法在大型資料下都能縮減決策樹在計算上的時間。
英文摘要	As the amount of data becomes increasing more and more today, the computation speed of the computer becomes a very important factor. If we can decrease the computing time, we could make the prediction or decision much early than expected. Parallel computation is one of the methods which can decrease computing time. In this paper, we will combine the parallel computation with decision tree technique, and use the CART algorithm as our decision tree model. There are two types of parallel algorithms in this research, 1. Data parallelization: split the data into several parts to build the decision tree in parallel, 2. Variable parallelization: split the predictive variables to do parallelization. In our simulation, we will show that method 2 together with the CART tree algorithm has better performance than method 1. On the other hand, CART with method 1 will be faster than non-parallel CART tree when the size of data amount is over 60,000. Also, when the number of CPU kernels are added from two to four, there is a significant reduction in computing time in both algorithms.
第三語言摘要
論文目次	目錄 I 表目錄 III 圖目錄 IV 第一章緒論 1 1.1 研究背景 1 1.2 研究動機與目的 2 1.3 論文結構 3 第二章文獻探討 5 2.1決策樹簡介 5 2.2 CART決策樹 7 2.3平行化 (Parallization) 8 2.4決策樹平行化 8 第三章研究方法 13 3.1 R軟體套件運用 13 3.1 資料平行化 15 3.2 預測變數平行化 16 第四章模擬結果 19 4.1模擬資料介紹 19 4.2時間比較 20 4.3預測正確度比較 27 第五章研究結論與建議 29 5.1結論 29 5.2建議 30 附表 31 參考文獻 34 中文文獻: 34 英文文獻 34 表 1 預測變數平行化演算法預測變數平行化演算法預測變數平行化演算法 10 表 2 節點平行化演算法節點平行化演算法節點平行化演算法 11 表 3 模擬代號介紹模擬代號介紹 20 表 4 M04M04M0和 M1 法在 1~61~6 萬筆資料下運算時間萬筆資料下運算時間萬筆資料下運算時間 21 表 5 M 0和 M2 法在 1~61~6 萬筆資料下運算時間萬筆資料下運算時間萬筆資料下運算時間 23 表 6 M06M06M0、M1 和 M2 法在 90~10090~10090~100 90~10090~100萬筆資料下運算時間萬筆資料下運算時間萬筆資料下運算時間 26 圖 1 研究流程圖研究流程圖 3 圖 2 決策樹結構 6 圖 3 資料平行化分割示意圖資料平行化分割示意圖資料平行化分割示意圖資料平行化分割示意圖 15 圖 4 M04M04M0和 M1 法在 1~61~6 萬筆資料下運算時間萬筆資料下運算時間萬筆資料下運算時間 21 圖 5 M05M05M0和 M2 法在 1~61~6 萬筆資料下運算時間萬筆資料下運算時間萬筆資料下運算時間 23 圖 6 M06M06M0、M1 和 M2 法在 1~1001~1001~1001~1001~100萬筆資料下運算時間萬筆資料下運算時間萬筆資料下運算時間 24 圖 7 M0 、M1 和 M2 法在 90~10090~10090~100 90~10090~100萬筆資料下運算時間萬筆資料下運算時間萬筆資料下運算時間 25 圖 8 M0 、M1 和 M2 標準差 (紫色為 M0 和 M2 法，綠色為法，綠色為 M1 法) 27
參考文獻	王派洲譯；Han, J., Kamber, M.著(2008)，資料探勘：概念與方法，臺中市：滄海。陳景祥(2010)，R軟體：應用統計方法，臺北市：台灣東華。 Breiman, L. ,(2001), Random Forests, Machine Learning, 45, 5-32. Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J., (1984).Classification and regression trees. Wadsworth. Yildiz, O.T., Dikmen, O.,(2006),Parallel univariate decision trees,Pattern Recognition Letters 28 (2007) 825–832 Shafer, J., Agrawal, R., Mehta, M.,(1996), SPRINT: A Scalable Parallel Classifier for Data Mining. Jin, R., Agrawal, G.,(2003), Communication and Memory Efﬁcient Parallel DecisionTree Construction. In: Proceedings of Third SIAM Conferenceon Data Mining. Jochen,K., Christine, P., Harald, B., Guido, S.,(2009), Easier Parallel Computing in R withsnowfall and sfCluster, The R Journal Vol. 1/1. Weston, S., (2013), Getting Started with doMC and foreach. Hothorn, T., Zeileis, A., partykit: A Toolkit for Recursive Partytioning
論文全文使用權限	校內：校內紙本論文立即公開同意電子論文全文授權校園內公開校內電子論文立即公開校外：同意授權校外電子論文立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信