§ 瀏覽學位論文書目資料
  
系統識別號 U0002-0702200621575400
DOI 10.6846/TKU.2006.00106
論文名稱(中文) 有效率的分散式循序樣本探勘系統
論文名稱(英文) An efficient distributed system for mining sequential patterns
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊管理學系碩士班
系所名稱(英文) Department of Information Management
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 94
學期 1
出版年 95
研究生(中文) 黃揚智
研究生(英文) Yang-Chih Huang
學號 692520132
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2006-01-07
論文頁數 57頁
口試委員 指導教授 - 張昭憲
委員 - 廖賀田
委員 - 莊庭瑞
委員 - 高明達
關鍵字(中) 分散式探勘
平行探勘
循序樣本探勘
資料探勘
關鍵字(英) distributed mining
parallel mining
sequential pattern mining
data mining
第三語言關鍵字
學科別分類
中文摘要
本研究以提昇循序樣本探勘(sequential pattern mining)之效率為目標,發展以多部電腦為基礎的分散式探勘系統。分散式探勘首重節點間的負載平衡,同時需降低節點間通訊負擔。負載平衡與通訊負擔的降低均有賴於準確的任務工作量預測,此外,也應避免透過繁複的任務重分配來解決負載失衡的狀況。
    針對上述重點,本研究首先針對相關文獻中常利用之工作量預測方法進行分析,發現以序列(sequence)支持度(support)為基礎之預測方式與實際狀況差異甚大,將嚴重影響負載平衡。此外,我們也發現不同特性之資料庫可能適用於不同之探勘演算法,此發現可導致全新的分散式演算法設計。根據上述結果,我們發展了一套新的分散式探勘演算法,Segmented Dynamic Load Balance algorithm (SDLB)。演算法中對於任務工作分配採取分段實施,並依狀況進行靜態與動態分配,期能在負載平衡與降低通訊負擔間取得協調。此外,本研究亦導入全新的演算法切換概念,能讓各節點依照分配任務之特性,選用合適之探勘演算法,以進一步提升總體效率。由實驗結果顯示,與前人研究相較,本系統能有效縮短執行時間,並取得更良好之加速比,此結果也顯示本系統具有處理超大型資料庫之潛力。
英文摘要
In this thesis, an effective distributed mining system is developed to increase the efficiency of sequential pattern mining. Two important concerns for distributed mining, load balance and communication overhead reduction, are carefully examined in our works. To achieve a good load balance, accurate predictions on work load of subtasks are indispensable. In addition, the independency of subtask should be effectively controlled to decrease the communication overhead among nodes.
  According to above objectives, the work load prediction methods used in the previous works are analyzed at first. We finds that the commonly-used prediction method does not show a well metrics for work load. Eventually, a skew presented in task partition occurs frequently due to the inaccurate prediction. Besides, we also found that, the efficiency of performing a sequence mining algorithm on a transaction database is strongly related to the distribution of sequential patterns in this database. To increase the mining efficiency, we should choose a proper mining algorithm for each database but not apply a single algorithm for all databases with different features. Base on the above observations, a novel distributed data mining algorithm, Segmented Dynamic Load Balance algorithm (SDLB), is developed. Different previous works, SDLB divides subtask dispatches into several stages. According to different situations, the static and dynamic load balance method are applied adeptly to prevent the task partition from skew and reduce the communication overhead simultaneously. Furthermore, SDLB inducts a brand-new concept that the nodes may adopt different basic mining algorithm for subtasks with different characteristics to promote the mining efficiency. In comparison with the previous works, the experimental results shows SDLB can effectively reduce the runtime and obtain a better speed-up ratio. This result demonstrates the potentials of SDLB for mining sequential pattern in Very Large Data Bases (VLDBs).
第三語言摘要
論文目次
目錄
1.	緒論	1
2.	循序樣本探勘簡介	7
2.1.	循序樣本探勘演算法	8
2.2.	循序樣本探勘之平行演算法	10
3.	子任務工作量預測	15
3.1.	支持度與工作量	15
3.2.	資料庫特性與演算法之關係	17
3.2.1.	資料庫特性與樣本分佈	17
3.2.2.	演算法特性與工作量	19
4.	分散式探勘演算法	21
4.1.	分散式探勘環境	21
4.2.	分散式探勘演算法	22
4.2.1.	任務分配架構	22
4.2.2.	分散式循序樣本探勘演算法	23
4.2.3.	任務分配法之平衡度比較	26
4.3.	節點探勘演算法切換	28
4.3.1.	切換架構	28
4.3.2.	演算法資料結構之探討	30
4.3.3.	演算法之執行時間相較	31
5.	實驗結果	32
5.1.	模擬資料庫	32
5.2.	分散式探勘演算法執行時間	33
5.2.1.	鞍型結構測試	34
5.2.2.	瀑布型結構測試	36
5.3.	各階層探勘時間	38
6.	結論	39
參考文獻	40
附錄一:	42
附錄二:	47
附錄三:	51
附錄四:	53
 
表目錄
表 2.1水平資料庫範例	8
表 2.2垂直資料庫範例	8
表 2.3 ID-list時序結合(Temporal Join)	9
表 2.4 PrefixSpan範例資料	10
表 2.5 字首<a>映射後的資料及sequential patterns	10
表 3.1 Large 1-sequence支持度與總時間	17
表 4.1分散式探勘演算法	25
表 4.2演算法切換邏輯	29
表 4.3不同節點之探勘時間	31
表 5.1實驗用資料庫及其參數	32
表 5.2鞍型結構執行時間與加速比(單位:秒)	34
表 5.3一般及微偏結構執行時間與加速比(單位:秒)	36

圖目錄
圖 2.1 pSPADE產生之循序樣本	11
圖 2.2 PrefixSpan演算法產生之樹狀結構	14
圖 3.1資料庫參數對於循序樣本特性之影響	18
圖 3.2不同特性之探勘結果與演算法執行時間之相關性	20
圖 4.1分散式探勘之執行環境	21
圖 4.2任務分配架構圖	22
圖 4.3不同特性之分散式演算法與負載平衡	27
圖 4.4新舊架構之加速比	28
圖 4.5分散式演算法切換架構	29
圖 5.1實驗資料庫及其循序樣本特性	33
圖 5.2鞍型結構執行時間與加速比	35
圖 5.3一般及微偏結構執行時間與加速比	37
圖 5.4演算法與其各階層探勘時間加總	38
參考文獻
[1] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), pages 487–499, Santiago, Chile, Sept. 1994.

[2] R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. 1995 Int. Conf. Data Engineering (ICDE’95), pages 3–14, Taipei, Taiwan, Mar. 1995.

[3] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In 5th Intl. Conf. Extending Database Technology, March 1996.

[4] M. J. Zaki, “Parallel Sequence Mining on Shared-Memory Machines,” Journal of Parallel and Distributed Computing 61, 401-426 (2001).

[5] V. Guralnik and G. Karypis, “Parallel tree-projection-based sequence mining algorithms,” Parallel Computing 30 (2004) 443-472. 

[6] J. Pei, et al., “Freespan: Frequent pattern-projected sequential pattern mining”, In Proc. Int. Conf. Knowledge Discovery and Data Mining (KDD’00), Boston, MA, August 2000, pp. 355-359.

[7] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, “PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth,” Proc. 2001 Int'l Conf. Data Eng. (ICDE '01), pp. 215-226, 2001.

[8] J. Pei, J. Han, and W. Wang, “Mining Sequential Patterns with Constraints in Large Databases”, Proc. 2002 Int. Conf. on Information and Knowledge Management (CIKM'02), Washington, D.C., Nov. 2001.

[9] J. Pei, J. Han, and W. Wang, Constraint-Based Sequential Pattern Mining in Large Databases Proc. 2002 Int'l Conf. Information and Knowledge Management (CIKM '02), pp. 18-25, Nov. 2002.

[10] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. "Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach," IEEE Transactions on Knowledge and Data Engineering, vol.16, no.11, pp.1424-1440, November 2004.

[11] M. J. Zaki, “Sequence mining in categorical domains: Incorporating constraints,” In CIKM, pages 422-429,2000.

[12] M. J. Zaki. “SPADE: An efficient algorithm for mining frequent sequences,” Machine Learning, 42(1/2):31-60, 2001.

[13] S. Orlando, R. Perego, C. Silvestri: A new algorithm for gap constrained sequence mining. Symposim on Applied Computing (SAC'04), pp 540-547, 2004

[14] IBM. http://www.almaden.ibm.com/software/quest/Resources/index.shtml. Quest Data Mining Project, IBM Almaden Research Center, San Jose, CA 95120.

[15] M. Garofalakis, R. Rastogi, and K. Shim, “Mining Sequential Patterns with Regular Expression Constrains,” IEEE Trans. on Knowledge and Data Eng., Vol. 14, No. 3, pp. 530-552, May/June 2002.

[16]	Y. L. Chen, S. S. Chen, and P. Y. Hsu, “Mining hybrid sequential patterns and sequential rules, Information Systems,” Vol. 27, No. 5, 2004, pp. 345-362.

[17] J. S. Park; Ming-S yan Chen and Philip S. Yu , “An Effective Hash-Based Algorithm for Mining Association Rules,” Proceedins Management of Data ,May 1995 ACM SIGMOD international conference on Management of Data , May 1995,pp.175-186.

[18] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation”, In Proc. ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’00), Dallas, TX, May 2000, pp.1-12.

[19] 張昭憲、周定賢, “以動態任務分配為基礎之分散式循序樣本探勘系統,”第十六屆國際資訊管理學術研討會,台北(輔仁大學),2005年五月。
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信