淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-0702200621575400
中文論文名稱 有效率的分散式循序樣本探勘系統
英文論文名稱 An efficient distributed system for mining sequential patterns
校院名稱 淡江大學
系所名稱(中) 資訊管理學系碩士班
系所名稱(英) Department of Information Management
學年度 94
學期 1
出版年 95
研究生中文姓名 黃揚智
研究生英文姓名 Yang-Chih Huang
學號 692520132
學位類別 碩士
語文別 中文
口試日期 2006-01-07
論文頁數 57頁
口試委員 指導教授-張昭憲
委員-廖賀田
委員-莊庭瑞
委員-高明達
中文關鍵字 分散式探勘  平行探勘  循序樣本探勘  資料探勘 
英文關鍵字 distributed mining  parallel mining  sequential pattern mining  data mining 
學科別分類 學科別社會科學管理學
學科別社會科學資訊科學
中文摘要 本研究以提昇循序樣本探勘(sequential pattern mining)之效率為目標,發展以多部電腦為基礎的分散式探勘系統。分散式探勘首重節點間的負載平衡,同時需降低節點間通訊負擔。負載平衡與通訊負擔的降低均有賴於準確的任務工作量預測,此外,也應避免透過繁複的任務重分配來解決負載失衡的狀況。
針對上述重點,本研究首先針對相關文獻中常利用之工作量預測方法進行分析,發現以序列(sequence)支持度(support)為基礎之預測方式與實際狀況差異甚大,將嚴重影響負載平衡。此外,我們也發現不同特性之資料庫可能適用於不同之探勘演算法,此發現可導致全新的分散式演算法設計。根據上述結果,我們發展了一套新的分散式探勘演算法,Segmented Dynamic Load Balance algorithm (SDLB)。演算法中對於任務工作分配採取分段實施,並依狀況進行靜態與動態分配,期能在負載平衡與降低通訊負擔間取得協調。此外,本研究亦導入全新的演算法切換概念,能讓各節點依照分配任務之特性,選用合適之探勘演算法,以進一步提升總體效率。由實驗結果顯示,與前人研究相較,本系統能有效縮短執行時間,並取得更良好之加速比,此結果也顯示本系統具有處理超大型資料庫之潛力。
英文摘要 In this thesis, an effective distributed mining system is developed to increase the efficiency of sequential pattern mining. Two important concerns for distributed mining, load balance and communication overhead reduction, are carefully examined in our works. To achieve a good load balance, accurate predictions on work load of subtasks are indispensable. In addition, the independency of subtask should be effectively controlled to decrease the communication overhead among nodes.
According to above objectives, the work load prediction methods used in the previous works are analyzed at first. We finds that the commonly-used prediction method does not show a well metrics for work load. Eventually, a skew presented in task partition occurs frequently due to the inaccurate prediction. Besides, we also found that, the efficiency of performing a sequence mining algorithm on a transaction database is strongly related to the distribution of sequential patterns in this database. To increase the mining efficiency, we should choose a proper mining algorithm for each database but not apply a single algorithm for all databases with different features. Base on the above observations, a novel distributed data mining algorithm, Segmented Dynamic Load Balance algorithm (SDLB), is developed. Different previous works, SDLB divides subtask dispatches into several stages. According to different situations, the static and dynamic load balance method are applied adeptly to prevent the task partition from skew and reduce the communication overhead simultaneously. Furthermore, SDLB inducts a brand-new concept that the nodes may adopt different basic mining algorithm for subtasks with different characteristics to promote the mining efficiency. In comparison with the previous works, the experimental results shows SDLB can effectively reduce the runtime and obtain a better speed-up ratio. This result demonstrates the potentials of SDLB for mining sequential pattern in Very Large Data Bases (VLDBs).
論文目次 目錄
1. 緒論 1
2. 循序樣本探勘簡介 7
2.1. 循序樣本探勘演算法 8
2.2. 循序樣本探勘之平行演算法 10
3. 子任務工作量預測 15
3.1. 支持度與工作量 15
3.2. 資料庫特性與演算法之關係 17
3.2.1. 資料庫特性與樣本分佈 17
3.2.2. 演算法特性與工作量 19
4. 分散式探勘演算法 21
4.1. 分散式探勘環境 21
4.2. 分散式探勘演算法 22
4.2.1. 任務分配架構 22
4.2.2. 分散式循序樣本探勘演算法 23
4.2.3. 任務分配法之平衡度比較 26
4.3. 節點探勘演算法切換 28
4.3.1. 切換架構 28
4.3.2. 演算法資料結構之探討 30
4.3.3. 演算法之執行時間相較 31
5. 實驗結果 32
5.1. 模擬資料庫 32
5.2. 分散式探勘演算法執行時間 33
5.2.1. 鞍型結構測試 34
5.2.2. 瀑布型結構測試 36
5.3. 各階層探勘時間 38
6. 結論 39
參考文獻 40
附錄一: 42
附錄二: 47
附錄三: 51
附錄四: 53

表目錄
表 2.1水平資料庫範例 8
表 2.2垂直資料庫範例 8
表 2.3 ID-list時序結合(Temporal Join) 9
表 2.4 PrefixSpan範例資料 10
表 2.5 字首映射後的資料及sequential patterns 10
表 3.1 Large 1-sequence支持度與總時間 17
表 4.1分散式探勘演算法 25
表 4.2演算法切換邏輯 29
表 4.3不同節點之探勘時間 31
表 5.1實驗用資料庫及其參數 32
表 5.2鞍型結構執行時間與加速比(單位:秒) 34
表 5.3一般及微偏結構執行時間與加速比(單位:秒) 36

圖目錄
圖 2.1 pSPADE產生之循序樣本 11
圖 2.2 PrefixSpan演算法產生之樹狀結構 14
圖 3.1資料庫參數對於循序樣本特性之影響 18
圖 3.2不同特性之探勘結果與演算法執行時間之相關性 20
圖 4.1分散式探勘之執行環境 21
圖 4.2任務分配架構圖 22
圖 4.3不同特性之分散式演算法與負載平衡 27
圖 4.4新舊架構之加速比 28
圖 4.5分散式演算法切換架構 29
圖 5.1實驗資料庫及其循序樣本特性 33
圖 5.2鞍型結構執行時間與加速比 35
圖 5.3一般及微偏結構執行時間與加速比 37
圖 5.4演算法與其各階層探勘時間加總 38
參考文獻 [1] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), pages 487–499, Santiago, Chile, Sept. 1994.

[2] R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. 1995 Int. Conf. Data Engineering (ICDE’95), pages 3–14, Taipei, Taiwan, Mar. 1995.

[3] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In 5th Intl. Conf. Extending Database Technology, March 1996.

[4] M. J. Zaki, “Parallel Sequence Mining on Shared-Memory Machines,” Journal of Parallel and Distributed Computing 61, 401-426 (2001).

[5] V. Guralnik and G. Karypis, “Parallel tree-projection-based sequence mining algorithms,” Parallel Computing 30 (2004) 443-472.

[6] J. Pei, et al., “Freespan: Frequent pattern-projected sequential pattern mining”, In Proc. Int. Conf. Knowledge Discovery and Data Mining (KDD’00), Boston, MA, August 2000, pp. 355-359.

[7] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, “PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth,” Proc. 2001 Int'l Conf. Data Eng. (ICDE '01), pp. 215-226, 2001.

[8] J. Pei, J. Han, and W. Wang, “Mining Sequential Patterns with Constraints in Large Databases”, Proc. 2002 Int. Conf. on Information and Knowledge Management (CIKM'02), Washington, D.C., Nov. 2001.

[9] J. Pei, J. Han, and W. Wang, Constraint-Based Sequential Pattern Mining in Large Databases Proc. 2002 Int'l Conf. Information and Knowledge Management (CIKM '02), pp. 18-25, Nov. 2002.

[10] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. "Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach," IEEE Transactions on Knowledge and Data Engineering, vol.16, no.11, pp.1424-1440, November 2004.

[11] M. J. Zaki, “Sequence mining in categorical domains: Incorporating constraints,” In CIKM, pages 422-429,2000.

[12] M. J. Zaki. “SPADE: An efficient algorithm for mining frequent sequences,” Machine Learning, 42(1/2):31-60, 2001.

[13] S. Orlando, R. Perego, C. Silvestri: A new algorithm for gap constrained sequence mining. Symposim on Applied Computing (SAC'04), pp 540-547, 2004

[14] IBM. http://www.almaden.ibm.com/software/quest/Resources/index.shtml. Quest Data Mining Project, IBM Almaden Research Center, San Jose, CA 95120.

[15] M. Garofalakis, R. Rastogi, and K. Shim, “Mining Sequential Patterns with Regular Expression Constrains,” IEEE Trans. on Knowledge and Data Eng., Vol. 14, No. 3, pp. 530-552, May/June 2002.

[16] Y. L. Chen, S. S. Chen, and P. Y. Hsu, “Mining hybrid sequential patterns and sequential rules, Information Systems,” Vol. 27, No. 5, 2004, pp. 345-362.

[17] J. S. Park; Ming-S yan Chen and Philip S. Yu , “An Effective Hash-Based Algorithm for Mining Association Rules,” Proceedins Management of Data ,May 1995 ACM SIGMOD international conference on Management of Data , May 1995,pp.175-186.

[18] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation”, In Proc. ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’00), Dallas, TX, May 2000, pp.1-12.

[19] 張昭憲、周定賢, “以動態任務分配為基礎之分散式循序樣本探勘系統,”第十六屆國際資訊管理學術研討會,台北(輔仁大學),2005年五月。
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2006-02-09公開。
  • 同意授權瀏覽/列印電子全文服務,於2006-02-09起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信