§ 瀏覽學位論文書目資料
  
系統識別號 U0002-2806201602363800
DOI 10.6846/TKU.2016.00967
論文名稱(中文) 基於MapReduce程式架構下的分散式循序樣式探勘方法之研究
論文名稱(英文) A study of distributed sequential pattern mining method based on MapReduce programming model.
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊管理學系碩士班
系所名稱(英文) Department of Information Management
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 104
學期 2
出版年 105
研究生(中文) 陳智翔
研究生(英文) Jhih-Siang Chen
學號 603630491
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2016-05-29
論文頁數 57頁
口試委員 指導教授 - 徐煥智
委員 - 趙景明
委員 - 魏世杰
關鍵字(中) Hadoop
MapReduce
循序樣式
資料探勘
關鍵字(英) Hadoop
MapReduce
Sequential Pattern
Data Mining
第三語言關鍵字
學科別分類
中文摘要
循序樣式探勘是在巨量循序資料庫中用來取得頻繁循序樣式的一種資料探勘方法,常見的循序資料探勘方法可以分為兩大類,候選樣式產生與樣式成長方法,這些演算法主要執行於單機的環境,便會造成一些缺點,像是對於巨量資料的掃描時間、可擴展性的問題、對於巨量資料及的效率較低。為了增進循序資料探勘的性能,並且改善可擴展性的問題,本研究提出了以Hadoop平台與MapReduce軟體架構為基礎的循序資料探勘方法。
探勘任務被分解為許多分散式任務,Map方法用來挖掘資料集中的所有循序樣式,然後Reduce方法合併所有被找出來的樣式。簡化了搜尋的空間以及獲得了更高的探勘效能。
在這次研究當中,我們對於用戶所設定最小支持度的影響有更進一步的討論,根據我們的實驗,我們發現在探勘過程中的Map與Reduce階段對於最小支持度的設定應該不同,否則會產生頻繁樣式流失的可能。
英文摘要
Sequential pattern mining is a data mining method for obtaining frequent sequential patterns in a large sequential database. Conventional sequence data mining methods could be divided into two categories: Apriori-like methods and pattern growth methods. These algorithms are mainly executed on standalone environment. There are some disadvantages like large database scanning time, scalability problem, less efficient for massive dataset. To improve the performance of sequential pattern mining and to improve the scalability issues, this study presents a distributed sequential pattern mining method based on Hadoop platform and Map Reduce programming model. Mining tasks are decomposed to many distributed tasks, the Map function is used to mine each sequential pattern in a subset of database. Then the Reduce function merges together all these identified patterns. It simplifies the search space and acquires a higher mining efficiency. In this study, we have further discussion on the influence of the setting of user-specified minimum support threshold on the distributed mining process. According to our experiments, it has been found that the threshold setting should be different in Map and Reduce mining process to prevent loss of some frequent patterns.
第三語言摘要
論文目次
第一章 緒論 1
第一節 研究背景與動機 1
第二節 研究目的 2
第三節 研究架構 2
第二章 文獻探討 4
第一節 循序樣式探勘 4
第二節 PrefixSpan演算法 7
第三節 Hadoop框架 9
第四節 MapReduce架構 10
第五節 HDFS 12
第六節 YARN 13
第三章 在MapReduce架構下的循序樣式探勘方法 15
第一節 MapReduce架構的PrefixSpan演算法 15
第二節 範例說明 19
第四章 實證和參數的影響 22
第一節 實作系統架構 22
第二節 Support值在MapReduce架構下對PrefixSpan演算法的影響 23
第三節 PrefixSpan在單機與MapReduce架構下的效能比較 30
第五章 結論 33
參考文獻 35
附錄一-PrefixSpan程式 38
附件二-MapReduce下的PrefixSpan程式 46
附件三-測試資料產生程式 56

表目錄
===================================
表2-1 PrefixSpan 範例資料庫 8
表2-2 一階頻繁項目集的映射資料庫 9
表3-1 四個Mapper輸出的結果 19
表3-2 Reducer輸出結果 21
表4-1 硬體比較表 22
表4-2 100000筆的交易資料庫在兩階段不同Support值探勘的序列長度 30
表4-3 MapReduce架構與單機架構的PrefixSpan演算法執行時間比較 31

圖目錄
===================================
圖2-1 PrefixSpan演算法流程步驟 8
圖2-2 YARN架構 14
圖3-1 main方法 16
圖3-2 Mapper虛擬碼 17
圖3-3 PSItem類別 18
圖3-4 Reducer的虛擬碼 18
圖3-5 總筆數為20筆的範例資料庫 19
圖4-1 硬體架構圖 23
圖4-2 單機與Reduce-Support值均為4%的探勘結果 26
圖4-3 單機與Reduce-Support值均為4%的探勘結果折線圖 26
圖4-4 單機與Reduce-Support值均為6%的探勘結果 27
圖4-5 單機與Reduce-Support值均為6%的探勘結果折線圖 27
圖4-6 測試資料『10-5』的部分探勘結果 28
圖4-7 測試資料『10-5』 28
圖4-8 測試資料『10-5』包含樣式『bf』的序列 28
圖4-9 測試資料『10-5』包含樣式『dd』的序列 29
圖4-10 MapReduce架構與單機架構的PrefixSpan演算法執行時間折線圖 32
參考文獻
[1].R. Agrawal and R. Srikant. Mining sequential patterns. Presented at Data Engineering, 1995. Proceedings of the Eleventh International Conference On. 1995, . DOI: 10.1109/ICDE.1995.380415.
[2].J. H. Chang and N. H. Park. Comparative analysis of sequence weighting approaches for mining time-interval weighted sequential patterns. Expert Syst. Appl. 39(3), pp. 3867-3873. 2012. 
[3].Y. Chen, M. Chiang and M. Ko. Discovering time-interval sequential patterns in sequence databases. Expert Syst. Appl. 25(3), pp. 343-354. 2003. 
[4].Y. Chen and Y. Hu. Constraint-based sequential pattern mining: The consideration of recency and compactness. Decis. Support Syst. 42(2), pp. 1203-1215. 2006. 
[5].J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal and M. Hsu. FreeSpan: Frequent pattern-projected sequential pattern mining. Presented at Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2000, .
[6].J. Han, J. Pei, Y. Yin and R. Mao. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery 8(1), pp. 53-87. Available: http://dx.doi.org/10.1023/B:DAMI.0000005258.31418.83. DOI: 10.1023/B:DAMI.0000005258.31418.83".
[7].Q. He, Q. Wang, F. Zhuang, Q. Tan and Z. Shi. Parallel CLARANS clustering based on MapReduce. Energy Procedia 13pp. 3269-3279. 2011. Available: /cgi-bin/sciserv.pl?collection=journals&journal=18766102&issue=v13inone_c&article=3269_pccbom.
[8].Jian Pei, Jiawei Han, B. Mortazavi-Asl, H. Pinto, Qiming Chen, U. Dayal and Mei-Chun Hsu. PrefixSpan,: Mining sequential patterns efficiently by prefix-projected pattern growth. Presented at Data Engineering, 2001. Proceedings. 17th International Conference On. 2001, . DOI: 10.1109/ICDE.2001.914830.
[9].C. Kim, J. Lim, R. T. Ng and K. Shim. SQUIRE: Sequential pattern mining with quantities. J. Syst. Software 80(10), pp. 1726-1745. 2007. 
[10].W. Peng and Z. Liao. Mining sequential patterns across multiple sequence databases. Data Knowl. Eng. 68(10), pp. 1014-1033. 2009. 
[11].R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. Presented at Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology. 1996, Available: http://dl.acm.org/citation.cfm?id=645337.650382.
[12].M. Svendsen, A. P. Mukherjee and S. Tirthapura. Mining maximal cliques from a large graph using MapReduce: Tackling highly uneven subproblem sizes. Journal of Parallel and Distributed Computing 79-80pp. 104-114. 2015. Available: /cgi-bin/sciserv.pl?collection=journals&journal=07437315&issue=v79-80inone_c&article=104_mmcfalmthuss.
[13].Yen-Liang Chen and T. C. -. Huang. Discovering fuzzy time-interval sequential patterns in sequence databases. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions On 35(5), pp. 959-972. 2005. . DOI: 10.1109/TSMCB.2005.847741.
[14].U. Yun. A new framework for detecting weighted sequential patterns in large sequence databases. Knowledge-Based Syst. 21(2), pp. 110-122. 2008. 
[15].M. J. Zaki. SPADE: An efficient algorithm for mining frequent sequences. Mach. Learning 42(1), pp. 31-60. Available: http://dx.doi.org/10.1023/A:1007652502315. DOI: 10.1023/A:1007652502315".
[16].J. Zhang, T. Li, D. Ruan, Z. Gao and C. Zhao. A parallel method for computing rough set approximations. Inf. Sci. 194pp. 209-223. 2012. Available: /cgi-bin/sciserv.pl?collection=journals&journal=00200255&issue=v194inone_c&article=209_apmfcrsa.
[17].刘栋, 尉永清 and 薛文娟. 基于 map reduce 的序列模式挖掘算法. 计算机工程 38(15), pp. 43-45. 2012.
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信