§ 瀏覽學位論文書目資料
  
系統識別號 U0002-1408201621154300
DOI 10.6846/TKU.2016.00374
論文名稱(中文) 尋求有效率之資料串流頻繁樣式探勘
論文名稱(英文) In Pursuit of Efficient Frequent Pattern Mining in Data Streams
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 電機工程學系碩士班
系所名稱(英文) Department of Electrical and Computer Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 104
學期 2
出版年 105
研究生(中文) 凃耘昇
研究生(英文) Yun-Sheng Tu
學號 603450023
學位類別 碩士
語言別 繁體中文
第二語言別 英文
口試日期 2016-06-17
論文頁數 72頁
口試委員 指導教授 - 莊博任(pjchuang@ee.tku.edu.tw)
委員 - 陳省隆(hlchen@mail.ntust.edu.tw)
委員 - 許獻聰(stsheuce.ncu.edu.tw)
關鍵字(中) 資料串流
頻繁樣式探勘
關鍵字(英) Data stream
Frequent pattern mining
第三語言關鍵字
學科別分類
中文摘要
資料串流探勘(Data stream mining)是現今大數據(Big data)領域的研究主題之一。資料串流指的是日常生活中隨著時間不斷產生的大量資料。例如:每個人上網的點擊串流、感測器產生的資料、購物資料、網路流量資料…等。這些資料串流中隱含著許多有價值的資訊,利用資料探勘(Data mining)的技術,可以分析出針對不同應用之下所需要的結果,最後做出更佳的決策。在資料探勘的領域中,其中有一個主題是頻繁樣式探勘(Frequent pattern mining)。目標是要從資料當中找出哪些樣式出現頻率較高,利用找出來的樣式可以進一步運用。
目前在資料串流中找出頻繁樣式的研究領域,針對即時分析的需求以及爆量資料的問題,近年來有LCSS演算法。它能夠用有限的時間與空間處理Data stream。但是,由於沒有使用較佳的資料結構,使得運算的時間很長。針對這個問題,我們使用較佳的資料結構Trie存取資料。Trie是一個字典排序(Lexicographic)的Tree。因此,在搜尋pattern時,能夠用更少的步驟處理,使得速度能夠有效的提升。
此外,LCSS演算法有另外一個問題。當最小支持度(Minimum support threshold)較低時,無法正確的找出頻繁樣式,導致準確率趨近於零。因為LCSS是用有限的空間儲存pattern,當資料較龐大且複雜時,沒有足夠緩衝的空間能夠存取pattern。為了改善準確率較低的問題,我們提出了Length Skip。犧牲Long frequent pattern,將運算資源集中在Short frequent pattern,使其準確率提高。Length Skip是在處理Data stream的過程中,使用滑動視窗(Sliding Window)保持最新的10K筆資料。每到達Check point時,利用Sliding window內的data計算準確率。若準確率很低,計算出恰當的X限制pattern長度。使得長度1~X的pattern有更充裕的空間可以存取Summary,有效提高準確率。
實驗結果證實,使用Trie存取pattern能夠用較少的步驟處理完畢,大幅減少執行時間。在改善準確率方面,有效的使用Sliding window分析準確率,並計算出恰當的X限制長度。避免過多的Long pattern充斥在Summary之中,進而大大地提升了準確率。
英文摘要
Data stream mining is a research topic in the big data field. Data streams contain a lot of data which are generated continuously in our daily life. They are click data, sensor data, transaction data, network traffic data and so on. There are a lot of valuable information in data streams. Data mining techniques have been used to analyze data in different applications so that making better decisions. In data mining fields, frequent pattern mining is an important topic. It is used to find patterns which appeared frequently. Frequent patterns are used in several applications and other analysis.
  In recent years, there are a lot of researches about “frequent pattern mining in data streams”. There is a method called LCSS algorithm which has ability to do real-time response and handle bursty data. The LCSS algorithm uses finite run time and memory space to process data streams, but, inappropriate data structures cost a lot of time. In order to solve the problem, a better data structure called “Trie” is used, which is a lexicographic tree. As a result, run time decreases because fewer steps are used to access patterns in Trie.
  In addition, there is a problem in the LCSS algorithm. Frequent patterns can’t be found accurately when minimum support threshold is low. Therefore, accuracy closes to zero. Accuracy is low because the LCSS algorithm uses finite memory space to store patterns. There are not enough memory space store patterns when data streams are complex and large. In order to increase accuracy, we propose “Length Skip” which sacrifices long frequent patterns and focuses on short frequent patterns. When it processes data streams, uses a fixed size sliding window to hold the last ten thousand data. In each check point, data in sliding window are used to calculate accuracy. If accuracy is low, appropriate X would be calculated to limit the length of patterns.
  Experimental results show that using Trie to access patterns could reduce the number of process steps so that it spends shorter time to process data streams. On the other hand, sliding window is used to calculate accuracy and appropriate limited length X so that there are not many long frequent patterns are stored in a summary. Therefore, accuracy increases efficiently.
第三語言摘要
論文目次
目錄
第一章、緒論	1
1.1、論文簡介	1
1.2、論文架構	4
第二章、相關研究背景	5
2.1、Database Mining與Data Stream Mining差異	5
2.2、Data stream mining的架構	6
2.2.1、Landmark	6
2.2.2、Sliding window	7
2.2.3、Time fading	8
2.2.4、不同架構之比較	9
2.3、Landmark Model處理模式	9
2.3.1、Batch by Batch	9
2.3.2、One by One	10
2.4、LCSS Algorithm	11
2.4.1、LCSS特色	11
2.4.2、Skip LCSS algorithm	13
2.4.3、Stream Reduction	20
2.4.4、Skip LCSS Example	23
2.4.5、R-skip	26
第三章、我們的策略	27
3.1、LCSS問題描述	27
3.1.1、LCSS執行時間	27
3.1.2、LCSS準確率	29
3.2、降低執行時間	32
3.2.1、使用Trie資料結構	32
3.2.2、Big-O分析	34
3.2.3、建立cmin Queue的成本	36
3.3、提升準確率	37
3.3.1、Summary size	37
3.3.2、Length Skip	38
第四章、模擬評估	48
4.1、實驗環境與測試資料	48
4.2、輸入參數與評估指標	49
4.2.1、評估執行時間	50
4.2.2、評估準確率F-score	50
4.3、實驗結果	52
4.3.1、降低執行時間	52
4.3.2、提升準確率	53
第五章、結論	66
第六章、參考文獻	69

圖目錄
圖2.1、Landmark	7
圖2.2、Sliding window	8
圖2.3、Time fading	8
圖2.4、Batch by Batch	10
圖2.5、One by One	11
圖2.6、Entry table	12
圖2.7、Skip LCSS流程圖	14
圖2.8、Skip LCSS虛擬碼	18
圖2.9、case A:cs < me	19
圖2.10、case B:cs = me	19
圖2.11、case C:cs > me	19
圖2.12、Stream Reduction示意圖	21
圖2.13、Stream Reduction虛擬碼	22
圖2.14、Item table樣式	22
圖2.15、Skip LCSS初始化	23
圖2.16、Skip LCSS範例	24
圖3.1、LCSS algorithm執行時間	28
圖3.2、Skip LCSS與R-skip準確率比較	29
圖3.3、使用Trie存取pattern	33
圖3.4、cmin Queue	34
圖3.5、List與Trie比較	35
圖3.6、Summary大小與F-score關係	38
圖3.7、不同Dataset限制長度與準確率關係	42
圖3.8、使用Sliding window計算限制長度X	43
圖3.9、LCSS + Length Skip流程圖	45
圖3.10、Length Skip流程圖	47
圖4.1、Retail data前10筆樣式	49
圖4.2、Recall & Precision	51
圖4.3、使用Trie降低執行時間	53
圖4.4、Sliding window準確率趨勢 in Retail data	54
圖4.5、Sliding window準確率趨勢 in T10I4D100K data	55
圖4.6、Sliding window準確率趨勢 in BMS2 data	55
圖4.7、限制pattern長度X的趨勢 in Retail data	57
圖4.8、限制pattern長度X的趨勢 in T10I4D100K data	57
圖4.9、限制pattern長度X的趨勢 in BMS2 data	58
圖4.10、每個Check point的X值	59
圖4.11、使用Length Skip減少執行時間	60
圖4.12、使用Length Skip提升準確率 in Retail data	62
圖4.13、使用Length Skip提升準確率 in T10I4D100K data	62
圖4.14、使用Length Skip提升準確率 in BMS2 data	63
圖4.15、Frequent pattern分佈圖	64


表目錄
表2.1、Skip LCSS輸入參數	17
表2.2、Skip LCSS變數	17
表2.3、不同case的更新方法	20
表3.1、Skip LCSS與R-skip差別	27
表3.2、Skip LCSS準確率表現	30
表3.3、Threshold與關係 in Retail data	31
表3.4、R-skip準確率表現	31
表3.5、List與Trie參數	35
表3.6、pattern長度及數量關係in Retail data	39
表3.7、限制長度與準確率關係 in Retail data	42
表4.1、Dataset特色	48
表4.2、長度1~4的pattern準確率表現in Retail data	61
表4.3、使用Length Skip準確率表現 in Retail data	61
表4.4、Pattern遺失率	64
參考文獻
[1] J. Gama, “Data Stream Mining: the Bounded Rationality,” Informatica, vol. 37, 2013, pp. 21–25.
[2] C. C. Aggarwal, “Applications of Frequent Pattern Mining,” in Frequent Pattern Mining, Cham: Springer International Publishing, 2014, pp. 443–467.
[3] C.-W. Tsai, C.-F. Lai, M.-C. Chiang, and L. T. Yang, “Data Mining for Internet of Things: A Survey,” IEEE Commun. Surv. Tutorials, vol. 16, no. 1, 2014, pp. 77–97.
[4] V. E. Lee, R. Jin, and G. Agrawal, “Frequent Pattern Mining in Data Streams,” in Frequent Pattern Mining, Cham: Springer International Publishing, 2014, pp. 199–224.
[5] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.
[6] V. Rao, “Numeric Measures for Association Rules.” [Online]. Available: http://www.datasciencecentral.com/profiles/blogs/numeric-measures-for-association-rules.
[7] K. J. Kang, B. Ka, and S. J. Kim, “A service scenario generation scheme based on association rule mining for elderly surveillance system in a smart home environment,” Eng. Appl. Artif. Intell., vol. 25, no. 7, Oct. 2012, pp. 1355–1364.
[8] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases,” in Proceedings of the 20th International Conference on Very Large Data Bases, 1994, pp. 487–499.
[9] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns Without Candidate Generation,” SIGMOD Rec., vol. 29, no. 2, 2000, pp. 1–12. 
[10]“Apriori algorithm.”	[Online].Available: https://en.wikipedia.org/wiki/Apriori_algorithm.
[11] “Precision	and		recall.”	[Online].Available: https://en.wikipedia.org/wiki/Precision_and_recall.
[12] G. Lee, U. Yun, and K. H. Ryu, “Sliding window based weighted maximal frequent pattern mining over data streams,” Expert Syst. Appl., vol. 41, no. 2, Feb. 2014, pp. 694–708.
[13] L. Chen and Q. Mei, “Mining frequent items in data stream using time fading model,” Inf. Sci. (Ny), vol. 257, Feb. 2014, pp. 54–69.
[14] C.-W. Li, K.-F. Jea, R.-P. Lin, S.-F. Yen, and C.-W. Hsu, “Mining frequent patterns from dynamic data streams with data load management,” J. Syst. Softw., vol. 85, no. 6, Jun. 2012, pp. 1346–1362.
[15]Y.Yamamoto, K.Iwanuma, and S.Fukuda, “Resource-oriented approximation for frequent itemset mining from bursty data streams,” in Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD ’14, 2014, pp. 205–216.
[16] W. Ng and M. Dash, “A comparison between approximate counting and sampling methods for frequent pattern mining on data streams,” Intell. Data Anal., vol. 14, no. 6, pp. 2010, 749–771.
[17] G. S. Manku and R. Motwani, “Approximate frequency counts over data streams,” Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, 2002, pp. 346–357.
[18] A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams,” in Proceedings of the 10th international conference on Database Theory, Springer-Verlag, 2004, pp. 398–412.
[19] L. P. Suresh, S. S. Dash, and B. K. Panigrahi, “Artificial Intelligence and Evolutionary Algorithms in Engineering Systems: Proceedings of ICAEES 2014, Volume 2,” Jan. 2015.
[20] S. Itkar and U. Kulkarni, “Efficient Frequent Pattern Mining Using Auto-Associative Memory Neural Network,” Br. J. Appl. Sci. Technol., vol. 4, no. 22, 2014, p. 3160.
[21] A. Choubey, R. Patel, and J. L. Rana, “GRAPH BASED NEW APPROACH FOR FREQUENT PATTERN MINING,” Int. J. Comput. Sci. Inf. Technol., vol. 4, no. 1, 2012, p. 221.
[22] A. Adhikari, J. Adhikari, and W. Pedrycz, “Data Analysis and Pattern Recognition in Multiple Databases”, Cham: Springer International Publishing, vol. 61, 2014.
[23] B. Goethals, "Frequent Itemset Mining Dataset Repository." [Online]. Available: http://fimi.ua.ac.be/.
[24] V. S. Fournier-Viger, P., Gomariz, Gueniche, T., A., Soltani, A., Wu., C., Tseng, “SPMF: a Java Open-Source Pattern Mining Library.” [Online]. Available: http://www.philippe-fournier-viger.com/spmf/
論文全文使用權限
校內
紙本論文於授權書繳交後1年公開
同意電子論文全文授權校園內公開
校內電子論文於授權書繳交後1年公開
校外
同意授權
校外電子論文於授權書繳交後1年公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信