淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-1408201621154300
中文論文名稱 尋求有效率之資料串流頻繁樣式探勘
英文論文名稱 In Pursuit of Efficient Frequent Pattern Mining in Data Streams
校院名稱 淡江大學
系所名稱(中) 電機工程學系碩士班
系所名稱(英) Department of Electrical Engineering
學年度 104
學期 2
出版年 105
研究生中文姓名 凃耘昇
研究生英文姓名 Yun-Sheng Tu
學號 603450023
學位類別 碩士
語文別 中文
第二語文別 英文
口試日期 2016-06-17
論文頁數 72頁
口試委員 指導教授-莊博任
委員-陳省隆
委員-許獻聰
中文關鍵字 資料串流  頻繁樣式探勘 
英文關鍵字 Data stream  Frequent pattern mining 
學科別分類 學科別應用科學電機及電子
中文摘要 資料串流探勘(Data stream mining)是現今大數據(Big data)領域的研究主題之一。資料串流指的是日常生活中隨著時間不斷產生的大量資料。例如:每個人上網的點擊串流、感測器產生的資料、購物資料、網路流量資料…等。這些資料串流中隱含著許多有價值的資訊,利用資料探勘(Data mining)的技術,可以分析出針對不同應用之下所需要的結果,最後做出更佳的決策。在資料探勘的領域中,其中有一個主題是頻繁樣式探勘(Frequent pattern mining)。目標是要從資料當中找出哪些樣式出現頻率較高,利用找出來的樣式可以進一步運用。
目前在資料串流中找出頻繁樣式的研究領域,針對即時分析的需求以及爆量資料的問題,近年來有LCSS演算法。它能夠用有限的時間與空間處理Data stream。但是,由於沒有使用較佳的資料結構,使得運算的時間很長。針對這個問題,我們使用較佳的資料結構Trie存取資料。Trie是一個字典排序(Lexicographic)的Tree。因此,在搜尋pattern時,能夠用更少的步驟處理,使得速度能夠有效的提升。
此外,LCSS演算法有另外一個問題。當最小支持度(Minimum support threshold)較低時,無法正確的找出頻繁樣式,導致準確率趨近於零。因為LCSS是用有限的空間儲存pattern,當資料較龐大且複雜時,沒有足夠緩衝的空間能夠存取pattern。為了改善準確率較低的問題,我們提出了Length Skip。犧牲Long frequent pattern,將運算資源集中在Short frequent pattern,使其準確率提高。Length Skip是在處理Data stream的過程中,使用滑動視窗(Sliding Window)保持最新的10K筆資料。每到達Check point時,利用Sliding window內的data計算準確率。若準確率很低,計算出恰當的X限制pattern長度。使得長度1~X的pattern有更充裕的空間可以存取Summary,有效提高準確率。
實驗結果證實,使用Trie存取pattern能夠用較少的步驟處理完畢,大幅減少執行時間。在改善準確率方面,有效的使用Sliding window分析準確率,並計算出恰當的X限制長度。避免過多的Long pattern充斥在Summary之中,進而大大地提升了準確率。
英文摘要 Data stream mining is a research topic in the big data field. Data streams contain a lot of data which are generated continuously in our daily life. They are click data, sensor data, transaction data, network traffic data and so on. There are a lot of valuable information in data streams. Data mining techniques have been used to analyze data in different applications so that making better decisions. In data mining fields, frequent pattern mining is an important topic. It is used to find patterns which appeared frequently. Frequent patterns are used in several applications and other analysis.
In recent years, there are a lot of researches about “frequent pattern mining in data streams”. There is a method called LCSS algorithm which has ability to do real-time response and handle bursty data. The LCSS algorithm uses finite run time and memory space to process data streams, but, inappropriate data structures cost a lot of time. In order to solve the problem, a better data structure called “Trie” is used, which is a lexicographic tree. As a result, run time decreases because fewer steps are used to access patterns in Trie.
In addition, there is a problem in the LCSS algorithm. Frequent patterns can’t be found accurately when minimum support threshold is low. Therefore, accuracy closes to zero. Accuracy is low because the LCSS algorithm uses finite memory space to store patterns. There are not enough memory space store patterns when data streams are complex and large. In order to increase accuracy, we propose “Length Skip” which sacrifices long frequent patterns and focuses on short frequent patterns. When it processes data streams, uses a fixed size sliding window to hold the last ten thousand data. In each check point, data in sliding window are used to calculate accuracy. If accuracy is low, appropriate X would be calculated to limit the length of patterns.
Experimental results show that using Trie to access patterns could reduce the number of process steps so that it spends shorter time to process data streams. On the other hand, sliding window is used to calculate accuracy and appropriate limited length X so that there are not many long frequent patterns are stored in a summary. Therefore, accuracy increases efficiently.
論文目次 目錄
第一章、緒論 1
1.1、論文簡介 1
1.2、論文架構 4
第二章、相關研究背景 5
2.1、Database Mining與Data Stream Mining差異 5
2.2、Data stream mining的架構 6
2.2.1、Landmark 6
2.2.2、Sliding window 7
2.2.3、Time fading 8
2.2.4、不同架構之比較 9
2.3、Landmark Model處理模式 9
2.3.1、Batch by Batch 9
2.3.2、One by One 10
2.4、LCSS Algorithm 11
2.4.1、LCSS特色 11
2.4.2、Skip LCSS algorithm 13
2.4.3、Stream Reduction 20
2.4.4、Skip LCSS Example 23
2.4.5、R-skip 26
第三章、我們的策略 27
3.1、LCSS問題描述 27
3.1.1、LCSS執行時間 27
3.1.2、LCSS準確率 29
3.2、降低執行時間 32
3.2.1、使用Trie資料結構 32
3.2.2、Big-O分析 34
3.2.3、建立cmin Queue的成本 36
3.3、提升準確率 37
3.3.1、Summary size 37
3.3.2、Length Skip 38
第四章、模擬評估 48
4.1、實驗環境與測試資料 48
4.2、輸入參數與評估指標 49
4.2.1、評估執行時間 50
4.2.2、評估準確率F-score 50
4.3、實驗結果 52
4.3.1、降低執行時間 52
4.3.2、提升準確率 53
第五章、結論 66
第六章、參考文獻 69

圖目錄
圖2.1、Landmark 7
圖2.2、Sliding window 8
圖2.3、Time fading 8
圖2.4、Batch by Batch 10
圖2.5、One by One 11
圖2.6、Entry table 12
圖2.7、Skip LCSS流程圖 14
圖2.8、Skip LCSS虛擬碼 18
圖2.9、case A:cs < me 19
圖2.10、case B:cs = me 19
圖2.11、case C:cs > me 19
圖2.12、Stream Reduction示意圖 21
圖2.13、Stream Reduction虛擬碼 22
圖2.14、Item table樣式 22
圖2.15、Skip LCSS初始化 23
圖2.16、Skip LCSS範例 24
圖3.1、LCSS algorithm執行時間 28
圖3.2、Skip LCSS與R-skip準確率比較 29
圖3.3、使用Trie存取pattern 33
圖3.4、cmin Queue 34
圖3.5、List與Trie比較 35
圖3.6、Summary大小與F-score關係 38
圖3.7、不同Dataset限制長度與準確率關係 42
圖3.8、使用Sliding window計算限制長度X 43
圖3.9、LCSS + Length Skip流程圖 45
圖3.10、Length Skip流程圖 47
圖4.1、Retail data前10筆樣式 49
圖4.2、Recall & Precision 51
圖4.3、使用Trie降低執行時間 53
圖4.4、Sliding window準確率趨勢 in Retail data 54
圖4.5、Sliding window準確率趨勢 in T10I4D100K data 55
圖4.6、Sliding window準確率趨勢 in BMS2 data 55
圖4.7、限制pattern長度X的趨勢 in Retail data 57
圖4.8、限制pattern長度X的趨勢 in T10I4D100K data 57
圖4.9、限制pattern長度X的趨勢 in BMS2 data 58
圖4.10、每個Check point的X值 59
圖4.11、使用Length Skip減少執行時間 60
圖4.12、使用Length Skip提升準確率 in Retail data 62
圖4.13、使用Length Skip提升準確率 in T10I4D100K data 62
圖4.14、使用Length Skip提升準確率 in BMS2 data 63
圖4.15、Frequent pattern分佈圖 64


表目錄
表2.1、Skip LCSS輸入參數 17
表2.2、Skip LCSS變數 17
表2.3、不同case的更新方法 20
表3.1、Skip LCSS與R-skip差別 27
表3.2、Skip LCSS準確率表現 30
表3.3、Threshold與關係 in Retail data 31
表3.4、R-skip準確率表現 31
表3.5、List與Trie參數 35
表3.6、pattern長度及數量關係in Retail data 39
表3.7、限制長度與準確率關係 in Retail data 42
表4.1、Dataset特色 48
表4.2、長度1~4的pattern準確率表現in Retail data 61
表4.3、使用Length Skip準確率表現 in Retail data 61
表4.4、Pattern遺失率 64
參考文獻 [1] J. Gama, “Data Stream Mining: the Bounded Rationality,” Informatica, vol. 37, 2013, pp. 21–25.
[2] C. C. Aggarwal, “Applications of Frequent Pattern Mining,” in Frequent Pattern Mining, Cham: Springer International Publishing, 2014, pp. 443–467.
[3] C.-W. Tsai, C.-F. Lai, M.-C. Chiang, and L. T. Yang, “Data Mining for Internet of Things: A Survey,” IEEE Commun. Surv. Tutorials, vol. 16, no. 1, 2014, pp. 77–97.
[4] V. E. Lee, R. Jin, and G. Agrawal, “Frequent Pattern Mining in Data Streams,” in Frequent Pattern Mining, Cham: Springer International Publishing, 2014, pp. 199–224.
[5] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.
[6] V. Rao, “Numeric Measures for Association Rules.” [Online]. Available: http://www.datasciencecentral.com/profiles/blogs/numeric-measures-for-association-rules.
[7] K. J. Kang, B. Ka, and S. J. Kim, “A service scenario generation scheme based on association rule mining for elderly surveillance system in a smart home environment,” Eng. Appl. Artif. Intell., vol. 25, no. 7, Oct. 2012, pp. 1355–1364.
[8] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases,” in Proceedings of the 20th International Conference on Very Large Data Bases, 1994, pp. 487–499.
[9] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns Without Candidate Generation,” SIGMOD Rec., vol. 29, no. 2, 2000, pp. 1–12.
[10]“Apriori algorithm.” [Online].Available: https://en.wikipedia.org/wiki/Apriori_algorithm.
[11] “Precision and recall.” [Online].Available: https://en.wikipedia.org/wiki/Precision_and_recall.
[12] G. Lee, U. Yun, and K. H. Ryu, “Sliding window based weighted maximal frequent pattern mining over data streams,” Expert Syst. Appl., vol. 41, no. 2, Feb. 2014, pp. 694–708.
[13] L. Chen and Q. Mei, “Mining frequent items in data stream using time fading model,” Inf. Sci. (Ny), vol. 257, Feb. 2014, pp. 54–69.
[14] C.-W. Li, K.-F. Jea, R.-P. Lin, S.-F. Yen, and C.-W. Hsu, “Mining frequent patterns from dynamic data streams with data load management,” J. Syst. Softw., vol. 85, no. 6, Jun. 2012, pp. 1346–1362.
[15]Y.Yamamoto, K.Iwanuma, and S.Fukuda, “Resource-oriented approximation for frequent itemset mining from bursty data streams,” in Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD ’14, 2014, pp. 205–216.
[16] W. Ng and M. Dash, “A comparison between approximate counting and sampling methods for frequent pattern mining on data streams,” Intell. Data Anal., vol. 14, no. 6, pp. 2010, 749–771.
[17] G. S. Manku and R. Motwani, “Approximate frequency counts over data streams,” Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, 2002, pp. 346–357.
[18] A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams,” in Proceedings of the 10th international conference on Database Theory, Springer-Verlag, 2004, pp. 398–412.
[19] L. P. Suresh, S. S. Dash, and B. K. Panigrahi, “Artificial Intelligence and Evolutionary Algorithms in Engineering Systems: Proceedings of ICAEES 2014, Volume 2,” Jan. 2015.
[20] S. Itkar and U. Kulkarni, “Efficient Frequent Pattern Mining Using Auto-Associative Memory Neural Network,” Br. J. Appl. Sci. Technol., vol. 4, no. 22, 2014, p. 3160.
[21] A. Choubey, R. Patel, and J. L. Rana, “GRAPH BASED NEW APPROACH FOR FREQUENT PATTERN MINING,” Int. J. Comput. Sci. Inf. Technol., vol. 4, no. 1, 2012, p. 221.
[22] A. Adhikari, J. Adhikari, and W. Pedrycz, “Data Analysis and Pattern Recognition in Multiple Databases”, Cham: Springer International Publishing, vol. 61, 2014.
[23] B. Goethals, "Frequent Itemset Mining Dataset Repository." [Online]. Available: http://fimi.ua.ac.be/.
[24] V. S. Fournier-Viger, P., Gomariz, Gueniche, T., A., Soltani, A., Wu., C., Tseng, “SPMF: a Java Open-Source Pattern Mining Library.” [Online]. Available: http://www.philippe-fournier-viger.com/spmf/
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2017-08-15公開。
  • 同意授權瀏覽/列印電子全文服務,於2017-08-15起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2486 或 來信