系統識別號 | U0002-1408201621154300 |
---|---|
DOI | 10.6846/TKU.2016.00374 |
論文名稱(中文) | 尋求有效率之資料串流頻繁樣式探勘 |
論文名稱(英文) | In Pursuit of Efficient Frequent Pattern Mining in Data Streams |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 電機工程學系碩士班 |
系所名稱(英文) | Department of Electrical and Computer Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 104 |
學期 | 2 |
出版年 | 105 |
研究生(中文) | 凃耘昇 |
研究生(英文) | Yun-Sheng Tu |
學號 | 603450023 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
第二語言別 | 英文 |
口試日期 | 2016-06-17 |
論文頁數 | 72頁 |
口試委員 |
指導教授
-
莊博任(pjchuang@ee.tku.edu.tw)
委員 - 陳省隆(hlchen@mail.ntust.edu.tw) 委員 - 許獻聰(stsheuce.ncu.edu.tw) |
關鍵字(中) |
資料串流 頻繁樣式探勘 |
關鍵字(英) |
Data stream Frequent pattern mining |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
資料串流探勘(Data stream mining)是現今大數據(Big data)領域的研究主題之一。資料串流指的是日常生活中隨著時間不斷產生的大量資料。例如:每個人上網的點擊串流、感測器產生的資料、購物資料、網路流量資料…等。這些資料串流中隱含著許多有價值的資訊,利用資料探勘(Data mining)的技術,可以分析出針對不同應用之下所需要的結果,最後做出更佳的決策。在資料探勘的領域中,其中有一個主題是頻繁樣式探勘(Frequent pattern mining)。目標是要從資料當中找出哪些樣式出現頻率較高,利用找出來的樣式可以進一步運用。 目前在資料串流中找出頻繁樣式的研究領域,針對即時分析的需求以及爆量資料的問題,近年來有LCSS演算法。它能夠用有限的時間與空間處理Data stream。但是,由於沒有使用較佳的資料結構,使得運算的時間很長。針對這個問題,我們使用較佳的資料結構Trie存取資料。Trie是一個字典排序(Lexicographic)的Tree。因此,在搜尋pattern時,能夠用更少的步驟處理,使得速度能夠有效的提升。 此外,LCSS演算法有另外一個問題。當最小支持度(Minimum support threshold)較低時,無法正確的找出頻繁樣式,導致準確率趨近於零。因為LCSS是用有限的空間儲存pattern,當資料較龐大且複雜時,沒有足夠緩衝的空間能夠存取pattern。為了改善準確率較低的問題,我們提出了Length Skip。犧牲Long frequent pattern,將運算資源集中在Short frequent pattern,使其準確率提高。Length Skip是在處理Data stream的過程中,使用滑動視窗(Sliding Window)保持最新的10K筆資料。每到達Check point時,利用Sliding window內的data計算準確率。若準確率很低,計算出恰當的X限制pattern長度。使得長度1~X的pattern有更充裕的空間可以存取Summary,有效提高準確率。 實驗結果證實,使用Trie存取pattern能夠用較少的步驟處理完畢,大幅減少執行時間。在改善準確率方面,有效的使用Sliding window分析準確率,並計算出恰當的X限制長度。避免過多的Long pattern充斥在Summary之中,進而大大地提升了準確率。 |
英文摘要 |
Data stream mining is a research topic in the big data field. Data streams contain a lot of data which are generated continuously in our daily life. They are click data, sensor data, transaction data, network traffic data and so on. There are a lot of valuable information in data streams. Data mining techniques have been used to analyze data in different applications so that making better decisions. In data mining fields, frequent pattern mining is an important topic. It is used to find patterns which appeared frequently. Frequent patterns are used in several applications and other analysis. In recent years, there are a lot of researches about “frequent pattern mining in data streams”. There is a method called LCSS algorithm which has ability to do real-time response and handle bursty data. The LCSS algorithm uses finite run time and memory space to process data streams, but, inappropriate data structures cost a lot of time. In order to solve the problem, a better data structure called “Trie” is used, which is a lexicographic tree. As a result, run time decreases because fewer steps are used to access patterns in Trie. In addition, there is a problem in the LCSS algorithm. Frequent patterns can’t be found accurately when minimum support threshold is low. Therefore, accuracy closes to zero. Accuracy is low because the LCSS algorithm uses finite memory space to store patterns. There are not enough memory space store patterns when data streams are complex and large. In order to increase accuracy, we propose “Length Skip” which sacrifices long frequent patterns and focuses on short frequent patterns. When it processes data streams, uses a fixed size sliding window to hold the last ten thousand data. In each check point, data in sliding window are used to calculate accuracy. If accuracy is low, appropriate X would be calculated to limit the length of patterns. Experimental results show that using Trie to access patterns could reduce the number of process steps so that it spends shorter time to process data streams. On the other hand, sliding window is used to calculate accuracy and appropriate limited length X so that there are not many long frequent patterns are stored in a summary. Therefore, accuracy increases efficiently. |
第三語言摘要 | |
論文目次 |
目錄 第一章、緒論 1 1.1、論文簡介 1 1.2、論文架構 4 第二章、相關研究背景 5 2.1、Database Mining與Data Stream Mining差異 5 2.2、Data stream mining的架構 6 2.2.1、Landmark 6 2.2.2、Sliding window 7 2.2.3、Time fading 8 2.2.4、不同架構之比較 9 2.3、Landmark Model處理模式 9 2.3.1、Batch by Batch 9 2.3.2、One by One 10 2.4、LCSS Algorithm 11 2.4.1、LCSS特色 11 2.4.2、Skip LCSS algorithm 13 2.4.3、Stream Reduction 20 2.4.4、Skip LCSS Example 23 2.4.5、R-skip 26 第三章、我們的策略 27 3.1、LCSS問題描述 27 3.1.1、LCSS執行時間 27 3.1.2、LCSS準確率 29 3.2、降低執行時間 32 3.2.1、使用Trie資料結構 32 3.2.2、Big-O分析 34 3.2.3、建立cmin Queue的成本 36 3.3、提升準確率 37 3.3.1、Summary size 37 3.3.2、Length Skip 38 第四章、模擬評估 48 4.1、實驗環境與測試資料 48 4.2、輸入參數與評估指標 49 4.2.1、評估執行時間 50 4.2.2、評估準確率F-score 50 4.3、實驗結果 52 4.3.1、降低執行時間 52 4.3.2、提升準確率 53 第五章、結論 66 第六章、參考文獻 69 圖目錄 圖2.1、Landmark 7 圖2.2、Sliding window 8 圖2.3、Time fading 8 圖2.4、Batch by Batch 10 圖2.5、One by One 11 圖2.6、Entry table 12 圖2.7、Skip LCSS流程圖 14 圖2.8、Skip LCSS虛擬碼 18 圖2.9、case A:cs < me 19 圖2.10、case B:cs = me 19 圖2.11、case C:cs > me 19 圖2.12、Stream Reduction示意圖 21 圖2.13、Stream Reduction虛擬碼 22 圖2.14、Item table樣式 22 圖2.15、Skip LCSS初始化 23 圖2.16、Skip LCSS範例 24 圖3.1、LCSS algorithm執行時間 28 圖3.2、Skip LCSS與R-skip準確率比較 29 圖3.3、使用Trie存取pattern 33 圖3.4、cmin Queue 34 圖3.5、List與Trie比較 35 圖3.6、Summary大小與F-score關係 38 圖3.7、不同Dataset限制長度與準確率關係 42 圖3.8、使用Sliding window計算限制長度X 43 圖3.9、LCSS + Length Skip流程圖 45 圖3.10、Length Skip流程圖 47 圖4.1、Retail data前10筆樣式 49 圖4.2、Recall & Precision 51 圖4.3、使用Trie降低執行時間 53 圖4.4、Sliding window準確率趨勢 in Retail data 54 圖4.5、Sliding window準確率趨勢 in T10I4D100K data 55 圖4.6、Sliding window準確率趨勢 in BMS2 data 55 圖4.7、限制pattern長度X的趨勢 in Retail data 57 圖4.8、限制pattern長度X的趨勢 in T10I4D100K data 57 圖4.9、限制pattern長度X的趨勢 in BMS2 data 58 圖4.10、每個Check point的X值 59 圖4.11、使用Length Skip減少執行時間 60 圖4.12、使用Length Skip提升準確率 in Retail data 62 圖4.13、使用Length Skip提升準確率 in T10I4D100K data 62 圖4.14、使用Length Skip提升準確率 in BMS2 data 63 圖4.15、Frequent pattern分佈圖 64 表目錄 表2.1、Skip LCSS輸入參數 17 表2.2、Skip LCSS變數 17 表2.3、不同case的更新方法 20 表3.1、Skip LCSS與R-skip差別 27 表3.2、Skip LCSS準確率表現 30 表3.3、Threshold與關係 in Retail data 31 表3.4、R-skip準確率表現 31 表3.5、List與Trie參數 35 表3.6、pattern長度及數量關係in Retail data 39 表3.7、限制長度與準確率關係 in Retail data 42 表4.1、Dataset特色 48 表4.2、長度1~4的pattern準確率表現in Retail data 61 表4.3、使用Length Skip準確率表現 in Retail data 61 表4.4、Pattern遺失率 64 |
參考文獻 |
[1] J. Gama, “Data Stream Mining: the Bounded Rationality,” Informatica, vol. 37, 2013, pp. 21–25. [2] C. C. Aggarwal, “Applications of Frequent Pattern Mining,” in Frequent Pattern Mining, Cham: Springer International Publishing, 2014, pp. 443–467. [3] C.-W. Tsai, C.-F. Lai, M.-C. Chiang, and L. T. Yang, “Data Mining for Internet of Things: A Survey,” IEEE Commun. Surv. Tutorials, vol. 16, no. 1, 2014, pp. 77–97. [4] V. E. Lee, R. Jin, and G. Agrawal, “Frequent Pattern Mining in Data Streams,” in Frequent Pattern Mining, Cham: Springer International Publishing, 2014, pp. 199–224. [5] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011. [6] V. Rao, “Numeric Measures for Association Rules.” [Online]. Available: http://www.datasciencecentral.com/profiles/blogs/numeric-measures-for-association-rules. [7] K. J. Kang, B. Ka, and S. J. Kim, “A service scenario generation scheme based on association rule mining for elderly surveillance system in a smart home environment,” Eng. Appl. Artif. Intell., vol. 25, no. 7, Oct. 2012, pp. 1355–1364. [8] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases,” in Proceedings of the 20th International Conference on Very Large Data Bases, 1994, pp. 487–499. [9] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns Without Candidate Generation,” SIGMOD Rec., vol. 29, no. 2, 2000, pp. 1–12. [10]“Apriori algorithm.” [Online].Available: https://en.wikipedia.org/wiki/Apriori_algorithm. [11] “Precision and recall.” [Online].Available: https://en.wikipedia.org/wiki/Precision_and_recall. [12] G. Lee, U. Yun, and K. H. Ryu, “Sliding window based weighted maximal frequent pattern mining over data streams,” Expert Syst. Appl., vol. 41, no. 2, Feb. 2014, pp. 694–708. [13] L. Chen and Q. Mei, “Mining frequent items in data stream using time fading model,” Inf. Sci. (Ny), vol. 257, Feb. 2014, pp. 54–69. [14] C.-W. Li, K.-F. Jea, R.-P. Lin, S.-F. Yen, and C.-W. Hsu, “Mining frequent patterns from dynamic data streams with data load management,” J. Syst. Softw., vol. 85, no. 6, Jun. 2012, pp. 1346–1362. [15]Y.Yamamoto, K.Iwanuma, and S.Fukuda, “Resource-oriented approximation for frequent itemset mining from bursty data streams,” in Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD ’14, 2014, pp. 205–216. [16] W. Ng and M. Dash, “A comparison between approximate counting and sampling methods for frequent pattern mining on data streams,” Intell. Data Anal., vol. 14, no. 6, pp. 2010, 749–771. [17] G. S. Manku and R. Motwani, “Approximate frequency counts over data streams,” Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, 2002, pp. 346–357. [18] A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient Computation of Frequent and Top-k Elements in Data Streams,” in Proceedings of the 10th international conference on Database Theory, Springer-Verlag, 2004, pp. 398–412. [19] L. P. Suresh, S. S. Dash, and B. K. Panigrahi, “Artificial Intelligence and Evolutionary Algorithms in Engineering Systems: Proceedings of ICAEES 2014, Volume 2,” Jan. 2015. [20] S. Itkar and U. Kulkarni, “Efficient Frequent Pattern Mining Using Auto-Associative Memory Neural Network,” Br. J. Appl. Sci. Technol., vol. 4, no. 22, 2014, p. 3160. [21] A. Choubey, R. Patel, and J. L. Rana, “GRAPH BASED NEW APPROACH FOR FREQUENT PATTERN MINING,” Int. J. Comput. Sci. Inf. Technol., vol. 4, no. 1, 2012, p. 221. [22] A. Adhikari, J. Adhikari, and W. Pedrycz, “Data Analysis and Pattern Recognition in Multiple Databases”, Cham: Springer International Publishing, vol. 61, 2014. [23] B. Goethals, "Frequent Itemset Mining Dataset Repository." [Online]. Available: http://fimi.ua.ac.be/. [24] V. S. Fournier-Viger, P., Gomariz, Gueniche, T., A., Soltani, A., Wu., C., Tseng, “SPMF: a Java Open-Source Pattern Mining Library.” [Online]. Available: http://www.philippe-fournier-viger.com/spmf/ |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信