§ 瀏覽學位論文書目資料
  
系統識別號 U0002-2506200915075200
DOI 10.6846/TKU.2009.00944
論文名稱(中文) 一個處理概念漂移的垃圾郵件分類演算法
論文名稱(英文) An Anti-Spam Algorithm for Handling Concept Drift
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊管理學系碩士班
系所名稱(英文) Department of Information Management
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 97
學期 2
出版年 98
研究生(中文) 陳昱辰
研究生(英文) Yu-Chen Chen
學號 696630036
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2009-06-07
論文頁數 61頁
口試委員 指導教授 - 周清江
委員 - 廖賀田
委員 - 伍台國
委員 - 翁頌舜
關鍵字(中) 郵件分類
概念漂移
資料偏斜
關鍵字(英) e-mail categorization
concept drift
data skewedness
第三語言關鍵字
學科別分類
中文摘要
垃圾郵件氾濫的問題一直沒有得到徹底的解決,各種垃圾郵件防治機制紛紛興起,其中以機器學習為主的垃圾郵件內容分類過濾最為盛行。而這些方法,主要都是基於所有的資料在固定不變的環境下之假設,但是在實際環境中,郵件內容會隨著概念的漂移而不斷變動,使得分類器在模型建立之初,都有不錯的分類效果,但隨著時間的演進與概念的漂移,郵件的分類正確率會逐漸下滑,因此必須有一個學習與調整的機制,針對資料集中新進與舊有郵件做相關的學習與調整。另一個郵件分類的問題是資料的偏斜,由於垃圾郵件的氾濫,垃圾郵件個數通常明顯的比正常郵件來的多,在分類的過程中,雖然垃圾郵件類別都有著較高的召回率,但是正常郵件類別的召回率卻相對不佳。因此本研究提出IFWB(Incremental Forgetting Weighted Bayesian,漸進遺忘權重貝氏)演算法,以貝氏分類為基礎,採用IGICF(Information Gain and Inverse Class Frequency,資訊增益與類別頻率倒數)擷取關鍵字,結合漸進遺忘機制與分類成本架構來解決郵件分類中概念漂移與資料偏斜的問題,最後透過實驗來驗證本研究所提出的郵件分類方法。
英文摘要
The overflow problem of spam has not been solved completely. Many anti-spam techniques have been proposed. Among them, the machine learning techniques are the most popular, but these works are based on a static environment assumption. In the real world application, the email context may change with concept drift. The classification result is usually good at the beginning, but along with time evolution and concept drift, the classification accuracy dropped down gradually. So a mechanism is needed to adjust the classifier according to the new incoming emails and the old emails in the dataset. Another problem of email categorization is data skewedness. Because of the spam overflow, the number of spam emails is far more than that of legitimate ones. In the classification result, the majority class is with good recall rate, but the minority class with poor recall rate. For these reasons, we propose an algorithm, IFWB (Incremental Forgetting Weighted Bayesian), based on Naïve Bayesian and IGICF (Information Gain and Inverse Class Frequency) feature extraction, combined with the gradual forgetting mechanism and cost-sensitive model to tackle concept drift and data skewedness. Finally, we demonstrate the effectiveness of the IFWB algorithm through a series of experiments.
第三語言摘要
論文目次
目錄
第1章	緒論	1
1.1	研究背景與動機	1
1.2	研究目的	3
1.3	論文架構	5
第2章	文獻探討	6
2.1	垃圾郵件防治方法	6
2.2	分類演算法	9
2.3	概念漂移	13
2.4	資料偏斜	15
2.5	概念漂移之垃圾郵件過濾方法	16
第3章	IFWB演算法	21
3.1	相關演算法之選擇	21
3.2	IFWB演算法流程	23
3.3	字詞擷取	29
3.4	權重貝氏分類法	31
3.5	分類錯誤成本架構	37
第4章	實驗	39
4.1	郵件資料集與實驗說明	39
4.1.1.	實驗環境與郵件資料集	39
4.1.2.	實驗說明	41
4.2	實驗探討與分析	43
4.2.1.	實驗一	43
4.2.2.	實驗二	45
4.2.3.	實驗三	46
4.2.4.	實驗四	47
4.2.5.	實驗五	49
4.2.6.	實驗六	51
4.2.7.	實驗七	52
第5章	結論	54
參考文獻	58

圖目錄
圖 1:kNN運作原理	10
圖 2:類神經網路運作原理	12
圖 3:SVM運作原理	13
圖 4:字詞與案例間的關聯網	18
圖 5:ICBC調整分群架構機制	19
圖 6:訓練階段流程	23
圖 7:字詞庫於訓練階段建立與關鍵字集產生方法	27
圖 8:郵件資料集之向量矩陣	27
圖 9:分類階段流程	28
圖 10:線性的漸進遺忘方程式,N=100,k=80%	32
圖 11:IFWB在各個資料集下的分類結果	45
圖 12:IG、IGICF關鍵字詞擷取方法之分類結果	46
圖 13:貝氏與IFWB在概念漂移下的分類結果	47
圖 14:IFWB演算法在訓練集為SpamAssassin測試集為LingSpam下之分類結果	49
圖 15:TREC與SpamAssassin在各種遺忘速率k值下的分類結果	50
圖 16:在資料偏斜下,有無分類成本學習架構之正常郵件召回率	51

表目錄
表 1:關鍵字擷取演算法	31
表 2:範例郵件關鍵字資料表	35
表 3:分類成本矩陣	37
表 4:公開郵件資料集	40
表 5:實驗郵件資料集	41
表 6:各種正常郵件分類成本下之分類結果	52
參考文獻
[1]徐燕、李錦濤、王斌、孫春明、張森,2007,不均衡數據集上文本分類的特徵選擇研究,計算機研究與發展,Vol. 44,No. z2,58-62
[2]羅淑薰,2007,具部份漸進學習能力之類神經網路樹及其於垃圾郵件過濾之應用,國立中央大學資訊工程研究所碩士論文
[3]張僩鈞,2006,兩階層式垃圾郵件過濾機制之研究,銘傳大學資訊傳播工程研究所碩士論文
[4]Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-Based Learning Algorithms. Machine Learning, Vol. 6, No. 1, 37-66
[5]Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C. D., & Stamatopoulos, P. (2000). Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach. Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, 1-13
[6]Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter, Vol. 6, No. 1, 1-6
[7]Crocker, D. (2005). Challenges in Anti-Spam Efforts. Internet Protocol Journal, Vol. 8, No. 4, 2-14
[8]Delany, S. J., Cunningham, P., & Tsymbal, A. (2005). A Comparison of Ensemble and Case-Based Maintenance Techniques for Handling Concept Drift in Spam Filtering. Technical Report TCD-CS-2005-19, Trinity College Dublin
[9]Delany, S. J., Cunningham, P., Tsymbal, A., & Coyle, L. (2005). A Case-Based Technique for Tracking Concept Drift in Spam Filtering. Knowledge-Based Systems, Vol. 18, No. 4-5, 187-195
[10]Dredze, M., Gevaryahu, R., & Bachrach, A. E. (2007). Learning Fast Classifiers for Image Spam, Proceedings of the 4th Conference on Email and Anti-Spam, 487-493
[11]Drucker, H., Wu, D., & Vapnik, V. N. (1999).  Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks, Vol. 10, No. 5, 1048-1054
[12]Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. Proceedings of the 17th International Joint Conference on Artificial Intelligence, 239-246
[13]Fawcett, T. (2004). In Vivo Spam Filtering: A Challenge Problem for Data Mining. ACM SIGKDD Explorations Newsletter, Vol. 5, No. 2, 140-148
[14]Fdez-Riverola, F., Iglesias, E. L., Díaz, F., Méndez J. R., & Corchado, J. M. (2007). Applying Lazy Learning Algorithms to Tackle Concept Drift in Spam Filtering. Expert Systems with Applications, Vol. 33, No. 1, 36-48
[15]Gustavo, E. A., Batista, P. A., Ronaldo, C., Prati, & Maria C. M. (2004). A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter, Vol. 6, No. 1, 20-29
[16]Han, J., & Kamber, M. (2006). Data Mining Concept and Techniques, Second Edition, Morgan Kaufmann
[17]Hershkop, S. (2006). Behavior-Based Email Analysis with Application to Spam Detection, PhD thesis, Department of Computer Science, Columbia University 
[18]Hsiao, W. F., & Chang, T. M. (2008). An Incremental Cluster-Based Approach to Spam Filtering. Expert Systems with Applications, Vol. 34, No. 3, 1599-1608
[19]Katakis, I., Tsoumakas, G., & Vlahavas, I. (2005). On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams. Proceedings of the 10th Panhellenic Conference on Informatics, Springer-Verlag, 338-348
[20]Koychev, I. (2000). Gradual Forgetting for Adaptation to Concept Drift. Proceedings of ECAI 2000 Workshop Current Issues in Spatio-Temporal Reasoning, 101-106
[21]Kuncheva, L. I. (2004). Classifier Ensembles for Changing Environments. Lecture Notes in Computer Science (LNCS), Vol. 3077, 1-15
[22]Monard, M. C., & Batista, G. E. A. P. A. (2002). Learning with Skewed Class Distributions. Proceedings of the 3rd Congress of Logic Applied to Technology (LAPTEC), 173-180
[23]Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program, Vol. 4, No. 3, 130-137
[24]Ramachandran, A., Feamster, N., & Vempala, S. (2007). Filtering Spam with Behavioral Blacklisting, Proceedings of the 14th ACM Conference on Computer and Communications Security, 342-351
[25]Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian Approach to Filtering Junk E-mail. Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, 55-62
[26]Tretyakov, K. (2004). Machine Learning Techniques in Spam Filtering. Technical Report, Department of Computer Science, Tartu University
[27]Tsymbal, A. (2004). The Problem of Concept Drift: Definitions and Related Work. Technical Report TCD-CS-2004-15, Department of Computer Science, Trinity College Dublin
[28]Widmer, G., & Kubat, M. (1996). Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning, Vol. 23, No. 1, 69-101
[29]Yang, Y., & Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text Categorization. Proceedings of the ICML-97 14th Conference on Machine Learning, 412-420
論文全文使用權限
校內
紙本論文於授權書繳交後5年公開
同意電子論文全文授權校園內公開
校內電子論文於授權書繳交後5年公開
校外
同意授權
校外電子論文於授權書繳交後5年公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信