電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2006-06-19起於校外公開使用
本論文紙本於2006-06-19起公開使用

系統識別號	U0002-2905200611002600
DOI	10.6846/TKU.2006.00908
論文名稱(中文)	語意解析垃圾郵件過濾器
論文名稱(英文)	Semantic processing model for spam filter
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	資訊管理學系碩士班
系所名稱(英文)	Department of Information Management
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	94
學期	2
出版年	95
研究生(中文)	謝文軒
研究生(英文)	Wen Hsuan Hsieh
學號	693520966
學位類別	碩士
語言別	繁體中文
第二語言別
口試日期	2006-05-20
論文頁數	42頁
口試委員	指導教授 - 梁德昭(tcliang@mail.im.tku.edu.tw) 委員 - 梁德昭委員 - 魏世杰委員 - 連志成委員 - 陳安斌
關鍵字(中)	垃圾郵件語意處理特徵擷取
關鍵字(英)	spam feature extraction semantic processing
第三語言關鍵字
學科別分類
中文摘要	網路基礎建設發達之後，網路人口暴增，也陸續衍生出許多便利的網路應用。垃圾郵件卻是一個負面的例子。垃圾郵件的數量以及不堪入目的內容讓人不勝其擾。本研究欲發展一使用者端郵件過濾器技術，此技術將可處理中、英文郵件資訊，不需預先建立大量的郵件黑名單，擁有累進學習(adaptive learning)的能力，達成高正確率並兼顧訓練時期與分類時期的速度，使其能實際應用於現實環境中。郵件過濾的技術與文件分類相似。首先是面對的問題是如何擷取數量以及特質都足以代表此郵件的特徵，再利用自動分類演算法依據這些特徵來決定該郵件是為垃圾郵件。本研究在特徵擷取方面，利用斷詞後的結果經由以詞性為主的停用字過濾，以及Sliding Window配合關鍵詞組合的方式，擷取垃圾郵件的字面特徵。而分類演算法則採用貝式分類演算法。由於本研究使用之特徵擷取的演算法深入語意層面，所以其正確率高於關鍵字的特徵擷取法，從實驗結果來看，我們的郵件過濾機制正確率達到92%，但是由於語意特徵擷取的程序，因此其訓練階段與分類階段的速度皆低於關鍵字特徵擷取法。
英文摘要	In this information age, network provides many convenient applications to us, but spam is different one. The huge amount of spam and disgusting contents are disturbance people who use e-mail in daily life. The thesis is to develop a semantic-based spam filter in client side, it can handle mail message in Chinese or in English and doesn’t need to build a huge amount of black-white list for mail. It has an ability of adaptive learning to reach high precision rate and looks after the speed in training phase and classifying phase. So it can be used in real environment. Mail filtering is similar with document classification. First problem is how to extract enough features that represent the mail exactly. Then according to these features, we use automatic classify algorithm to classify this mail is spam or ham. We use sliding window to extract features and take Bayesian’s algorithm as our classification algorithm. Due to the feature extraction method deeps into semantic layer, the precision rate is higher than the feature extraction with keywords as a result.
第三語言摘要
論文目次	目錄第一章續論 1 第一節研究背景 1 第二節研究動機與目的 3 第三節論文架構 3 第二章文獻探討 5 第一節文件分類方法 5 2.1.1 決策樹(decision tree induction) 6 2.1.2 貝氏分類法(Bayesian Classification) 11 2.1.3 貝氏信心網路(Bayesian belief networks) 13 2.1.4 神經網路(neural network) 15 2.1.5 K-NN (K-Nearest Neighbor)分類法 16 2.1.6 SVM（Support Vector Machine） 17 第二節文件分類作法 17 2.2.1 分類之資料準備 18 2.2.2 文件分類架構 19 2.2.3 文件分類特徵選取 20 第三節知名的郵件過濾系統 24 第三章語意解析垃圾郵件過濾器 28 第一節系統架構 28 第二節特徵擷取 29 第三節郵件過濾技術 33 第四章實作與評估 35 第一節系統實作 35 第二節實驗結果 35 第五章結論與未來發展 38 參考文獻 40 圖、表目錄圖2.1 決策樹 6 圖2.2 貝式信心網路 14 圖2.3 多層回饋網路 15 圖3.1 系統架構 28 圖3.2 郵件內文特徵擷取方法 29 圖3.3 Sliding Window 31 表1.1 CNET調查報告 2 表1.2 Radical Group預測垃圾郵件成長趨勢 2 表2.1 天氣資料 8 表2.2 肺癌CPT 14 表2.3 CKIP斷詞系統的斷詞詞性和其意義 20 表2.4 國內外知名垃圾郵件過濾系統摘要 24 表3.1 特徵數量統計 33 表4.1 測試資料分佈 35 表4.2 語意特徵擷取法實驗結果 36 表4.3 對照組實驗結果 36 表5.1 垃圾郵件攻擊方式 38
參考文獻	[1]About Bayesian Belief Networks, Charles River Analytics, Inc. 2004 [2]Aron Culotta, Ron Bekkerman, and Andrew McCallum, Extracting social networks and contact information from email and the Web, MIT Spam Conference, 2004 [3]Barry Leiba, Nathaniel Borenstein, A Multifaceted Approach to Spam Reduction, MIT Spam Conference, 2004 [4]Brett Watson, Beyond Identity: Addressing Problems that Persist in an Electronic Mail System with Reliable Sender Identification. MIT Spam Conference, 2004 [5]Bryan Klimt, Yiming Yang, Introducing the Enron Corpus, MIT Spam Conference, 2004 [6]Cnet. http://taiwan.cnet.tw [7]Eirinaios Michelakis, Ion Androutsopoulos, Georgios Paliouras, George Sakkis, and Panagiotis Stamatopoulos, Filtron: A Learning-Based Anti-Spam Filter, MIT Spam Conference, 2004 [8]Geoff Hulten, Anthony Penta, Gopalakrishnan Seshadrinathan, Manav Mishra, Trends in Spam Products and Methods, MIT Spam Conference, 2004 [9]Gregory L. Wittel and S. Felix Wu, On Attacking Statistical Spam Filters, MIT Spam Conference, 2004 [10]Hiromitsu FUJIKAWA, Katsuyuki YAMAZAKI, Fuminori ADACHI, Takashi WASHIO, Hiroshi MOTODA, Density-Based Spam Detector, Industry/Government Track Paper, 2004 [11]Jason Rennie. Ifile: Anapplication of Machine Learning to E-Mail Filtering: KDD-2000 Text Mining Workshop Boston, MA USA [12]Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Intelligent Database Systems Research Lab, School of Computing Science, Simon Fraser University, Canada [13]John Graham-Cumming, The Spammer’s Compendium, popfile.sourceforge.net, 2003 [14]Joshua Goodman, IP Addresses in Email Clients, MIT Spam Conference, 2004 [15]Marek J. Druzdzel, Qualitative verbal explanations in Bayesian Belief Network, Artificial, Intelligence and Simulation of Behaviour Quarterly, Specail issue on Baysian belief networks, 94:43-54, 1996 [16]Pedro Domingos.When and how to sub sample: SIGKDD Exploratioins,3(2),2002 [17]P.E.M. Huygen, Use of Bayesian Belief Networks in legal reasoning, 17th BILETA Annual Conference, 2002 [18]Prasad, & E. K. Park, ”AI-based Classification and Retrieval of Reusable Software Components,” IEEE Software, 1993, pp. 519-523. [19]Richard Clayton, Stopping Spam by Extrusion Detection, MIT Spam Conference, 2004 [20]Richard Segal, Jason Crawford, Jeff Kephart, Barry Leiba, SpamGuru: An Enterprise Anti-Spam Filtering System, MIT Spam Conference, 2004 [21]Shabbir Ahmed and Farzana Mithun, Word Stemming to Enhance Spam Filtering, MIT Spam Conference, 2004 [22]Shlomo Hershkop. Behavior based Spam Detection . MIT Spam Conference, 2004. [23]Spam Conference 2003. http://spamconference.org/ [24]Susan Dumais, John Platt, David Heckerman and Mehran sahami. Inductive learning algorithms and representations for text classification: Seventh International Conference on Information and Knowledge Management [25]Thorsten Joachims. Aprobabililistic analysis of the rocchio algorithm with tfidf for text categorization: Proceedingsof the Fourteenth International Conference on Machine Learning,1997 [26]T.A Meyer and B Whateley, SpamBayes: Effective open-source, Bayesian based, email classification System, MIT Spam Conference, 2004 [27]William Yerazunis. Sparse Binary Polynomial Hashing and the CRM114 Discriminator : Spam Conference 2003 ,MIT [28]Yiming Yang and Xin Liu. A re-examination of text categorization methods. Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, 1999 [29]王稔志 & 張俊盛, ”適應性文件分類系統”，第十四屆計算語言學研討會, 2001 [30]柯淑津、陳振南, 階層式文件自動分類之特徵選取研究, 第十二屆計算語言學研討會，p.137-150頁, 1999 [31]沈健誠 & 張俊盛，”多篇文件自動摘要系統”，第十四屆計算語言學研討會, 2001 [32]卓忠毅 & 盧祈安, Bayesian Network, Department of Computer and Information Science, Soochow University, 2003 [33]楊允言, 陳淑美, 陳克健, 中文文件自動分類之研究, 第六屆計算語言學研討會論文集, p.217-233, 1993 [34]顏國偉, 譚慧敏基於知網的常識知識標注: 中文計算語言學期刊,第四卷第二期, 39~86頁,1999 [35]中研院CKIP 小組. http://godel.iis.sinica.edu.tw/CKIP/
論文全文使用權限	校內：校內紙本論文立即公開同意電子論文全文授權校園內公開校內電子論文立即公開校外：同意授權校外電子論文立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信