§ 瀏覽學位論文書目資料
  
系統識別號 U0002-1807201111574000
DOI 10.6846/TKU.2011.00643
論文名稱(中文) 中文語意結構在垃圾信過濾的應用
論文名稱(英文) Application of Chinese Semantic in Spam Mail Filtering
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 統計學系碩士班
系所名稱(英文) Department of Statistics
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 99
學期 2
出版年 100
研究生(中文) 陳美華
研究生(英文) Mei-Hua Chen
學號 698650248
學位類別 碩士
語言別 英文
第二語言別
口試日期 2010-06-16
論文頁數 44頁
口試委員 指導教授 - 陳景祥
委員 - 李百靈
委員 - 歐士田
關鍵字(中) 垃圾郵件
資料採礦
中文斷詞
機率類神經網路
多層感知機
C4.5
關鍵字(英) spam
data mining
Chinese Word Segmentation
MLP
PNN
C4.5
第三語言關鍵字
學科別分類
中文摘要
為了阻擋垃圾信件, 各界都有所努力, 例如各個國家紛紛成立「反制垃圾郵件法」(「或稱垃圾郵件管制法」), 軟體公司發展防毒以及防惡意程式的軟體等等, 但是即使有再好的防護還是未能完全阻絕。一般使用資料採礦的方法辨別垃圾郵件, 大部分都是從技術方面提升其分類效用, 像是改良分類器或是尋求更好的分類方法,甚少從資料輸入這一部分著手, 本篇論文主要的目的就是透過改善資料輸入的方式, 來使得分類效果提升。
在此考慮了三種類型的輸入變數組合, 除了14個寄件者行為特徵以及20個經由TF-IDF 權重計算所挑選的關鍵字是由前人所提出之外, 我們加入了24個語意成份(也就是各個詞語的詞性) 來表達垃圾郵件寄送者在郵件書寫時的方式。由C4.5、多層感知機以及機率神經網路所驗證的結果來看, 若是加入24個語意成份作為輸入變數, 其效果會比只有14個行為特徵變數加上關鍵字還要好。
英文摘要
In order to prevent spam mails, there are many achievement from the collective efforts of all sectors, although the protections become better and better, the challenges remain.
The study focus on how much information is added in the odel, for this reason we hope to explain the output by meliorated version of input elements.
We use 14 features of sender’s behavior and 20 keywords which calculated to be the most effectiveness by TF-IDF. Besides that, we proposed 24 new variables of semantic component that simulated the habits of writer and considered the expression between
spam e-mail sender and ligitimate e-mail sender. The result shows that simultaneous use of all variables achieve the best results from the point of view of classifiers whatever in C4.5, MLP, or PNN.
第三語言摘要
論文目次
Contents
Chapter 1 Introduction .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. 1
1.1 Background .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 1
1.1 Purpose of Research .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 2
Chapter 2 Literature Review.... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 3
2.1 Techniques for Filtering Spam .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 3
2.2 Literature Relating to the Classification for Chinese E-mails 6
2.3 The Architecture of E-mail .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 8
2.4 Data Mining Techniques.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14
2.4.1 C4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . 16
2.4.3 Probability Neural Network (PNN) . . . . . . . . . . . . . . . . . 18
2.4.4 TF-IDF Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 The Chinese Word Segmented System .. .. .. .. .. .. .. .. .. .. .. .. .. 22
Chapter 3 Research Method and Procedures .. .. .. .. .. .. .. .... .. .. 23
3.1 The Source of Research Data.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 23
3.2 Research Methods .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 23
3.3 Data Mining Approaches for Classifying the E-mail .. .. .. .. .. 24
3.4 Data Processing.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 28
3.5 Classification Criteria .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 32
3.6 Research Tools .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 33
Chapter 4 Application of Algorithms .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. 34
4.1 Using C4.5 and Variables of Semantic Component to Clas-
sify E-mails.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..35
4.2 Using MLP and Variables of Semantic Component to Clas-
sify E-mails.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..36
4.3 Using PNN and Variables of Semantic Component to Clas-
sify E-mails.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..38
4.4 Summary of the Analysis of the Three Methods .. .. .. .. .. .. .. 39
Chapter 5 Conclusions and Discussion .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 41
Reference .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. ..42

List of Tables
2.1 Defined the Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Fields of Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Fields of MIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 League Tables for TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Part of Academia Sinica Balanced Corpus of Modern Chinese . . . . . . . 27
3.3 Summary of all input variables type . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Summary of all output variables type . . . . . . . . . . . . . . . . . . . . . 32
3.5 The result of classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 The formula for all principles . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Summary of the measures for each situation when using C4.5 . . . . . . . . 35
4.2 Confusion matrix using C4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Summary of the measures for each situation when using MLP . . . . . . . 37
4.4 Confusion matrix using MLP . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Summary of the measures for each situation when using PNN . . . . . . . 38
4.6 Confusion matrix using PNN . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.7 Summary of the Analysis of the Three Methods . . . . . . . . . . . . . . . 40

List of Figures
2.1 The Process of Sending E-mail . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 ANN with input, hide and output . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Parzen window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Organization for classification of patterns into categories . . . . . . . . . . 21
3.1 Sample file for mailparsing e-mail . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 The result for using Mailparsing function . . . . . . . . . . . . . . . . . . . 29
參考文獻
[1] Petition for System of IronPort [2004], sunrise information co., ltd. edn.
[2] ADMIN [2011], “RFC”, Http://0rz.tw/QwuFE.
[3] Ahmad, Nor Shafrin, Nor Hashimah Hashim, and Rahimi Che Aman. [2007], “Ado-lescent Emotional Expression and Regulation: A Case Study in Malaysia”, The In-ternational Journal of Interdisciplinary Social Sciences, 5, 45–56.
[4] an eHow Contributor [2011], “How to Blacklist an Email”, Http://0rz.tw/DzxLo.
[5] Cang, S. [2005], “Novel probabilty neural network”, IEE Proceedings - Vision Image
and Signal Processing, 152, pp. 535–544.
[6] Cang, S. and H. Yu [4-8 July 2005], “A Probabilty Neural Network for Continuous and Categorical Data”, in Proceedings of the 16th IFAC World Congress, Prague.
[7] Chien, Hui-Ling [2008], A Study of Feature Selection for Chinese Spam Filtering, Master’s thesis, Department of Computer and Information Science, Soochow Univer-sity.
[8] Chou, Ying-Chieh [2007], The Application of Back-Propagation Network in E-mail
Classification, Master’s thesis, Computer Science and Enginerring, Tatung Univer-sity.
[9] RFC 822: Standard for ARPA Internet Text Messages [1982], http://0rz.tw/BGsJS.
[10] Crocker, Stephen D. [2009], “How the Internet Got Its Rules”, Http://0rz.tw/J1SWb.
[11] Dent, Kyle D. [2003], Postfix: The Definitive Guide, OReilly Media.
[12] RFC 2045: Multipurpose Internet Mail Extensions [1996], http://0rz.tw/eLZq2.
[13] Hall, M. [2008], “Malware, spam: Bad, getting worse”, Http://0rz.tw/6AiVT.
[14] ICST [2004], “Spam Terminator - Bayesian Filtering”, Http://0rz.tw/m58Yp.
[15] Joachims, Thorsten [1997], “A Probabilistic Analysis of the Rocchio Algorithm with
TFIDF for Text Categorization”, Proceedings of the Fourteenth International Con-
ference on Machine Learning, 143–151.
[16] Jung, J. and E. Sit [2004], “An Empirical Study of Spam Traffic and the Use of DNS
Black Lists”, in Internet Measurement Conference, Taormina, Italy.
[17] RFC 2821: Simple Mail Transfer Protocol [2001], http://0rz.tw/cDVGR.
[18] RFC 5321: Simple Mail Transfer Protocol [2008], http://0rz.tw/YKLWs.
[19] Lee, C.H. and Y.P. Cheng [2007], “A Study on Applying Support Vector Machines
based Categorization Techniques to Identification of Chinese Spam Emails”, Journal
of Engineering Technology and Education, 4, 462–474.
[20] M. Dolores del Castillo, J. Ignacio Serrano [2006], “An Interactive Hybrid System
for Identifying and Filtering Unsolicited E-mail”, in In Proceedings of IDEAL.
[21] Park, E. [2010], “Spam and Phishing Landscape”, State of Spam and Phishing.
[22] RFC 821: Simple Mall Transfer Protocol [1982], http://0rz.tw/hxntL.
[23] Qian, Wang, Han Xue, and Wang Xiaoyu [2009], “Studying of Classifying Junk
Messages Based on The Data Mining”, Management and Service Science.
[24] Quinlan, J. R. [1986], “Induction of Decision Trees”, Machine Learning, 1.
[25] Quinlan, J. R [1993], “C4.5: Programs for Machine Learning”, Machine Learning,
16, 135–240.
[26] Radicati Group [2003], “Architectural Comparison of Enterprise Anti-spam Solu-
tions”, Tech. rep., TUMBLEWEED COMMUNICATIONS.
[27] RFC 2822: Internet Message Format [2001], http://0rz.tw/L7XVd.
[28] RFC 5322: Internet Message Format [2008], http://0rz.tw/53GTl.
[29] Saeedmanesh, M., T. Izadi, and E. Ahvar2 [2010], “HDM: A Hybrid Data Mining
Technique for Stock Exchange Prediction”, Proceedings of the International Multi-
conference of Engineers and Computer Scientists, Vol 1.
[30] Seung, S. [2002], “Multilayer perceptrons and backpropagation learning”, .
[31] Specht, Donald F. [1990], “Probabilistic neural networks”, IEE Proceedings - Vision
Image and Signal Processing, 3.
[32] Su, Yung-Hu [2010], Detection and Blocking of Botnet Packet Behavior Based on DNS Packet, Master’s thesis, Department of Computer Science and Engineering,
Tatung University.
[33] Tsai, Meng-Chuan [2005], Application of decision tree methods on spam filtering, Master’s thesis, Department of Statistics, Tamkang University.
[34] TWCERT/CC [2005], “The Suggestions of Spam Management and Software Use
Methods”, Http://0rz.tw/gaI6z.
[35] Whyte, D., P.C. van Oorschot, and E. Kranakis [2005], “Addressing Malicious SMTP-
based Mass-Mailing Activity Within an Enterprise Network”, Tech. rep., Carleton
University, SCS Technical Report.
[36] Wu, Che-Ying [2008], “CKIP Client”, Http://0rz.tw/J1SWb.
[37] Wu, Yung-Ching [2007], A study of customizable Chinese spam E-mails filtering sys-
tem, Master’s thesis, Department of Statistics, Tamkang University.
[38] Yeh, Tsai-Ling [2006], Spam Filtering: Application of Data Mining and Chinese
Word Segmentation Technique, Master’s thesis, Department of Statistics, Tamkang
University.
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信