§ 瀏覽學位論文書目資料
  
系統識別號 U0002-1501200716224000
DOI 10.6846/TKU.2007.00399
論文名稱(中文) 漸進式網頁文件分類技術
論文名稱(英文) Progressive Analysis Scheme for Web Document Classification
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系博士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 95
學期 1
出版年 96
研究生(中文) 宋立群
研究生(英文) Li-Chun Sung
學號 887190030
學位類別 博士
語言別 英文
第二語言別
口試日期 2007-01-04
論文頁數 85頁
口試委員 指導教授 - 郭經華
委員 - 劉遠楨
委員 - 陳孟彰
委員 - 王英宏
委員 - 葛煥昭
關鍵字(中) 網頁探勘
網頁分類
漸進式分析
關鍵字(英) Web Mining
Web Document Classification
Progressive Analysis
第三語言關鍵字
學科別分類
中文摘要
在本篇論文中,我們提出了一種漸進式網頁文件分類技術(簡稱PAS)。透過這種分類技術,由於分類器只需分析文件中部分關鍵區塊的內容,就足以確認文件之所屬類別,因此可以達到提升網頁分類效率的目的。
一般而言,網頁文件可以根據其DOM架構分割為許多小的標籤區域。每塊標籤區域,通常會被以特定的視覺型態加以呈現於瀏覽器視窗中。而這種視覺型態,則由附加於此標籤區域上之HTML成對標籤所構成。根據我們的觀察,由於網頁的寫作習慣,標籤區域中內容對分類的益助性會隨著其視覺型態的不同而有不同的傾向。除此之外,在文件中具有相同視覺型態的標籤區域,也會因為文件寫作技巧的考量而具有不同的分類益助性。
在本篇論文中,我們藉由分析大量網頁文件,並藉由EM與HMM等模式識別技術的輔助,識別出每種視覺型態的益助性特質,包括:益助性傾向、與益助性變化模式。我們將這兩種特質加以整合,提出了一套標籤區域益助性預測機制。在進行分類時,我們可以透過這套機制動態地預測每塊還未被分析之標籤區域的益助性,並漸進地擷取最有益助性之標籤區域進行分類運算,直到網頁類別被確認為止。
為了減少錯誤預測的機率,預測機制會根據已分析過標籤區域之實際益助性,進行自身最佳化調整。此外,對於罕見視覺型態之益助性預測,預測機制會同時參考其近似之視覺型態的益助性特質,以期獲得較正確之預測。
透過實驗,我們說明了參數設定對分類器效能的影響,並驗證了所提出之網頁分類技術的優越性。
英文摘要
In this thesis, we propose a web document classification scheme, called the Progressive Analysis Scheme (PAS), whose classification performance is improved by just analyzing few key parts sufficient for category confirmation. 
Based on the DOM tag-tree structure, a web document can be segmented into small tag-regions. Each tag-region is visualized by a visual type which corresponds to a specific nested combination of tag-pairs. Under observation, the profitabilities of tag-regions for classification will vary among visual types caused by the web authoring convention. In addition, in a document, the profitabilities of tag-regions of a visual type may also vary caused by the document writing knacks.
In the thesis, for each visual type, we model the two kinds of profitability variations as the profitability tendencies and the tendency transition patterns based on the Expectation Maximization scheme and the Hidden Markov Model scheme. For classification, we integrate them into a profitability forecasting strategy further. Based on the forecasting strategies, we will forecast the potential profitabilities of unanalyzed tag-regions and extract continuously the most profitable unanalyzed tag-regions for classification until category confirmation. 
Dynamically, the forecasting strategies will be optimized for the document by feeding back the actual profitabilities of analyzed tag-regions to them. Thus, the profitabilities of next tag-regions can be forecasted more accurately. In addition, for each unreliable model generated by a sparse set of training samples, we propose a solution which is to support its forecasting process by the strategies of other similar visual types.
Through simulations, the results will show that PAS has better classification performance than the previous approaches, such as the full-text (e.g. SVM) and sequential classifiers.
第三語言摘要
論文目次
Chapter 1 Introduction	1
1.1 Traditional term-based document classification schemes	1
1.1.1 Vector representation of document	2
1.1.2 Vector comparison for document classification	4
1.1.3 Performance degradation by unprofitable terms	6
1.2 Structural characteristics of web documents	7
1.3 Performance improvement based on the authoring convention	10
1.4 Organization of the thesis	11
Chapter 2 Related works	13
2.1 Works based on hyperlink characteristics	13
2.2 Works based on visual characteristics	14
2.2.1 A single tag-region as a conspicuous block	14
2.2.2 A concept segment as a conspicuous block	16
2.2.3 Drawbacks of segment-based classifiers	18
Chapter 3 Issues derived from the profitability discrimination scheme based on the visual type of tag-region	20
3.1 The number of surrounding tag-pairs of a visual type adopted in discrimination	20
3.1.1 The data sparseness problem of visual types	21
3.2 The multiple profitability tendencies of a visual type	22
3.2.1 The appropriate number of tag-regions for classification	23
Chapter 4 The concept of progressive analysis scheme	24
4.1 Concept of progressive analysis	25
4.2 Notations of profitability and profitability tendency	28
4.2.1 Notation of profitability of a tag-region	28
4.2.2 Notation of profitability tendency of a visual type	30
4.3 Tendency transition patterns of a visual type	34
4.3.1 Observation and definition	34
4.3.2 Pattern identification	37
4.4 Profitability forecasting strategy	41
4.4.1 Tendency transition pattern modeling	41
4.4.2 Profitability forecasting process	44
4.4.3 Classification process based on the profitability forecasting strategy	46
4.4.4 Modification for active profitable tag-region discovery	49
4.5 Blurred visual type	53
4.5.1 The representation of a blurred visual type of a visual type	54
4.5.2 The similarity between the blurred visual types of two visual types	56
4.5.3 The modified profitability forecasting strategy of a visual type	57
Chapter 5 Performance evaluation	58
5.1 Experiment setting	58
5.2 Assumption verification	59
5.2.1 Assumptions about the averaged profitability of a visual type	59
5.2.2 Assumption about multiple profitability tendencies of a visual type	60
5.2.3 Featured similarity curves in PAS	66
5.3 Performance of PAS classifier	73
5.3.1 Performance measurement metrics	73
5.3.2 Influence of parameter setting	74
5.3.3 Performance comparison	77
Chapter 6 Conclusion	80
References	81
List of Figures

Fig. 1.1  An example of an HTML document	8
Fig. 2.1  A extraction example of concept segments of a document	17
Fig. 3.1  The profitability distributions of two visual types whose enclosing tag-pair is a <A> tag-pair	21
Fig. 4.1  The iterative process of the progressive analysis strategy	26
Fig. 4.2  The similarity curves during progressive analysis	28
Fig. 4.3  Some profitability transition curves of tabular writing knack	35
Fig. 4.4  Some profitability transition curve of paragraph writing knack	36
Fig. 4.5  The algorithm for tendency transition pattern identification	39
Fig. 4.6  The algorithm for distance measurement between tendency sequences	40
Fig. 4.7  The architecture of the Profitability Forecaster & Tag-region Extractor	46
Fig. 4.8  The classification algorithm for PAS	48
Fig. 4.9  The ideal analysis sequencing of tag-regions	49
Fig. 4.10	The analysis sequencing of tag-regions based on the original HMM-based forecasting strategy	50
Fig. 4.11	The profitability distributions of two visual types which differ in tag-pair ordering	55
Fig. 5.1  The phenomenon of multiple profitability tendencies of a blurred visual type	61
Fig. 5.2  The intrinsic profitability transitions of {BODY, HTML, TABLE, TBODY, TD, TR}	64
Fig. 5.3  The intrinsic profitability transitions of {BODY, HTML, P}	64
Fig. 5.4  The similarity curves of four test web documents	69
Fig. 5.5  The recall curves for different Nperiod parameters	75
Fig. 5.6  The recall curves for whether enabling sample supplement	76
Fig. 5.7  The recall curves for whether enabling pattern inheritance	77
Fig. 5.8  The recall curves of five classifiers.	79
List of Tables

Table 1.1  Formulas for term weight	4
Table 2.1  The class list of enclosing tag-pairs	15
Table 5.1  Partial list of blurred visual types with Header tag-pairs and their averaged profitabilities	60
Table 5.2  The list of inheritance ratios of some visual types	65
Table 5.3  The performance measurements for different Nperiod parameters	75
Table 5.4  The performance measurement for whether enabling sample supplement	76
Table 5.5  The performance measurement for whether enabling pattern inheritance	77
Table 5.6  The classification performance of the five classifiers	78
Table 5.7  The performances of the PAS classifier with different TConfirm	78
參考文獻
[1] S. Abiteboul, “Querying semi-structured data”, In Proceedings of the International Conference on Database Theory. Delphi, Greece, 1997, pp. 1-18.
[2] D. P. Bertsekas, and J. N. Tsitsiklis, Introduction to Probability, Athena Scientific, 2002.
[3] R. Braz, R. Girju, V. Punyakanok, D. Roth, and M. Sammons, “An Inference Model for Semantic Entailment in Natural Language,” In Proceedings of 12th National Conference on Artificial Intelligence (AAAI), 2005, pp. 1043-1049.
[4] S. Chakrabarti, B. Dom, and P. Indyk, “Enhanced hypertext categorization using hyperlinks”, In Proceedings of ACM SIGMOD’98, ACM Press, 1998, pp. 307-318.
[5]	S. Chakrabarti, K. Punera, and M. Subramanyam, “Accelerated focused crawling through online relevance feedback,” In Proceedings of the Eleventh International World Wide Web Conference, 2002, pp. 148-159.
[6] S. Chakrabarti, MINING THE WEB: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2003.
[7] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery, “Learning to Extract Symbolic Knowledge from the World Wide Web”, In Proceedings of the 15th National Conference on Artificial Intelligence, 1998, pp. 509-516.
[8] M. Cutler, Y. Shih, and W. Meng, "Using the Structure of HTML Documents to Improve Retrieval," In Proceedings of Usenix Symposium on Internet Technologies and Systems (NSITS'97), Monterey California, December 1997, pp. 241-251.
[9] H. P. Edmundson, “New Methods in Automatic Extraction,” Journal of the ACM, Vol. 16, No. 2, 1968, pp. 264-285.
[10] J. F&uuml;rnkranz, “Exploiting Structural Information for Text Classification on the WWW”, In Proceedings of the 3rd Symposium on Intelligent Data Analysis, Springer-Verlag, Amsterdam, Netherlands, 1999, pp. 487-497.
[11] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo, “Extracting Semistructured Information from the Web”, In Proceedings of the Workshop on Management of Semistructured Data (PODS/SIGMOD'97), 1997, pp. 8-25.
[12] E. H. Hovy, and C. Y. Lin, “Automated Text Summarization in SUMMARIST,” In Proceedings of the ACL97/EACL97 Workshop on Intelligent Scalable Text Summarization, 1997, PP. 18-24.
[13] M. Kovacevic, M. Diligenti, M. Gori and V. M. Milutinovic, “Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification”, In Proceedings of IEEE ICDM’02, 2002, pp. 250-257.
[14] D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, “Training algorithms for linear text classiers,” In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp. 298–306.
[15] S. H. Lin, M. C. Chen, J. Ho and Y. M. Huang, “ACIRD: Intelligent Internet Document Organization and Retrieval”, IEEE Trans. Knowledge and Data Engineering, Vol. 14, No. 3, 2002, pp. 599-614.
[16] H. P. Luhn, “The Automatic Creation of Literature Abstracts,” IBM Journal of Research and Development, 1959, pp.159-165.
[17] M. Mitra, A. Singhal, and C. Buckley, “Automatic text summarization by paragraph extraction.” In Proceedings of the 14th National Conference on Artificial Intelligence (AAAI-97), 1997, pp. 31-36.
[18] T. K. Moon, “The expectation-maximization algorithm,” IEEE Signal Processing Mag., Nov. 1996, pp. 47-60.
[19] C. S. Myers and L. R. Rabiner, “A comparative study of several dynamic time-warping algorithms for connected word recognition,” The Bell System Technical Journal, Vol. 60, No.7, 1981, pp. 1389-1409.
[20] C. Nello, and S. T. John, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000.
[21] H. J. Oh, S. H. Myaeng, and M. H. Lee, “A practical hypertext categorization method using links and incrementally available class information”, In Proceedings of ACM SIGIR 2000, ACM Press, Athens, Greece, July 2000, pp. 264-271.
[22] L. Rabiner, and B. H. Juang, Fundamentals of Speech Recognition, Prentice Hall, Chapter 6, 1993.
[23] G. Salton, A Flexible Automatic System for the Organization, Storage, and Retrieval of Language Data (SMART). Report ISR-5, Section I, Harvard Computation Lab., Jan. 1964.
[24] G. Salton, and M. J. McGill. Introduction to Modern Information Retrieval, McGraw-Hill Book company, 1983.
[25] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison Wesley Publisher, 1989.
[26] G. Salton, “The SMART document retrieval project,” In Proceedings of ACM SIGIR’91, 1991, pp. 357-358.
[27] S. Soderland, “Learning to extract text-based information from the World Wide Web”, In Proceedings of the ACM SIGKDD’97, Newport Beach, CA, 1997, pp. 251-254.
[28] C. J. van Rijsbergen, Information Retrieval, Butterworths, 1979.
[29] W3C, HyperText Markup Language specification (Http://www.w3c.org/MarkUp/), The World Wide Web Consortium, 1999.
[30] I. H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. J. Cunningham, “Weka: Practical machine learning tools and techniques with Java implementations”, In Proceedings of International Workshop: Emerging Knowledge Engineering and Connectionist-Based Information Systems, 1999, pp. 192-196.
[31] W. Wong and A. Fu, “Finding Structure and Characteristics of Web Documents for Classification”, In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Dallas, USA, 2000, pp. 96-105.
[32] Jinxi Xu , and W. Bruce Croft, “Query expansion using local and global document analysis,” In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, 1996, PP. 4-11.
[33] Y. Yang, and X. Liu, “A Re-examination of Text Categorization Methods,” In Proceedings of SIGIR’99, 22nd ACM International Conference on Research and Development in Information Retrieval, 1999, pp. 42-49.
[34] Y. Yang, and H. Zhang, “HTML Page Analysis Based on Visual Cues”, In Proceedings of the 6th International Conference on Document Analysis and Recognition, 2001, pp. 859-864.
[35] J. Yi, and N. Sundaresan, “A classifier for semi-structured documents”, In Proceedings of the 6th ACM SIGKDD’00, Boston, MA, USA , 2000, pp. 340-344.
[36] L. Yi, B. Liu, and X. Li, “Eliminating noisy information in Web pages for data mining”, In Proceedings of ACM SIGKDD’03, 2003, pp. 296-305.
[37] S. Yu, D. Cai, J. R. Wen, and W. Y. Ma, “Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segment”, In Proceedings of the 12th International Conference on WWW, 2003, pp. 11-18.
[38] S. W. Jung, and H. C. Kwon, “A scalable hybrid approach for extracting head components from Web tables,” IEEE transactions on Knowledge and Data Engineering, Vol. 18, No. 2, 2006, pp. 174-187.
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信