系統識別號 | U0002-1501200716224000 |
---|---|
DOI | 10.6846/TKU.2007.00399 |
論文名稱(中文) | 漸進式網頁文件分類技術 |
論文名稱(英文) | Progressive Analysis Scheme for Web Document Classification |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系博士班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 95 |
學期 | 1 |
出版年 | 96 |
研究生(中文) | 宋立群 |
研究生(英文) | Li-Chun Sung |
學號 | 887190030 |
學位類別 | 博士 |
語言別 | 英文 |
第二語言別 | |
口試日期 | 2007-01-04 |
論文頁數 | 85頁 |
口試委員 |
指導教授
-
郭經華
委員 - 劉遠楨 委員 - 陳孟彰 委員 - 王英宏 委員 - 葛煥昭 |
關鍵字(中) |
網頁探勘 網頁分類 漸進式分析 |
關鍵字(英) |
Web Mining Web Document Classification Progressive Analysis |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
在本篇論文中,我們提出了一種漸進式網頁文件分類技術(簡稱PAS)。透過這種分類技術,由於分類器只需分析文件中部分關鍵區塊的內容,就足以確認文件之所屬類別,因此可以達到提升網頁分類效率的目的。 一般而言,網頁文件可以根據其DOM架構分割為許多小的標籤區域。每塊標籤區域,通常會被以特定的視覺型態加以呈現於瀏覽器視窗中。而這種視覺型態,則由附加於此標籤區域上之HTML成對標籤所構成。根據我們的觀察,由於網頁的寫作習慣,標籤區域中內容對分類的益助性會隨著其視覺型態的不同而有不同的傾向。除此之外,在文件中具有相同視覺型態的標籤區域,也會因為文件寫作技巧的考量而具有不同的分類益助性。 在本篇論文中,我們藉由分析大量網頁文件,並藉由EM與HMM等模式識別技術的輔助,識別出每種視覺型態的益助性特質,包括:益助性傾向、與益助性變化模式。我們將這兩種特質加以整合,提出了一套標籤區域益助性預測機制。在進行分類時,我們可以透過這套機制動態地預測每塊還未被分析之標籤區域的益助性,並漸進地擷取最有益助性之標籤區域進行分類運算,直到網頁類別被確認為止。 為了減少錯誤預測的機率,預測機制會根據已分析過標籤區域之實際益助性,進行自身最佳化調整。此外,對於罕見視覺型態之益助性預測,預測機制會同時參考其近似之視覺型態的益助性特質,以期獲得較正確之預測。 透過實驗,我們說明了參數設定對分類器效能的影響,並驗證了所提出之網頁分類技術的優越性。 |
英文摘要 |
In this thesis, we propose a web document classification scheme, called the Progressive Analysis Scheme (PAS), whose classification performance is improved by just analyzing few key parts sufficient for category confirmation. Based on the DOM tag-tree structure, a web document can be segmented into small tag-regions. Each tag-region is visualized by a visual type which corresponds to a specific nested combination of tag-pairs. Under observation, the profitabilities of tag-regions for classification will vary among visual types caused by the web authoring convention. In addition, in a document, the profitabilities of tag-regions of a visual type may also vary caused by the document writing knacks. In the thesis, for each visual type, we model the two kinds of profitability variations as the profitability tendencies and the tendency transition patterns based on the Expectation Maximization scheme and the Hidden Markov Model scheme. For classification, we integrate them into a profitability forecasting strategy further. Based on the forecasting strategies, we will forecast the potential profitabilities of unanalyzed tag-regions and extract continuously the most profitable unanalyzed tag-regions for classification until category confirmation. Dynamically, the forecasting strategies will be optimized for the document by feeding back the actual profitabilities of analyzed tag-regions to them. Thus, the profitabilities of next tag-regions can be forecasted more accurately. In addition, for each unreliable model generated by a sparse set of training samples, we propose a solution which is to support its forecasting process by the strategies of other similar visual types. Through simulations, the results will show that PAS has better classification performance than the previous approaches, such as the full-text (e.g. SVM) and sequential classifiers. |
第三語言摘要 | |
論文目次 |
Chapter 1 Introduction 1 1.1 Traditional term-based document classification schemes 1 1.1.1 Vector representation of document 2 1.1.2 Vector comparison for document classification 4 1.1.3 Performance degradation by unprofitable terms 6 1.2 Structural characteristics of web documents 7 1.3 Performance improvement based on the authoring convention 10 1.4 Organization of the thesis 11 Chapter 2 Related works 13 2.1 Works based on hyperlink characteristics 13 2.2 Works based on visual characteristics 14 2.2.1 A single tag-region as a conspicuous block 14 2.2.2 A concept segment as a conspicuous block 16 2.2.3 Drawbacks of segment-based classifiers 18 Chapter 3 Issues derived from the profitability discrimination scheme based on the visual type of tag-region 20 3.1 The number of surrounding tag-pairs of a visual type adopted in discrimination 20 3.1.1 The data sparseness problem of visual types 21 3.2 The multiple profitability tendencies of a visual type 22 3.2.1 The appropriate number of tag-regions for classification 23 Chapter 4 The concept of progressive analysis scheme 24 4.1 Concept of progressive analysis 25 4.2 Notations of profitability and profitability tendency 28 4.2.1 Notation of profitability of a tag-region 28 4.2.2 Notation of profitability tendency of a visual type 30 4.3 Tendency transition patterns of a visual type 34 4.3.1 Observation and definition 34 4.3.2 Pattern identification 37 4.4 Profitability forecasting strategy 41 4.4.1 Tendency transition pattern modeling 41 4.4.2 Profitability forecasting process 44 4.4.3 Classification process based on the profitability forecasting strategy 46 4.4.4 Modification for active profitable tag-region discovery 49 4.5 Blurred visual type 53 4.5.1 The representation of a blurred visual type of a visual type 54 4.5.2 The similarity between the blurred visual types of two visual types 56 4.5.3 The modified profitability forecasting strategy of a visual type 57 Chapter 5 Performance evaluation 58 5.1 Experiment setting 58 5.2 Assumption verification 59 5.2.1 Assumptions about the averaged profitability of a visual type 59 5.2.2 Assumption about multiple profitability tendencies of a visual type 60 5.2.3 Featured similarity curves in PAS 66 5.3 Performance of PAS classifier 73 5.3.1 Performance measurement metrics 73 5.3.2 Influence of parameter setting 74 5.3.3 Performance comparison 77 Chapter 6 Conclusion 80 References 81 List of Figures Fig. 1.1 An example of an HTML document 8 Fig. 2.1 A extraction example of concept segments of a document 17 Fig. 3.1 The profitability distributions of two visual types whose enclosing tag-pair is a <A> tag-pair 21 Fig. 4.1 The iterative process of the progressive analysis strategy 26 Fig. 4.2 The similarity curves during progressive analysis 28 Fig. 4.3 Some profitability transition curves of tabular writing knack 35 Fig. 4.4 Some profitability transition curve of paragraph writing knack 36 Fig. 4.5 The algorithm for tendency transition pattern identification 39 Fig. 4.6 The algorithm for distance measurement between tendency sequences 40 Fig. 4.7 The architecture of the Profitability Forecaster & Tag-region Extractor 46 Fig. 4.8 The classification algorithm for PAS 48 Fig. 4.9 The ideal analysis sequencing of tag-regions 49 Fig. 4.10 The analysis sequencing of tag-regions based on the original HMM-based forecasting strategy 50 Fig. 4.11 The profitability distributions of two visual types which differ in tag-pair ordering 55 Fig. 5.1 The phenomenon of multiple profitability tendencies of a blurred visual type 61 Fig. 5.2 The intrinsic profitability transitions of {BODY, HTML, TABLE, TBODY, TD, TR} 64 Fig. 5.3 The intrinsic profitability transitions of {BODY, HTML, P} 64 Fig. 5.4 The similarity curves of four test web documents 69 Fig. 5.5 The recall curves for different Nperiod parameters 75 Fig. 5.6 The recall curves for whether enabling sample supplement 76 Fig. 5.7 The recall curves for whether enabling pattern inheritance 77 Fig. 5.8 The recall curves of five classifiers. 79 List of Tables Table 1.1 Formulas for term weight 4 Table 2.1 The class list of enclosing tag-pairs 15 Table 5.1 Partial list of blurred visual types with Header tag-pairs and their averaged profitabilities 60 Table 5.2 The list of inheritance ratios of some visual types 65 Table 5.3 The performance measurements for different Nperiod parameters 75 Table 5.4 The performance measurement for whether enabling sample supplement 76 Table 5.5 The performance measurement for whether enabling pattern inheritance 77 Table 5.6 The classification performance of the five classifiers 78 Table 5.7 The performances of the PAS classifier with different TConfirm 78 |
參考文獻 |
[1] S. Abiteboul, “Querying semi-structured data”, In Proceedings of the International Conference on Database Theory. Delphi, Greece, 1997, pp. 1-18. [2] D. P. Bertsekas, and J. N. Tsitsiklis, Introduction to Probability, Athena Scientific, 2002. [3] R. Braz, R. Girju, V. Punyakanok, D. Roth, and M. Sammons, “An Inference Model for Semantic Entailment in Natural Language,” In Proceedings of 12th National Conference on Artificial Intelligence (AAAI), 2005, pp. 1043-1049. [4] S. Chakrabarti, B. Dom, and P. Indyk, “Enhanced hypertext categorization using hyperlinks”, In Proceedings of ACM SIGMOD’98, ACM Press, 1998, pp. 307-318. [5] S. Chakrabarti, K. Punera, and M. Subramanyam, “Accelerated focused crawling through online relevance feedback,” In Proceedings of the Eleventh International World Wide Web Conference, 2002, pp. 148-159. [6] S. Chakrabarti, MINING THE WEB: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2003. [7] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery, “Learning to Extract Symbolic Knowledge from the World Wide Web”, In Proceedings of the 15th National Conference on Artificial Intelligence, 1998, pp. 509-516. [8] M. Cutler, Y. Shih, and W. Meng, "Using the Structure of HTML Documents to Improve Retrieval," In Proceedings of Usenix Symposium on Internet Technologies and Systems (NSITS'97), Monterey California, December 1997, pp. 241-251. [9] H. P. Edmundson, “New Methods in Automatic Extraction,” Journal of the ACM, Vol. 16, No. 2, 1968, pp. 264-285. [10] J. Fürnkranz, “Exploiting Structural Information for Text Classification on the WWW”, In Proceedings of the 3rd Symposium on Intelligent Data Analysis, Springer-Verlag, Amsterdam, Netherlands, 1999, pp. 487-497. [11] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo, “Extracting Semistructured Information from the Web”, In Proceedings of the Workshop on Management of Semistructured Data (PODS/SIGMOD'97), 1997, pp. 8-25. [12] E. H. Hovy, and C. Y. Lin, “Automated Text Summarization in SUMMARIST,” In Proceedings of the ACL97/EACL97 Workshop on Intelligent Scalable Text Summarization, 1997, PP. 18-24. [13] M. Kovacevic, M. Diligenti, M. Gori and V. M. Milutinovic, “Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification”, In Proceedings of IEEE ICDM’02, 2002, pp. 250-257. [14] D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, “Training algorithms for linear text classiers,” In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp. 298–306. [15] S. H. Lin, M. C. Chen, J. Ho and Y. M. Huang, “ACIRD: Intelligent Internet Document Organization and Retrieval”, IEEE Trans. Knowledge and Data Engineering, Vol. 14, No. 3, 2002, pp. 599-614. [16] H. P. Luhn, “The Automatic Creation of Literature Abstracts,” IBM Journal of Research and Development, 1959, pp.159-165. [17] M. Mitra, A. Singhal, and C. Buckley, “Automatic text summarization by paragraph extraction.” In Proceedings of the 14th National Conference on Artificial Intelligence (AAAI-97), 1997, pp. 31-36. [18] T. K. Moon, “The expectation-maximization algorithm,” IEEE Signal Processing Mag., Nov. 1996, pp. 47-60. [19] C. S. Myers and L. R. Rabiner, “A comparative study of several dynamic time-warping algorithms for connected word recognition,” The Bell System Technical Journal, Vol. 60, No.7, 1981, pp. 1389-1409. [20] C. Nello, and S. T. John, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000. [21] H. J. Oh, S. H. Myaeng, and M. H. Lee, “A practical hypertext categorization method using links and incrementally available class information”, In Proceedings of ACM SIGIR 2000, ACM Press, Athens, Greece, July 2000, pp. 264-271. [22] L. Rabiner, and B. H. Juang, Fundamentals of Speech Recognition, Prentice Hall, Chapter 6, 1993. [23] G. Salton, A Flexible Automatic System for the Organization, Storage, and Retrieval of Language Data (SMART). Report ISR-5, Section I, Harvard Computation Lab., Jan. 1964. [24] G. Salton, and M. J. McGill. Introduction to Modern Information Retrieval, McGraw-Hill Book company, 1983. [25] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison Wesley Publisher, 1989. [26] G. Salton, “The SMART document retrieval project,” In Proceedings of ACM SIGIR’91, 1991, pp. 357-358. [27] S. Soderland, “Learning to extract text-based information from the World Wide Web”, In Proceedings of the ACM SIGKDD’97, Newport Beach, CA, 1997, pp. 251-254. [28] C. J. van Rijsbergen, Information Retrieval, Butterworths, 1979. [29] W3C, HyperText Markup Language specification (Http://www.w3c.org/MarkUp/), The World Wide Web Consortium, 1999. [30] I. H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. J. Cunningham, “Weka: Practical machine learning tools and techniques with Java implementations”, In Proceedings of International Workshop: Emerging Knowledge Engineering and Connectionist-Based Information Systems, 1999, pp. 192-196. [31] W. Wong and A. Fu, “Finding Structure and Characteristics of Web Documents for Classification”, In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Dallas, USA, 2000, pp. 96-105. [32] Jinxi Xu , and W. Bruce Croft, “Query expansion using local and global document analysis,” In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, 1996, PP. 4-11. [33] Y. Yang, and X. Liu, “A Re-examination of Text Categorization Methods,” In Proceedings of SIGIR’99, 22nd ACM International Conference on Research and Development in Information Retrieval, 1999, pp. 42-49. [34] Y. Yang, and H. Zhang, “HTML Page Analysis Based on Visual Cues”, In Proceedings of the 6th International Conference on Document Analysis and Recognition, 2001, pp. 859-864. [35] J. Yi, and N. Sundaresan, “A classifier for semi-structured documents”, In Proceedings of the 6th ACM SIGKDD’00, Boston, MA, USA , 2000, pp. 340-344. [36] L. Yi, B. Liu, and X. Li, “Eliminating noisy information in Web pages for data mining”, In Proceedings of ACM SIGKDD’03, 2003, pp. 296-305. [37] S. Yu, D. Cai, J. R. Wen, and W. Y. Ma, “Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segment”, In Proceedings of the 12th International Conference on WWW, 2003, pp. 11-18. [38] S. W. Jung, and H. C. Kwon, “A scalable hybrid approach for extracting head components from Web tables,” IEEE transactions on Knowledge and Data Engineering, Vol. 18, No. 2, 2006, pp. 174-187. |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信