淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-1501200716224000
中文論文名稱 漸進式網頁文件分類技術
英文論文名稱 Progressive Analysis Scheme for Web Document Classification
校院名稱 淡江大學
系所名稱(中) 資訊工程學系博士班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 95
學期 1
出版年 96
研究生中文姓名 宋立群
研究生英文姓名 Li-Chun Sung
學號 887190030
學位類別 博士
語文別 英文
口試日期 2007-01-04
論文頁數 85頁
口試委員 指導教授-郭經華
委員-劉遠楨
委員-陳孟彰
委員-王英宏
委員-葛煥昭
中文關鍵字 網頁探勘  網頁分類  漸進式分析 
英文關鍵字 Web Mining  Web Document Classification  Progressive Analysis 
學科別分類 學科別應用科學資訊工程
中文摘要 在本篇論文中,我們提出了一種漸進式網頁文件分類技術(簡稱PAS)。透過這種分類技術,由於分類器只需分析文件中部分關鍵區塊的內容,就足以確認文件之所屬類別,因此可以達到提升網頁分類效率的目的。
一般而言,網頁文件可以根據其DOM架構分割為許多小的標籤區域。每塊標籤區域,通常會被以特定的視覺型態加以呈現於瀏覽器視窗中。而這種視覺型態,則由附加於此標籤區域上之HTML成對標籤所構成。根據我們的觀察,由於網頁的寫作習慣,標籤區域中內容對分類的益助性會隨著其視覺型態的不同而有不同的傾向。除此之外,在文件中具有相同視覺型態的標籤區域,也會因為文件寫作技巧的考量而具有不同的分類益助性。
在本篇論文中,我們藉由分析大量網頁文件,並藉由EM與HMM等模式識別技術的輔助,識別出每種視覺型態的益助性特質,包括:益助性傾向、與益助性變化模式。我們將這兩種特質加以整合,提出了一套標籤區域益助性預測機制。在進行分類時,我們可以透過這套機制動態地預測每塊還未被分析之標籤區域的益助性,並漸進地擷取最有益助性之標籤區域進行分類運算,直到網頁類別被確認為止。
為了減少錯誤預測的機率,預測機制會根據已分析過標籤區域之實際益助性,進行自身最佳化調整。此外,對於罕見視覺型態之益助性預測,預測機制會同時參考其近似之視覺型態的益助性特質,以期獲得較正確之預測。
透過實驗,我們說明了參數設定對分類器效能的影響,並驗證了所提出之網頁分類技術的優越性。
英文摘要 In this thesis, we propose a web document classification scheme, called the Progressive Analysis Scheme (PAS), whose classification performance is improved by just analyzing few key parts sufficient for category confirmation.
Based on the DOM tag-tree structure, a web document can be segmented into small tag-regions. Each tag-region is visualized by a visual type which corresponds to a specific nested combination of tag-pairs. Under observation, the profitabilities of tag-regions for classification will vary among visual types caused by the web authoring convention. In addition, in a document, the profitabilities of tag-regions of a visual type may also vary caused by the document writing knacks.
In the thesis, for each visual type, we model the two kinds of profitability variations as the profitability tendencies and the tendency transition patterns based on the Expectation Maximization scheme and the Hidden Markov Model scheme. For classification, we integrate them into a profitability forecasting strategy further. Based on the forecasting strategies, we will forecast the potential profitabilities of unanalyzed tag-regions and extract continuously the most profitable unanalyzed tag-regions for classification until category confirmation.
Dynamically, the forecasting strategies will be optimized for the document by feeding back the actual profitabilities of analyzed tag-regions to them. Thus, the profitabilities of next tag-regions can be forecasted more accurately. In addition, for each unreliable model generated by a sparse set of training samples, we propose a solution which is to support its forecasting process by the strategies of other similar visual types.
Through simulations, the results will show that PAS has better classification performance than the previous approaches, such as the full-text (e.g. SVM) and sequential classifiers.
論文目次 Chapter 1 Introduction 1
1.1 Traditional term-based document classification schemes 1
1.1.1 Vector representation of document 2
1.1.2 Vector comparison for document classification 4
1.1.3 Performance degradation by unprofitable terms 6
1.2 Structural characteristics of web documents 7
1.3 Performance improvement based on the authoring convention 10
1.4 Organization of the thesis 11
Chapter 2 Related works 13
2.1 Works based on hyperlink characteristics 13
2.2 Works based on visual characteristics 14
2.2.1 A single tag-region as a conspicuous block 14
2.2.2 A concept segment as a conspicuous block 16
2.2.3 Drawbacks of segment-based classifiers 18
Chapter 3 Issues derived from the profitability discrimination scheme based on the visual type of tag-region 20
3.1 The number of surrounding tag-pairs of a visual type adopted in discrimination 20
3.1.1 The data sparseness problem of visual types 21
3.2 The multiple profitability tendencies of a visual type 22
3.2.1 The appropriate number of tag-regions for classification 23
Chapter 4 The concept of progressive analysis scheme 24
4.1 Concept of progressive analysis 25
4.2 Notations of profitability and profitability tendency 28
4.2.1 Notation of profitability of a tag-region 28
4.2.2 Notation of profitability tendency of a visual type 30
4.3 Tendency transition patterns of a visual type 34
4.3.1 Observation and definition 34
4.3.2 Pattern identification 37
4.4 Profitability forecasting strategy 41
4.4.1 Tendency transition pattern modeling 41
4.4.2 Profitability forecasting process 44
4.4.3 Classification process based on the profitability forecasting strategy 46
4.4.4 Modification for active profitable tag-region discovery 49
4.5 Blurred visual type 53
4.5.1 The representation of a blurred visual type of a visual type 54
4.5.2 The similarity between the blurred visual types of two visual types 56
4.5.3 The modified profitability forecasting strategy of a visual type 57
Chapter 5 Performance evaluation 58
5.1 Experiment setting 58
5.2 Assumption verification 59
5.2.1 Assumptions about the averaged profitability of a visual type 59
5.2.2 Assumption about multiple profitability tendencies of a visual type 60
5.2.3 Featured similarity curves in PAS 66
5.3 Performance of PAS classifier 73
5.3.1 Performance measurement metrics 73
5.3.2 Influence of parameter setting 74
5.3.3 Performance comparison 77
Chapter 6 Conclusion 80
References 81
List of Figures

Fig. 1.1 An example of an HTML document 8
Fig. 2.1 A extraction example of concept segments of a document 17
Fig. 3.1 The profitability distributions of two visual types whose enclosing tag-pair is a tag-pair 21
Fig. 4.1 The iterative process of the progressive analysis strategy 26
Fig. 4.2 The similarity curves during progressive analysis 28
Fig. 4.3 Some profitability transition curves of tabular writing knack 35
Fig. 4.4 Some profitability transition curve of paragraph writing knack 36
Fig. 4.5 The algorithm for tendency transition pattern identification 39
Fig. 4.6 The algorithm for distance measurement between tendency sequences 40
Fig. 4.7 The architecture of the Profitability Forecaster & Tag-region Extractor 46
Fig. 4.8 The classification algorithm for PAS 48
Fig. 4.9 The ideal analysis sequencing of tag-regions 49
Fig. 4.10 The analysis sequencing of tag-regions based on the original HMM-based forecasting strategy 50
Fig. 4.11 The profitability distributions of two visual types which differ in tag-pair ordering 55
Fig. 5.1 The phenomenon of multiple profitability tendencies of a blurred visual type 61
Fig. 5.2 The intrinsic profitability transitions of {BODY, HTML, TABLE, TBODY, TD, TR} 64
Fig. 5.3 The intrinsic profitability transitions of {BODY, HTML, P} 64
Fig. 5.4 The similarity curves of four test web documents 69
Fig. 5.5 The recall curves for different Nperiod parameters 75
Fig. 5.6 The recall curves for whether enabling sample supplement 76
Fig. 5.7 The recall curves for whether enabling pattern inheritance 77
Fig. 5.8 The recall curves of five classifiers. 79
List of Tables

Table 1.1 Formulas for term weight 4
Table 2.1 The class list of enclosing tag-pairs 15
Table 5.1 Partial list of blurred visual types with Header tag-pairs and their averaged profitabilities 60
Table 5.2 The list of inheritance ratios of some visual types 65
Table 5.3 The performance measurements for different Nperiod parameters 75
Table 5.4 The performance measurement for whether enabling sample supplement 76
Table 5.5 The performance measurement for whether enabling pattern inheritance 77
Table 5.6 The classification performance of the five classifiers 78
Table 5.7 The performances of the PAS classifier with different TConfirm 78

參考文獻 [1] S. Abiteboul, “Querying semi-structured data”, In Proceedings of the International Conference on Database Theory. Delphi, Greece, 1997, pp. 1-18.
[2] D. P. Bertsekas, and J. N. Tsitsiklis, Introduction to Probability, Athena Scientific, 2002.
[3] R. Braz, R. Girju, V. Punyakanok, D. Roth, and M. Sammons, “An Inference Model for Semantic Entailment in Natural Language,” In Proceedings of 12th National Conference on Artificial Intelligence (AAAI), 2005, pp. 1043-1049.
[4] S. Chakrabarti, B. Dom, and P. Indyk, “Enhanced hypertext categorization using hyperlinks”, In Proceedings of ACM SIGMOD’98, ACM Press, 1998, pp. 307-318.
[5] S. Chakrabarti, K. Punera, and M. Subramanyam, “Accelerated focused crawling through online relevance feedback,” In Proceedings of the Eleventh International World Wide Web Conference, 2002, pp. 148-159.
[6] S. Chakrabarti, MINING THE WEB: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2003.
[7] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery, “Learning to Extract Symbolic Knowledge from the World Wide Web”, In Proceedings of the 15th National Conference on Artificial Intelligence, 1998, pp. 509-516.
[8] M. Cutler, Y. Shih, and W. Meng, "Using the Structure of HTML Documents to Improve Retrieval," In Proceedings of Usenix Symposium on Internet Technologies and Systems (NSITS'97), Monterey California, December 1997, pp. 241-251.
[9] H. P. Edmundson, “New Methods in Automatic Extraction,” Journal of the ACM, Vol. 16, No. 2, 1968, pp. 264-285.
[10] J. Fürnkranz, “Exploiting Structural Information for Text Classification on the WWW”, In Proceedings of the 3rd Symposium on Intelligent Data Analysis, Springer-Verlag, Amsterdam, Netherlands, 1999, pp. 487-497.
[11] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo, “Extracting Semistructured Information from the Web”, In Proceedings of the Workshop on Management of Semistructured Data (PODS/SIGMOD'97), 1997, pp. 8-25.
[12] E. H. Hovy, and C. Y. Lin, “Automated Text Summarization in SUMMARIST,” In Proceedings of the ACL97/EACL97 Workshop on Intelligent Scalable Text Summarization, 1997, PP. 18-24.
[13] M. Kovacevic, M. Diligenti, M. Gori and V. M. Milutinovic, “Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification”, In Proceedings of IEEE ICDM’02, 2002, pp. 250-257.
[14] D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, “Training algorithms for linear text classiers,” In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp. 298–306.
[15] S. H. Lin, M. C. Chen, J. Ho and Y. M. Huang, “ACIRD: Intelligent Internet Document Organization and Retrieval”, IEEE Trans. Knowledge and Data Engineering, Vol. 14, No. 3, 2002, pp. 599-614.
[16] H. P. Luhn, “The Automatic Creation of Literature Abstracts,” IBM Journal of Research and Development, 1959, pp.159-165.
[17] M. Mitra, A. Singhal, and C. Buckley, “Automatic text summarization by paragraph extraction.” In Proceedings of the 14th National Conference on Artificial Intelligence (AAAI-97), 1997, pp. 31-36.
[18] T. K. Moon, “The expectation-maximization algorithm,” IEEE Signal Processing Mag., Nov. 1996, pp. 47-60.
[19] C. S. Myers and L. R. Rabiner, “A comparative study of several dynamic time-warping algorithms for connected word recognition,” The Bell System Technical Journal, Vol. 60, No.7, 1981, pp. 1389-1409.
[20] C. Nello, and S. T. John, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000.
[21] H. J. Oh, S. H. Myaeng, and M. H. Lee, “A practical hypertext categorization method using links and incrementally available class information”, In Proceedings of ACM SIGIR 2000, ACM Press, Athens, Greece, July 2000, pp. 264-271.
[22] L. Rabiner, and B. H. Juang, Fundamentals of Speech Recognition, Prentice Hall, Chapter 6, 1993.
[23] G. Salton, A Flexible Automatic System for the Organization, Storage, and Retrieval of Language Data (SMART). Report ISR-5, Section I, Harvard Computation Lab., Jan. 1964.
[24] G. Salton, and M. J. McGill. Introduction to Modern Information Retrieval, McGraw-Hill Book company, 1983.
[25] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison Wesley Publisher, 1989.
[26] G. Salton, “The SMART document retrieval project,” In Proceedings of ACM SIGIR’91, 1991, pp. 357-358.
[27] S. Soderland, “Learning to extract text-based information from the World Wide Web”, In Proceedings of the ACM SIGKDD’97, Newport Beach, CA, 1997, pp. 251-254.
[28] C. J. van Rijsbergen, Information Retrieval, Butterworths, 1979.
[29] W3C, HyperText Markup Language specification (Http://www.w3c.org/MarkUp/), The World Wide Web Consortium, 1999.
[30] I. H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. J. Cunningham, “Weka: Practical machine learning tools and techniques with Java implementations”, In Proceedings of International Workshop: Emerging Knowledge Engineering and Connectionist-Based Information Systems, 1999, pp. 192-196.
[31] W. Wong and A. Fu, “Finding Structure and Characteristics of Web Documents for Classification”, In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Dallas, USA, 2000, pp. 96-105.
[32] Jinxi Xu , and W. Bruce Croft, “Query expansion using local and global document analysis,” In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, 1996, PP. 4-11.
[33] Y. Yang, and X. Liu, “A Re-examination of Text Categorization Methods,” In Proceedings of SIGIR’99, 22nd ACM International Conference on Research and Development in Information Retrieval, 1999, pp. 42-49.
[34] Y. Yang, and H. Zhang, “HTML Page Analysis Based on Visual Cues”, In Proceedings of the 6th International Conference on Document Analysis and Recognition, 2001, pp. 859-864.
[35] J. Yi, and N. Sundaresan, “A classifier for semi-structured documents”, In Proceedings of the 6th ACM SIGKDD’00, Boston, MA, USA , 2000, pp. 340-344.
[36] L. Yi, B. Liu, and X. Li, “Eliminating noisy information in Web pages for data mining”, In Proceedings of ACM SIGKDD’03, 2003, pp. 296-305.
[37] S. Yu, D. Cai, J. R. Wen, and W. Y. Ma, “Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segment”, In Proceedings of the 12th International Conference on WWW, 2003, pp. 11-18.
[38] S. W. Jung, and H. C. Kwon, “A scalable hybrid approach for extracting head components from Web tables,” IEEE transactions on Knowledge and Data Engineering, Vol. 18, No. 2, 2006, pp. 174-187.
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2007-02-06公開。
  • 同意授權瀏覽/列印電子全文服務,於2007-02-06起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信