系統識別號 | U0002-0606200619091500 |
---|---|
DOI | 10.6846/TKU.2006.00087 |
論文名稱(中文) | 計算語意相似度之方法及其應用於字義排歧 |
論文名稱(英文) | Word Sense Disambiguation Using Semantic Relatedness Measurement |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系博士班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 94 |
學期 | 2 |
出版年 | 95 |
研究生(中文) | 楊哲宇 |
研究生(英文) | Che-Yu Yang |
學號 | 890190100 |
學位類別 | 博士 |
語言別 | 英文 |
第二語言別 | |
口試日期 | 2006-05-19 |
論文頁數 | 104頁 |
口試委員 |
指導教授
-
施國琛(tshih@cs.tku.edu.tw)
委員 - 廖弘源 委員 - 張志勇 委員 - 石貴平 委員 - 王俊嘉 |
關鍵字(中) |
觀念表示 語意相似度 字義排歧 詞網 自然語言處理 資訊擷取 |
關鍵字(英) |
concept representation semantic relatedness word sense disambiguation Wordnet natural language processing information retrieval |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
將相似度或相關度這種直覺上的見解做一正規化及量化,長久以來在哲學、心理學、人工智慧等領域有著常駐的興趣,許多不同的觀點也都被提出。而這種決定兩個以詞彙表示的概念在語意上的相似度的工作,或者更一般性的來說 — 相關度,可以被運用到與多不同的地方。 所有人類語言中的詞彙,都可能因為出現在不同的前後文裡面,而代表著不同的意涵,而這種擁有多重字義的詞彙,潛在著語意不清的問題。對於幾乎所有與人類語言相關的應用領域,這種語意不清往往成為錯誤的來源。而字義排歧便是決定一個詞彙出現在某個前後文中,其所帶有的字義的一種工作。 多形字 — 一個帶有多重字義的詞彙,以及同義字 — 多個代表著相同字義的不同詞彙,對於自然語言處理或人工智慧相關的領域都是非常重要的課題。在資訊擷取相關的領域,多形字造成準確率的降低,而同義字造成召回率的降低。 在本論文中,我們提出了一套新穎的混合式方法,來測量任意兩個概念在人類語意上的相關度,並將此方法應用到字義排歧的工作上。此外,我們也研究了如何利用字義排歧,來克服多形字及同義字的問題,以提高資訊擷取系統的效能。這個論文不但從理論上的角度來研究概念表達、概念分佈以及語意相關度,並且也思考如何實際利用這些理論,來幫助意義排歧及資訊擷取。 |
英文摘要 |
The problem of formalizing and quantifying the intuitive notion of similarity or relatedness has a long history in philosophy, psychology, and artificial intelligence, and many different perspectives have been suggested. The need to determine the degree of semantic similarity, or more generally, relatedness, between two lexically expressed concepts is applied in many applications. All human languages have words that can mean different things in different contexts, such words with multiple meanings are potentially “ambiguous”. For almost all applications of language technology, word sense ambiguity is a potential source of error. “Word Sense Disambiguation (WSD)” is the process of deciding which of their several meanings is intended in a given context. Polysemy — a single word form having more than one meaning; synonymy — multiple words having the same meaning, are both important issues in natural language processing or artificial intelligence related fields. In information retrieval field, polysemy decreases retrieval precision by false matches; on the other hand, synonymy decreases the recall by missing true conceptual matches. In this thesis, we explore the measures of semantic relatedness between word senses based on a novel hybrid approach, and we apply the measure of semantic relatedness to the WSD task. Beside, we also investigate how WSD can benefit the task of information retrieval by solving the problems of polysymy and synonymy. This research is not only from a theoretical perspective on concept representation, concept distribution and semantic relatedness, but also considered possible applications of the proposed theory on word sense disambiguation and information retrieval. |
第三語言摘要 | |
論文目次 |
CONTENTS CHINESE ABSTRACT I ABSTRACT II CONTENTS III LIST OF FIGURES V LIST OF TABLES VII CHAPTER 1. Introduction 1 2. Related Works 5 2.1 Wordnet 5 2.2 Word Sense Disambiguation based on Wordnet 11 3. Semantic Relatedness and Word Sense Disambiguation 19 3.1 Synonym set and basic set theory in modern mathematics 19 3.2 Variable Lexical Notations for a Concept 23 3.2.1 Generic Concept Notation for a Synset 28 3.2.2 Specific Concept Notation for a Synset 32 3.3 Semantic Relatedness and Word Sense Disambiguation 34 4. Evaluations 40 4.1 Experiment setup 40 4.2 Experimental Results 46 4.3 Discussion 54 5. Using WSD to Improve Internet Search 62 5.1 Background 62 5.2 Query Expansion based on Word Sense 65 5.3 Noise Filter Out Using WSD 70 5.4 Chapter Review 73 6. A Semantic based Online Question Answering System for Distance Learning 75 6.1 Background 76 6.2 The Question Answering System Architecture 81 6.3 Design of the Question Answering System 87 6.4 Chapter Review 90 7. Conclusions and Future Research 92 7.1 Conclusions 92 7.2 Future Research 96 Bibliography 98 LIST OF FIGURES 1. The intersection of Wordnet’s synonym sets (synsets) 21 2. The union of Wordnet’s synonym sets (synsets) 22 3. A “snapshot” of Wordnet hierarchy 26 4. An example “snapshot” of Wordnet hypernym/hyponym hierarchy (the nodes are synsets) 28 5. Examples of generic concept notation for a synset 31 6. The procedure for determining the semantic relatedness of two given Wordnet synsets 37 7. The WSD task using Wordnet – to generate the mappings between word forms and synsets 38 8. The procedure to find the most appropriate sense for a term 39 9. Example sentences and the tag format in Semcor 42 10(a). The precision against different i value of the generic notation on br-a01 47 10(b). The precision against different i value of the generic notation on br-b20 47 10(c). The precision against different i value of the generic notation on br-j09 48 10(d). The precision against different i value of the generic notation on br-r05 48 11(a). The precision against different sizes of context window on br-a01 48 11(b). The precision against different sizes of context window on br-b20 48 11(c). The precision against different sizes of context window on br-j09 51 11(d). The precision against different sizes of context window on br-r05 51 12. The average precision against different i value of the generic notation 49 13. The average precision (generic notation) against different sizes of context window 50 14(a). The precision against different i value of the specific notation on br-a01 51 14(b). The precision against different i value of the specific notation on br-b20 51 14(c). The precision against different i value of the specific notation on br-j09 51 14(d). The precision against different i value of the specific notation on br-r05 51 15(a). The precision against different sizes of context window on br-a01 52 15(b). The precision against different sizes of context window on br-b20 52 15(c). The precision against different sizes of context window on br-j09 52 15(d). The precision against different sizes of context window on br-r05 52 16. The average precision against different i value of the specific notation 53 17. The average precision (specific notation) against different sizes of context window 53 18. Precision against the degree of polysemy 54 19. Semantic-based interactive query expansion flow – add synonymy terms according to the intended word senses 68 20. Interactive query-expansion enabled interface that takes given search terms as input, looks for synonyms in the Wordnet, and let searcher to assign intended senses to their search terms 69 21. Semantic filtering to the initial search results. Only web pages that contain the word senses matched to the intended senses assigned by user (in the query expansion phrase) are sent to the user 72 22. System architecture of improved “semantic” search engine – integrated with semantic query expansion and semantic result filtering 73 23. The typical workflow of information retrieval 78 24. Architecture of the Semantic-based Automated Question Answering 81 25. The student interface – Ask a question 87 26. The student interface – display the answers. If the answers are not satisfactory to the student, he can then press the below button demanding for the answer from the instructor 88 27. The instructor interface – List unanswered-questions 89 28. The instructor interface – Answer a question 89 29. An interface for the instructor to manually collect Q&A sets 90 LIST OF TABLES 1. Number of words, synsets, and word-sense pairs in WordNet v2.0 6 2. Polysemy statistical information of WordNet v2.0 6 3. Polysemy average information of WordNet v2.0 7 4. Some semantic relations (links) defined in Wordnet 7 5. Statistics of semantic relations for nouns in Wordnet 2.0 8 6. Statistics of semantic relations for verbs in Wordnet 2.0 9 7. Statistics of semantic relations for adjectives in Wordnet 2.0 10 8. Statistics of semantic relations for adverbs in Wordnet 2.0 11 9. Mapping from the conception 6 to the semantic relatedness measurement 36 10. The genders of texts inventoried in Semcor 42 11. Polysemy statistical information of the four tested text files 43 12(a). The precision on br-a01 (generic notation) 47 12(b). The precision on br-b20 (generic notation) 47 12(c). The precision on br-j09 (generic notation) 47 12(d). The precision on br-r05 (generic notation) 47 13(a). The precision on br-a01 (specific notation) 50 13(b). The precision on br-b20 (specific notation) 50 13(c). The precision on br-j09 (specific notation) 51 13(d). The precision on br-r05 (specific notation) 51 |
參考文獻 |
Agirre, E. and Rigau, G., 1996. Word Sense Disambiguation using Conceptual Density. In Proceedings of the 16th International Conference on Computational Linguistics (Coling'96), pages 16--22. Copenhagen, Denmark. Attardi, G., Cisternino, A., Formica, F., Simi, M. and Tommasi, A., 2001. In: Proceedings of TREC-9 Conference, NIST, pp 633-641, 2001. Baeza-Yates, R. and Ribiero-Neto, B., 1999. Modern Information Retrieval, Addison-Wesley, 1999. Banerjee, S. and Pedersen, T., 2002. An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, 2002, pp. 136–145. Bates, M. J., 1986. “Subject access in online catalogs: A design model.” Journal of the American Society for Information Science, 37, 357-376. 1986 Bradshaw, Scheinkman, and Hammond, 2000. “Guiding People to Information: Providing an Interface to a Digital Library Using Reference as a Basis for Indexing.” In Proceedings of the Fourth International Conference on Intelligent User Interfaces, New Orleans, LA, January 9-12, 2000. Budanitsky, A. and Hirst, G., 2001. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, June 2001. Burke, R. D., Hammond, K. J., Kulyukin, V. A., Lytinen, S. L., Tomuro, N. and Schoenberg, S., 1997. Questions answering from frequently-asked question files: Experiences with the FAQ Finder System. The University of Chicago, Computer Science Department, TR-97-05. Chua, S. and Kulathuramaiyer, N., 2004. Semantic Feature Selection Using WordNet. In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence (WI 2004), page(s): 166 – 172, 2004. Clarke, C.L.A., Cormack, G.G., Kisman, D.I.E. and Lynam, K., 2000. “Question Answering by Passage Selection”, TREC-9, (2000). Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R., 1990. “Indexing by latent semantic analysis.” Journal of the American Society for Information Science, 41:391–407, 1990. Fellbaum, C., 1998. An Electronic Lexical Database, MIT Press, Cambridge, Mass. Fidel, R., 1985. “Individual variability in online searching behavior.” In C.A. Parkhurst (Ed.). ASIS'85: Proceedings of the ASIS 48th Annual Meeting, Vol. 22, October 20-24, 1985, 6972. Fuhr, N. and Buckley, C., 1991. “A probabilistic learning approach for document indexing.” ACM Transactions on Information Systems, 9(3):223–248, 1991. Furnas, G., Landauer, T., Gomez, L. and Dumais, S., 1987. “The Vocabulary Problem in Human-Systems Communication”, Communications of the ACM, 30 (11), 1987, pp. 964-971. Google. http://www.google.com Guarino, N., 1999. OntoSeek: Content-Based Access to the Web. IEEE Intelligent Systems, pp 70-80, 1999. Harabagiu, S. and Moldovan, D. et al., 2000. “FALCON: Boosting knowledge for answer engines”, TREC-9, (2000). Heenan, C. H., 2002. “A review of Academic Research on Information Retrieval”, Available at: http://eil.stanford.edu/publications/charles_heenan/AcademicInfoRetrievalResearch.pdf Hirst, G. and St-Onge, D., 1998. Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms. in WordNet: An electronic lexical database, Christiane Fellbaum (editor), Cambridge, MA: The MIT Press, 1998. Hovy, E., Gerber, L., Hermjakob, U., Junk, M. and Liu C.Y., 2001a. Question Answering in Webclopedia. In: Proceedings of TREC-9 Conference, NIST, 2001. Hovy, E., Gerber, L., Hermjakob, U., Liu, C.Y. and Ravichandran, D., 2001b. Toward Semantics-Based Answer Pinpointing. In: Proceedings of DARPA Human Language Technology conference (HLT), 2001. Hull, D.A., 2000. Xerox TREC-8 Question Answering Track Report, TREC-8, 1999. Jiang, J.J. and Conrath, D.W., 1997. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In Proceedings of ROCLING X (1997) International Conference on Research in Computational Linguistics, Taiwan, 1997. Kamps, J., 2004. Improving retrieval effectiveness by reranking documents based on controlled vocabulary. In: S. McDonald and J. Tait, editors, Advances in Information Retrieval: 26th European Conference on IR Research (ECIR 2004), LNCS 2997, p. 283-295.Springer-Verlag. Kilgarriff, A., 1998. I don't believe in word senses. Computers and the Humanities, 31(2):91-113 Kim, S. B., Seo, H. C. and Rim, H. C., 2004. Information retrieval using word senses: root sense tagging approach. In Proceedings of the 27th annual international conference on Research and development in information retrieval (SIGIR '04), Sheffield, United Kingdom, July 25 - 29, 2004. Kupiec, J., 1993. MURAX: a robust linguistic approach for question answering using an on-line encyclopedia. In: Proceedings of the 16th annual international ACM SIGIR, conference on Research and development in information retrieval, 181-190. ACM Press. Lawrence, S. and Giles, C., 1998. “Searching the world wide web.” Science 280, 98–100, 1998. Leacock, C. and Martin C., 1998. Combining local context with WordNet similarity for word sense identification. In Christiane Fellbaum, editor, WordNet: A Lexical Reference System and its Application. MIT Press, Cambridge, MA. Lee, J.H., Kim, M.H. and Lee, Y.I., 1993. Information Retrieval based on conceptual distance in IS-A hierarchies. Journal of Documentation, 49(2), June 1993, pp. 188-207 Lesk, M., 1986. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine code from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation, ACM Press, 1986, pp. 24–26. Li, H. and Li, C., 2004. Word translation disambiguation using bilingual bootstrapping. Computational Linguistics, Vol. 30, Issue 1, pages 1-22, March 2004. Lin, D. 1998. An information-theoretic definition of sim-ilarity. In Proceedings of the International Conference on Machine Learning. Mano, H. and Ogawa, Y., 2001. “Selecting Expansion Terms in Automatic Query Expansion”, In Proceedings of SIGIR ’01, ACM Press: New Orleans, LA, USA, 2001, pp. 390-391. Miller, George A., 1993. WordNet: A Lexical Database. Comm. ACM, Vol. 38, No. 11, 1993, pp. 39-41. Miller, George A. and Beckwith, R., 1993. Introduction to WordNet: An On-line Lexical Database. Revised August 1993. Mitra, M, Singhal, A., and Buckley, C., 1998. “Improving Automated Query Expansion”, In Proceedings of SIGIR ’98, ACM Press: Melbourne, Australia, 1998, pp. 206-214. Moldovan, D. and Mihalcea, R., 2000. Using WordNet and Lexical Operators to improve Internet Searches. IEEE Internet Computing, vol. 4, no. 1, January 2000. Patwardhan, S., Banerjee, S. and Pedersen, T., 2003. Using measures of semantic relatedness for word sense disambiguation, in: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, 2003, pp. 241–257. Pinkerton, B., 1994. “Finding What People Want: Experiences with the WebCrawler”, In Proceedings of the Second International World Wide Web Conference, Chicago, Illinois, USA, July, 1994. Plamondon, L., Lapalme, G., Diro, R. and Kosseim, L., 2001. The QUANTUM Question Answering System. In: Proceedings of TREC-9 Conference, NIST, 2001. Prager, J., Brown, E., Coden, A. and Radev, D., 2000. Question-answering by predictive annotation. In: Proceedings, 23rd Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, July 2000. Rada, R., Mili, H., Bicknell, E. and Bletner, M., 1989. Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man and Cybernetics, Vol. 19, No. 1, 17-30. Radev, D. R., Prager, J. and Samn, V., 2000. Ranking potential answers to natural language questions. In: Proceedings of the 6th Conference on Applied Natural Language Processing, Seattle, WA, May 2000. Resnik, P., 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Vol. 1, 448-453, Montreal, August 1995 Robertson, S. and Jones, S., 1977. “The probability ranking principle in information retrieval.” Journal of Documentation, 33:294–304, 1977. Rosso, P., Masulli, F. and Buscaldi, D., 2003. Word sense disambiguation combining conceptual distance, frequency and gloss. In Proceedings of International Conference on Natural Language Processing and Knowledge Engineering, page(s): 120-125, Oct. 2003 Sahlgren, M., Karlsgren, J., Cöster, R. and Järvinen, T., 2002. “Automatic Query Expansion using Random Indexing”, In Proceedings of CLEF 2002, Rome, Italy, 2002. Slator G., Brian M. and Wilks, Yorick A., 1987. Towards Semantic Structures from Dictionary Entries. In Proceedings of the Second Annual Rocky Mountain Conference on Artificial Intelligence (RMCAI-87) Boulder, CO, June 17-19, pp. 85-96. Salton, G., Wong, A., and Yang, C., 1975. “A vector space model for automatic indexing.” Communications of the ACM, 18:613–620, 1975. Sussna, M., 1993. Word sense disambiguation for free text indexing using a massive semantic network. In Proceedings of the Second International Conference on Information and Knowledge Management, Arlington, Virginia, 1993. Turtle, H. and Croft, W., 1991. “Evaluation of an inference network-based retrieval model.” ACM Transactions on Information Systems, 9(3):187–222, 1991. Van Hage, W., de Rijke, M. and Marx, M., 2004. Information retrieval support for ontology construction and use. In: S. Mcllraith, D. Plexousakis, and F. van Harmelen, editors, In: Proceedings 3rdInternational Semantic Web Conference (ISWC 2004), LNCS3298, p. 518-533, 2004. Van Rijsbergen, C., 1979. Information Retrieval. Butterworths, London, 1979. Van Rijsbergen, C., 1986. “Anon-classical logic for information retrieval.” Computer Journal, 29:481–485, 1986. Yahoo. http://tw.yahoo.com. Yang, C.Y., Hung, J.C., Wang, C.S. Chiu, M.S. and Yee G. Applying Word Sense Disambiguation to Question Answering System for E-Learning. The 19th International Conference on Advanced Information Networking and Applications (AINA 2005), 28-30 March 2005, Taipei, Taiwan. IEEE Computer Society 2005, ISBN 0-7695-2249-1 Yaworsky, D., 1992. Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on LargeCorpora. Proceedings of the 15th International Conference on Computational Linguistics (Coling '92). Nantes, France. Yu, J.S., Wen, Z.S., Liu, Y. and Jin, Z.H., 2004. Statistical Overview of WordNet from 1.6 to 2.0. The Second Global Wordnet Conference (GWC 2004), Brno, Czech Republic, January 20-23, 2004. |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信