淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-0606200619091500
中文論文名稱 計算語意相似度之方法及其應用於字義排歧
英文論文名稱 Word Sense Disambiguation Using Semantic Relatedness Measurement
校院名稱 淡江大學
系所名稱(中) 資訊工程學系博士班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 94
學期 2
出版年 95
研究生中文姓名 楊哲宇
研究生英文姓名 Che-Yu Yang
學號 890190100
學位類別 博士
語文別 英文
口試日期 2006-05-19
論文頁數 104頁
口試委員 指導教授-施國琛
委員-廖弘源
委員-張志勇
委員-石貴平
委員-王俊嘉
中文關鍵字 觀念表示  語意相似度  字義排歧  詞網  自然語言處理  資訊擷取 
英文關鍵字 concept representation  semantic relatedness  word sense disambiguation  Wordnet  natural language processing  information retrieval 
學科別分類 學科別應用科學資訊工程
中文摘要 將相似度或相關度這種直覺上的見解做一正規化及量化,長久以來在哲學、心理學、人工智慧等領域有著常駐的興趣,許多不同的觀點也都被提出。而這種決定兩個以詞彙表示的概念在語意上的相似度的工作,或者更一般性的來說 — 相關度,可以被運用到與多不同的地方。
所有人類語言中的詞彙,都可能因為出現在不同的前後文裡面,而代表著不同的意涵,而這種擁有多重字義的詞彙,潛在著語意不清的問題。對於幾乎所有與人類語言相關的應用領域,這種語意不清往往成為錯誤的來源。而字義排歧便是決定一個詞彙出現在某個前後文中,其所帶有的字義的一種工作。
多形字 — 一個帶有多重字義的詞彙,以及同義字 — 多個代表著相同字義的不同詞彙,對於自然語言處理或人工智慧相關的領域都是非常重要的課題。在資訊擷取相關的領域,多形字造成準確率的降低,而同義字造成召回率的降低。
在本論文中,我們提出了一套新穎的混合式方法,來測量任意兩個概念在人類語意上的相關度,並將此方法應用到字義排歧的工作上。此外,我們也研究了如何利用字義排歧,來克服多形字及同義字的問題,以提高資訊擷取系統的效能。這個論文不但從理論上的角度來研究概念表達、概念分佈以及語意相關度,並且也思考如何實際利用這些理論,來幫助意義排歧及資訊擷取。
英文摘要 The problem of formalizing and quantifying the intuitive notion of similarity or relatedness has a long history in philosophy, psychology, and artificial intelligence, and many different perspectives have been suggested. The need to determine the degree of semantic similarity, or more generally, relatedness, between two lexically expressed concepts is applied in many applications.
All human languages have words that can mean different things in different contexts, such words with multiple meanings are potentially “ambiguous”. For almost all applications of language technology, word sense ambiguity is a potential source of error. “Word Sense Disambiguation (WSD)” is the process of deciding which of their several meanings is intended in a given context.
Polysemy — a single word form having more than one meaning; synonymy — multiple words having the same meaning, are both important issues in natural language processing or artificial intelligence related fields. In information retrieval field, polysemy decreases retrieval precision by false matches; on the other hand, synonymy decreases the recall by missing true conceptual matches.
In this thesis, we explore the measures of semantic relatedness between word senses based on a novel hybrid approach, and we apply the measure of semantic relatedness to the WSD task. Beside, we also investigate how WSD can benefit the task of information retrieval by solving the problems of polysymy and synonymy. This research is not only from a theoretical perspective on concept representation, concept distribution and semantic relatedness, but also considered possible applications of the proposed theory on word sense disambiguation and information retrieval.
論文目次 CONTENTS

CHINESE ABSTRACT I
ABSTRACT II
CONTENTS III
LIST OF FIGURES V
LIST OF TABLES VII

CHAPTER
1. Introduction 1
2. Related Works 5
2.1 Wordnet 5
2.2 Word Sense Disambiguation based on Wordnet 11
3. Semantic Relatedness and Word Sense Disambiguation 19
3.1 Synonym set and basic set theory in modern mathematics 19
3.2 Variable Lexical Notations for a Concept 23
3.2.1 Generic Concept Notation for a Synset 28
3.2.2 Specific Concept Notation for a Synset 32
3.3 Semantic Relatedness and Word Sense Disambiguation 34
4. Evaluations 40
4.1 Experiment setup 40
4.2 Experimental Results 46
4.3 Discussion 54
5. Using WSD to Improve Internet Search 62
5.1 Background 62
5.2 Query Expansion based on Word Sense 65
5.3 Noise Filter Out Using WSD 70
5.4 Chapter Review 73
6. A Semantic based Online Question Answering System for Distance Learning 75
6.1 Background 76
6.2 The Question Answering System Architecture 81
6.3 Design of the Question Answering System 87
6.4 Chapter Review 90
7. Conclusions and Future Research 92
7.1 Conclusions 92
7.2 Future Research 96
Bibliography 98



LIST OF FIGURES

1. The intersection of Wordnet’s synonym sets (synsets) 21
2. The union of Wordnet’s synonym sets (synsets) 22
3. A “snapshot” of Wordnet hierarchy 26
4. An example “snapshot” of Wordnet hypernym/hyponym hierarchy (the nodes are synsets) 28
5. Examples of generic concept notation for a synset 31
6. The procedure for determining the semantic relatedness of two given Wordnet synsets 37
7. The WSD task using Wordnet – to generate the mappings between word forms and synsets 38
8. The procedure to find the most appropriate sense for a term 39
9. Example sentences and the tag format in Semcor 42
10(a). The precision against different i value of the generic notation on br-a01 47
10(b). The precision against different i value of the generic notation on br-b20 47
10(c). The precision against different i value of the generic notation on br-j09 48
10(d). The precision against different i value of the generic notation on br-r05 48
11(a). The precision against different sizes of context window on br-a01 48
11(b). The precision against different sizes of context window on br-b20 48
11(c). The precision against different sizes of context window on br-j09 51
11(d). The precision against different sizes of context window on br-r05 51
12. The average precision against different i value of the generic notation 49
13. The average precision (generic notation) against different sizes of context window 50
14(a). The precision against different i value of the specific notation on br-a01 51
14(b). The precision against different i value of the specific notation on br-b20 51
14(c). The precision against different i value of the specific notation on br-j09 51
14(d). The precision against different i value of the specific notation on br-r05 51
15(a). The precision against different sizes of context window on br-a01 52
15(b). The precision against different sizes of context window on br-b20 52
15(c). The precision against different sizes of context window on br-j09 52
15(d). The precision against different sizes of context window on br-r05 52
16. The average precision against different i value of the specific notation 53
17. The average precision (specific notation) against different sizes of context window 53
18. Precision against the degree of polysemy 54
19. Semantic-based interactive query expansion flow – add synonymy terms according to the intended word senses 68
20. Interactive query-expansion enabled interface that takes given search terms as input, looks for synonyms in the Wordnet, and let searcher to assign intended senses to their search terms 69
21. Semantic filtering to the initial search results. Only web pages that contain the word senses matched to the intended senses assigned by user (in the query expansion phrase) are sent to the user 72
22. System architecture of improved “semantic” search engine – integrated with semantic query expansion and semantic result filtering 73
23. The typical workflow of information retrieval 78
24. Architecture of the Semantic-based Automated Question Answering 81
25. The student interface – Ask a question 87
26. The student interface – display the answers. If the answers are not satisfactory to the student, he can then press the below button demanding for the answer from the instructor 88
27. The instructor interface – List unanswered-questions 89
28. The instructor interface – Answer a question 89
29. An interface for the instructor to manually collect Q&A sets 90

LIST OF TABLES

1. Number of words, synsets, and word-sense pairs in WordNet v2.0 6
2. Polysemy statistical information of WordNet v2.0 6
3. Polysemy average information of WordNet v2.0 7
4. Some semantic relations (links) defined in Wordnet 7
5. Statistics of semantic relations for nouns in Wordnet 2.0 8
6. Statistics of semantic relations for verbs in Wordnet 2.0 9
7. Statistics of semantic relations for adjectives in Wordnet 2.0 10
8. Statistics of semantic relations for adverbs in Wordnet 2.0 11
9. Mapping from the conception 6 to the semantic relatedness measurement 36
10. The genders of texts inventoried in Semcor 42
11. Polysemy statistical information of the four tested text files 43
12(a). The precision on br-a01 (generic notation) 47
12(b). The precision on br-b20 (generic notation) 47
12(c). The precision on br-j09 (generic notation) 47
12(d). The precision on br-r05 (generic notation) 47
13(a). The precision on br-a01 (specific notation) 50
13(b). The precision on br-b20 (specific notation) 50
13(c). The precision on br-j09 (specific notation) 51
13(d). The precision on br-r05 (specific notation) 51

參考文獻 Agirre, E. and Rigau, G., 1996. Word Sense Disambiguation using Conceptual Density. In Proceedings of the 16th International Conference on Computational Linguistics (Coling'96), pages 16--22. Copenhagen, Denmark.

Attardi, G., Cisternino, A., Formica, F., Simi, M. and Tommasi, A., 2001. In: Proceedings of TREC-9 Conference, NIST, pp 633-641, 2001.

Baeza-Yates, R. and Ribiero-Neto, B., 1999. Modern Information Retrieval, Addison-Wesley, 1999.

Banerjee, S. and Pedersen, T., 2002. An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, 2002, pp. 136–145.

Bates, M. J., 1986. “Subject access in online catalogs: A design model.” Journal of the American Society for Information Science, 37, 357-376. 1986

Bradshaw, Scheinkman, and Hammond, 2000. “Guiding People to Information: Providing an Interface to a Digital Library Using Reference as a Basis for Indexing.” In Proceedings of the Fourth International Conference on Intelligent User Interfaces, New Orleans, LA, January 9-12, 2000.

Budanitsky, A. and Hirst, G., 2001. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, June 2001.

Burke, R. D., Hammond, K. J., Kulyukin, V. A., Lytinen, S. L., Tomuro, N. and Schoenberg, S., 1997. Questions answering from frequently-asked question files: Experiences with the FAQ Finder System. The University of Chicago, Computer Science Department, TR-97-05.

Chua, S. and Kulathuramaiyer, N., 2004. Semantic Feature Selection Using WordNet. In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence (WI 2004), page(s): 166 – 172, 2004.

Clarke, C.L.A., Cormack, G.G., Kisman, D.I.E. and Lynam, K., 2000. “Question Answering by Passage Selection”, TREC-9, (2000).

Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R., 1990. “Indexing by latent semantic analysis.” Journal of the American Society for Information Science, 41:391–407, 1990.

Fellbaum, C., 1998. An Electronic Lexical Database, MIT Press, Cambridge, Mass.

Fidel, R., 1985. “Individual variability in online searching behavior.” In C.A. Parkhurst (Ed.). ASIS'85: Proceedings of the ASIS 48th Annual Meeting, Vol. 22, October 20-24, 1985, 6972.

Fuhr, N. and Buckley, C., 1991. “A probabilistic learning approach for document indexing.” ACM Transactions on Information Systems, 9(3):223–248, 1991.

Furnas, G., Landauer, T., Gomez, L. and Dumais, S., 1987. “The Vocabulary Problem in Human-Systems Communication”, Communications of the ACM, 30 (11), 1987, pp. 964-971.

Google. http://www.google.com

Guarino, N., 1999. OntoSeek: Content-Based Access to the Web. IEEE Intelligent Systems, pp 70-80, 1999.

Harabagiu, S. and Moldovan, D. et al., 2000. “FALCON: Boosting knowledge for answer engines”, TREC-9, (2000).

Heenan, C. H., 2002. “A review of Academic Research on Information Retrieval”, Available at: http://eil.stanford.edu/publications/charles_heenan/AcademicInfoRetrievalResearch.pdf

Hirst, G. and St-Onge, D., 1998. Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms. in WordNet: An electronic lexical database, Christiane Fellbaum (editor), Cambridge, MA: The MIT Press, 1998.

Hovy, E., Gerber, L., Hermjakob, U., Junk, M. and Liu C.Y., 2001a. Question Answering in Webclopedia. In: Proceedings of TREC-9 Conference, NIST, 2001.

Hovy, E., Gerber, L., Hermjakob, U., Liu, C.Y. and Ravichandran, D., 2001b. Toward Semantics-Based Answer Pinpointing. In: Proceedings of DARPA Human Language Technology conference (HLT), 2001.

Hull, D.A., 2000. Xerox TREC-8 Question Answering Track Report, TREC-8, 1999.

Jiang, J.J. and Conrath, D.W., 1997. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In Proceedings of ROCLING X (1997) International Conference on Research in Computational Linguistics, Taiwan, 1997.

Kamps, J., 2004. Improving retrieval effectiveness by reranking documents based on controlled vocabulary. In: S. McDonald and J. Tait, editors, Advances in Information Retrieval: 26th European Conference on IR Research (ECIR 2004), LNCS 2997, p. 283-295.Springer-Verlag.

Kilgarriff, A., 1998. I don't believe in word senses. Computers and the Humanities, 31(2):91-113

Kim, S. B., Seo, H. C. and Rim, H. C., 2004. Information retrieval using word senses: root sense tagging approach. In Proceedings of the 27th annual international conference on Research and development in information retrieval (SIGIR '04), Sheffield, United Kingdom, July 25 - 29, 2004.

Kupiec, J., 1993. MURAX: a robust linguistic approach for question answering using an on-line encyclopedia. In: Proceedings of the 16th annual international ACM SIGIR, conference on Research and development in information retrieval, 181-190. ACM Press.

Lawrence, S. and Giles, C., 1998. “Searching the world wide web.” Science 280, 98–100, 1998.

Leacock, C. and Martin C., 1998. Combining local context with WordNet similarity for word sense identification. In Christiane Fellbaum, editor, WordNet: A Lexical Reference System and its Application. MIT Press, Cambridge, MA.

Lee, J.H., Kim, M.H. and Lee, Y.I., 1993. Information Retrieval based on conceptual distance in IS-A hierarchies. Journal of Documentation, 49(2), June 1993, pp. 188-207

Lesk, M., 1986. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine code from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation, ACM Press, 1986, pp. 24–26.

Li, H. and Li, C., 2004. Word translation disambiguation using bilingual bootstrapping. Computational Linguistics, Vol. 30, Issue 1, pages 1-22, March 2004.

Lin, D. 1998. An information-theoretic definition of sim-ilarity. In Proceedings of the International Conference on Machine Learning.

Mano, H. and Ogawa, Y., 2001. “Selecting Expansion Terms in Automatic Query Expansion”, In Proceedings of SIGIR ’01, ACM Press: New Orleans, LA, USA, 2001, pp. 390-391.

Miller, George A., 1993. WordNet: A Lexical Database. Comm. ACM, Vol. 38, No. 11, 1993, pp. 39-41.

Miller, George A. and Beckwith, R., 1993. Introduction to WordNet: An On-line Lexical Database. Revised August 1993.

Mitra, M, Singhal, A., and Buckley, C., 1998. “Improving Automated Query Expansion”, In Proceedings of SIGIR ’98, ACM Press: Melbourne, Australia, 1998, pp. 206-214.

Moldovan, D. and Mihalcea, R., 2000. Using WordNet and Lexical Operators to improve Internet Searches. IEEE Internet Computing, vol. 4, no. 1, January 2000.

Patwardhan, S., Banerjee, S. and Pedersen, T., 2003. Using measures of semantic relatedness for word sense disambiguation, in: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, 2003, pp. 241–257.

Pinkerton, B., 1994. “Finding What People Want: Experiences with the WebCrawler”, In Proceedings of the Second International World Wide Web Conference, Chicago, Illinois, USA, July, 1994.

Plamondon, L., Lapalme, G., Diro, R. and Kosseim, L., 2001. The QUANTUM Question Answering System. In: Proceedings of TREC-9 Conference, NIST, 2001.

Prager, J., Brown, E., Coden, A. and Radev, D., 2000. Question-answering by predictive annotation. In: Proceedings, 23rd Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, July 2000.

Rada, R., Mili, H., Bicknell, E. and Bletner, M., 1989. Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man and Cybernetics, Vol. 19, No. 1, 17-30.

Radev, D. R., Prager, J. and Samn, V., 2000. Ranking potential answers to natural language questions. In: Proceedings of the 6th Conference on Applied Natural Language Processing, Seattle, WA, May 2000.

Resnik, P., 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Vol. 1, 448-453, Montreal, August 1995

Robertson, S. and Jones, S., 1977. “The probability ranking principle in information retrieval.” Journal of Documentation, 33:294–304, 1977.

Rosso, P., Masulli, F. and Buscaldi, D., 2003. Word sense disambiguation combining conceptual distance, frequency and gloss. In Proceedings of International Conference on Natural Language Processing and Knowledge Engineering, page(s): 120-125, Oct. 2003

Sahlgren, M., Karlsgren, J., Cöster, R. and Järvinen, T., 2002. “Automatic Query Expansion using Random Indexing”, In Proceedings of CLEF 2002, Rome, Italy, 2002.

Slator G., Brian M. and Wilks, Yorick A., 1987. Towards Semantic Structures from Dictionary Entries. In Proceedings of the Second Annual Rocky Mountain Conference on Artificial Intelligence (RMCAI-87) Boulder, CO, June 17-19, pp. 85-96.

Salton, G., Wong, A., and Yang, C., 1975. “A vector space model for automatic indexing.” Communications of the ACM, 18:613–620, 1975.

Sussna, M., 1993. Word sense disambiguation for free text indexing using a massive semantic network. In Proceedings of the Second International Conference on Information and Knowledge Management, Arlington, Virginia, 1993.

Turtle, H. and Croft, W., 1991. “Evaluation of an inference network-based retrieval model.” ACM Transactions on Information Systems, 9(3):187–222, 1991.

Van Hage, W., de Rijke, M. and Marx, M., 2004. Information retrieval support for ontology construction and use. In: S. Mcllraith, D. Plexousakis, and F. van Harmelen, editors, In: Proceedings 3rdInternational Semantic Web Conference (ISWC 2004), LNCS3298, p. 518-533, 2004.

Van Rijsbergen, C., 1979. Information Retrieval. Butterworths, London, 1979.

Van Rijsbergen, C., 1986. “Anon-classical logic for information retrieval.” Computer Journal, 29:481–485, 1986.

Yahoo. http://tw.yahoo.com.

Yang, C.Y., Hung, J.C., Wang, C.S. Chiu, M.S. and Yee G. Applying Word Sense Disambiguation to Question Answering System for E-Learning. The 19th International Conference on Advanced Information Networking and Applications (AINA 2005), 28-30 March 2005, Taipei, Taiwan. IEEE Computer Society 2005, ISBN 0-7695-2249-1

Yaworsky, D., 1992. Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on LargeCorpora. Proceedings of the 15th International Conference on Computational Linguistics (Coling '92). Nantes, France.

Yu, J.S., Wen, Z.S., Liu, Y. and Jin, Z.H., 2004. Statistical Overview of WordNet from 1.6 to 2.0. The Second Global Wordnet Conference (GWC 2004), Brno, Czech Republic, January 20-23, 2004.
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2006-06-22公開。
  • 同意授權瀏覽/列印電子全文服務,於2008-06-22起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信