§ 瀏覽學位論文書目資料
  
系統識別號 U0002-0606200619091500
DOI 10.6846/TKU.2006.00087
論文名稱(中文) 計算語意相似度之方法及其應用於字義排歧
論文名稱(英文) Word Sense Disambiguation Using Semantic Relatedness Measurement
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系博士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 94
學期 2
出版年 95
研究生(中文) 楊哲宇
研究生(英文) Che-Yu Yang
學號 890190100
學位類別 博士
語言別 英文
第二語言別
口試日期 2006-05-19
論文頁數 104頁
口試委員 指導教授 - 施國琛(tshih@cs.tku.edu.tw)
委員 - 廖弘源
委員 - 張志勇
委員 - 石貴平
委員 - 王俊嘉
關鍵字(中) 觀念表示
語意相似度
字義排歧
詞網
自然語言處理
資訊擷取
關鍵字(英) concept representation
semantic relatedness
word sense disambiguation
Wordnet
natural language processing
information retrieval
第三語言關鍵字
學科別分類
中文摘要
將相似度或相關度這種直覺上的見解做一正規化及量化,長久以來在哲學、心理學、人工智慧等領域有著常駐的興趣,許多不同的觀點也都被提出。而這種決定兩個以詞彙表示的概念在語意上的相似度的工作,或者更一般性的來說 — 相關度,可以被運用到與多不同的地方。
所有人類語言中的詞彙,都可能因為出現在不同的前後文裡面,而代表著不同的意涵,而這種擁有多重字義的詞彙,潛在著語意不清的問題。對於幾乎所有與人類語言相關的應用領域,這種語意不清往往成為錯誤的來源。而字義排歧便是決定一個詞彙出現在某個前後文中,其所帶有的字義的一種工作。
多形字 — 一個帶有多重字義的詞彙,以及同義字 — 多個代表著相同字義的不同詞彙,對於自然語言處理或人工智慧相關的領域都是非常重要的課題。在資訊擷取相關的領域,多形字造成準確率的降低,而同義字造成召回率的降低。
在本論文中,我們提出了一套新穎的混合式方法,來測量任意兩個概念在人類語意上的相關度,並將此方法應用到字義排歧的工作上。此外,我們也研究了如何利用字義排歧,來克服多形字及同義字的問題,以提高資訊擷取系統的效能。這個論文不但從理論上的角度來研究概念表達、概念分佈以及語意相關度,並且也思考如何實際利用這些理論,來幫助意義排歧及資訊擷取。
英文摘要
The problem of formalizing and quantifying the intuitive notion of similarity or relatedness has a long history in philosophy, psychology, and artificial intelligence, and many different perspectives have been suggested. The need to determine the degree of semantic similarity, or more generally, relatedness, between two lexically expressed concepts is applied in many applications.
All human languages have words that can mean different things in different contexts, such words with multiple meanings are potentially “ambiguous”. For almost all applications of language technology, word sense ambiguity is a potential source of error. “Word Sense Disambiguation (WSD)” is the process of deciding which of their several meanings is intended in a given context. 
Polysemy — a single word form having more than one meaning; synonymy —   multiple words having the same meaning, are both important issues in natural language processing or artificial intelligence related fields. In information retrieval field, polysemy decreases retrieval precision by false matches; on the other hand, synonymy decreases the recall by missing true conceptual matches. 
In this thesis, we explore the measures of semantic relatedness between word senses based on a novel hybrid approach, and we apply the measure of semantic relatedness to the WSD task. Beside, we also investigate how WSD can benefit the task of information retrieval by solving the problems of polysymy and synonymy. This research is not only from a theoretical perspective on concept representation, concept distribution and semantic relatedness, but also considered possible applications of the proposed theory on word sense disambiguation and information retrieval.
第三語言摘要
論文目次
CONTENTS

CHINESE ABSTRACT	I
ABSTRACT	II
CONTENTS	III
LIST OF FIGURES	V
LIST OF TABLES	VII

CHAPTER
1. Introduction	1
2. Related Works	5
2.1 Wordnet	5
2.2 Word Sense Disambiguation based on Wordnet	11
3. Semantic Relatedness and Word Sense Disambiguation	19
3.1 Synonym set and basic set theory in modern mathematics	19
3.2 Variable Lexical Notations for a Concept	23
3.2.1 Generic Concept Notation for a Synset	28
3.2.2 Specific Concept Notation for a Synset	32
3.3 Semantic Relatedness and Word Sense Disambiguation	34
4. Evaluations	40
4.1 Experiment setup	40
4.2 Experimental Results	46
4.3 Discussion	54
5. Using WSD to Improve Internet Search	62
5.1 Background	62
5.2 Query Expansion based on Word Sense	65
5.3 Noise Filter Out Using WSD	70
5.4 Chapter Review	73
6. A Semantic based Online Question Answering System for Distance Learning	75
6.1 Background	76
6.2 The Question Answering System Architecture	81
6.3 Design of the Question Answering System	87
6.4 Chapter Review	90
7. Conclusions and Future Research	92
7.1 Conclusions	92
7.2 Future Research	96
Bibliography	98


 
LIST OF FIGURES

1. The intersection of Wordnet’s synonym sets (synsets)	21
2. The union of Wordnet’s synonym sets (synsets)	22
3. A “snapshot” of Wordnet hierarchy	26
4. An example “snapshot” of Wordnet hypernym/hyponym hierarchy (the nodes are synsets)	28
5. Examples of generic concept notation for a synset	31
6. The procedure for determining the semantic relatedness of two given Wordnet synsets	37
7. The WSD task using Wordnet – to generate the mappings between word forms and synsets	38
8. The procedure to find the most appropriate sense for a term	39
9. Example sentences and the tag format in Semcor	42
10(a). The precision against different i value of the generic notation on br-a01	47
10(b). The precision against different i value of the generic notation on br-b20	47
10(c). The precision against different i value of the generic notation on br-j09	48
10(d). The precision against different i value of the generic notation on br-r05	48
11(a). The precision against different sizes of context window on br-a01	48
11(b). The precision against different sizes of context window on br-b20	48
11(c). The precision against different sizes of context window on br-j09	51
11(d). The precision against different sizes of context window on br-r05	51
12. The average precision against different i value of the generic notation	49
13. The average precision (generic notation) against different sizes of context window	50
14(a). The precision against different i value of the specific notation on br-a01	51
14(b). The precision against different i value of the specific notation on br-b20	51
14(c). The precision against different i value of the specific notation on br-j09	51
14(d). The precision against different i value of the specific notation on br-r05	51
15(a). The precision against different sizes of context window on br-a01	52
15(b). The precision against different sizes of context window on br-b20	52
15(c). The precision against different sizes of context window on br-j09	52
15(d). The precision against different sizes of context window on br-r05	52
16. The average precision against different i value of the specific notation	53
17. The average precision (specific notation) against different sizes of context window	53
18. Precision against the degree of polysemy	54
19. Semantic-based interactive query expansion flow – add synonymy terms according to the intended word senses	68
20. Interactive query-expansion enabled interface that takes given search terms as input, looks for synonyms in the Wordnet, and let searcher to assign intended senses to their search terms	69
21. Semantic filtering to the initial search results. Only web pages that contain the word senses matched to the intended senses assigned by user (in the query expansion phrase) are sent to the user	72
22. System architecture of improved “semantic” search engine – integrated with semantic query expansion and semantic result filtering	73
23. The typical workflow of information retrieval	78
24. Architecture of the Semantic-based Automated Question Answering	81
25. The student interface – Ask a question	87
26. The student interface – display the answers. If the answers are not satisfactory to the student, he can then press the below button demanding for the answer from the instructor	88
27. The instructor interface – List unanswered-questions	89
28. The instructor interface – Answer a question	89
29. An interface for the instructor to manually collect Q&A sets	90
 
LIST OF TABLES

1. Number of words, synsets, and word-sense pairs in WordNet v2.0	6
2. Polysemy statistical information of WordNet v2.0	6
3. Polysemy average information of WordNet v2.0	7
4. Some semantic relations (links) defined in Wordnet	7
5. Statistics of semantic relations for nouns in Wordnet 2.0	8
6. Statistics of semantic relations for verbs in Wordnet 2.0	9
7. Statistics of semantic relations for adjectives in Wordnet 2.0	10
8. Statistics of semantic relations for adverbs in Wordnet 2.0	11
9. Mapping from the conception 6 to the semantic relatedness measurement	36
10. The genders of texts inventoried in Semcor	42
11. Polysemy statistical information of the four tested text files	43
12(a). The precision on br-a01 (generic notation)	47
12(b). The precision on br-b20 (generic notation)	47
12(c). The precision on br-j09 (generic notation)	47
12(d). The precision on br-r05 (generic notation)	47
13(a). The precision on br-a01 (specific notation)	50
13(b). The precision on br-b20 (specific notation)	50
13(c). The precision on br-j09 (specific notation)	51
13(d). The precision on br-r05 (specific notation)	51
參考文獻
Agirre, E. and Rigau, G., 1996. Word Sense Disambiguation using Conceptual Density. In Proceedings of the 16th International Conference on Computational Linguistics (Coling'96), pages 16--22. Copenhagen, Denmark.

Attardi, G., Cisternino, A., Formica, F., Simi, M. and Tommasi, A., 2001. In: Proceedings of TREC-9 Conference, NIST, pp 633-641, 2001.

Baeza-Yates, R. and Ribiero-Neto, B., 1999. Modern Information Retrieval, Addison-Wesley, 1999.

Banerjee, S. and Pedersen, T., 2002. An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, 2002, pp. 136–145.

Bates, M. J., 1986. “Subject access in online catalogs: A design model.” Journal of the American Society for Information Science, 37, 357-376. 1986

Bradshaw, Scheinkman, and Hammond, 2000. “Guiding People to Information: Providing an Interface to a Digital Library Using Reference as a Basis for Indexing.” In Proceedings of the Fourth International Conference on Intelligent User Interfaces, New Orleans, LA, January 9-12, 2000.

Budanitsky, A. and Hirst, G., 2001. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, June 2001.

Burke, R. D., Hammond, K. J., Kulyukin, V. A., Lytinen, S. L., Tomuro, N. and Schoenberg, S., 1997. Questions answering from frequently-asked question files: Experiences with the FAQ Finder System. The University of Chicago, Computer Science Department, TR-97-05.

Chua, S. and Kulathuramaiyer, N., 2004. Semantic Feature Selection Using WordNet. In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence (WI 2004), page(s): 166 – 172, 2004.

Clarke, C.L.A., Cormack, G.G.,  Kisman, D.I.E. and Lynam, K., 2000. “Question Answering by Passage Selection”, TREC-9, (2000).

Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R., 1990. “Indexing by latent semantic analysis.” Journal of the American Society for Information Science, 41:391–407, 1990.

Fellbaum, C., 1998. An Electronic Lexical Database, MIT Press, Cambridge, Mass.

Fidel, R., 1985. “Individual variability in online searching behavior.” In C.A. Parkhurst (Ed.). ASIS'85: Proceedings of the ASIS 48th Annual Meeting, Vol. 22, October 20-24, 1985, 6972.

Fuhr, N. and Buckley, C., 1991. “A probabilistic learning approach for document indexing.” ACM Transactions on Information Systems, 9(3):223–248, 1991.

Furnas, G., Landauer, T., Gomez, L. and Dumais, S., 1987. “The Vocabulary Problem in Human-Systems Communication”, Communications of the ACM, 30 (11), 1987, pp. 964-971.

Google. http://www.google.com

Guarino, N., 1999. OntoSeek: Content-Based Access to the Web. IEEE Intelligent Systems, pp 70-80, 1999.

Harabagiu, S. and Moldovan, D. et al., 2000. “FALCON: Boosting knowledge for answer engines”, TREC-9, (2000).

Heenan, C. H., 2002. “A review of Academic Research on Information Retrieval”, Available at: http://eil.stanford.edu/publications/charles_heenan/AcademicInfoRetrievalResearch.pdf

Hirst, G. and St-Onge, D., 1998. Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms. in WordNet: An electronic lexical database, Christiane Fellbaum (editor), Cambridge, MA: The MIT Press, 1998.

Hovy, E., Gerber, L., Hermjakob, U., Junk, M. and Liu C.Y., 2001a. Question Answering in Webclopedia. In: Proceedings of TREC-9 Conference, NIST, 2001.

Hovy, E., Gerber, L., Hermjakob, U., Liu, C.Y. and Ravichandran, D., 2001b. Toward Semantics-Based Answer Pinpointing. In: Proceedings of DARPA Human Language Technology conference (HLT), 2001.

Hull, D.A., 2000. Xerox TREC-8 Question Answering Track Report, TREC-8, 1999.

Jiang, J.J. and Conrath, D.W., 1997. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In Proceedings of ROCLING X (1997) International Conference on Research in Computational Linguistics, Taiwan, 1997.

Kamps, J., 2004. Improving retrieval effectiveness by reranking documents based on controlled vocabulary. In: S. McDonald and J. Tait, editors, Advances in Information Retrieval: 26th European Conference on IR Research (ECIR 2004), LNCS 2997, p. 283-295.Springer-Verlag.

Kilgarriff, A., 1998. I don't believe in word senses. Computers and the Humanities, 31(2):91-113

Kim, S. B., Seo, H. C. and Rim, H. C., 2004. Information retrieval using word senses: root sense tagging approach. In Proceedings of the 27th annual international conference on Research and development in information retrieval (SIGIR '04), Sheffield, United Kingdom, July 25 - 29, 2004.

Kupiec, J., 1993. MURAX: a robust linguistic approach for question answering using an on-line encyclopedia. In: Proceedings of the 16th annual international ACM SIGIR, conference on Research and development in information retrieval, 181-190. ACM Press.

Lawrence, S. and Giles, C., 1998. “Searching the world wide web.” Science 280, 98–100, 1998.

Leacock, C. and Martin C., 1998. Combining local context with WordNet similarity for word sense identification. In Christiane Fellbaum, editor, WordNet: A Lexical Reference System and its Application. MIT Press, Cambridge, MA.

Lee, J.H., Kim, M.H. and Lee, Y.I., 1993. Information Retrieval based on conceptual distance in IS-A hierarchies. Journal of Documentation, 49(2), June 1993, pp. 188-207

Lesk, M., 1986. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine code from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation, ACM Press, 1986, pp. 24–26.

Li, H. and Li, C., 2004. Word translation disambiguation using bilingual bootstrapping. Computational Linguistics, Vol. 30, Issue 1, pages 1-22, March 2004.

Lin, D. 1998. An information-theoretic definition of sim-ilarity. In Proceedings of the International Conference on Machine Learning.

Mano, H. and Ogawa, Y., 2001. “Selecting Expansion Terms in Automatic Query Expansion”, In Proceedings of SIGIR ’01, ACM Press: New Orleans, LA, USA, 2001, pp. 390-391.

Miller, George A., 1993. WordNet: A Lexical Database. Comm. ACM, Vol. 38, No. 11, 1993, pp. 39-41.

Miller, George A. and Beckwith, R., 1993. Introduction to WordNet: An On-line Lexical Database. Revised August 1993.

Mitra, M, Singhal, A., and Buckley, C., 1998. “Improving Automated Query Expansion”, In Proceedings of SIGIR ’98, ACM Press: Melbourne, Australia, 1998, pp. 206-214.

Moldovan, D. and Mihalcea, R., 2000. Using WordNet and Lexical Operators to improve Internet Searches. IEEE Internet Computing, vol. 4, no. 1, January 2000.

Patwardhan, S., Banerjee, S. and Pedersen, T., 2003. Using measures of semantic relatedness for word sense disambiguation, in: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, 2003, pp. 241–257.

Pinkerton, B., 1994. “Finding What People Want: Experiences with the WebCrawler”, In Proceedings of the Second International World Wide Web Conference, Chicago, Illinois, USA, July, 1994.

Plamondon, L., Lapalme, G., Diro, R. and Kosseim, L., 2001. The QUANTUM Question Answering System. In: Proceedings of TREC-9 Conference, NIST, 2001.

Prager, J., Brown, E., Coden, A. and Radev, D., 2000. Question-answering by predictive annotation. In: Proceedings, 23rd Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, July 2000.

Rada, R., Mili, H., Bicknell, E. and Bletner, M., 1989. Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man and Cybernetics, Vol. 19, No. 1, 17-30.

Radev, D. R., Prager, J. and Samn, V., 2000. Ranking potential answers to natural language questions. In: Proceedings of the 6th Conference on Applied Natural Language Processing, Seattle, WA, May 2000.

Resnik, P., 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Vol. 1, 448-453, Montreal, August 1995

Robertson, S. and Jones, S., 1977. “The probability ranking principle in information retrieval.” Journal of Documentation, 33:294–304, 1977.

Rosso, P., Masulli, F. and Buscaldi, D., 2003. Word sense disambiguation combining conceptual distance, frequency and gloss. In Proceedings of International Conference on Natural Language Processing and Knowledge Engineering, page(s): 120-125, Oct. 2003

Sahlgren, M., Karlsgren, J., Cöster, R. and Järvinen, T., 2002. “Automatic Query Expansion using Random Indexing”, In Proceedings of CLEF 2002, Rome, Italy, 2002.

Slator G., Brian M. and Wilks, Yorick A., 1987. Towards Semantic Structures from Dictionary Entries. In Proceedings of the Second Annual Rocky Mountain Conference on Artificial Intelligence (RMCAI-87) Boulder, CO, June 17-19, pp. 85-96.

Salton, G., Wong, A., and Yang, C., 1975. “A vector space model for automatic indexing.” Communications of the ACM, 18:613–620, 1975.

Sussna, M., 1993. Word sense disambiguation for free text indexing using a massive semantic network. In Proceedings of the Second International Conference on Information and Knowledge Management, Arlington, Virginia, 1993.

Turtle, H. and Croft, W., 1991. “Evaluation of an inference network-based retrieval model.” ACM Transactions on Information Systems, 9(3):187–222, 1991.

Van Hage, W., de Rijke, M. and Marx, M., 2004. Information retrieval support for ontology construction and use. In: S. Mcllraith, D. Plexousakis, and F. van Harmelen, editors, In: Proceedings 3rdInternational Semantic Web Conference (ISWC 2004), LNCS3298, p. 518-533, 2004.

Van Rijsbergen, C., 1979. Information Retrieval. Butterworths, London, 1979.

Van Rijsbergen, C., 1986. “Anon-classical logic for information retrieval.” Computer Journal, 29:481–485, 1986.

Yahoo. http://tw.yahoo.com.

Yang, C.Y., Hung, J.C., Wang, C.S. Chiu, M.S. and Yee G. Applying Word Sense Disambiguation to Question Answering System for E-Learning. The 19th International Conference on Advanced Information Networking and Applications (AINA 2005), 28-30 March 2005, Taipei, Taiwan. IEEE Computer Society 2005, ISBN 0-7695-2249-1

Yaworsky, D., 1992. Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on LargeCorpora. Proceedings of the 15th International Conference on Computational Linguistics (Coling '92). Nantes, France.

Yu, J.S., Wen, Z.S., Liu, Y. and Jin, Z.H., 2004. Statistical Overview of WordNet from 1.6 to 2.0. The Second Global Wordnet Conference (GWC 2004), Brno, Czech Republic, January 20-23, 2004.
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文於授權書繳交後2年公開
校外
同意授權予資料庫廠商
校外電子論文於授權書繳交後2年公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信