電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2008-08-07起於校外公開使用
本論文紙本於2008-08-07起公開使用

系統識別號	U0002-0508200815260700
DOI	10.6846/TKU.2008.00131
論文名稱(中文)	應用搭配詞概念提升語法搜尋系統之效能
論文名稱(英文)	Improving the Syntax-base Retrieval System Using Collocation Indexing
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	資訊工程學系碩士班
系所名稱(英文)	Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	96
學期	2
出版年	97
研究生(中文)	陳瑞璟
研究生(英文)	Ruey-Jinng Chen
學號	695410596
學位類別	碩士
語言別	繁體中文
第二語言別	英文
口試日期	2008-06-20
論文頁數	71頁
口試委員	指導教授 - 郭經華委員 - 陳孟彰委員 - 王英宏
關鍵字(中)	詞性標記詞性還原索引建置
關鍵字(英)	POS tagging Lemmatizing Collocation k-gram Indexing
第三語言關鍵字
學科別分類
中文摘要	在本論文中主要是設計一個語法查詢的系統，其中將語法搜尋系統結合電影，變成一個可查詢語法的電影檢索系統。這樣的一種結合將有助於英文老師的教學，跳脫以往對於語法死板教學的印象，透過本系統所提供的語法查詢系統，找出符合的電影場景與對白，增加學生學習的興趣。要達到本系統所提供的語法查詢，必須先對電影的字幕做一些前置作業，例如：對字幕做詞性加註、詞性還原。經過詞性加註和詞性還原之後的電影對白，我們會利用可擴充標示語言格式將這些資訊保存起來，以提供搜尋引擎使用正規語言表示法去做比對。由於我們利用正規語言表示法當作我們的查詢語言，因此在查詢語句做比對的時候將會耗費較多的時間。為解決這樣的問題，我們替電影字幕建置了索引表來降低正規語言所要比對的句子個數。在索引建置的部份，我們使用了兩種方法，其一是單字字元索引建置方法，此法包含了單字字元切割、有效索引與無前置後置集；其二是搭配詞索引，此法則包含了，搭配詞的建立及搭配詞的過慮。在系統實做的部份，我們針對上述兩種索引表進行比較，比較加入了搭配詞索引表是否可以有效的改善系統搜尋的效能。而實驗結果顯示，使用了搭配詞索引表，的確可以有效的降低搜尋引擎所要比對的正規語言句子的個數。當索引值降低，正規表示法所需要比對的句數變少，則系統所要花的搜尋時間也就相對的降低了。因此在本論文中所提出的搭配詞索引概念，對系統效能的提升有正面的幫助，這也是本論文在索引建置與系統加速上主要貢獻。
英文摘要	The purpose of this paper is to design a syntax search system and to apply it to a movie search system. The concepts applied include those in the field of linguistics and collocation, to increase the speed of the syntax search system. First, we must process the keywords in the database by labeling them according to their part of speech. From the results of the process, we will construct a K-gram index and Collocation index. In this proposal we bring out a few examples of common English syntax rules and sentence structures as test models. After the run through, the K-gram index and the Collocation index are compared. We have found that part of the sentence, after having gone through the Collocation index search, has a far smaller sample space that the K-gram index alone, which is to say that the Collocation index is able to find the most correct result from fewer samples, thus minimizing the time cost in Query Match.
第三語言摘要
論文目次	第1章緒論………………………………………………1 1.1 研究動機與目的…………………………………1 1.2 研究內容…………………………………………4 1.3 研究內容大綱……………………………………5 第2章背景知識與相關研究……………………………6 2.1 XML 文件表示語言……………………………6 2.2 XML 文件索引機制……………………………12 2.3 詞類標記………………………………………16 2.4 搭配詞…………………………………………20 第3章系統架構與設計……………………………………22 3.1 系統架構…………………………………………22 3.2 索引表之設計……………………………………26 3.2.1 K-gram indexing…………………………………26 3.2.1.1 Multi-gram…………………………………………27 3.2.1.2 Useful index………………………………………30 3.2.1.3 Presuf free set…………………………………32 3.2.2 搭配詞索引表………………………………………34 3.2.3 文字檢索系統………………………………………37 3.3 語法搜尋…………………………………………41 第4章實作與討論…………………………………………48 4.1 系統介紹…………………………………………48 4.2 搭配詞索引表之實作測試………………………53 第5章結論與未來研究方向………………………………58 5.1 結論………………………………………………58 5.2 未來研究與方向…………………………………60 參考文獻…………………………………………………………61 附錄………………………………………………………………65 圖目錄圖2.1 DTD中所定義的Tag名稱……………………………9 圖2.2 XML文件結構範例…………………………………10 圖2.3 Well-formed XML文件與Valid XML文件的關係…10 圖3.1 影片檢索系統架構圖………………………………23 圖3.2 經 POS tagging、Lemmatizing 後的 XML Form…25 圖3.3 Prefix和Suffix示意圖…………………………33 圖3.4 搭配詞的選擇示意圖………………………………35 圖3.5 子系統架構…………………………………………37 圖3.6 K-gram Indexing Algorithm………………………38 圖3.7 Collocation Indexing Algorithm………………39 圖3.8 包含 index term“pay＂與“ten＂的交集………42 圖4.1 系統搜尋介面………………………………………48 圖4.2 系統搜尋結果………………………………………51 圖4.3 10個句型及其 Regular Expression……………54 圖4.4 Collocation indexing 製做前後差異比較圖……56 表目錄表2.1 TnT 測試不同 Corpus 所得的正確率…………………17 表2.2 範例句經由詞類標記後的詞性種類……………………19 表2.3 英文上常用的搭配字……………………………………21 表3.1 index term 索引表……………………………………29 表4.1 系統提供的詞性種類……………………………………50 表4.2 未經Presuf free set 處理與經Presuf free set 處理之index數量比較……………………………………55 表4.3 未經 Filtering 處理與經 Filtering 處理之index數量比較……………………………………………………55 表4.4 加入Collocation 前後，所得到的句子個數的差異…57
參考文獻	[1] Jane King, “Using DVD Feature Films in the EFL Classroom,＂,Computer Assisted Language Learning,Vol. 15, No. 5, pp 509-523, 2002. [2] Erwin Tschirner, “Language Acquisition in the Classroom: The Role of Digital Video,＂ Computer Assisted Language Learning, Vol. 14, No. 3-4, pp 305-319, 2001. [3] Chin-Hwa Kuo, David Wible, Nai-Lung Tsao, and Chen-Fu Chang, “A Video Retrieval System for Computer Assisted Language Learning,” AI-ED 2005, July 18-22, 2005. [4] 陳會安, XML網頁製作徹底研究, 旗標出版社, 2000. [5] S. Abiteboul. D.Quass .J.Mchugh.J.Widom.and .Wiener. “The Lorel Query Language for Semistructured Data＂International Journal on Digital Libraries , Vol 1, pp-68-88, 1997. [6] XML Path Language .Http://www.w3c.org/TR/Xpath. [7] XML Path Language .Http://www.w3c.org/TR/Xquery. [8] R. Goldman and J. Widom, “DataGuides ：Enabling Query Formulation and Optimization in Semistructured Databases＂Proc.Ofthe 23rd VLDB conference ,1997,pp. 436-455 . [9] S.Park and H.J.Kim, “A new query processing technique for XML based on signature,＂Database Systems for Advanced Application,2001. Proceedings. Seventh International Conference on,pp.22-29,2001. [10] V. Tseng and W. Lin, “A new Method for Indexing for XML Document＂Proc. Of the 12th Workshop on Object-Oriented Technology and Application,2001. pp.39-46. [11] S.Park and H.J.Kim, “SigDAQ：an enhanced XML Query optimization technique,＂Journal of Systems and Software Vol 61,Issue：2,pp-91-103,March 15,2002 . [12] Thorsten.Brants, TnT-A Statistical Part-of-Speech Tagger.In Proceedings of the Sixth Applied Natrual Language Processing Conference ANLP-2000, Seatle,WA, 2000. [13] http://www.coli.uni-sb.de/sfb378/negra-corpus/. [14] http://www.cogs.susx.ac.uk/users/geoffs/RSue.html. [15] McCarthy, Michael and O'Dell, Felicity, English Collocations in Use: how words work together for fluent and natural English. Cambridge University Press, 2005. [16] Gledhill C., Collocations in Science Writing, Narr, Tübingen, 2000. [17] Firth J.R. (1957): Papers in Linguistics 1934-1951. Oxford: Oxford University Press. [18] Sinclair J. (1996): 「The Search for Units of Meaning」, in Textus, IX, 75-106. [19] Smadja F. A & McKeown, K. R. (1990): 「Automatically extracting and representing collocations for language generation」, Proceedings of ACL』90, 252-259, Pittsburgh, Pennsylvania. [20] Moon R. (1998): Fixed Expressions and Idioms, a Corpus-Based Approach. Oxford, Oxford University Press. [21] Frath P. & Gledhill C. (2005): 「Free-Range Clusters or Frozen Chunks? Reference as a Defining Criterion for Linguistic Units,」 in Recherches anglaises et Nord-américaines, vol.38：25-43 [22] BNC http://www.natcorp.ox.ac.uk/. [23] Cho, Junghoo and Sridhar Rajagopalan. A Fast Regular Expression Indexing Engine. In Proceedings of 18th IEEE Conference on Data Engineering.2002
論文全文使用權限	校內：校內紙本論文立即公開同意電子論文全文授權校園內公開校內電子論文立即公開校外：同意授權校外電子論文立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信