§ 瀏覽學位論文書目資料
  
系統識別號 U0002-2806200716574000
DOI 10.6846/TKU.2007.00927
論文名稱(中文) 以聚合法(AGNES)提升檢索效果之研究—以中文新聞為例
論文名稱(英文) The Research on Improving the Performance of Information Retrieval with the AGglomerative NESting (AGNES) Algorithm — Using a Chinese News Dataset
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊管理學系碩士班
系所名稱(英文) Department of Information Management
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 95
學期 2
出版年 96
研究生(中文) 宋永杰
研究生(英文) Yung-Chieh Sung
學號 693520560
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2007-06-09
論文頁數 54頁
口試委員 指導教授 - 魏世杰
委員 - 呂芳懌
委員 - 陳彥良
關鍵字(中) 資訊檢索
聚合法
分群
向量空間模式
關鍵字(英) Information Retrieval
Agglomerative Nesting Algorithm
Clustering
Vector Space Model
第三語言關鍵字
學科別分類
中文摘要
傳統向量模式檢索系統回傳的相關資料往往過於雜亂缺乏系統,使用者必須花費心思逐步過濾,才能取得真正符合需求的資訊。本研究以聚合法所建構出的樹狀結構為基礎,由下而上動態群聚向量模式檢索系統所回傳的結果,形成多個群集,群集間依本研究之耦合力與內聚力的平均值做排名,群集內則依文章與查詢的相似度做排名,經調整排名後提升其精確率,並以群集的方式提供使用者瀏覽。
本研究採用中文文件集,經斷詞、特徵詞選取、建立文件向量、分群、檢索、群聚檢索結果與調整排名等處理。實驗結果顯示,在整體檢索表現中本系統可提升傳統向量模式檢索系統約20.9%~24.0%的精確率,經Wilcoxon Signed Ranks Test檢定,在1個關鍵詞與2個關鍵詞查詢下,本系統檢索表現優於傳統向量模式檢索系統。
英文摘要
Usually the document ranking returned by the traditional vector space model of an information retrieval system is unorganized. It is often found that related documents do not have adjacent ranks. In order not to miss the needed information, the user still has to read several unrelated documents before finding another related document. In this research, we cluster the documents from the traditional vector space model based on the binary tree hierarchy constructed by the AGglomerative NESting (AGNES) algorithm. The clusters are ranked by the average of the coupling and the cohesion measures proposed in this thesis, and the documents in the cluster are ranked by the similarity between the query and the document. We try to improve the precision by such ranking adjustment.
We used the Chinese news dataset and went through the word segmentation, vector representation, AGNES clustering, query based document retrieval and the final ranking adjustments for evaluation. As result, our system can improve the precision by 20.9% to 24.0% compared to the traditional vector space model. We also tested the result by the Wilcoxon Signed Ranks Test. It shows that our system is significantly better than the traditional vector space model for queries of one or two keywords.
第三語言摘要
論文目次
目錄
目錄	III
圖目錄	VI
表目錄	VII
1.	緒論	1
1.1.	研究動機與背景	1
1.2.	研究目的	5
1.3.	論文架構	5
2.	相關研究	6
2.1.	中文斷詞	6
2.2.	檢索系統	7
2.2.1.	反向檔	7
2.2.2.	向量空間模型	8
2.2.3.	向量模式檢索與排名	10
2.2.4.	評量模式	11
2.3.	文件分群	13
2.3.1.	分割式分群	13
2.3.2.	階層式分群	14
3.	以聚合法提升檢索效果之方法	17
3.1.	系統架構	17
3.2.	文件前處理	18
3.2.1.	中文斷詞	18
3.2.2.	文件特徵表現法	19
3.3.	文件分群	20
3.3.1.	建立文件向量	20
3.3.2.	以聚合法建構樹狀結構	20
3.4.	群聚檢索結果與調整排名	21
4.	實驗結果與分析	32
4.1.	資料集與實驗方法	32
4.1.1.	資料集	32
4.1.2.	實驗環境	35
4.1.3.	實驗方法	35
4.2.	實驗結果與分析	37
4.2.1.	效用實驗	37
4.2.2.	實驗一	41
4.2.3.	實驗二	45
4.2.4.	實驗三	46
4.2.5.	結果分析	46
5.	結論與未來方向	50
5.1.	結論	50
5.2.	未來方向	51
6.	參考文獻	52

 
圖目錄
圖 1:查詢q之檢索結果二維空間示意圖	3
圖 2:k-means分群演算法	14
圖 3:聚合法與分裂法示意圖	15
圖 4:聚合法分群演算法	15
圖 5:系統架構圖	18
圖 6:樹狀結構圖	21
圖 7:聚合法群聚示意圖	22
圖 8:查詢q1之檢索結果二維空間示意圖	24
圖 9:查詢q1之群聚檢索結果示意圖	26
圖 10:查詢q2之檢索結果二維空間示意圖	27
圖 11:查詢q2與群集之內聚力示意圖	27
圖 12:內聚力誤判示意圖	28
圖 13:以耦合力調整內聚力誤判示意圖	29
圖 14:耦合力誤判示意圖	30
圖 15:以內聚力調整耦合力誤判示意圖	30
圖 16:以聚合法為基礎的群聚檢索結果與調整排名	31
圖 17:精確率圖	42
圖 18:F值圖	42
 
表目錄
表 1:查詢q之檢索排名	3
表 2:文章列表	7
表 3:反向檔	8
表 4:參考標示與系統回應	12
表 5:查詢q1之檢索排名	23
表 6:查詢q1調整後的排名	26
表 7:檢索結果群聚後可能狀況	31
表 8:CIRB30文章相關層級	32
表 9:資料集590篇文章斷詞前後之統計資料	33
表 10:13個測試主題	34
表 11:關鍵詞加入順序,以「電腦病毒」主題E為例	36
表 12:各主題的關鍵詞查詢數	36
表 13:群排名指摽在1個關鍵詞查詢下之F值提升率	38
表 14:群排名指摽在2個關鍵詞查詢下之F值提升率	39
表 15:群排名指摽在3個關鍵詞查詢下之F值提升率	39
表 16:群排名指摽之F值提升率總平均	40
表 17:群排名指標表現最好之總次數	40
表 18:各召回點之召回率、精確率、F值	41
表 19:1個關鍵詞29次查詢在13個主題下平均F值與平均精確率之提升率	44
表 20:2個關鍵詞26次查詢在9個主題下平均F值與平均精確率之提升率	45
表 21:3個關鍵詞6次查詢在4個主題下平均F值與平均精確率之提升率	46
表 22:本系統檢索表現提升率之敘述統計	46
表 23:SPSS-Wilcoxon Signed Ranks Test-output1	48
表 24:SPSS-Wilcoxon Signed Ranks Test-output2	49
參考文獻
[1]	中央研究院中文詞知識庫小組,CIRB30,中文(新聞)語料庫,http://godel.iis.sinica.edu.tw/CKIP/publication.htm,民95。
[2]	曾憲雄、蔡秀滿、蘇東興、曾秋蓉、王慶堯,資料探勘,旗標,民94。
[3]	趙善中、趙薇、尤柄文,軟體工程,儒林,民92。
[4]	J. Allan, Topic Detection and Tracking: Event-based information Organization, Kluwer Academic Publishers, 2002.
[5]	R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addsion Wesley Longman Limited, 1999.
[6]	V. Boyapati, Towards a Comprehensive Topic Hierarchy for News, Master Thesis,  Department of Computer Science in the Australian National University, 2000.
[7]	H.C. Chang, C.C. Hsu, “Using topic keyword clusters for automatic document clustering,” Proceedings of the International Conference on Information Technology and Applications 2005, vol. 1, pp. 419-424, Australia, July 2005.
[8]	H.C. Chang, C.C. Hsu, and Y.W. Deng, “Automatic document clustering based on keyword clusters using partitions of weighted undirected graph,” Proc. 2003 Symposium on Digital Life and Internet Technologies, Sept. 2003.
[9]	H.C. Chang, C.C. Hsu, Y.W. Deng, “Unsupervised document clustering based on keyword clusters”, Proceedings of the International Symposium on Communications and Information Technologies 2004, pp. 1198-1203, October 2004.
[10]	A. Chen, J. He, L. Xu, F. Gey and J. Meggs, “Chinese text retrieval without using a dictionary,” ACM SIGIR, pp. 42-49, 1997.
[11]	H.H. Chen and J.C. Lee, “Identification and classification of proper names in Chinese texts,” Proceedings of 16th International Conference on Computational Linguistics, pp. 222-229, 1996.
[12]	T.H. Chiang, J.S. Chang, M.Y. Lin and K.Y. Su, “Statistical models for word segmentation and unknown word resolution,” Proceedings of ROCLING-V, ROC Computational Linguistics Conferences, Taiwan, pp. 123-146, 1992.
[13]	DataparkSearch Engine, http://www.dataparksearch.org/
[14]	W. Frakes and R. Baeza-Yates, Information Retrieval: Data Structure & Algorithms, Prentice Hall PTR, 1992
[15]	S.J. Green, “Building hypertext links by computing semantic similarity”, IEEE Transactions On Knowledge And Data Engineering, vol. 11, no. 5, pp. 713-730,  September/October, 1999.
[16]	J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001.
[17]	A.K. Jain, M.N Murty and P. J. Flynn, “Data clustering: a review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[18]	S.J. Karen, “A statistical interpretation of term s specificity and its application in retrieval”, Journal of Documentation, vol. 28, no. 5, pp. 111–121, 1972.
[19]	D. Koller and M. Sahami, “Hierarchically classifying documents using very few words,” Proceedings of the 14th International Conference on Machine Learning, pp. 170-178, 1997.
[20]	V. Lavrenko, J. Allan, E. DeGuzman, D. LaFlamme, V. Pollard, and S. Thomas,“Relevance models for topic detection and tracking”, Proceedings of the Human Language Technology Conference (HLT), pp. 104-110, 2002.
[21]	J. Morris and G. Hirst, “Lexical cohesion computed by thesaural relations as an indicator of the structure of text,” Computational Linguistics, vol. 17, no. 1, pp. 21-48, 1991.
[22]	J.Y. Nie, J. Gao, J. Zhang and M. Zhou, “On the use of word and n-grams for Chinese information retrieval”, IRAL-2000, Hong Kong, September 30 – October 1, pp. 141 – 148, 2000.
[23]	Y. Ohsawa, N.E. Benson and M. Yachida, “KeyGraph: Automatic indexing by co-occurrence graph based on building construction metaphor,” Proceedings of the Advances in Digital Libraries Conference, pp. 12-18, 1998.
[24]	G.. Salton, Automactic Text Processing: The Transaction, Analysis, and Retrieval of Information by Computer, Addision-Wesley, 1989.
[25]	G. Salton and C. Buckley, “Term-weighting approaches in automatic retrieval,” Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988.
[26]	G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill Book Co., 1986.
[27]	A. Schenker, M. Last and A. Kandel, “A term-based algorithm for hierarchical clustering of web documents,” Proceedings of IFSA/NAFIPS 2001, pp.  3076-3081, Vancouver, Canada, July 25-28, 2001.
[28]	H. Schutze and C. Silverstein, “Projections for efficient document clustering,” ACM SIGIR’97, pp. 74-81, 1997.
[29]	M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques”, Technical Report #00-034, Department of Computer Science and Engineering, University of Minnesota, 2000.
[30]	C.H. Tsai, “MMSEG: A word identification system for mandarin Chinese text based on two variants of the maximum matching algorithm,” http://technology.chtsai.org/mmseg/, 1996.
[31]	C.J. van Rijsbergen, Information Retrieval, Butterworths, Second Edition, 1979.
[32]	Y. Yang, “Noise reduction in a statistical approach to text categorization,” ACM SIGIR, pp. 256–263, 1995.
[33]	Y. Yang and J.O. Pedersen, “A comparative study on feature selection in text categorization,” ICML, pp. 412–420, 1997.
[34]	Y. Yang and J. Wilbur, “Using corpus statistics to remove redundant words in text categorization,” JASIS, vol. 47, no. 5, pp. 357–369, 1996.
[35]	O. Zamir and O. Etzioni, “Web document clustering: a feasibility demonstration,” ACM SIGIR’98, pp. 46-54, 1998.
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信