§ 瀏覽學位論文書目資料
  
系統識別號 U0002-3007201212135600
DOI 10.6846/TKU.2012.01321
論文名稱(中文) 商品展覽會深網整合及其關鍵字查詢排名策略
論文名稱(英文) Deep web integration of product exhibitions and its ranking strategy for keyword search
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊管理學系碩士班
系所名稱(英文) Department of Information Management
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 100
學期 2
出版年 101
研究生(中文) 石永瑜
研究生(英文) Yung-Yu Shih
學號 699630157
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2012-05-26
論文頁數 78頁
口試委員 指導教授 - 周清江
委員 - 陳育亮
委員 - 陸承志
委員 - 廖賀田
關鍵字(中) 深網整合
關鍵字查詢
排名策略
關鍵字(英) Deep Web Integration
Keyword Search
Ranking Strategy
第三語言關鍵字
學科別分類
中文摘要
隨著網路使用量不斷地增加,搜尋引擎已成為蒐集資訊情報的重要工具,但仍然有許多有價值資料隱藏在深層網路的資料庫內,無法有效率的在傳統搜尋引擎中被找到,本研究以商品展覽會網路資料庫為例,提供一個解決方案。一個中小企業人員及參展廠商,在網路上常面臨到無法確實得知何時何地有國際展覽會舉行,而展覽會中又有哪些公司及相關產品參展,所花費的時間過長且找尋到資料未必齊全,無法有效地蒐集展覽會相關資訊。本研究整合網路上來自相同領域不同展覽會的資料,並提供使用者進行產品關鍵字查詢,查詢結果包括了產品所屬的公司及該公司中與關鍵字相關產品。本研究由兩個系統完成:(1)資料整合系統:使用網路機器人,蒐集多個展覽會網站資料來源、將不同網站所提供的資訊,整合於關聯式資料庫中;(2)排名處理系統:處理關鍵字查詢,且提供排名策略,除了參考過去研究之值組樹大小標準化、文件長度標準化、反向文件頻率標準化及文件之間權重標準化的調整因素外,本研究加入特定欄位出現次數權重及異質資料倍率權重進行排序調整,讓公司及產品資訊與使用者輸入的關鍵字相關性較高者,排名較前面。經過使用者測試評估顯示,當特定欄位出現次數權重值為9及異質資料倍率權重值為2-7時,平均準確率(Mean Average Precision, MAP)的結果為0.6471,與未考慮這兩項的做法比較,有59.70%的改善。
英文摘要
With the rapid development of World Wide Web, the search engine has become an important tool to collect information. However, there are still lots of valuable information in the deep web that can’t be found by traditional search engine efficiently. We tackle the problem using web exhibition product databases. A small and medium enterprises (SMEs) personnel and exhibitor often face a problem in the web that they could not exactly know when and where an international exhibition to would be held and they could not get the information about which companies and related products are in the exhibition. The collection of this information takes time. Furthermore, it may not be the complete information. In this study, we integrate different exhibition websites information in the same field. It provides users to search product through keyword query. Moreover, the query results include the product’s company and its other products related to the keyword. The system is implemented by the combination of two systems. The first one is the crawler extracting system that uses network robot to collect many data of exhibition sites in the same field and to integrate these data into a relational database. The other one is the query processing system that answers a keyword query with its ranking strategies. Except for the tuple tree size normalization, the document length normalization reconsidered, the document frequency normalization and the inter-document weight normalization that were used in the past research, we join the specific field occurrences weight  and heterogeneous data weights to adjust ranking list. The more company and product descriptions related to the keywords, the closer they will be put in the top of the result. Compared with past practices, when specific field occurrences weight is with value 9 and heterogeneous data weights with value 2-7, our experiments had a MAP (Mean Average Precision) value 0.6471, which was 59.70% improvement.
第三語言摘要
論文目次
目錄	V
圖目錄	VIII
表目錄	IX
第一章	緒論	1
1.1.	研究背景與動機	1
1.2.	研究目的	5
1.3.	研究範圍與限制	6
1.4.	論文架構	7
第二章	文獻探討	9
2.1.	深層網路資料整合	9
2.2.	關聯式資料庫中關鍵字查詢	13
2.3.	查詢結果排名	17
第三章	商品展覽會深網整合及其關鍵字查詢排名策略	20
3.1.	展覽會網站現況說明	20
3.2.	系統需求與分析	25
3.2.1.	整合異質網站頁面內容所需功能分析	25
3.2.2.	關鍵字查詢排名所需功能分析	29
3.2.2.1.	本研究相關名詞定義	29
3.2.2.2.	過去研究之關鍵字查詢排名策略背景介紹	30
3.3.	系統設計	35
3.3.1.	系統架構與元件說明	35
3.3.2.	資料整合系統	37
3.3.3.	排名處理系統	45
3.4.	本研究架構下之應用說明	47
3.5.	系統實作	54
第四章	實驗與討論	56
4.1	實驗環境	56
4.2	實驗資料集	56
4.3	實驗目標	57
4.4	實驗結果與討論	61
4.4.1	實驗一:特定欄位出現次數權重(α)值調整之11-Point Average Precision結果	62
4.4.2	實驗二:異質資料倍率權重(β)值調整之11-Point Average Precision結果	63
4.4.3	實驗三:特定欄位出現次數權重(α)值調整之排名倒數差值(RRD)結果	64
4.4.4	實驗四:異質資料倍率權重(β)值調整之排名倒數差值(RRD)結果 	66
4.4.5	實驗五:使用者研究	68
第五章	結論與未來展望	69
參考文獻	71
附錄1:本研究資料表綱要	75

圖目錄
圖 1:劉偉等人[2]深層網路資料整合框架圖 ...................................... 10
圖 2:黃執強[1]系統整合圖 .................................................................. 13
圖 3:中經社網站上提供之展覽會資訊 .............................................. 22
圖 4:網路爬蟲爬行展覽會資訊的擷取結果範例 .............................. 23
圖 5:香港貿易發展局網站上提供之公司基本資料 .......................... 24
圖 6:香港貿易發展局網站上提供之產品基本資料 .......................... 25
圖 7:展覽會分類形式範例 .................................................................. 26
圖 8:本研究系統架構圖 ...................................................................... 36
圖 9:本研究廣度優先網頁爬行架構圖 .............................................. 40
圖 10:本研究網路爬行流程圖 ............................................................ 40
圖 11:本研究資料庫實體關係圖......................................................... 43
圖 12:本研究建置值組樹示意圖 ........................................................ 48
圖 13:特定欄位出現次數權重值調整之11-Point Average Precision 結
果 ............................................................................................................... 63
圖 14:異質資料倍率權重值調整之11-Point Average Precision 結果
................................................................................................................... 64
圖 15:特定欄位出現次數權重值調整之排名倒數差值結果 ............ 66
圖 16:異質資料倍率權重值調整之排名倒數差值結果 .................... 67

表目錄
表 1:關聯式資料庫關鍵字查詢相關研究特徵 .................................. 16
表 2:本研究資料庫中資料表 .............................................................. 44
表 3:本研究資料庫之資料統計量 ...................................................... 48
表 4:"notebook"關鍵字詞為例之資料庫內容範例 ............................ 49
表 5:"notebook"關鍵字詞為例之相似度計算之各參數結果 ............ 50
表 6:"notebook"關鍵字詞為例之計算值組樹大小標準化結果 ........ 51
表 7:符合"notebook"關鍵字詞資料庫統計量 .................................... 51
表 8:"notebook"關鍵字詞為例計算文件長度標準化結果 ................ 51
表 9:"notebook"關鍵字詞為例計算相關詞頻率標準化結果 ............ 52
表 10:"notebook"關鍵字詞為例計算反向文件頻率標準化結果 ...... 53
表 11:"notebook"關鍵字詞為例計算關鍵字()在每一個文件()中的
權重結果 ................................................................................................... 54
表 12:本研究實驗環境 ........................................................................ 56
表 13:關鍵字詞為"notebook"之排名倒數差值範例說明 .................. 59
表 14:本研究展覽會資料表綱要 ........................................................ 75
表 15:本研究公司資料表綱要 ............................................................ 75
表 16:本研究產品資料表綱要 ............................................................ 76
表 17:本研究展覽會與公司關聯表綱要 ............................................ 77
表 18:本研究展覽會與產品關聯表綱要 ............................................ 77
表 19:本研究資料來源表綱要 ............................................................ 78
參考文獻
1.	黃執強. (2005). 同性質網頁資料整合之自動化研究. 中央大學資訊管理學系碩士論文.
2.	劉偉, 孟小峰, & 孟衛一. (2007). Deep Web 資料整合研究综述. 計算機學報, 30(9), 1475-1489. 
3.	Agrawal, S., Chaudhuri, S., & Das, G. (2002). DBXplorer: A system for keyword-based search over relational databases. In: Proc. of the 18th Int’l Conf. on Data Engineering (ICDE 2002), 5-16.
4.	Ahonen-Myka, H., Heinonen, O., Klemettinen, M., & Verkamo, A. I. (1999). Finding co-occurring text phrases by combining sequence and frequent set discovery. In Proceedings of 16th International Joint Conference on Artificial Intelligence IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, 1-9.
5.	Arlotta, L., Crescenzi, V., Mecca, G., & Merialdo, P. (2003). Automatic annotation of data extracted from large web sites. In Proceedings of the International Workshop on the Web and Databases, 7-12.
6.	Balmin, A., Hristidis, V., & Papakonstantinou, Y. (2004). Objectrank: Authority-based keyword search in databases. In: VLDB’04, 564-575.
7.	Bergman, M. K. (2001). White paper: the deep web: surfacing hidden value. Journal of Electronic Publishing, 7(1). 
8.	Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., & Sudarshan, S. (2002). Keyword searching and browsing in databases using BANKS. In ICDE, 431-440.
9.	Cafarella, M. J., Halevy, A., & Khoussainova, N. (2009). Data integration for the relational web. Proceedings of the VLDB Endowment, 2(1), 1090-1101. 
10.	Cai, D., Yu, S., Wen, J. R., & Ma, W. Y. (2004). Block-based web search. The 27th Annual International ACM SIGIR Conference on Information Retrieval, 456-463.
11.	Cormack, G. V. & Lynam, T. R. (2007). Power and bias of subset pooling strategies. In Proceedings of SIGIR, 837-838.
12.	Dasgupta, A., Das, G., & Mannila, H. (2007). A random walk approach to sampling hidden databases. In Proc. SIGMOD, 629-640.
13.	De Felipe, I., Hristidis, V., & Rishe, N. (2008). Keyword search on spatial databases. In ICDE, 688-699.
14.	Fangjiao, J., Linlin, J., & Xiaofeng, M. (2007). Query translation on the fly in Deep Web integration. Wuhan University Journal of Natural Sciences, 12(5), 819-824.
15.	Florescu, D., Kossmann, D., & Manolescu, I. (2000). Integrating keyword search into XML query processing. Computer Networks, 33(1), 119-135. 
16.	Gil, P. (2012). What Is the 'Invisible Web'? The Content That Goes Beyond Google, Yahoo, Bing, and Ask.com.  (http://netforbeginners.about.com/cs/secondaryweb1/a/secondaryweb.htm)
17.	Guo, L., Shao, F., Botev, C., & Shanmugasundaram, J. (2003). XRANK: ranked keyword search over XML documents. In Proc. 2003 ACM SIGMOD International Conference on Management of Data, 16-27.
18.	Hristidis, V., Gravano, L., & Papakonstantinou, Y. (2003). Efficient IR-style keyword search over relational databases. VLDB 2003, 850-861.
19.	Hristidis, V., Hwang, H., & Papakonstantinou, Y. (2008). Authority-based keyword search in databases. ACM Transactions on Database Systems (TODS), 33(1), 1. 
20.	Hristidis, V., & Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases. In Proc. 28th Int. Conf. Very Large Data Bases, VLDB, 670-681.
21.	Hristidis, V., Papakonstantinou, Y., & Balmin, A. (2003). Keyword proximity search on XML graphs. In ICDE, 367-378.
22.	Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., & Karambelkar, H. (2005). Bidirectional expansion for keyword search on graph databases. In VLDB, 505-516.
23.	Li, G., Ooi, B. C., Feng, J., Wang, J., & Zhou, L. (2008). EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In SIGMOD, 903-914.
24.	Ling, Y. Y., Meng, X. F., & Liu, W. (2008). An attributes correlation based approach for estimating size of web databases. Ruan Jian Xue Bao(Journal of Software), 19(2), 224-236. 
25.	Liu, F., Yu, C., Meng, W., & Chowdhury, A. (2006). Effective keyword search in relational databases. In Proceedings of SIGMOD, 563-574,
26.	Liu, K. L., Santoso, A., Yu, C., & Meng, W. (2001). Discovering the representative of a search engine. In Proceedings of 10th ACM International Conference on Information and Knowledge Management (CIKM), 577-579.
27.	Liu, W., Meng, X. F., & Ling, Y. Y. (2008). A graph-based approach for Web database sampling. Ruan Jian Xue Bao(Journal of Software), 19(2), 179-193. 
28.	Luo, Y., Wang, W., Lin, X., Zhou, X., Wang, J., & Li, K. (2007). SPARK2: Top-k Keyword Query in Relational Databases. Knowledge and Data Engineering, IEEE Transactions on(99), 1763-1780. 
29.	Markowetz, A., Yang, Y., & Papadias, D. (2007). Keyword search on relational data streams. In Proc of SIGMOD’07, 605-616.
30.	Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford Digital Libraries Working Paper.
31.	Qin, L., Yu, J. X., & Chang, L. (2011). Scalable keyword search on large data streams. The VLDB Journal, 20(1), 35-57. 
32.	Qin, L., Yu, J. X., Chang, L., & Tao, Y. (2009). Querying communities in relational databases. In Proc. of ICDE’09, 724-735.
33.	Sayyadian, M., LeKhac, H., Doan, A. H., & Gravano, L. (2007). Efficient keyword search across heterogeneous relational databases. In ICDE, 346-355.
34.	Shokouhi, M., Zobel, J., Scholer, F., & Tahaghoghi, S. (2006). Capturing collection size for distributed non-cooperative retrieval. In Proc. ACM SIGIR Conf., 316-323.
35.	Si, L., & Callan, J. (2003). Relevant document distribution estimation method for resource selection. In Proc. of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 298-305.
36.	Singhal, A., & Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. In Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’96), 21-29.
37.	Vu, Q. H., Ooi, B. C., Papadias, D., & Tung, A. K. H. (2008). A graph method for keyword-based selection of the top-k databases. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 915-926.
38.	Wei, W., Liu, M., & Li, S. (2004). Merging of XML documents. Conceptual Modeling–ER 2004, 273-285. 
39.	Yu, B., Li, G., Sollins, K., & Tung, A. K. H. (2007). Effective keyword-based selection of relational databases. In SIGMOD, 139-150.
40.	Zhang, D., Chee, Y. M., Mondal, A., Tung, A., & Kitsuregawa, M. (2009). Keyword search in spatial databases: Towards searching by document. In ICDE, 688-699.
41.	Zhang, J., Peng, Z., Wang, S., & Nie, H. (2006). Si-SEEKER: Ontology-based semantic search over databases. Knowledge Science, Engineering and Management, 599-611. 
42.	Zhou, B., & Pei, J. (2009). Answering aggregate keyword queries on relational databases using minimal group-bys. In EDBT’09, 108-119.
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信