電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2017-07-26起於校外公開使用
本論文紙本於2017-07-26起公開使用

系統識別號	U0002-2107201710110100
DOI	10.6846/TKU.2017.00741
論文名稱(中文)	興趣點之多標籤分類方法研究
論文名稱(英文)	A Study of Multi-label Classification of POI
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	電機工程學系碩士班
系所名稱(英文)	Department of Electrical and Computer Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	105
學期	2
出版年	106
研究生(中文)	邱捷琦
研究生(英文)	Chieh-Chi Chiu
學號	604450220
學位類別	碩士
語言別	繁體中文
第二語言別
口試日期	2017-07-10
論文頁數	60頁
口試委員	指導教授 - 李維聰(wtlee@mail.tku.edu.tw) 委員 - 衛信文(hsinwen.wei@gmail.com) 委員 - 朱國志(kcchu@mail.lhu.edu.tw)
關鍵字(中)	分類機器學習爬蟲相似度支持向量機最近鄰居法
關鍵字(英)	Classification Similarity Machine learning Web crawler SVM kNN
第三語言關鍵字
學科別分類
中文摘要	在這個資訊爆炸的時代，網際網路的盛行，在網路的世界裡上有成千上萬的資訊，如何在眾多的資訊中找到自己想要的資料，因此有了推薦系統。好的推薦系統勢必要有好的分類，因此便有了分類系統這門學問。一個好的分類系統可以在短時間內快速地找到適合自己的資料，不好的分類不僅耗時還有可能找到不是自己想要的資料。傳統的單標籤分類只能單方面的知道這個資訊是不是屬於這一類，在搜尋前如果不知道所要查的資訊是屬於哪一類時，便會提高尋找資料的時間，這樣對使用者來說相當耗時。但是多標籤分類，可以提高資訊的相關性，讓未知的訊息多了幾個可以找到它們的線索，可以讓使用者在搜尋資料時可以更快地找到並符合自己想要的資料。因此，本論文主要的研究，即針對使用者有興趣的地點進行以食、衣、住、行、育樂為標籤的多標籤分類機制研究。在本論文中，首先利用網路爬蟲取得網頁的資料。當收到地點的資料時，利用搜尋名稱並使用Google Custom Search API取得網頁來蒐集資料。之後藉由斷詞系統(Ckip)來分類蒐集到的網頁內容並計算所有值的權重，透過權重值的計算來得知網頁字詞與類別的相關性。接著，本論文利用搜尋到的網頁內容來製作關鍵詞表，分為食、衣、住、行、育樂五種。再來，我們將地點名稱量化並取權重值、相似度及相似度符合率來作為三個特徵值。最後利用這三個特徵值加上kNN及SVM來取得單標籤分類的結果，我們將單標籤分類後的結果，再進行一次分類來達到本論文所要做的多標籤分類。本論文的實驗是將未知的地點訊息做多標籤分類，讓使用者在未知的地方。使用社群網站輸入地點名稱，進而找到該地點的資訊。從實驗的結果我們可以發現，當訊息越多時，分類的效果越好；反之，當訊息越少，則分類的效果較差。k值越大則分類的效果較佳為結論。在未來，我們希望能將分類的範圍擴大，目的是資訊越多，能分類的項目就越多，利用範圍擴大來提升資料的多樣性以及準確性。
英文摘要	With the rise of internet technology and development of mobile application, more and more data are around us. However, it’s not always easy to find the needed information that people want. Therefore, a good recommendation system is required for giving useful or interesting information. To provide useful information for user, a good classification of data is needed for recommendation system. Good classification of data allows system to process users’ requests easily and efficiently, on the other hand, poor classification of data makes recommendation useless and time-consumed. Traditional single-label classification can only be unilateral to know whether this information belongs to a certain category. Before searching information, if you do not know the category of the information, it will increase the time to find information, so the search is quite time consuming. In contrast, the multi-label classification can obtain the relevance of the information, so that it can find a few more clues for the unknown data and allow users to obtain the needed information faster. Therefore, the main research of this paper is to study the multi-label classification mechanism, which tries to classify data into following categories: food, clothing, accommodation, transportation and education. In this paper, we first use the web crawler to obtain the information of the webpage. When we receive the information of the place, we use the search name and use the Google Custom Search API to obtain the webpage to collect the data. Then by the word system (Ckip) to classify the collected web content and calculate the weight of all values. Through the weight of the calculation, the relevance of the page terms and categories can be obtained. Second, we use the web content to construct the keyword table, which includes words related food, clothing, accommodation, transportation and education categories. Then, we use three features with kNN and SVM to get the results of single-label classification. In order to improve the diversity of information, the results of single-label, are sorted after the unknown information is classified into food, clothing, accommodation, transportation and education. After that, the classifiers are applied to the results to obtain the results of Multi-label classification. The experiment in this paper is to sort the unknown location information into a multi-label category, allowing the user to use the community site to enter the place name in an unknown place to find the information for that location. From the results of the experiment we can find that the more the data we collected, the better the results of classification; the other hand, when the obtained data is less, the results of classification are poor. Moreover, the simulation results also show that when k value is greater the results of classification are better. In the future, we want to extend the scope of the classification to have more data and so that expand the diversity and accuracy of classification.
第三語言摘要
論文目次	第一章緒論 1 1.1 前言 1 1.2 動機與目的 1 1.3 論文章節架構 3 第二章相關研究與背景知識 5 2.1 機器學習 5 2.1.1 支持向量機(Support Vector Machine, SVM) 5 2.1.2 最近鄰居法(k-nearest neighbors, kNN) 7 2.2 相似度計算 7 2.2.1 歐幾里得距離(Euclidean distance) 7 2.2.2 餘弦相似度(Cosine similarity) 8 2.3 CHINESE KNOWLEDGE AND INFORMATION PROCESSING(CKIP) 9 2.4 相關研究 9 第三章研究方法 11 3.1 權重值計算 11 3.2 關鍵詞表 12 3.2.1 關鍵詞表製作 13 3.2.2 相似度計算 16 3.2.3 相似度符合率 17 3.3 實驗流程 18 第四章研究成果 22 4.1 實驗環境 22 4.1.1 地點資料蒐集 22 4.1.2 網頁資料蒐集 24 4.1.3 詞彙資料蒐集 25 4.2 關鍵詞表 26 4.2.1 單標籤分類使用的各類型關鍵詞表 26 4.2.2 多標籤分類使用的各類型關鍵詞表 31 4.3 特徵值 34 4.4 數據討論 37 4.4.1 單標籤分類三個特徵值 38 4.4.2 多標籤分類三個特徵值 48 4.4.3 數據分析 54 第五章結論與未來展望 55 參考文獻 56 圖 1.1打卡圖 2 圖 1.2僅有地點名稱的位置 3 圖 2.1 SVM範例圖 6 圖 2.2 kNN範例圖 7 圖 3.1權重值範例圖 12 圖 3.2關鍵詞表製作流程圖 13 圖 3.3第一階段實驗流程圖 18 圖 3.4第二階段實驗流程圖 20 圖 4.1資料蒐集範圍 23 圖 4.2三個特徵值與公式 3 5的食物類型LibSVM圖 39 圖 4.3 三個特徵值與公式 3-5的衣類型 LibSVM圖 39 圖 4.4三個特徵值與公式 3 5的住類型LibSVM圖 40 圖 4.5三個特徵值與公式 3-5的行類型 LibSVM圖 40 圖 4.6三個特徵值與公式 3 5的育樂類型LibSVM圖 41 圖 4.7三個特徵值與公式 3 6的食物類型LibSVM圖 42 圖 4.8三個特徵值與公式 3-6的衣類型 LibSVM圖 42 圖 4.9三個特徵值與公式 3 6的住類型LibSVM圖 43 圖 4.10三個特徵值與公式 3-6的行類型 LibSVM圖 43 圖 4.11三個特徵值與公式 3 6的育樂類型LibSVM圖 44 圖 4.12三個特徵值與公式 3 7的食物類型LibSVM圖 45 圖 4.13三個特徵值與公式 3 7的衣類型LibSVM圖 45 圖 4.14三個特徵值與公式 3 7的住類型LibSVM圖 46 圖 4.15三個特徵值與公式 3 7的行類型LibSVM圖 46 圖 4.16三個特徵值與公式 3 7的育樂類型LibSVM圖 47 圖 4.17無修改關鍵詞表的食住LibSVM圖 49 圖 4.18無修改關鍵詞表的食育樂LibSVM圖 49 圖 4.19無修改關鍵詞表的衣育樂LibSVM圖 50 圖 4.20無修改關鍵詞表的行育樂LibSVM圖 50 圖 4.21已修改關鍵詞表的食住LibSVM圖 51 圖 4.22已修改關鍵詞表的食育樂LibSVM圖 52 圖 4.23已修改關鍵詞表衣育樂的LibSVM圖 52 圖 4.24已修改關鍵詞表的行育樂LibSVM圖 53 表 3 1住關鍵詞表 14 表 3-2 食住關鍵詞表 15 表 3 3刪除規則範例 19 表 4 1利用Facebook Place API所取得的地點名稱 23 表 4 2部分地點名稱與其相關網頁 24 表 4 3無用處詞性 25 表 4 4食物關鍵詞表 27 表 4-5衣相關詞表 28 表 4 6住相關詞表 29 表 4 7行相關詞表 29 表 4 8育樂關鍵詞表 30 表 4-9 食住關鍵詞表 31 表 4-10 食育樂關鍵詞表 32 表 4-11 育樂衣關鍵詞表 33 表 4-12 育樂行關鍵詞表 34 表 4 13網頁與食物特徵值 35 表 4 14網頁與食住特徵值 36 表 4 15地點與食物特徵值 37 表 4 16地點與食住特徵值 37 表 4 17三個特徵值與公式 3 5的iBk準確率 41 表 4 18三個特徵值與公式 3 6的iBk準確率 44 表 4 19三個特徵值與公式 3 7的iBk準確率 47 表 4 20三個特徵值的各方程式LibSVM準確率 48 表 4 21無修改關鍵詞表的iBk準確率 51 表 4 22無修改關鍵詞表的LibSVM圖 51 表 4 23已修改關鍵詞表的iBk準確率 53 表 4 24已修改關鍵詞表的LibSVM圖 53
參考文獻	[1] 民生六大需求 [Online]. available: https://note.com.tw/2016/08/28/食、衣、住、行、育、樂/ [2] Machine Learning[online].avaliable: https://zh.wikipedia.org/wiki/機器學習 [3] J. Park, J. Park and J. Choi, "Web-Based Document Classification Using a Trie-Based Index Structure," in Web Intelligence and Intelligent Agent Technology Workshops, 2007 IEEE/WIC/ACM International Conferences on, 2007, pp. 52-55. [4] M. Engin and T. Can, "Text classification in the Turkish marketing domain for context sensitive ad distribution," in Computer and Information Sciences, 2009. ISCIS 2009. 24th International Symposium on, 2009, pp. 105-110. [5] I. C. Kim, D. X. Le and G. R. Thoma, "Automated method for extracting “citation sentences” from online biomedical articles using SVM-based text summarization technique," in 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2014, pp. 1991-1996. [6] k-nearest neighbors algorithm [Online]. available: https://zh.wikipedia.org/wiki/最近鄰居法. [7] Rong-Lu Li and Yun-Fa Hu, "Noise reduction to text categorization based on density for kNN," in Machine Learning and Cybernetics, 2003 International Conference on, 2003, pp. 3119-3124 Vol.5. [8] Hua Jiang, Ping Li, Xin Hu and Shuyan Wang, "An improved method of term weighting for text classification," in Intelligent Computing and Intelligent Systems, 2009. ICIS 2009. IEEE International Conference on, 2009, pp. 294-298. [9] L. Li, Y. Che, H. Zhang, T. Li and M. Yang, "kNN text categorization algorithm based on LSA reduce dimensionality," in Information Technology and Artificial Intelligence Conference (ITAIC), 2011 6th IEEE Joint International, 2011, pp. 72-75. [10] Euclidean distance [Online]. available: https://en.wikipedia.org/wiki/Euclidean_distance. [11] K. Taghva and R. Veni, "Effects of Similarity Metrics on Document Clustering," in Information Technology: New Generations (ITNG), 2010 Seventh International Conference on, 2010, pp. 222-226. [12] A. Amine, Z. Elberrichi, M. Simonet and M. Malki, "WordNet-Based and N-Grams-Based Document Clustering: A Comparative Study," in Broadband Communications, Information Technology & Biomedical Applications, 2008 Third International Conference on, 2008, pp. 394-401. [13] Huaizhong Kou and G. Gardarin, "Similarity model and term association for document categorization," in Database and Expert Systems Applications, 2002. Proceedings. 13th International Workshop on, 2002, pp. 256-260. [14] V.V. Raghavan and S.K.M. Wong, "A critical analysis of vector space model for information retrieval," Journal of the American Society for Information Science, vol. 37, no. 5, pp. 279-287 1986. [15] G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing & Management, vol. 24, no. 5, pp. 513-523, 01/01 1988. [16] J. Ni, F. Kong, P. Li and Q. Zhu, "Research on Cross-Document Coreference of Chinese Person Name," in Asian Language Processing (IALP), 2011 International Conference on, 2011, pp. 81-84. [17] 中文詞知識庫小組 [Online]. available: http://ckip.iis.sinica.edu.tw/CKIP/index.htm. [18] Xue X., Zhou Z., "Distributional Features for Text Categorization, " 17th European conference on machine learning, Berlin, Germany, September 18-22, 2006 [19] Zhong-Xing Xie, "A Study of Content-aware Classification of POI",2019-08-08 [20] 多元分類[Online]. available: https://zh.wikipedia.org/wiki/多元分類 [21] 多標籤分類[Online]. available: http://blog.csdn.net/bemachine/article/details/10471383 [22] Shibiao Wan, "Adaptive thresholding for multi-label SVM classification with application to protein subcellular localization prediction", Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 2013, pp. 3547-3551. [23] Jianqing Zhu, "Multi-label CNN based pedestrian attribute learning for soft biometrics",Biometrics (ICB), 2015 International Conference on, 2015, pp. 535-540. [24] LI Li-shuang,HUANG De-gen,CHEN Chun-rong,YANG Yuan-sheng, "Identifying chinese place names based on support vector machines and rules," Industrial Mechatronics and Automation (ICIMA), 2010 2nd International Conference on, vol. 20, no. 5, pp. 53, 2006-10-16 2006. [25] TF-IDF [Online]. available: https://zh.wikipedia.org/wiki/Tf-idf.
論文全文使用權限	校內：校內紙本論文立即公開同意電子論文全文授權校園內公開校內電子論文立即公開校外：同意授權校外電子論文立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信