電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2019-07-30起於校外公開使用
本論文紙本於2019-07-30起公開使用

系統識別號	U0002-2907201913250700
DOI	10.6846/TKU.2019.00982
論文名稱(中文)	在PTT平台上比較以分群為主的議題偵測方法
論文名稱(英文)	Comparing the clustering-based topic detection methods on the PTT platform
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	統計學系應用統計學碩士班
系所名稱(英文)	Department of Statistics
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	107
學期	2
出版年	108
研究生(中文)	陳柏瑋
研究生(英文)	Bo-Wei Chen
學號	606650231
學位類別	碩士
語言別	繁體中文
第二語言別
口試日期	2019-07-03
論文頁數	61頁
口試委員	指導教授 - 陳景祥(steve@stat.tku.edu.tw) 共同指導教授 - 吳淑妃(100665@mail.tku.edu.tw) 委員 - 李百靈委員 - 何宗武
關鍵字(中)	集群分析文字探勘局部敏感散列大數據資料分析
關鍵字(英)	cluster analysis text mining Locality-Sensitive Hashing big data analysis
第三語言關鍵字
學科別分類
中文摘要	現今的網路上有越來越多的社群網路平台提供使用者討論，在這些平台中的文章常涉及到近期發生的事件以及發表自己的言論，這些言論提供我們許多重要的資訊，對於每天不斷更新的事件話題集大量資料的流量，監督式學習的分類方式確實有一定的成效，但需耗費大量的時間與人力成本資源等。本論文使用非監督式學習的集群分析，比較三種不同集群的方法，其中k-means分群法及k-modes分群法對於文章相似度過高的問題採取優化的應對措施。本論文也利用minhash + LSH搜尋法的構思結合分群技術，有效率的將PTT八卦版文章作分群，與其他分群法進行整體性的評估，最後進行議題偵測與探討。
英文摘要	Recently, there are more and more the platforms of social network on the internet that provide user discussions. The articles on these platforms often involve recent events and express their own opinions. These comments provide us with much important information for each day. The constantly updated event topics a large amount of data streaming, and the supervised learning methods for classification of articles do have certain results, but they takes a lot of time and cost of human resources. In this paper, we adopt clustering analysis of unsupervised learning to compare three different clustering methods, among which k-means cluster method and k-modes cluster method adopt optimized countermeasures for the problem of the high similarity of articles. This article uses the concept of minhash + LSH search method combined with clustering technique to efficiently group PTT gossiping data,calculate overall evaluation with other clustering methods, and finally conduct text detection and discussions.
第三語言摘要
論文目次	目錄目錄 I 表目錄 III 圖目錄 IV 第一章緒論 1 1.1 研究背景 1 1.2 研究動機與目的 1 1.3 論文結構 2 第二章相關研究 3 2.1 主題偵測和追蹤 4 2.2 文字探勘 5 2.2.1 網路爬蟲 6 2.2.2 文章清理 6 2.2.3 集群分析 7 2.3 LOCALITY-SENSITIVE HASHING 8 第三章研究方法 10 3.1. 架構流程 10 3.2. 中文文字處理 11 3.2.1 斷詞工具：CKIP網頁版 11 3.2.2 斷詞工具：R軟體 11 3.3. 詞彙權重計算 12 3.4. 分群方法 13 3.4.1 k-means分群法 14 3.4.2 k-modes分群法 17 3.4.3 LSH搜尋分群法 22 第四章實例分析結果與成效評估 27 4.1 分析環境 27 4.2 資料描述 27 4.3 資料處理 30 4.3.1 斷詞與停詞 30 4.3.2 tdm與dtm矩陣 31 4.3.3 TF-IDF權重 32 4.4 集群分析 34 4.4.1 分割式分群法 34 4.4.2 快速搜尋法 43 4.4.3 階層式分群法 45 4.5 集群評估 48 4.5.1 集群指標評估 48 4.5.2 時間評估 50 4.5.3 整體綜合比較 51 4.6 集群偵測趨勢 52 4.6.1 集群分析圖 52 4.6.2 ＰＴＴ八卦版 54 4.6.3 文字雲 55 第五章結論與建議未來展望 57 5.1 結論 57 5.2 建議未來展望 58 參考文獻 59 表目錄表1 k-means分群法與k-modes分群法的比較 21 表2 k-means與k-modes分群法的優缺點 21 表3 將簽名矩陣進行分組 26 表4 分析此論文的電腦規格 27 表5 R軟體NbClust套件分群結果(總共花費約16.5個小時) 35 表6 k-means分群後結果 37 表7 K-means分群與八卦版屬性資料數的比較 37 表8 優化後之k-means_e分群結果 38 表9 優化後之k-means分群與八卦版屬性資料數的比較 39 表10 k-modes分群後結果 40 表11 k-modes分群與八卦版屬性資料數的比較 40 表12 優化後之k-modes_h分群後結果 41 表13 優化後之k-modes分群與八卦版屬性資料數的比較 42 表14 JH分群後結果 47 表15 JH分群與八卦版屬性資料數的比較 47 表16 各個分群法的純度評估分數 48 表17 各個分群法的Rand index評估分數 49 表18 各個分群法的總體所花費的時間成本之比較 50 圖目錄圖1 LSH方法的構思 9 圖2 R軟體jiebaR套件進行中文斷詞和停詞後結果之範例 12 圖3 k-means原始資料與分群後資料的真實範例 14 圖4 k-means分群資料概念圖 14 圖5 k-means之k個初始集群的中心 15 圖6 k-means之產生初始k個群集(固定μi求解所屬群Si) 16 圖7 k-means之產生新的質量中心(固定Si求解群中心μi) 16 圖8 k-means之初始值設定極端的例子 17 圖9 k-modes之分群資料 19 圖10 k-modes之產生初始k個群中心 20 圖11 K-modes之產生k個群集 20 圖12 一般散列與局部敏感散列的差別示意圖 22 圖13 Jaccard相似度之概念圖 24 圖14 將4個原始文檔轉置成簽名矩陣 24 圖15 在隨機的簽名矩陣找出最小散列值 25 圖16 PTT網站之八卦版文章 28 圖17 八卦版文章之爬蟲到R軟體 28 圖18 PTT網站之八卦版內文 29 圖19 R軟體爬蟲下來共有5740篇文章之內文 29 圖20 原始內文與斷、停詞後文章之差異 30 圖21 本篇原始內文的tdm矩陣超過100詞頻的字詞 31 圖22 加入TF-IDF後的401個字詞之超過35詞頻的字詞 33 圖23 factoextra套件顯示之最佳分群數結果 36 圖24 NbClust套件的dindex指標計算出來的結果 36 圖25 k-means分群與八卦版屬性資料比重的比較 38 圖26 優化後之k-means_e分群與八卦版屬性資料比重的比較 39 圖27 k-modes分群與八卦版屬性資料比重的比較 41 圖28 優化後之k-modes_h分群與八卦版屬性資料比重的比較 42 圖29 存放在第一篇文章的240個散列標籤之R程式範例 43 圖30 在R程式中文檔1放置不同桶中並以加鑰方式記載之範例 44 圖31 比較兩文檔相似性程度之R程式範例 44 圖32 八卦版前10篇文章的兩兩相似度之R程式範例 46 圖33 JH分群與八卦版屬性資料比重的比較 47 圖34 各個分群與所有評估指標的效度比較 51 圖35 k-means分群法之投射在平面式意圖 52 圖36 優化後k-means分群法之投射在平面式意圖 52 圖37 JH分群法之投射在平面式意圖 53 圖38 PTT八卦版文章版規(問卦、爆卦) 54 圖39 PTT八卦版文章版規(臉書、新聞) 54 圖40 對於文章字詞的權重計算在R程式之範例 55 圖41 此篇資料重要字詞 56 圖42 對於此篇資料篩選出更為重要的字詞 56
參考文獻	[1] A. Dasgupta, R. Kumar, T. Sarlos, (2011). Fastlocality-sensitive hashing, in: Proceedings ofthe 17th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, pp. 1073–1081. [2] Allan, V. Lavrenko, D. Malin, R. Swan, (2000). Detections, bounds, and timelines: UMass and TDT-3, in: Proceedings of Topic Detection and Tracking Workshop. pp. 167–174. [3] B. Sharifi, M. Hutton, J.K. Kalita, (2010). Experiments in microblog summarization, in: Proceedings of the 2010 IEEE Second International Conference on Social Computing, pp. 49–56. [4] Blei D.M., Lafferty J.D., (2006). Dynamic topic models, ACM International Conference Proceeding Series, 148, pp. 113-120. [5] He, Z., Xu, X., & Deng, S., (2010). Attibute value weighting in k-Modes clustering, Expert Systems eith Applications,38, pp. 15365-15369. [6] Huang, J.Z., (2009). Clustering Categorical Data with k-modes, pp. 246-250. [7] J. Allan, in: J. Allan (Ed.), (2002). Topic Detection and Tracking. Kluwer Academic Publishers, Norwell, MA, USA, pp. 1–16. [8] J. Allan, V. Lavrenko, D. Malin, R. Swan, (2000). Detections, bounds, and timelines: UMass and TDT-3, in: Proceedings of Topic Detection and Tracking Workshop. pp. 167–174. [9] J.G. Fiscus, G.R. Doddington, J. Allan (Ed.), (2002). Topic Detection and Tracking. Kluwer Academic Publishers, Norwell, MA, USA, pp. 17–31. [10] S. Kaleel , A. Abhari, (2015). Cluster-discovery of Twitter messages for event detection and trending. Journal of Computational Science, 6, pp. 47–57. [11] Krishna Y. Kamath, James Caverlee, (2012). Content-Based Crowd Retrieval on the Real-Time Web. real-time web. Proceedings of the 21st ACM International Conference on Information and Knowledge Management - CIKM ’12, pp. 195-204. [12] S. Petrovic, M. Osborne, V. Lavrenko, (2010). Streaming first story detection with application to Twitter, in: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, pp. 181–189. [13] Victor Lavrenko, James Allan, Edward DeGuzman, Daniel LaFlamme, Veera Pollard, Steven Thomas, (2002). Relevance Models for Topic Detection and Tracking, pp. 115-121. [14] Xingtong Ge, Xiaofang Jin, Bo Miao, Chenming Liu, & Xinyi Wu., (2019). Research on the Key Technology of Chinese Text Sentiment Analysis. Proceedings of the IEEE International Conference on Software Engineering and Service Sciences, ICSESS. 2018-November..pp. 395-398. [15] 黃純敏, 陳聰宜, 詹雅筑, (2014). 新聞事件偵測與追蹤之分群分類演算法研究. 資訊科技國際期刊第八卷第一期. [16] 李佩隃, (2011). 潛在類別分析與二階段群集分析分群效果之比較研究. 國立臺灣師範大學教育心理與輔導學系碩士論文. [17] 吳誌航, (2016). 從二階段分群萃取輿情事件. 中原大學資訊管理研究所碩士論文. [18] 洪宇, 張宇, 劉挺, 李生, (2007). 話題檢測與追蹤的評測集研究敘述. 中文信息學報地21卷第6期. [19] 陳同孝, 陳雨霖, 劉明山, 許文綬, 林志強, 邱永興, (2006)結合 K-means及階層式分群法之二階段分群演算法及階層式分群法之二階段分群演, 電腦學刊第十七卷第一期. [20] 陳景祥, (2018), R軟體：應用統計方法第二版, 台灣東華.
論文全文使用權限	校內：校內紙本論文立即公開同意電子論文全文授權校園內公開校內電子論文立即公開校外：同意授權予資料庫廠商校外電子論文立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信