電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2015-07-31起於校外公開使用
本論文紙本於2015-07-31起公開使用

系統識別號	U0002-1307201214014900
DOI	10.6846/TKU.2012.00488
論文名稱(中文)	長句斷詞法和遺傳演算法對新聞分類的影響
論文名稱(英文)	The Effects of Long Sentence Segmentation and Genetic Algorithm on News Classification
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	資訊工程學系碩士班
系所名稱(英文)	Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	100
學期	2
出版年	101
研究生(中文)	許桓瑜
研究生(英文)	Huang-Yu Hsu
學號	699410675
學位類別	碩士
語言別	繁體中文
第二語言別	英文
口試日期	2012-07-02
論文頁數	62頁
口試委員	指導教授 - 蔡憶佳委員 - 蔡憶佳委員 - 陳伯榮委員 - 林慶昌
關鍵字(中)	中文斷詞長詞優先法遺傳演算法單純貝氏分類法
關鍵字(英)	Chinese word segmentation Maximum Matching Genetic Algorithm Naive Bayesian classification
第三語言關鍵字
學科別分類
中文摘要	最近由於網路迅速發展，已經在很多人的生活裡占很重要的一部分，加上網路的方便性、即時性、全球性，讓使用者可以快速接收新消息，因此許多新聞網站成立，但是分類上卻沒有統一性。如何讓使用者可以有效率的找到想要的新聞資訊，是目前需要解決的問題；而好的新聞文章分類有賴中文斷詞技術，所以精準的中文斷詞也不失為一個重要的議題。在本篇論文中主要針對兩個系統：第一個是中文斷詞系統，我們先利用訓練文章來建造詞庫，之後以詞庫搭配長詞優先法和遺傳演算法，把測試用新聞文章的長句子截成短句子，並且分析比對內容，找出最佳的全文斷詞方式。第二個是文章分類系統，對於經過斷詞的新聞文章，我們比對這些斷詞與詞庫，找出每個斷詞常出現在哪些類別，再利用單純貝氏分類來分析此篇新聞最可能的類別。其中我們也提出偵測新詞是否有助於新聞文章斷詞及分類的想法，先利用A-priori and adjacent characters algorithm找出未知詞或新詞，把新詞加入詞庫，之後再用擴充後的詞庫繼續分析之後的新聞文章斷詞及分類。本論文實驗的結果是，將長句子截成短句子之後，再使用遺傳演算法做中文斷詞，其斷詞精確率與召回率會比未經過截短句子的組別提升1-2%；而一旦經過遺傳演算法作中文斷詞，不論是否有先進行截短句子，斷詞精確率與召回率均可達到約八成。又在此斷詞精確的情況下，新聞文章藉由單純貝氏分類也有高達九成五的分類正確率。最後我們也提出未來考慮以添加新詞至詞庫的方式，或許可更提升新聞文章斷詞及分類準確性的理論。
英文摘要	The fast growing Internet has become a very important part of our lives. Its convenience, instantaneity, and globality enable users to receive news promptly, which also promotes the creation of news web sites. However, there’s no standard ways of classification for all the news. How to increase the accuracy and efficiency of news searching becomes a major issue to be solved. Also, a good news classification system depends on the quality of word segmentation, so it is very important to have an appropriate Chinese words segmentation system. In this paper we focused on two issues: the first is the Chinese word segmentation system. We use training articles to build vocabulary database, which will be used by two algorithms – Maximum Matching Algorithm and Genetic Algorithm to split unknown long sentence into short sentences during content analysis. The second is the news classification system. After performing word segmentation, we compare the segmented words with the vocabulary database to determine which categories the article most likely belongs to by Naive Bayesian Classification. In addition, we adopt A-Priori and Adjacent Characters Algorithm to identify unknown words or new words. The detected new words will be added to the database and we will use the expanded one to redo the tasks and see if there is difference in word segmentation and news classification. After splitting any long sentence into short sentences, the precision and recall of segmented words performed by Genetic Algorithm will increase. Furthermore, the results of news classification is fairly accurate if the word segmentation is appropriate. Adding new words to the database will also enhance the accuracy of both word segmentation and news classification more.
第三語言摘要
論文目次	目錄第一章緒論 1 1.1 研究背景 1 1.2 研究目的 3 1.3 論文架構 3 第二章相關研究 4 2.1 文件分類方法 4 2.1.1 單純貝氏分類法 4 2.1.2 K-最鄰近法 5 2.1.3 遺傳演算法 6 2.1.4 決策樹 6 2.2 文章斷詞方法 8 2.2.1 詞典式斷詞法 9 2.2.2 統計式斷詞法 9 2.2.3 遺傳演算法 10 2.2.4 MMSeg 法 11 第三章研究方法 12 3.1 研究方法 12 3.2 進行步驟 14 3.2.1 新聞收集 14 3.2.2 製作詞庫 14 3.2.3 文章斷句 15 3.2.4 縮短句長 15 3.2.5 文章斷詞 16 3.2.6 文章分類 16 3.2.7 成果分析 17 3.2.8 偵測新詞 18 第四章論文所提方法 19 4.1 製作詞庫 19 4.1.1 N-gram 斷詞法 19 4.2 文章斷詞方式 20 4.2.1 以非中文字做為斷句 20 4.2.2 以長詞優先法將長句斷成短句 23 4.2.3 以遺傳演算法為基礎的斷詞 25 4.3 文章分類方法 30 4.3.1 單純貝氏分類法 30 4.4 偵測新詞 32 4.4.1 A-priori with adjacent character algorithm 32 第五章實驗結果與分析 34 5.1 實驗資料與環境 34 5.2 實驗結果 34 5.3 結果分析與討論 45 第六章結論與未來方向 51 參考文獻 53 附錄-英文論文 55 圖目錄圖一分類模型的決策樹範例 7 圖二訓練模組流程圖 13 圖三 N-gram斷詞方式 19 圖四新聞網頁的內容範例 21 圖五由非中文字處進行斷句後的新聞文章內容 23 圖六中文文章句子字數統計 24 圖七正向與反向長詞優先法操作 24 圖八用長詞優先法協助長句斷句成短句 25 圖九遺傳演算法之染色體編碼 26 圖十遺傳演算法之染色體編碼實際範例圖 26 圖十一遺傳演算法之染色體交配互換 28 圖十二遺傳演算法之染色體突變 29 圖十三 A-priori with adjacent character algorithm與ACP和FACP概念 32 圖十四分別以二字詞、三字詞、四字詞斷詞的結果 36 圖十五原始文章以及從非中文字處初步斷句結果 40 圖十六未經處理的斷句結果及經過長詞優先法縮短的斷句結果 41 圖十七經由遺傳演算法後的測試文章斷詞結果 43 表目錄表一訓練樣本 7 表二中文文章句子字數累積表 24 表三句長字數統計分布 34 表四各類別新聞以n-gram製造出二、三、四字詞詞庫的詞數統計 35 表五經遺傳演算法斷詞後的詞召回率比對表 43 表六遺傳演算法與MMSeg法斷詞的詞召回率比較表 44 表七「詞典詞庫」合併遺傳演算法與MMSeg法斷詞的詞召回率比較表 44 表八經單純貝氏分類法的測試文章分類結果 45
參考文獻	[1] 林孟翰，「基於中文斷詞技術之新聞網頁分類系統」，碩士論文，淡江大學資訊工程學系碩士班，2011。 [2] 林大澈，「新聞網頁自動分類系統」，碩士論文，淡江大學資訊工程學系碩士班，2010。 [3] 林千翔、張嘉惠，「基於特製隱藏式馬可夫模型之中文斷詞研究」，碩士論文，國立中央大學資訊工程學系，2006。 [4] 陳稼興、謝佳倫、許芳誠，「以遺傳演算法為基礎的中文斷詞研究」，資訊管理研究第二卷第二期，pp.27-44，2000。 [5] 李志豪、姜正雄，「以基因規劃法為基礎的中文斷詞模型」，碩士論文，玄奘大學資訊科學系，2008。 [6] K. J. Chen, and S. H. Liu, “Word Identification for Mandarin Chinese Sentences,” Proceedings of the 14th International Conference on Computational Linguistics (COLING’92), Vol. 1, pp. 101-107, 1992. [7] K. J. Chen, and W. Y. Ma, “Unknown Word Extraction for Chinese Documents”, Proceedings of the 19th International Conference on Computational Linguistics (COLING’02), Vol. 1, pp. 169-175, 2002. [8] C. L. Goh, M. Asahara, and Y. Matsumoto, “Chinese Word Segmentation by Classification of Characters,” Computational Linguistics and Chinese Language Processing, Vol. 10, pp. 381-396, 2005. [9] M. X. Jin, M. Y. Kim, D. Kim, and J. H. Lee, “Segmentation of Chinese Long Sentences Using Commas,” presented at the ACL SIGHAN Workshop, 2004. [10] J. Y. Nie, and M. Brisebois, “On Chinese Text Retrieval,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96), pp. 225-233, 1996. [11] C. H. Tsai, “MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm,” [http://technology.chtsai.org/mmseg/], 1996. [12] X. J. Wang, W. Liu, and Y. Qin, “A Search-based Chinese Word Segmentation Method,” Proceedings of the 16th International Conference on World Wide Web (WWW’07), pp. 1129-1130, 2007. [13] Y. Wang, and S. T. Huang, “Chinese Word Segmentation based on A-priori and Adjacent Characters,” Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vol. 6, pp. 3808–3813, 2005. [14] N. Xue, “Chinese Word Segmentation as Character Tagging,” Computational Linguistics and Chinese Language Processing, Vol. 8, pp. 29-48, 2003. [15] J. H. Zheng, and F. F. Wu, “Study on Segmentation of Ambiguous Phrases with the Combinatorial Type,” Collections of Papers on Computational Linguistics, Tsinghua University Press, Beijing, pp. 129-134, 1999.
論文全文使用權限	校內：紙本論文於授權書繳交後3年公開同意電子論文全文授權校園內公開校內電子論文於授權書繳交後3年公開校外：同意授權予資料庫廠商校外電子論文於授權書繳交後3年公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信