系統識別號 | U0002-2107201114300300 |
---|---|
DOI | 10.6846/TKU.2011.00777 |
論文名稱(中文) | 適合高中生閱讀學習之英文文章推薦機制設計 |
論文名稱(英文) | The Design of English Article Recommendation Mechanism for Senior High School Students |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系碩士在職專班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 99 |
學期 | 2 |
出版年 | 100 |
研究生(中文) | 張俐旋 |
研究生(英文) | Li-Hsuan Chang |
學號 | 798410089 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
第二語言別 | 英文 |
口試日期 | 2011-07-21 |
論文頁數 | 75頁 |
口試委員 |
指導教授
-
郭經華
委員 - 陳孟彰 委員 - 楊接期 委員 - 陳建彰 委員 - 郭經華 |
關鍵字(中) |
全民英文能力分級檢定測驗 智慧型互動式網路語言學習社群 英文新聞 第k位最接近的鄰居 貝式分類法 支援向量機 特徵值 平滑模型 餘旋相似度 評估 混淆矩陣 分類準確性 F-測量 Brier得分測量精度概率 |
關鍵字(英) |
GEPT IWiLL Web News KNN Naive Bayes SVM Features The Smoothed Unigram Model Cosine similarity Evaluation Confusion Matrix Classification accuracy F-measure Brier score |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
隨著全球化的趨勢,學習英文愈來愈重要,而近年來,已經有許多學習者利用閱讀英文文章的方式來幫助自己學習英文。但對於一般的學習者來說,如何去選擇一篇自己感興趣而又難度適中的英文文章來做學習是不容易的。 本研究的目的就是要產生一個推薦機制,當我們輸入一篇英文新聞文章,推薦機制可以判斷該篇文章是否適合使用者閱讀,研究使用對象設定為本國高中學生,使用的語料庫分別是「全民英文能力分級檢定測驗」 (GEPT)六級單字字彙庫、智慧型互動式網路語言學習社群 (Intelligent Web-based Interactive Language Learning,簡稱IWiLL)中,高中生所發表的文章、高中英文課本的課文 (SHSETs)及網路上收集的英文新聞文章 (Web News)。 要找出適合高中生閱讀的英文文章,首先要先計算語料庫的特徵值,再依據該特徵值,將英文文章分類,最後,再評估我們使用的特徵值選取方式與分類器的效能,評估是否能將英文文章做正確的分類。本研究使用2種方法計算出文章特徵值 (Document Features),分別是The Smoothed Unigram Model (平滑模型)及Cosine Similarity (餘弦相似性);三種分類方式對文章做分類,分別為貝式分類法 (Naive Bayes)、KNN (第k位最接近的鄰居)、SVM (支援向量機);三種效能評估方式 (Evaluation)分別是Classification accuracy (正確分類的比例)、F-measure (F-測量)、Brier score (Brier得分測量精度概率);最後,我們使用Confusion Matrix (混淆矩陣)來表示分類的準確性。 |
英文摘要 |
English language has been receiving more and more attention all over the world as a consequence of globalization, especially for non-English speaking countries. For most English learners, reading English articles has always been a proper way of improving the English proficiency. However, it is not a trivial job to select interesting English articles of adequate difficulty level. The purpose of this work is to devise a mechanism for selecting appropriate English articles. The proposed mechanism works by indicating whether a particular article, e.g., English news, is adequate in difficulty for the users. This research targets specifically at senior high school students. Four different databases of English vocabularies are utilized in this work. They are the GEPT level six, Intelligent Web-based Interactive Language Learning (IWiLL), senior high school English textbooks (SHSETs), and the Web News collected on the Internet. To find the level of difficulty for a given English article, we first have to obtain the Document Feature, which is then taken as the only characteristic for classifying the English article. In this work, the approaches of the Smoothed Unigram Model and the Cosine Similarity are both taken to find the document feature. We consider three different methods for classification, i.e., Naive Bayes, K-Nearest Neighbor (KNN), and Support Vector Machine (SVM). For performance evaluation, the Classification accuracy, F-measure, and Brier score are all computed to assess the proposed mechanism. The Evaluated Results are obtained by using the Confusion Matrix. Finally, we analyze the Evaluated Results over different combinations of the methods of obtaining the document features and the classifiers to assess whether the proposed mechanism is able to select adequate English articles for senior high school students. |
第三語言摘要 | |
論文目次 |
圖目錄 V 表目錄 VII 第一章 緒論 1 1.1 研究動機與目的 1 1.2 研究內容 3 1.3 論文大綱 5 第二章 相關研究 6 2.1 推薦機制相關研究 6 2.2 Classification 7 2.2.1 K-nearest neighbor 8 2.2.2 Naive Bayesian classifier 12 2.2.3 SVM 13 2.3 Orange 16 第三章 背景知識介紹 18 3.1 語料庫 18 3.1.1 GEPT 18 3.1.2 IWiLL 20 3.1.3 SHSETs 21 3.1.4 Web News 21 3.2 Data Input 22 3.3 Preprocessing 27 3.3.1 POS Tagging 29 3.3.2 Lemmatizing 30 3.4 文章特徵值 (Document Features) 34 3.4.1 The Smoothed Unigram Model 35 3.5 Evaluation Results 39 3.5.1 Classification accuracy 39 3.5.2 F-measure 40 3.5.3 Brier score 40 3.6 Orange操作 42 第四章 研究結果及分析 47 4.1 推薦機制架構 47 4.2 Similarity Analysis (相似度分析) 50 4.3 研究結果 55 4.3.1 Evaluation Results 56 4.3.2 Confusion Matrix 57 4.3.3 二種英文文章特徵值分類結果分析 60 4.3.4 三種分類器分類結果分析 62 第五章 結論與未來研究方向 65 5.1 結論 65 5.2 未來研究方向 65 參考文獻 67 附錄─英文論文 70 圖目錄 圖1-1 Google News Search 3 圖1-2 語料庫 4 圖2-1 classification說明 8 圖2-2 KNN說明 10 圖2-3 1-nearest neighbor (k=1) 11 圖2-4 2-nearest neighbor (k=2) 11 圖2-5 SVM說明 14 圖2-6 Orange首頁截圖 17 圖2-7 第一次開啟Orange的畫面 17 圖3-1 民國99年GEPT各年齡層考生人數比例統計圖 20 圖3-2 IWiLL首頁截圖 24 圖3-3 Web News (CNN)網頁截圖 25 圖3-4 Web News (CNN)新聞文章截圖 26 圖3-5 Preprocessing 27 圖3-6 語料庫文章前置處理流程 28 圖3-7 訓練資料庫的文章做過詞性標記及詞性還原後的差別 33 圖3-8 文章特徵值計算流程 34 圖3-9 The Smoothed Unigram Model 35 圖3-10 Evaluation Results 39 圖3-11 匯入統計分析的資料 42 圖3-12 觀看我們匯入的資料 43 圖3-13 效能評估工具 44 圖3-14 三種分類方式 45 圖3-15 分類方式效能評估 46 圖4-1 推薦機制架構圖 47 圖4-2 推薦機制流程圖 48 圖4-3 Similarity Analysis 50 圖4-4 圖示歐幾里得點積公式 51 圖4-5 Similarity Analysis 示意圖 53 圖4-6 Document Recommendation Agent 55 圖4-7 二種英文文章特徵值分類結果分析 61 圖4-8 二種英文文章特徵值分類結果分析 62 圖4-9 三種分類器分類結果分析 63 圖4-10 三種英文文章分類器分類結果 64 表目錄 表3-1 大考中修訂之高中英文參考詞彙表截圖 19 表3-2 GEPT六級單字字彙庫(此表為部分一級及六級單字字彙) 23 表3-3 詞類標記之詞性表 30 表3-4 Lemmatizing前,每篇文章出現GEPT字彙級數的比例 32 表3-5 Lemmatizing後,每篇文章出現GEPT字彙級數的比例 32 表3-6 The Smoothed Unigram Model範例 36 表3-7 高中英文課文特徵值數值表 38 表4-1 特徵值數值表 54 表4-2 評估分類器準確率總表 56 表4-3 Similarity Analysis- Naive Bayes 58 表4-4 Similarity Analysis-KNN 58 表4-5 Similarity Analysis-SNM 59 表4-6 The Smoothed Unigram Model- Naive Bayes 59 表4-7 The Smoothed Unigram Model-KNN 59 表4-8 The Smoothed Unigram Model-SVM 60 |
參考文獻 |
[1]Google News, available from http://news.google.com.tw/. [2]G. K. Kanji, 100 Statistical Tests. Thousand Oaks, CA: SAGE Publications, p. 110, 1999. [3]C. J. van Rijsbergen, (1979). Information Retrieval (2nd ed.) Butterworth. [4]Brier (1950). “Verification of forecasts expressed in terms of probability”. [5]http://www.cogs.susx.ac.uk/users/geoffs/RSue.html. [6]http://www.coli.uni-sb.de/sfb378/negra-corpus/. [7]Wenli Tsou, Weichung Wang, Yenjun Tzeng, “Applying A Multimedia Storytelling Website in Foreign Language Learning,” Computers & Education, vol. 47, pp 17-28, 2006. [8]M. H. Hsu, “A Personalized English Learning Recommender System for ESL Students,” Expert Systems with Applications, vol. 34, pp 683-688, 2008. [9]Thorsten.Brants, TnT-A Statistical Part-of-Speech Tagger. In Proceedings of the Sixth Applied Natrual Language Processing Conference ANLP-2000, Seatle, WA, 2000. [10]Y. H. Wang, C. H. Lin, “A multimedia database supports English distance learning,” Information Sciences, vol. 158, pp. 189–208, 2004. [11]C. M. Chen, C. J. Chung, “Personalized mobile English vocabulary learning system based on item response theory and learning memory cycle,” Computers & Education, vol.51, pp. 624-645, 2008. [12]K. C. Thompson, J. Callan, “Predicting Reading Difficulty With Statistical Language Models,” JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, vol.56, pp. 1448-1462, 2005. [13]GEPT, General English Proficiency Test, available from http://www.gept.org.tw. [14]IWiLL, available from http://cube.iwillnow.org. [15]Orange, analysis process through visual programming, available from http://orange.biolab.si/features.html. [16]LTTC, available from http://www.lttc.ntu.edu.tw. [17]W.A. Danielson, & S.D. Bryan, (1963). Computer automation of two readability formulas. Journalism Quarterly, 40, 201-206. [18]J.S. Chall, & E. Dale, (1995). Readability revisited: The New ale-Chall Readability Formula. Cambridge, MA: Brookline Books. [19]J.S. Chall, (1958). Readability: An appraisal of research and plication (Bureau of Educational Research Monographs, No. 34, Columbus: Ohio State University Press). Epping, England: Bowker. [20]G.R. Klare, (1963). The measurement of readability. Ames: Iowa State University Press. [21]C.B. Williams, (1940). A note on the statistical analysis of sentence length as a criterion of literary style. Biometrika, 31, 356-361. [22]V. Vapnik, The Nature of Statistical Learning Theory, NY Springer, 1995. [23]R. O. Duda, and P. E. Hart, Pattern Classification and Scene Analysis, Wiley New York, 1973. [24]James, M., Classification algorithms, John Wiley & Sons, Inc. 1985. [25]Ito, Kiyosi (1993), Encyclopedic Dictionary of Mathematics (2nd ed.), P82, P113, P144, P145, MIT Press, ISBN 978-0-262-59020-4 [26]Apostol, T. (1969). Calculus, Vol. 2: Multi-Variable Calculus and Linear Algebra with Applications. John Wiley and Sons. ISBN 978-0471000075. [27]CNN, available from http://edition.cnn.com/. [28]The China Post, available from http://chinapost.com.tw/. [29]BBC, available from http://www.bbc.co.uk/. |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信