淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-2107201114300300
中文論文名稱 適合高中生閱讀學習之英文文章推薦機制設計
英文論文名稱 The Design of English Article Recommendation Mechanism for Senior High School Students
校院名稱 淡江大學
系所名稱(中) 資訊工程學系碩士在職專班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 99
學期 2
出版年 100
研究生中文姓名 張俐旋
研究生英文姓名 Li-Hsuan Chang
學號 798410089
學位類別 碩士
語文別 中文
第二語文別 英文
口試日期 2011-07-21
論文頁數 75頁
口試委員 指導教授-郭經華
委員-陳孟彰
委員-楊接期
委員-陳建彰
委員-郭經華
中文關鍵字 全民英文能力分級檢定測驗  智慧型互動式網路語言學習社群  英文新聞  第k位最接近的鄰居  貝式分類法  支援向量機  特徵值  平滑模型  餘旋相似度  評估  混淆矩陣  分類準確性  F-測量  Brier得分測量精度概率 
英文關鍵字 GEPT  IWiLL  Web News  KNN  Naive Bayes  SVM  Features  The Smoothed Unigram Model  Cosine similarity  Evaluation  Confusion Matrix  Classification accuracy  F-measure  Brier score 
學科別分類 學科別應用科學資訊工程
中文摘要   隨著全球化的趨勢,學習英文愈來愈重要,而近年來,已經有許多學習者利用閱讀英文文章的方式來幫助自己學習英文。但對於一般的學習者來說,如何去選擇一篇自己感興趣而又難度適中的英文文章來做學習是不容易的。
  本研究的目的就是要產生一個推薦機制,當我們輸入一篇英文新聞文章,推薦機制可以判斷該篇文章是否適合使用者閱讀,研究使用對象設定為本國高中學生,使用的語料庫分別是「全民英文能力分級檢定測驗」 (GEPT)六級單字字彙庫、智慧型互動式網路語言學習社群 (Intelligent Web-based Interactive Language Learning,簡稱IWiLL)中,高中生所發表的文章、高中英文課本的課文 (SHSETs)及網路上收集的英文新聞文章 (Web News)。
  要找出適合高中生閱讀的英文文章,首先要先計算語料庫的特徵值,再依據該特徵值,將英文文章分類,最後,再評估我們使用的特徵值選取方式與分類器的效能,評估是否能將英文文章做正確的分類。本研究使用2種方法計算出文章特徵值 (Document Features),分別是The Smoothed Unigram Model (平滑模型)及Cosine Similarity (餘弦相似性);三種分類方式對文章做分類,分別為貝式分類法 (Naive Bayes)、KNN (第k位最接近的鄰居)、SVM (支援向量機);三種效能評估方式 (Evaluation)分別是Classification accuracy (正確分類的比例)、F-measure (F-測量)、Brier score (Brier得分測量精度概率);最後,我們使用Confusion Matrix (混淆矩陣)來表示分類的準確性。
英文摘要  English language has been receiving more and more attention all over the world as a consequence of globalization, especially for non-English speaking countries. For most English learners, reading English articles has always been a proper way of improving the English proficiency. However, it is not a trivial job to select interesting English articles of adequate difficulty level. The purpose of this work is to devise a mechanism for selecting appropriate English articles. The proposed mechanism works by indicating whether a particular article, e.g., English news, is adequate in difficulty for the users. This research targets specifically at senior high school students. Four different databases of English vocabularies are utilized in this work. They are the GEPT level six, Intelligent Web-based Interactive Language Learning (IWiLL), senior high school English textbooks (SHSETs), and the Web News collected on the Internet.
 To find the level of difficulty for a given English article, we first have to obtain the Document Feature, which is then taken as the only characteristic for classifying the English article. In this work, the approaches of the Smoothed Unigram Model and the Cosine Similarity are both taken to find the document feature. We consider three different methods for classification, i.e., Naive Bayes, K-Nearest Neighbor (KNN), and Support Vector Machine (SVM). For performance evaluation, the Classification accuracy, F-measure, and Brier score are all computed to assess the proposed mechanism. The Evaluated Results are obtained by using the Confusion Matrix. Finally, we analyze the Evaluated Results over different combinations of the methods of obtaining the document features and the classifiers to assess whether the proposed mechanism is able to select adequate English articles for senior high school students.
論文目次 圖目錄 V
表目錄 VII
第一章 緒論 1
1.1 研究動機與目的 1
1.2 研究內容 3
1.3 論文大綱 5
第二章 相關研究 6
2.1 推薦機制相關研究 6
2.2 Classification 7
2.2.1 K-nearest neighbor 8
2.2.2 Naive Bayesian classifier 12
2.2.3 SVM 13
2.3 Orange 16
第三章 背景知識介紹 18
3.1 語料庫 18
3.1.1 GEPT 18
3.1.2 IWiLL 20
3.1.3 SHSETs 21
3.1.4 Web News 21
3.2 Data Input 22
3.3 Preprocessing 27
3.3.1 POS Tagging 29
3.3.2 Lemmatizing 30
3.4 文章特徵值 (Document Features) 34
3.4.1 The Smoothed Unigram Model 35
3.5 Evaluation Results 39
3.5.1 Classification accuracy 39
3.5.2 F-measure 40
3.5.3 Brier score 40
3.6 Orange操作 42
第四章 研究結果及分析 47
4.1 推薦機制架構 47
4.2 Similarity Analysis (相似度分析) 50
4.3 研究結果 55
4.3.1 Evaluation Results 56
4.3.2 Confusion Matrix 57
4.3.3 二種英文文章特徵值分類結果分析 60
4.3.4 三種分類器分類結果分析 62
第五章 結論與未來研究方向 65
5.1 結論 65
5.2 未來研究方向 65
參考文獻 67
附錄─英文論文 70

圖目錄
圖1-1 Google News Search 3
圖1-2 語料庫 4
圖2-1 classification說明 8
圖2-2 KNN說明 10
圖2-3 1-nearest neighbor (k=1) 11
圖2-4 2-nearest neighbor (k=2) 11
圖2-5 SVM說明 14
圖2-6 Orange首頁截圖 17
圖2-7 第一次開啟Orange的畫面 17
圖3-1 民國99年GEPT各年齡層考生人數比例統計圖 20
圖3-2 IWiLL首頁截圖 24
圖3-3 Web News (CNN)網頁截圖 25
圖3-4 Web News (CNN)新聞文章截圖 26
圖3-5 Preprocessing 27
圖3-6 語料庫文章前置處理流程 28
圖3-7 訓練資料庫的文章做過詞性標記及詞性還原後的差別 33
圖3-8 文章特徵值計算流程 34
圖3-9 The Smoothed Unigram Model 35
圖3-10 Evaluation Results 39
圖3-11 匯入統計分析的資料 42
圖3-12 觀看我們匯入的資料 43
圖3-13 效能評估工具 44
圖3-14 三種分類方式 45
圖3-15 分類方式效能評估 46
圖4-1 推薦機制架構圖 47
圖4-2 推薦機制流程圖 48
圖4-3 Similarity Analysis 50
圖4-4 圖示歐幾里得點積公式 51
圖4-5 Similarity Analysis 示意圖 53
圖4-6 Document Recommendation Agent 55
圖4-7 二種英文文章特徵值分類結果分析 61
圖4-8 二種英文文章特徵值分類結果分析 62
圖4-9 三種分類器分類結果分析 63
圖4-10 三種英文文章分類器分類結果 64

表目錄
表3-1 大考中修訂之高中英文參考詞彙表截圖 19
表3-2 GEPT六級單字字彙庫(此表為部分一級及六級單字字彙) 23
表3-3 詞類標記之詞性表 30
表3-4 Lemmatizing前,每篇文章出現GEPT字彙級數的比例 32
表3-5 Lemmatizing後,每篇文章出現GEPT字彙級數的比例 32
表3-6 The Smoothed Unigram Model範例 36
表3-7 高中英文課文特徵值數值表 38
表4-1 特徵值數值表 54
表4-2 評估分類器準確率總表 56
表4-3 Similarity Analysis- Naive Bayes 58
表4-4 Similarity Analysis-KNN 58
表4-5 Similarity Analysis-SNM 59
表4-6 The Smoothed Unigram Model- Naive Bayes 59
表4-7 The Smoothed Unigram Model-KNN 59
表4-8 The Smoothed Unigram Model-SVM 60
參考文獻 [1]Google News, available from http://news.google.com.tw/.
[2]G. K. Kanji, 100 Statistical Tests. Thousand Oaks, CA: SAGE Publications, p. 110, 1999.
[3]C. J. van Rijsbergen, (1979). Information Retrieval (2nd ed.) Butterworth.
[4]Brier (1950). “Verification of forecasts expressed in terms of probability”.
[5]http://www.cogs.susx.ac.uk/users/geoffs/RSue.html.
[6]http://www.coli.uni-sb.de/sfb378/negra-corpus/.
[7]Wenli Tsou, Weichung Wang, Yenjun Tzeng, “Applying A Multimedia Storytelling Website in Foreign Language Learning,” Computers & Education, vol. 47, pp 17-28, 2006.
[8]M. H. Hsu, “A Personalized English Learning Recommender System for ESL Students,” Expert Systems with Applications, vol. 34, pp 683-688, 2008.
[9]Thorsten.Brants, TnT-A Statistical Part-of-Speech Tagger. In Proceedings of the Sixth Applied Natrual Language Processing Conference ANLP-2000, Seatle, WA, 2000.
[10]Y. H. Wang, C. H. Lin, “A multimedia database supports English distance learning,” Information Sciences, vol. 158, pp. 189–208, 2004.
[11]C. M. Chen, C. J. Chung, “Personalized mobile English vocabulary learning system based on item response theory and learning memory cycle,” Computers & Education, vol.51, pp. 624-645, 2008.
[12]K. C. Thompson, J. Callan, “Predicting Reading Difficulty With Statistical Language Models,” JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, vol.56, pp. 1448-1462, 2005.
[13]GEPT, General English Proficiency Test, available from http://www.gept.org.tw.
[14]IWiLL, available from http://cube.iwillnow.org.
[15]Orange, analysis process through visual programming, available from http://orange.biolab.si/features.html.
[16]LTTC, available from http://www.lttc.ntu.edu.tw.
[17]W.A. Danielson, & S.D. Bryan, (1963). Computer automation of two readability formulas. Journalism Quarterly, 40, 201-206.
[18]J.S. Chall, & E. Dale, (1995). Readability revisited: The New ale-Chall Readability Formula. Cambridge, MA: Brookline Books.
[19]J.S. Chall, (1958). Readability: An appraisal of research and plication (Bureau of Educational Research Monographs, No. 34, Columbus: Ohio State University Press). Epping, England: Bowker.
[20]G.R. Klare, (1963). The measurement of readability. Ames: Iowa State University Press.
[21]C.B. Williams, (1940). A note on the statistical analysis of sentence length as a criterion of literary style. Biometrika, 31, 356-361.
[22]V. Vapnik, The Nature of Statistical Learning Theory, NY Springer, 1995.
[23]R. O. Duda, and P. E. Hart, Pattern Classification and Scene Analysis, Wiley New York, 1973.
[24]James, M., Classification algorithms, John Wiley & Sons, Inc. 1985.
[25]Ito, Kiyosi (1993), Encyclopedic Dictionary of Mathematics (2nd ed.), P82, P113, P144, P145, MIT Press, ISBN 978-0-262-59020-4
[26]Apostol, T. (1969). Calculus, Vol. 2: Multi-Variable Calculus and Linear Algebra with Applications. John Wiley and Sons. ISBN 978-0471000075.
[27]CNN, available from http://edition.cnn.com/.
[28]The China Post, available from http://chinapost.com.tw/.
[29]BBC, available from http://www.bbc.co.uk/.
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2014-08-04公開。
  • 同意授權瀏覽/列印電子全文服務,於2014-08-04起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信