系統識別號 | U0002-3007200901024400 |
---|---|
DOI | 10.6846/TKU.2009.01147 |
論文名稱(中文) | 使用混合特徵選取於蛋白質結晶預測 |
論文名稱(英文) | Hybrid Feature Selection in Protein Crystallization Prediction |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系碩士班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 97 |
學期 | 2 |
出版年 | 98 |
研究生(中文) | 葉致遠 |
研究生(英文) | Jr-Yuan Yeh |
學號 | 696410652 |
學位類別 | 碩士 |
語言別 | 繁體中文 |
第二語言別 | 英文 |
口試日期 | 2009-07-22 |
論文頁數 | 68頁 |
口試委員 |
指導教授
-
許輝煌(h_hsu@mail.tku.edu.tw)
委員 - 白敦文(twp@ntou.edu.tw) 委員 - 劉昭麟(chaolin@nccu.edu.tw) 委員 - 許輝煌(h_hsu@mail.tku.edu.tw) |
關鍵字(中) |
支持向量機 適應推進演算法 特徵選取 機器學習 不平衡資料集 蛋白質結晶 |
關鍵字(英) |
Support Vector Machine Adaboost Feature Selection Machine Learning Imbalance data Protein Crystallization |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
蛋白質是生命構成的主要物質。蛋白質的功能會隨著結構不同而不同,因此,研究蛋白質分子的三維結構是科學家們努力的目標。而目前解析蛋白質三維結構的方法,除了利用統計學習理論去預測其結構外,在實作上通常是用X光繞射(X-ray diffraction)或是核磁共振(Nuclear Magnetic Resonance, NMR)實驗的結果來定義。其中,核磁共振耗時且花費成本,還不一定能解析出蛋白質結構。但如果蛋白質的溶液可以析出結晶,便可以用X光繞射來對結晶作分析。不過,不是所有蛋白質都可以產生結晶,故預測蛋白質是否能結晶就成為一個重要的問題。 我們希望藉由從TargetDB這個蛋白質資料庫所取得的蛋白質的氨基酸序列 - 即蛋白質的一級結構所提供的各種資訊來進行編碼,並使用F-score和Information Gain兩種特徵選取方法挑出對預測蛋白質結晶幫助較大的特徵。接著,我們將挑選出來的資料分別使用支持向量機和Adaboost演算法來進行學習的工作。支持向量機使用一個超平面(Hyperplane)將空間中不同類別的資料切開,以達到分類的效果;而Adaboost藉由Weak Learner在若干次的學習過程中,不斷的調整每筆訓練用資料的權重值,來降低Weak Learner的錯分率 (error rate),最後將這些學習的成果結合成為一個Strong Learner來達到分類的效果。 我們的實驗結果,對targetDB資料的預測正確率可達到93.02% ,而sensitivity (可結晶資料被正確分類為可結晶)為95.49%,specificity (不可結晶資料被正確分類為不可結晶)則是86.08% ,這些實驗的目的,無非是為了找出影響蛋白質不能結晶的要素,並更進一步的去改善這些造成蛋白質無法結晶的因素,以析出這些蛋白質的結晶,便可以利用X光繞射方法取得蛋白質結構的資訊。 |
英文摘要 |
Proteins are the major components of organisms. The structure of a protein gives information about its functions. Therefore, it is important to find out the structures of proteins. Nowadays, scientists usually use X-ray diffraction or Nuclear Magnetic Resonance (NMR) to discover the structures of proteins. However, the process of NMR is time-consuming and expensive. Therefore, X-ray diffraction is usually used to determine the structures of proteins. In order to use X-ray diffraction, we have to make sure the target protein can be crystallized. If a target protein can be crystallized, we can use X-ray diffraction to discover the target protein’s structure. Thus, the discovery of crystallization states of the target protein is very important. In this thesis, we use the data in TargetDB to generate a data set that have significant relationships with protein crystallization. We then apply two feature selection methods on the data set to remove the irrelevant or redundant features. After feature selection process, we use the support vector machine (SVM) and Adaboost respectively to predict whether the proteins can be crystallized or not. Furthermore, we compare and discuss the results generated by these two methods. According to our experimental results, applying Adaboost generates higher accuracy than applying SVM on the same data set. The prediction accuracy for Adaboost is 93.02%. Moreover, sensitivity (crystallized data) and specificity (non-crystallized data) by Adaboost is 95.49% and 86.08% respectively. The purpose of our experiments is to find out the factors that may cause proteins to be non-crystallized for Scientists to improve protein crystallization. As a result, X-ray diffraction can be applied to discover the structures of proteins. |
第三語言摘要 | |
論文目次 |
目錄 第一章 緒論 1 1.1研究背景 1 1.2研究動機 2 1.3 論文架構 3 第二章 文獻分析 5 2.1 蛋白質結晶的原理 5 2.2 特徵選取 8 2.2.1 過濾方法 9 2.2.2 包裝方法 10 2.2.3 過濾方法和包裝方法的比較 11 2.3 支持向量機 12 2.3.1 核心函數 15 2.4 Adaboost演算法 18 2.4.1 Gentle Adaboost 22 2.4.2 Modest Adaboost 22 2.5 分類迴歸樹 23 第三章 特徵選取及預測蛋白質結晶的方法 26 3.1特徵集合的挑選 26 3.2 資料的前處理與編碼 32 3.3 特徵選取的方法 38 3.3.1 F-Score 38 3.3.2 Information Gain 39 3.4 不平衡資料集的解決方法 40 第四章 系統架構與實驗結果 45 4.1 系統架構 45 4.2 實驗結果及討論 47 第五章 結論與未來展望 56 參考文獻 58 附錄 英文論文 64 圖目錄 圖2.1蒸氣擴散法圖示 6 圖2.2透析法圖示 7 圖2.3過濾方法 10 圖2.4包裝方法 11 圖2.5支持向量機 14 圖2.6核心函數說明 17 圖2.7 Adaboost演算法 19 圖2.8 Adaboost分類範例 21 圖2.9分類迴歸樹範例 24 圖3.1 .pdb檔案資料 30 圖3.2 TargetDB蛋白質資料 33 圖3.3整理過的TargetDB資料 34 圖3.4 TargetDB資料的PDB ID 34 圖3.5 fasta格式的蛋白質序列資料 35 圖3.6帶有非穩定區段資訊的蛋白質序列資料 36 圖3.7 LIBSVM資料格式 37 圖3.8 GML AdaBoost Matlab Toolbox資料格式 38 圖3.9 Information Gain計算流程 40 圖3.10調整偏移量對超平面造成的影響 42 圖3.11 Adaboost訓練不平衡資料集範例 44 圖4.1系統架構 46 圖4.2 Adaboost於不平衡資料集的預測率變化 54 表目錄 表2.1過濾方法和包裝方法的比較 12 表3.1 CRYSTALP氨基酸對特徵 27 表3.2 GES的親水性定義 28 表3.3 預測蛋白質結晶的特徵集合 31 表4.1 TargetDB測試結果 48 表4.2 TargetDB資料經過特徵挑選的特徵集合 49 表4.3 特徵代碼 50 表4.4 PDB測試結果 51 表4.5 PDB資料經過特徵選取挑選的特徵集合 52 表4.6 兩組資料的特徵子集合比較 55 |
參考文獻 |
[1] J.M. Tyszka, S.E. Fraser and R.E. Jacobs, “Magnetic Resonance Microscopy: recent advances and applications,” Current Opinion in Biotechnology, Vol.16, Issue 1, 2005, pp.93-99. [2] A. McPherson, “Introduction to protein crystallization,”Methods, Elsevier, Vol. 34, Issue 3, Nov., 2004,pp. 254-265. [3] H.Li and M.Niranjan, “Discriminant Subspaces of Some High Dimensional Pattern Classification Problems,” Machine Learning for Signal Processing, IEEE, Aug., 2007, pp. 27-32. [4] M. Dash, K. Choi, P. Scheuermann and H. Liu,“Feature selection for clustering – A filter solution,” ICDM, IEEE International Conference, Dec., 2002, pp.115-122. [5] R. Kohav and G.H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, Elsevier, Volume 97, Issues 1-2, Dec., 1997, pp. 273-324. [6] C. Cortes and V. Vapnik, “Support-Vector Networks, ”Machine Learning, SpringerLink, Vol. 20, No. 3, Sep., 1995, pp. 273-297. [7] J.S. Taylor and N. Cristianini, “Kernel Methods for Pattern Analysis,” Cambridge University Press, 2004. [8] Y. Freund and R.E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, Citeseer, Vol. 55, Issue 1, 1997, pp. 119–139. [9] Y. Freund and R.E. Schapire, “A Short Introduction to Boosting,” Journal of Japanese Society for Artificial Intelligence, Citeseer, Vol. 14 Issue 5, Sep., 1999, pp. 771-780. [10] X. Li, “Boosting---one of combining models,” http://www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt, 2007. [11] Y. Freund and R.E. Schapire. “Experiments with a new boosting algorithm,” Machine Learning: Proceedings of the Thirteenth International Conference, Citeseer, 1996, pp.148-156. [12] J. Thongkam, G. Xu, Y. Zhang and F. Huang, “Breast cancer survivability via AdaBoost algorithms,” Health Data and Knowledge Management, ACM, 2008, pp. 55-64. [13] Y. Ma and X. Ding, “Robust real-time face detection based on cost-sensitive AdaBoost method,” International Conference on Multimedia and Expo, IEEE, Vol. 2, 2003, pp. 465-468. [14] A. Vezhnevets and V. Vezhnevets, “Modest AdaBoost – teaching AdaBoost to generalize better,” Graphicon, 2005. [15] P. Smialowski, T. Schmidt, J. Cox, A. Kirschner and D.Frishmanl, “Will My Protein Crystallize? A Sequenced-Based Predictor,” PROTEIN: Structure, Function and Bioinformatics, InterScience, Vol. 62, Issue 2, Nov., 2005, pp. 343-355. [16] K. Chen, L. Kurgan and M. Rahbari, “Prediction of protein crystallization using collocation of amino acid pairs,” Biochemical and biophysical research communications, Vol. 355, Issue 3, Apr.,2007, pp. 764-769. [17] D. Gilis, S. Massar, N.J. Cerf and M. Rooman, “Optimality of the genetic code with respect to protein stability and amino acid frequencies,” Genome Biology, Vol.2 Issue 11, 2001, pp. research0049.1–0049.12. [18] D. M. Engelman, T. A. Steitz, and A. Goldman, “Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins,” Annu. Rev. Biophys. Biophys. Chem., Vol. 15, 1986, pp. 321-353. [19] A. Katherine, Kantardjieff and Bernhard Rupp, “Protein isoelectric point as a predictor for increased crystallization screening efficiency,” Bioinformatics, Oxford University Press, Vol. 20, No. 14, 2004, pp. 2162-2168. [20] RCSB Protein Data Bank, http:// www.rcsb.org/pdb. [21] TargetDB - Structural Genomics target registration database, http:// targetdb.pdb.org. [22] W. Li and A. Godzik, “CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences,” Bioinformatics, Oxford University Press, Vol. 22, No. 13, 2006, pp. 1658-1659. [23] LIBSVM - A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm. [24] Y.W. Chen, “Combining SVMs with Various Feature Selection Strategies,” www.csie.ntu.edu.tw/~cjlin/papers/features.pdf. [25] J. R. Quinlan, “Discovering Rules from Large Collections of Examples: A Case Study,” In Michie, D. (Ed.), Expert Systems in the Microelectronic Age, Edinburgh, Scotland: Edinburgh University Press, 1979, pp. 168-201. [26] N.V. Chawla, K.W. Bowyer, L.O. Hall and W.P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, Citeseer, Vol. 16, 2002, pp.321-357. [27] ]X.W. Chen, B. Gerlach and D. Casasent “Pruning support vectors for imbalanced data classification,”International Joint Conference on Neural Networks, IEEE, Vol.3, Aug., 2005, pp.1883-1888. [28] X. Li, L. Wang and E. Sung, “A Study of AdaBoost with SVM Based Weak Learners,” International Joint Conference on Neural Networks, IEEE, Vol. 1, 2005, pp. 196-201. [29] H.H. Hsu and S.M. Wang, “Protein crystallization prediction with a combined feature set, ” International Conference on Innovations in Information Technology, IEEE, Dec., 2008, pp. 702-706. |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信