淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-3007200901024400
中文論文名稱 使用混合特徵選取於蛋白質結晶預測
英文論文名稱 Hybrid Feature Selection in Protein Crystallization Prediction
校院名稱 淡江大學
系所名稱(中) 資訊工程學系碩士班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 97
學期 2
出版年 98
研究生中文姓名 葉致遠
研究生英文姓名 Jr-Yuan Yeh
學號 696410652
學位類別 碩士
語文別 中文
第二語文別 英文
口試日期 2009-07-22
論文頁數 68頁
口試委員 指導教授-許輝煌
委員-白敦文
委員-劉昭麟
委員-許輝煌
中文關鍵字 支持向量機  適應推進演算法  特徵選取  機器學習  不平衡資料集  蛋白質結晶 
英文關鍵字 Support Vector Machine  Adaboost  Feature Selection  Machine Learning  Imbalance data  Protein Crystallization 
學科別分類 學科別應用科學資訊工程
中文摘要   蛋白質是生命構成的主要物質。蛋白質的功能會隨著結構不同而不同,因此,研究蛋白質分子的三維結構是科學家們努力的目標。而目前解析蛋白質三維結構的方法,除了利用統計學習理論去預測其結構外,在實作上通常是用X光繞射(X-ray diffraction)或是核磁共振(Nuclear Magnetic Resonance, NMR)實驗的結果來定義。其中,核磁共振耗時且花費成本,還不一定能解析出蛋白質結構。但如果蛋白質的溶液可以析出結晶,便可以用X光繞射來對結晶作分析。不過,不是所有蛋白質都可以產生結晶,故預測蛋白質是否能結晶就成為一個重要的問題。
  我們希望藉由從TargetDB這個蛋白質資料庫所取得的蛋白質的氨基酸序列 - 即蛋白質的一級結構所提供的各種資訊來進行編碼,並使用F-score和Information Gain兩種特徵選取方法挑出對預測蛋白質結晶幫助較大的特徵。接著,我們將挑選出來的資料分別使用支持向量機和Adaboost演算法來進行學習的工作。支持向量機使用一個超平面(Hyperplane)將空間中不同類別的資料切開,以達到分類的效果;而Adaboost藉由Weak Learner在若干次的學習過程中,不斷的調整每筆訓練用資料的權重值,來降低Weak Learner的錯分率 (error rate),最後將這些學習的成果結合成為一個Strong Learner來達到分類的效果。
  我們的實驗結果,對targetDB資料的預測正確率可達到93.02% ,而sensitivity (可結晶資料被正確分類為可結晶)為95.49%,specificity (不可結晶資料被正確分類為不可結晶)則是86.08% ,這些實驗的目的,無非是為了找出影響蛋白質不能結晶的要素,並更進一步的去改善這些造成蛋白質無法結晶的因素,以析出這些蛋白質的結晶,便可以利用X光繞射方法取得蛋白質結構的資訊。

英文摘要 Proteins are the major components of organisms. The structure of a protein gives information about its functions. Therefore, it is important to find out the structures of proteins. Nowadays, scientists usually use X-ray diffraction or Nuclear Magnetic Resonance (NMR) to discover the structures of proteins. However, the process of NMR is time-consuming and expensive. Therefore, X-ray diffraction is usually used to determine the structures of proteins. In order to use X-ray diffraction, we have to make sure the target protein can be crystallized. If a target protein can be crystallized, we can use X-ray diffraction to discover the target protein’s structure. Thus, the discovery of crystallization states of the target protein is very important.
In this thesis, we use the data in TargetDB to generate a data set that have significant relationships with protein crystallization. We then apply two feature selection methods on the data set to remove the irrelevant or redundant features. After feature selection process, we use the support vector machine (SVM) and Adaboost respectively to predict whether the proteins can be crystallized or not. Furthermore, we compare and discuss the results generated by these two methods.
According to our experimental results, applying Adaboost generates higher accuracy than applying SVM on the same data set. The prediction accuracy for Adaboost is 93.02%. Moreover, sensitivity (crystallized data) and specificity (non-crystallized data) by Adaboost is 95.49% and 86.08% respectively. The purpose of our experiments is to find out the factors that may cause proteins to be non-crystallized for Scientists to improve protein crystallization. As a result, X-ray diffraction can be applied to discover the structures of proteins.

論文目次 目錄

第一章 緒論 1
1.1研究背景 1
1.2研究動機 2
1.3 論文架構 3
第二章 文獻分析 5
2.1 蛋白質結晶的原理 5
2.2 特徵選取 8
  2.2.1 過濾方法 9
  2.2.2 包裝方法 10
  2.2.3 過濾方法和包裝方法的比較 11
2.3 支持向量機 12
  2.3.1 核心函數 15
2.4 Adaboost演算法 18
  2.4.1 Gentle Adaboost 22
  2.4.2 Modest Adaboost 22
2.5 分類迴歸樹 23
第三章 特徵選取及預測蛋白質結晶的方法 26
3.1特徵集合的挑選 26
3.2 資料的前處理與編碼 32
3.3 特徵選取的方法 38
3.3.1 F-Score 38
3.3.2 Information Gain 39
3.4 不平衡資料集的解決方法 40
第四章 系統架構與實驗結果 45
4.1 系統架構 45
4.2 實驗結果及討論 47
第五章 結論與未來展望 56
參考文獻 58
附錄 英文論文 64


圖目錄

圖2.1蒸氣擴散法圖示 6
圖2.2透析法圖示 7
圖2.3過濾方法 10
圖2.4包裝方法 11
圖2.5支持向量機 14
圖2.6核心函數說明 17
圖2.7 Adaboost演算法 19
圖2.8 Adaboost分類範例 21
圖2.9分類迴歸樹範例 24
圖3.1 .pdb檔案資料 30
圖3.2 TargetDB蛋白質資料 33
圖3.3整理過的TargetDB資料 34
圖3.4 TargetDB資料的PDB ID 34
圖3.5 fasta格式的蛋白質序列資料 35
圖3.6帶有非穩定區段資訊的蛋白質序列資料 36
圖3.7 LIBSVM資料格式 37
圖3.8 GML AdaBoost Matlab Toolbox資料格式 38
圖3.9 Information Gain計算流程 40
圖3.10調整偏移量對超平面造成的影響 42
圖3.11 Adaboost訓練不平衡資料集範例 44
圖4.1系統架構 46
圖4.2 Adaboost於不平衡資料集的預測率變化 54

表目錄

表2.1過濾方法和包裝方法的比較 12
表3.1 CRYSTALP氨基酸對特徵 27
表3.2 GES的親水性定義 28
表3.3 預測蛋白質結晶的特徵集合 31
表4.1 TargetDB測試結果 48
表4.2 TargetDB資料經過特徵挑選的特徵集合 49
表4.3 特徵代碼 50
表4.4 PDB測試結果 51
表4.5 PDB資料經過特徵選取挑選的特徵集合 52
表4.6 兩組資料的特徵子集合比較 55
參考文獻 [1] J.M. Tyszka, S.E. Fraser and R.E. Jacobs, “Magnetic Resonance Microscopy: recent advances and applications,” Current Opinion in Biotechnology, Vol.16, Issue 1, 2005, pp.93-99.
[2] A. McPherson, “Introduction to protein crystallization,”Methods, Elsevier, Vol. 34, Issue 3, Nov., 2004,pp. 254-265.
[3] H.Li and M.Niranjan, “Discriminant Subspaces of Some High Dimensional Pattern Classification Problems,” Machine Learning for Signal Processing, IEEE, Aug., 2007, pp. 27-32.
[4] M. Dash, K. Choi, P. Scheuermann and H. Liu,“Feature selection for clustering – A filter solution,” ICDM, IEEE International Conference, Dec., 2002, pp.115-122.
[5] R. Kohav and G.H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, Elsevier, Volume 97, Issues 1-2, Dec., 1997, pp. 273-324.
[6] C. Cortes and V. Vapnik, “Support-Vector Networks, ”Machine Learning, SpringerLink, Vol. 20, No. 3, Sep., 1995, pp. 273-297.
[7] J.S. Taylor and N. Cristianini, “Kernel Methods for Pattern Analysis,” Cambridge University Press, 2004.
[8] Y. Freund and R.E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, Citeseer, Vol. 55, Issue 1, 1997, pp. 119–139.
[9] Y. Freund and R.E. Schapire, “A Short Introduction to Boosting,” Journal of Japanese Society for Artificial Intelligence, Citeseer, Vol. 14 Issue 5, Sep., 1999, pp. 771-780.
[10] X. Li, “Boosting---one of combining models,” http://www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt, 2007.
[11] Y. Freund and R.E. Schapire. “Experiments with a new boosting algorithm,” Machine Learning: Proceedings of the Thirteenth International Conference, Citeseer, 1996, pp.148-156.
[12] J. Thongkam, G. Xu, Y. Zhang and F. Huang, “Breast cancer survivability via AdaBoost algorithms,” Health Data and Knowledge Management, ACM, 2008, pp. 55-64.
[13] Y. Ma and X. Ding, “Robust real-time face detection based on cost-sensitive AdaBoost method,” International Conference on Multimedia and Expo, IEEE, Vol. 2, 2003, pp. 465-468.
[14] A. Vezhnevets and V. Vezhnevets, “Modest AdaBoost – teaching AdaBoost to generalize better,” Graphicon, 2005.
[15] P. Smialowski, T. Schmidt, J. Cox, A. Kirschner and D.Frishmanl, “Will My Protein Crystallize? A Sequenced-Based Predictor,” PROTEIN: Structure, Function and Bioinformatics, InterScience, Vol. 62, Issue 2, Nov., 2005, pp. 343-355.
[16] K. Chen, L. Kurgan and M. Rahbari, “Prediction of protein crystallization using collocation of amino acid pairs,” Biochemical and biophysical research communications, Vol. 355, Issue 3, Apr.,2007, pp. 764-769.
[17] D. Gilis, S. Massar, N.J. Cerf and M. Rooman, “Optimality of the genetic code with respect to protein stability and amino acid frequencies,” Genome Biology, Vol.2 Issue 11, 2001, pp. research0049.1–0049.12.
[18] D. M. Engelman, T. A. Steitz, and A. Goldman, “Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins,” Annu. Rev. Biophys. Biophys. Chem., Vol. 15, 1986, pp. 321-353.
[19] A. Katherine, Kantardjieff and Bernhard Rupp, “Protein isoelectric point as a predictor for increased crystallization screening efficiency,” Bioinformatics, Oxford University Press, Vol. 20, No. 14, 2004, pp. 2162-2168.
[20] RCSB Protein Data Bank, http:// www.rcsb.org/pdb.
[21] TargetDB - Structural Genomics target registration database, http:// targetdb.pdb.org.
[22] W. Li and A. Godzik, “CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences,” Bioinformatics, Oxford University Press, Vol. 22, No. 13, 2006, pp. 1658-1659.
[23] LIBSVM - A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[24] Y.W. Chen, “Combining SVMs with Various Feature Selection Strategies,” www.csie.ntu.edu.tw/~cjlin/papers/features.pdf.
[25] J. R. Quinlan, “Discovering Rules from Large Collections of Examples: A Case Study,” In Michie, D. (Ed.), Expert Systems in the Microelectronic Age, Edinburgh, Scotland: Edinburgh University Press, 1979, pp. 168-201.
[26] N.V. Chawla, K.W. Bowyer, L.O. Hall and W.P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, Citeseer, Vol. 16, 2002, pp.321-357.
[27] ]X.W. Chen, B. Gerlach and D. Casasent “Pruning support vectors for imbalanced data classification,”International Joint Conference on Neural Networks, IEEE, Vol.3, Aug., 2005, pp.1883-1888.
[28] X. Li, L. Wang and E. Sung, “A Study of AdaBoost with SVM Based Weak Learners,” International Joint Conference on Neural Networks, IEEE, Vol. 1, 2005, pp. 196-201.
[29] H.H. Hsu and S.M. Wang, “Protein crystallization prediction with a combined feature set, ” International Conference on Innovations in Information Technology, IEEE, Dec., 2008, pp. 702-706.
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2010-07-30公開。
  • 同意授權瀏覽/列印電子全文服務,於2010-07-30起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信