§ 瀏覽學位論文書目資料
  
系統識別號 U0002-3007200901024400
DOI 10.6846/TKU.2009.01147
論文名稱(中文) 使用混合特徵選取於蛋白質結晶預測
論文名稱(英文) Hybrid Feature Selection in Protein Crystallization Prediction
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系碩士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 97
學期 2
出版年 98
研究生(中文) 葉致遠
研究生(英文) Jr-Yuan Yeh
學號 696410652
學位類別 碩士
語言別 繁體中文
第二語言別 英文
口試日期 2009-07-22
論文頁數 68頁
口試委員 指導教授 - 許輝煌(h_hsu@mail.tku.edu.tw)
委員 - 白敦文(twp@ntou.edu.tw)
委員 - 劉昭麟(chaolin@nccu.edu.tw)
委員 - 許輝煌(h_hsu@mail.tku.edu.tw)
關鍵字(中) 支持向量機
適應推進演算法
特徵選取
機器學習
不平衡資料集
蛋白質結晶
關鍵字(英) Support Vector Machine
Adaboost
Feature Selection
Machine Learning
Imbalance data
Protein Crystallization
第三語言關鍵字
學科別分類
中文摘要
蛋白質是生命構成的主要物質。蛋白質的功能會隨著結構不同而不同,因此,研究蛋白質分子的三維結構是科學家們努力的目標。而目前解析蛋白質三維結構的方法,除了利用統計學習理論去預測其結構外,在實作上通常是用X光繞射(X-ray diffraction)或是核磁共振(Nuclear Magnetic Resonance, NMR)實驗的結果來定義。其中,核磁共振耗時且花費成本,還不一定能解析出蛋白質結構。但如果蛋白質的溶液可以析出結晶,便可以用X光繞射來對結晶作分析。不過,不是所有蛋白質都可以產生結晶,故預測蛋白質是否能結晶就成為一個重要的問題。
  我們希望藉由從TargetDB這個蛋白質資料庫所取得的蛋白質的氨基酸序列 - 即蛋白質的一級結構所提供的各種資訊來進行編碼,並使用F-score和Information Gain兩種特徵選取方法挑出對預測蛋白質結晶幫助較大的特徵。接著,我們將挑選出來的資料分別使用支持向量機和Adaboost演算法來進行學習的工作。支持向量機使用一個超平面(Hyperplane)將空間中不同類別的資料切開,以達到分類的效果;而Adaboost藉由Weak Learner在若干次的學習過程中,不斷的調整每筆訓練用資料的權重值,來降低Weak Learner的錯分率 (error rate),最後將這些學習的成果結合成為一個Strong Learner來達到分類的效果。
  我們的實驗結果,對targetDB資料的預測正確率可達到93.02% ,而sensitivity (可結晶資料被正確分類為可結晶)為95.49%,specificity (不可結晶資料被正確分類為不可結晶)則是86.08% ,這些實驗的目的,無非是為了找出影響蛋白質不能結晶的要素,並更進一步的去改善這些造成蛋白質無法結晶的因素,以析出這些蛋白質的結晶,便可以利用X光繞射方法取得蛋白質結構的資訊。
英文摘要
Proteins are the major components of organisms. The structure of a protein gives information about its functions. Therefore, it is important to find out the structures of proteins. Nowadays, scientists usually use X-ray diffraction or Nuclear Magnetic Resonance (NMR) to discover the structures of proteins. However, the process of NMR is time-consuming and expensive. Therefore, X-ray diffraction is usually used to determine the structures of proteins. In order to use X-ray diffraction, we have to make sure the target protein can be crystallized. If a target protein can be crystallized, we can use X-ray diffraction to discover the target protein’s structure. Thus, the discovery of crystallization states of the target protein is very important.
    In this thesis, we use the data in TargetDB to generate a data set that have significant relationships with protein crystallization. We then apply two feature selection methods on the data set to remove the irrelevant or redundant features. After feature selection process, we use the support vector machine (SVM) and Adaboost respectively to predict whether the proteins can be crystallized or not. Furthermore, we compare and discuss the results generated by these two methods.
    According to our experimental results, applying Adaboost generates higher accuracy than applying SVM on the same data set. The prediction accuracy for Adaboost is 93.02%. Moreover, sensitivity (crystallized data) and specificity (non-crystallized data) by Adaboost is 95.49% and 86.08% respectively. The purpose of our experiments is to find out the factors that may cause proteins to be non-crystallized for Scientists to improve protein crystallization. As a result, X-ray diffraction can be applied to discover the structures of proteins.
第三語言摘要
論文目次
目錄

第一章 緒論	1
1.1研究背景	1
1.2研究動機	2
1.3  論文架構	3
第二章 文獻分析	5
2.1 蛋白質結晶的原理	5
2.2 特徵選取	8
  2.2.1 過濾方法	9
  2.2.2 包裝方法	10
  2.2.3 過濾方法和包裝方法的比較	11
2.3 支持向量機	12
  2.3.1 核心函數	15
2.4 Adaboost演算法	18
  2.4.1 Gentle Adaboost	22
  2.4.2 Modest Adaboost	22
2.5 分類迴歸樹	23
第三章 特徵選取及預測蛋白質結晶的方法	26
3.1特徵集合的挑選	26
3.2 資料的前處理與編碼	32
3.3 特徵選取的方法	38
3.3.1 F-Score	38
3.3.2 Information Gain	39
3.4 不平衡資料集的解決方法	40
第四章 系統架構與實驗結果	45
4.1 系統架構	45
4.2 實驗結果及討論	47
第五章 結論與未來展望	56
參考文獻	58
附錄 英文論文	64

 
圖目錄

圖2.1蒸氣擴散法圖示	6
圖2.2透析法圖示	7
圖2.3過濾方法	10
圖2.4包裝方法	11
圖2.5支持向量機	14
圖2.6核心函數說明	17
圖2.7 Adaboost演算法	19
圖2.8 Adaboost分類範例	21
圖2.9分類迴歸樹範例	24
圖3.1 .pdb檔案資料	30
圖3.2 TargetDB蛋白質資料	33
圖3.3整理過的TargetDB資料	34
圖3.4 TargetDB資料的PDB ID	34
圖3.5 fasta格式的蛋白質序列資料	35
圖3.6帶有非穩定區段資訊的蛋白質序列資料	36
圖3.7 LIBSVM資料格式	37
圖3.8 GML AdaBoost Matlab Toolbox資料格式	38
圖3.9 Information Gain計算流程	40
圖3.10調整偏移量對超平面造成的影響	42
圖3.11 Adaboost訓練不平衡資料集範例	44
圖4.1系統架構	46
圖4.2 Adaboost於不平衡資料集的預測率變化	54
 
表目錄

表2.1過濾方法和包裝方法的比較	12
表3.1 CRYSTALP氨基酸對特徵	27
表3.2 GES的親水性定義	28
表3.3 預測蛋白質結晶的特徵集合	31
表4.1 TargetDB測試結果	48
表4.2 TargetDB資料經過特徵挑選的特徵集合	49
表4.3 特徵代碼	50
表4.4 PDB測試結果	51
表4.5 PDB資料經過特徵選取挑選的特徵集合	52
表4.6 兩組資料的特徵子集合比較	55
參考文獻
[1]	J.M. Tyszka, S.E. Fraser and R.E. Jacobs, “Magnetic Resonance Microscopy: recent advances and applications,” Current Opinion in Biotechnology, Vol.16, Issue 1, 2005, pp.93-99.
[2]	A. McPherson, “Introduction to protein crystallization,”Methods, Elsevier, Vol. 34, Issue 3, Nov., 2004,pp. 254-265.
[3]	H.Li and M.Niranjan, “Discriminant Subspaces of Some High Dimensional Pattern Classification Problems,” Machine Learning for Signal Processing, IEEE, Aug., 2007, pp. 27-32.
[4]	M. Dash, K. Choi, P. Scheuermann and H. Liu,“Feature selection for clustering – A filter solution,” ICDM, IEEE International Conference, Dec., 2002, pp.115-122.
[5]	R. Kohav and G.H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence,  Elsevier, Volume 97, Issues 1-2, Dec., 1997, pp. 273-324.
[6]	C. Cortes and V. Vapnik, “Support-Vector Networks, ”Machine Learning, SpringerLink, Vol. 20, No. 3, Sep., 1995, pp. 273-297.
[7]	J.S. Taylor and N. Cristianini, “Kernel Methods for Pattern Analysis,” Cambridge University Press, 2004.
[8]	Y. Freund and R.E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, Citeseer, Vol. 55, Issue 1, 1997, pp. 119–139.
[9]	Y. Freund and R.E. Schapire, “A Short Introduction to Boosting,” Journal of Japanese Society for Artificial Intelligence, Citeseer, Vol. 14 Issue 5, Sep., 1999, pp. 771-780.
[10]	X. Li, “Boosting---one of combining models,”  http://www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt, 2007.
[11]	Y. Freund and R.E. Schapire. “Experiments with a new boosting algorithm,” Machine Learning: Proceedings of the Thirteenth International Conference, Citeseer, 1996, pp.148-156.
[12]	J. Thongkam, G. Xu, Y. Zhang and F. Huang, “Breast cancer survivability via AdaBoost algorithms,” Health Data and Knowledge Management, ACM, 2008, pp. 55-64.
[13]	Y. Ma and X. Ding, “Robust real-time face detection based on cost-sensitive AdaBoost method,” International Conference on Multimedia and Expo, IEEE, Vol. 2, 2003, pp. 465-468.
[14]	A. Vezhnevets and V. Vezhnevets, “Modest AdaBoost – teaching AdaBoost to generalize better,” Graphicon, 2005.
[15]	P. Smialowski, T. Schmidt, J. Cox, A. Kirschner  and D.Frishmanl, “Will My Protein Crystallize?  A Sequenced-Based Predictor,” PROTEIN: Structure, Function and Bioinformatics, InterScience, Vol. 62, Issue 2, Nov., 2005, pp. 343-355.
[16]	K. Chen, L. Kurgan and M. Rahbari, “Prediction of protein crystallization using collocation of amino acid pairs,”  Biochemical and biophysical research communications, Vol. 355, Issue 3, Apr.,2007, pp. 764-769.
[17]	D. Gilis, S. Massar, N.J. Cerf and M. Rooman, “Optimality of the genetic code with respect to protein stability and amino acid frequencies,” Genome Biology, Vol.2 Issue 11, 2001, pp. research0049.1–0049.12.
[18]	D. M. Engelman, T. A. Steitz, and A. Goldman, “Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins,” Annu. Rev. Biophys. Biophys. Chem., Vol. 15, 1986, pp. 321-353.
[19]	A. Katherine, Kantardjieff and Bernhard Rupp, “Protein isoelectric point as a predictor for increased crystallization screening efficiency,” Bioinformatics,  Oxford University Press, Vol. 20, No. 14, 2004, pp. 2162-2168.
[20]	RCSB Protein Data Bank, http:// www.rcsb.org/pdb.
[21]	TargetDB - Structural Genomics target registration database, http:// targetdb.pdb.org.
[22]	W. Li and A. Godzik, “CD-HIT: a fast program for  clustering and comparing large sets of protein or nucleotide sequences,” Bioinformatics,  Oxford University Press, Vol. 22, No. 13, 2006, pp. 1658-1659.
[23]	LIBSVM - A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[24]	Y.W. Chen, “Combining SVMs with Various Feature Selection Strategies,” www.csie.ntu.edu.tw/~cjlin/papers/features.pdf.
[25]	J. R. Quinlan, “Discovering Rules from Large Collections of Examples: A Case Study,” In Michie, D. (Ed.), Expert Systems in the Microelectronic Age, Edinburgh, Scotland: Edinburgh University Press, 1979, pp. 168-201.
[26]	N.V. Chawla, K.W. Bowyer, L.O. Hall and W.P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, Citeseer, Vol. 16, 2002, pp.321-357.
[27]	]X.W. Chen, B. Gerlach and D. Casasent “Pruning support vectors for imbalanced data classification,”International Joint Conference on Neural Networks, IEEE, Vol.3, Aug., 2005, pp.1883-1888.
[28]	X. Li, L. Wang and E. Sung, “A Study of AdaBoost with SVM Based Weak Learners,” International Joint Conference on Neural Networks, IEEE, Vol. 1, 2005, pp. 196-201.
[29]	H.H. Hsu and S.M. Wang, “Protein crystallization prediction with a combined feature set, ” International Conference on Innovations in Information Technology, IEEE, Dec., 2008, pp. 702-706.
論文全文使用權限
校內
紙本論文於授權書繳交後1年公開
同意電子論文全文授權校園內公開
校內電子論文於授權書繳交後1年公開
校外
同意授權
校外電子論文於授權書繳交後1年公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信