系統識別號 | U0002-3006201012100300 |
---|---|
DOI | 10.6846/TKU.2010.01118 |
論文名稱(中文) | 生物資訊中之特徵選取 |
論文名稱(英文) | Feature Selection in Bioinformatics |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 資訊工程學系博士班 |
系所名稱(英文) | Department of Computer Science and Information Engineering |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 98 |
學期 | 2 |
出版年 | 99 |
研究生(中文) | 謝正葦 |
研究生(英文) | Cheng-Wei Hsieh |
學號 | 892190108 |
學位類別 | 博士 |
語言別 | 英文 |
第二語言別 | |
口試日期 | 2010-06-29 |
論文頁數 | 109頁 |
口試委員 |
指導教授
-
許輝煌
委員 - 施國琛 委員 - 王英宏 委員 - 白敦文 委員 - 陳朝欽 委員 - 許輝煌 |
關鍵字(中) |
特徵選取 生物資訊 群集分析 機器學習 |
關鍵字(英) |
Feature selection Bioinformatics Clustering Machine Learning |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
此論文主要探討生物資訊研究的特徵選取,在眾多的應用中,無論是分類、預測以致於基因選取。所包含的資料總是數以千計,有的甚至超過上萬筆屬性,過多的屬性會導致預測或是分類器的運作耗費大量的時間以及機器的運算,甚者,預測或是分類的效能將大幅降低。 因此,在上述問題中,特徵選取將是一個有效的解決之道。一來,可僅挑選重要的特徵為訓練學習機器做少量化的處理;二來,可剔除特徵中所謂的雜訊,而讓整體運作的結果更向上提升。相對於特徵擷取(feature extraction),此篇論文所採用的特徵選取(feature selection)更能保留屬性中的重要資料,也更能了解與問題息息相關的屬性為何。 然而,在眾多的特徵選取模型中,並沒有同時具有快速且準確率高的選取機制。一般而言,特徵選取可分兩大類,一類是利用訊息原理計算屬性間的資訊量或是相依關係來達到特徵選取,此類稱為filter且其運作速度相較而言較為快速,但是不保證其正確率;而另一類則是把一個學習機器嵌入特徵選取機制中,利用觀察學習機器的運作結果來判定選取的好壞,此類稱為wrapper,其結果較為正確但運作相對緩慢。 在此論文中,我們提出兩種解決的方法,以期能夠同時加快運作的速度並且保持其正確率。其中第一種設計的方法為混雜式的特徵選取方式,利用filter模式的快速大量篩選,把大部分的特徵點濾除,此為第一階段的特徵篩選;第二階段的篩選則採用正確率導向的wrapper模式,由於此類的篩選速度較為緩慢,所以我們在第一階段所做的filter篩選即發揮其效用,只保留部分的特徵給予後階段的wrapper模組分析,如此一來,總運算量便可以大幅下降,在經過兩階段的篩選 後,不但在速度方面有所提升,甚至正確率與運算量方面皆有所改進,此為我們所提出的混雜式資料選取模型。 另一方面,我們亦開發一個以群集概念為基礎的特徵選取方式,利用群集能夠分析空間中多點之間相近與否的特性套用在屬性的比對上,最終被歸類到同一群組的特徵可視為相似的或是重複的特徵,除了保留一個特徵外,其餘的特徵再進一步予以剔除,如此便達到特徵篩選的功能。其中,我們變更群集機制中的距離計算核心,把原本的歐式距離變更為相似度的計算,利用相關係數的大小來判定兩屬性間的相似與否,如此,產生出來的群集便能使空間中所有的資料點達到多對多相似度屬性計算之效果,最終再將一個群集保留一個特徵即可。此特徵擷取步驟能保證所取出的特徵彼此之間必定是極度相異的,可視為重覆資料濾除的特徵擷取法。 本論文中,我們針對三種生物資訊的應用主題進行實驗分析,分別為蛋白質非穩定區段的偵測、蛋白質結晶與否的預測及微陣列的基因選取。在這些實驗中,我們證明本論文所提出的模組可以將上述實驗之預測效能實質提升。此外我們亦同時比較其他著名的選取模組,結果亦證明我們提出的方法能更有效的增進效能。 |
英文摘要 |
This dissertation focuses on the feature selection problems, especially in bioinformatics. For the most applications of prediction, classification or gene selection, the feature or data number are over hundreds or even thousands. Too many data features would waste computation time for classifiers. Even more, the prediction accuracy might also decrease greatly. Therefore, regarding to the above-mentioned problems, feature selection would be an efficient solution. It not only selects the most important features for learning models to train, but also removes the noisy features from the original feature set. In contrast to the feature extraction techniques, feature selection keeps important information from the original feature set. Besides, we can find out the relationships between problems and features. However, in most feature selection models, there is no model with both fast and satisfied prediction accuracy. Generally speaking, there are two main kinds of feature selection models, including filter and wrapper models. The filter kind of feature selection models is based on information theorem, which calculates the amount of information or dependency between features, and performs faster than wrappers. However, this model does not guarantee the prediction accuracy. As for the wrapper, it combines a learning model in feature selection procedures, and uses the learning results to decide the removal of features. The result of the wrapper is much more accurate than the filter model, but requires more computational time. In this dissertation, we proposed two mechanisms to improve the computational performance and prediction accuracies. The first one is the hybrid feature selection model. It applies the filter models to remove most noisy features in the primary procedure, and applies the wrapper with a learning model in the secondary selection process. With these two main procedures, the filter provides efficient performance and the wrapper increases the system accuracy. For the second proposed solution, we developed a clustering based feature selection model. Using the concept of clustering which can analyze distances among data points in the high dimensional space, we can use it to measure similarities among features. In the final clustered results, redundant features were removed and only one feature was retained as the representative feature of the group. In additional to the original concept of clustering, we changed the core of distance calculation mechanism with correlation coefficient calculation. We measured the similarity between two features by the value of correlation coefficient. This procedure guarantees the result of selected features processing quite different characteristics from each other. This kind of feature selection is considered as the model of redundant feature removal. In the thesis, we have evaluated the proposed methods with respect to three different bioinformatics applications, including protein disordered region prediction, protein crystallization prediction and gene selection in the microarray. In these experiments, the proposed models were applied to assist the prediction procedures and enhance their performance. Besides, we also compare our methods to other famous feature selection models. The results show that the performance can be improved efficiently and effectively. |
第三語言摘要 | |
論文目次 |
Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Problem Definition 3 1.3 Organization of Dissertation 5 Chapter 2 Literature Review 7 2.1 Proteins Disordered Regions Prediction 8 2.2 Gene Selection on Microarray 14 2.3 Protein Crystallization Prediction 16 2.4 Feature Selection Studies 21 2.5 Chapter Summary 25 Chapter 3 Feature Selection Models 28 3.1 Filters and Wrappers 28 3.2 Hybrid Feature Selection 31 3.2.1 Preliminary screening 32 3.2.2 Combination 34 3.2.3 Fine tuning 36 3.3 Clustering Feature Selection 39 3.3.1 Clustering 40 3.3.2 K- medoids 42 3.3.3 Correlation Coefficient Clustering 45 3.4 Chapter Summary 51 Chapter 4 Data Collection and Classification Model 54 4.1 Bioinformatics Data 55 4.1.1 Disordered Region of Protein 55 4.1.2 Microarray 62 4.1.3 Protein Crystallization 64 4.2 Training Procedure 66 4.2.1 Support Vector Machine 66 4.2.2 Adaboost 69 4.2.3 Testing and Evaluation 72 4.3 Chapter Summary 74 Chapter 5 Experimental Results 77 5.1 Protein Disordered Regions Prediction 78 5.1.1 Feature Combination 78 5.1.2 Prediction with Hybrid Feature Selection Model 81 5.1.3 Prediction with K- medoids Clustering Feature Selection 83 5.1.4 Prediction with Correlation Coefficient Clustering Feature Selection 86 5.2 Protein Crystallization Prediction 87 5.3 Gene Selection 92 5.3.1 Selection with Hybrid Feature Selection Model 92 5.3.2 Selection with Correlation Coefficient Clustering 95 5.4 Chapter Summary 97 Chapter 6 Discussions and Conclusions 100 References 103 List of Figures Figure 1.The filters 29 Figure 2. The wrappers 30 Figure 3. Hybrid feature selection procedure 31 Figure 4. Intersection and exclusive-OR sets 35 Figure 5. Sequential backward searching (SBS) 37 Figure 6. Sequential forward searching (SFS) 37 Figure 7. Inversed sequential floating search method (iSFSM) 39 Figure 8. Separation of data via clustering 40 Figure 9. The process of correlation coefficient clustering feature selection 48 Figure 10. Remark 465: missing residues fields in 1BHS.pdb file 57 Figure 11. Feature generation architecture 58 Figure 12. Kernel function of SVM 66 Figure 13. Support vectors 67 Figure 14. Ten-fold cross-validation accuracy comparison 96 List of Tables Table 1. The comparison of the filters and the wrappers 31 Table 2. Strength of F-score and Information Gain 34 Table 3. Collected features 65 Table 4. Training SVM with different kernels. 79 Table 5. Prediction results of seven kinds of combinations. 80 Table 6. Preliminary screening on disordered protein data 81 Table 7. Combinations of disordered data after screening 82 Table 8. Fine tuning result 83 Table 9. The feature selection results of 440 features. 84 Table 10. Self training/testing analysis 85 Table 11. Five-fold cross-validation results on disordered protein data 87 Table 12. Crystallization prediction results 88 Table 13. Feature selection results 90 Table 14. Feature IDs and corresponding features 91 Table 15. Preliminary screening on microarray cancer data 92 Table 16. Combinations of microarray cancer data after screening 94 Table 17. Prediction accuracy after fine tuning 94 Table 18. The comparison with other methods (AML & ALL) 94 |
參考文獻 |
[1] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, P. E. Bourne, “The Protein Data Bank,” Nucleic Acids Resource, Vol. 28, 2000, pp. 235-242. [2] S. Vucetic, Z. Obradovic, V. Vacic, P. Radivojac, K. Peng, L. M. Iakoucheva, M. S. Cortese, J. D. Lawson, C. J. Brown, J. G. Sikes, C. D. Newton, A. K. Dunker, “DisProt: A Database of Protein Disorder,” Bioinformatics Vol. 21, 2005, pp. 137-140. [3] C. R. Cantor, Principles of Protein X-Ray Crystallography, Springer Verlag, New York, 1996, pp. 1-341. [4] C. Bracken, “NMR spin relaxation methods for characterization of disorder and folding in proteins,” J Mol Graph Model, Vol. 19, 2001, pp. 3-12. [5] G. D. Fasman, Circular Dichroism and the Conformational Analysis of Biomolecules, Plenum Press, New York, 1996. [6] LIBSVM - A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (last accessed June 3, 2010). [7] C. J. C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, Vol. 2, 1998, pp. 121-167. [8] Protein Structure Prediction Center, http://predictioncenter.gc.ucdavis.edu/ (last accessed June 3, 2009). [9] R. J. Williams, “The conformational mobility of proteins and its functional significance,” Biochem. Soc. Trans, Vol. 6, pp.1123-1126, 1978. [10] V. N. Uversky, J. R. Gillespie, A. L. Fink, “Why are "natively unfolded" proteins unstructured under physiologic conditions?,” Proteins, Vol. 41, 2000, pp.415-427. [11] R. Linding, R. B. Russell, V. Neduva, T. J. Gibson, “GlobPlot: Exploring protein sequences for globularity and disorder,” Nucleic Acids Res, Vol. 31, 2003, pp.3701-3708. [12] T. P. Hopp, K. R. Woods, “A computer program for predicting protein antigenic determinants,” Mol Immunol, Vol. 20, 1983, pp.483-448. [13] J. Kyte, R. F. Doolittle, “A simple method for displaying the hydropathic character of a protein,” J Mol Biol, Vol. 157, 1982, pp.105-132. [14] Z. R. Yang, R. Thomson, P. McNeil, R. Esnouf, “RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins,” Bioinformatics, Vol. 21, 2005, pp.3369-3376. [15] K. Peng, P. Radovojac, S. Vucetic, A. K. Dunker, Z. Obradovic, “Length-dependent prediction of protein intrinsic disorder,” Bioinformatics, Vol. 7, 2006, pp. 208-225. [16] G. Piatetsky-Shapiro, P. Tamayo, “Microarray Data Mining: Facing the Challenges,” ACM SIGKDD Explorations Newsletter, Vol. 5, No. 2, Dec, 2003, pp. 1-5. [17] V. Vapnik, I Guyon, J. Weston, S. Barnhill, “Gene Selection for Cancer Classification using Support Vector Machines,” Machine Learning, Vol. 46, No. 1-3 Jan, 2002, pp. 389-422. [18] J. Zhang, R. Lee, Y. J. Wang, “Support vector machine classifications for microarray expression data set,” IEEE International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2003), 27 -30 Sep, 2003, pp. 67-71. [19] S. Cho, J. Ryu, “Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features,” Proceedings of the IEEE, Vol. 90, No. 11, Nov, 2002, pp. 1744-1753. [20] W. Fujibuchi, T. Kato, “Classification of heterogeneous microarray data by maximum entropy kernel,” BMC Bioinformatics 2007, Vol. 8, July, 2007, pp. 267-277. [21] S. Cho, H. Won, “Cancer classification using ensemble of neural networks with multiple significant gene subsets,” Applied Intelligence, Vol. 26, No. 3, Jun, 2007, pp. 243-250. [22] J. M. Chandonia, S. E. Brenner, “The Impact of Structural Genomics: Expectations and Outcomes,” Science, Vol. 311, 2000, pp. 347-351. [23] J. M. Tyszka, S. E. Fraser, R. E. Jacobs, “Magnetic Resonance Microscopy: Recent Advances and Applications,” Current Opinion in Biotechnology, Vol. 16, 2006, pp. 93-99. [24] W. L. Bragg, “The Structure of Some Crystals as Indicated by Their Diffraction of X-rays,” Proceedings of the Royal Society (London), Vol. A89, 1914, pp. 248-277. [25] P. Smialowski, T. Schmidt, J. Cox, A. Kirschner, D. Frishmanl, “Will My Protein Crystallize? A Sequenced-Based Predictor,” PROTEIN: Structure, Function and Bioinformatics, Wiley InterScience, Vol. 62, 2005, pp. 343-355. [26] K. Chen, L. Kurgan, M. Rahbari, “Prediction of protein crystallization using collocation of amino acid pairs,” Biochemical and biophysical research communications, Vol. 355, 2007, pp. 764-769. [27] I. M. Overton, G. J. Barton, “A Normalised Scale for Structural Genomics Target Ranking: The OB-Score,” FEBS Letters, Vol. 580, 2006, pp. 4005-4009. [28] L. Slabinski, L. Jaroszewski, L. Rychlewski , I. A. Wilson, S. A. Lesley, A. Godzik, “XtalPred: a Web Server for Prediction of Protein Crystallizability,” Bioinformatics Applications Note, Vol. 23, 2007, pp. 3403-3405. [29] H. H. Hsu, S. M. Wang, “Protein crystallization prediction with a combined feature set,” Proceedings of 5th International Conference on Innovations in Information Technology, 2008, pp. 702-706. [30] R. Kohavi, G. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, Vol. 97, 1997, pp. 273-324. [31] Al-Ani, “A dependency-based search strategy for feature selection,” Expert Systems with Applications: an International Journal, Vol.36, 2009, pp. 12392-12398. [32] F. E. Bonev, M. Angel-Cazorla, “A Novel Information Theory Method for Filter Feature Selection,” MICAI 2007: Advances in Artificial Intelligence, Springer Berlin / Heidelberg, 2007, pp. 431-440. [33] J. J. Huang, Y. Z. Cai, X. M. Xu, “A Filter Approach to Feature Selection Based on Mutual Information,” Cognitive Informatics, Vol. 1, 2006, pp. 84 -89. [34] H. C. Peng, F. H. Long, C. Ding, “Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, 2005, pp. 1226-1238. [35] C. Ding, H. C. Peng, “Minimum Redundancy Feature Selection from Microarray Gene Expression Data,” Proceedings of the Second IEEE Computational Systems Bioinformatics Conference, 2003, pp. 523-528. [36] C. Deisy, B. Subbulakshmi, S. Baskar, N. Ramaraj, “Efficient Dimensionality Reduction Approaches for Feature Selection,” International Conference on Computational Intelligence and Multimedia Applications, Vol. 2, 2007, pp. 121- 127. [37] P. Pudil, J. Novovicova, J. Kittler, “Floating search methods in feature selection,” Pattern Recognition Letters, Vol. 15, 1994, pp.1119-1125. [38] M. Matteucci, “A Tutorial on Clustering Algorithms”, http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/ (last accessed Feb 3, 2010) [39] T. M. Cover, “The Best Two Independent Measurements Are Not the Two Best,” IEEE Trans. Systems, Man, and Cybernetics, Vol. 4, 1974, pp. 116-117. [40] U. Hobohm, C. Sander, “Enlarged representative set of protein structures,” Protein Sci, Vol. 3, 1994, pp. 522-524. [41] The UniProt Consortium, “The Universal Protein Resource (UniProt),” Nucleic Acids Res., Vol. 35, Jan, 2007, pp. 193-197. [42] A. Bairoch, “Serendipity in bioinformatics, the tribulations of a Swiss bioinformatician through exciting times!” Bioinformatics, Vol.16, 2000, pp. 48-64. [43] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, “Basic local alignment search tool,” J. Mol. Biol., Vol. 215, 1990, pp. 403-410. [44] D. Gilis, S. Massar, N. J. Cerf, M. Rooman, “Optimality of the genetic code with respect to protein stability and amino acid frequencies,” Genome Biology, Vol. 2, 2001, pp. 0049.1-0049.12. [45] D. M.Engelman, T. A. Steitz, A. Goldman, “Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins,” Annu. Rev. Biophys. Biophys. Chem., Vol. 15, 1986, pp. 321-353. [46] K. A. Kantardjieff, B. Rupp, “Protein isoelectric point as a predictor for increased crystallization screening efficiency,” Bioinformatics, Vol. 20, 2004, pp. 2162-2168. [47] R. Linding, L. J. Jensen, F. Diella, P. Bork, T. J. Gibson, R. B. Russell, “Protein disorder prediction: implications for structural proteomics,” Structure, Vol. 11, 2003, pp. 1453-1459. [48] Y. Freund, R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, Citeseer, Vol. 55, 1997, pp. 119-139. [49] Y. Freund, R. E. Schapire, “Experiments with a new boosting algorithm,” Proceedings of the Thirteenth International Conference on Machine Learning, 1996, pp. 148-156. [50] A. Vezhnevets, V. Vezhnevets, “Modest AdaBoost – teaching AdaBoost to generalize better,” Proceedings of Graphicon, Novosibirsk Akademgorodok, Russia, 2005. [51] R. Tibshirani, G. Walther, T. Hastie, “Estimating the number of clusters in a data set via the gap statistics,” Journal of the Royal Statistical Society, Series B 63, 2001, pp. 411-423. [52] R. B. Calinski, J. A. Harabasz, “A denrite method for cluster analysis,” Communications in Statistics, Vol. 3, 1974, pp. 1-27. [53] L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley, 1990. [54] J. A. Hartigan, Clustering Algorithms. Wiley, 1975. [55] S. M. Wang, Protein Crystallization Prediction Using Support Vector Machine, National Central Library, R.O.C., 2007. [56] J. Y. Yeh, Hybrid Feature Selection in Protein Crystallization Prediction, National Central Library, R.O.C., 2008. |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信