§ 瀏覽學位論文書目資料
  
系統識別號 U0002-3006201012100300
DOI 10.6846/TKU.2010.01118
論文名稱(中文) 生物資訊中之特徵選取
論文名稱(英文) Feature Selection in Bioinformatics
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系博士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 98
學期 2
出版年 99
研究生(中文) 謝正葦
研究生(英文) Cheng-Wei Hsieh
學號 892190108
學位類別 博士
語言別 英文
第二語言別
口試日期 2010-06-29
論文頁數 109頁
口試委員 指導教授 - 許輝煌
委員 - 施國琛
委員 - 王英宏
委員 - 白敦文
委員 - 陳朝欽
委員 - 許輝煌
關鍵字(中) 特徵選取
生物資訊
群集分析
機器學習
關鍵字(英) Feature selection
Bioinformatics
Clustering
Machine Learning
第三語言關鍵字
學科別分類
中文摘要
此論文主要探討生物資訊研究的特徵選取,在眾多的應用中,無論是分類、預測以致於基因選取。所包含的資料總是數以千計,有的甚至超過上萬筆屬性,過多的屬性會導致預測或是分類器的運作耗費大量的時間以及機器的運算,甚者,預測或是分類的效能將大幅降低。
   因此,在上述問題中,特徵選取將是一個有效的解決之道。一來,可僅挑選重要的特徵為訓練學習機器做少量化的處理;二來,可剔除特徵中所謂的雜訊,而讓整體運作的結果更向上提升。相對於特徵擷取(feature extraction),此篇論文所採用的特徵選取(feature selection)更能保留屬性中的重要資料,也更能了解與問題息息相關的屬性為何。
   然而,在眾多的特徵選取模型中,並沒有同時具有快速且準確率高的選取機制。一般而言,特徵選取可分兩大類,一類是利用訊息原理計算屬性間的資訊量或是相依關係來達到特徵選取,此類稱為filter且其運作速度相較而言較為快速,但是不保證其正確率;而另一類則是把一個學習機器嵌入特徵選取機制中,利用觀察學習機器的運作結果來判定選取的好壞,此類稱為wrapper,其結果較為正確但運作相對緩慢。
   在此論文中,我們提出兩種解決的方法,以期能夠同時加快運作的速度並且保持其正確率。其中第一種設計的方法為混雜式的特徵選取方式,利用filter模式的快速大量篩選,把大部分的特徵點濾除,此為第一階段的特徵篩選;第二階段的篩選則採用正確率導向的wrapper模式,由於此類的篩選速度較為緩慢,所以我們在第一階段所做的filter篩選即發揮其效用,只保留部分的特徵給予後階段的wrapper模組分析,如此一來,總運算量便可以大幅下降,在經過兩階段的篩選
後,不但在速度方面有所提升,甚至正確率與運算量方面皆有所改進,此為我們所提出的混雜式資料選取模型。
  另一方面,我們亦開發一個以群集概念為基礎的特徵選取方式,利用群集能夠分析空間中多點之間相近與否的特性套用在屬性的比對上,最終被歸類到同一群組的特徵可視為相似的或是重複的特徵,除了保留一個特徵外,其餘的特徵再進一步予以剔除,如此便達到特徵篩選的功能。其中,我們變更群集機制中的距離計算核心,把原本的歐式距離變更為相似度的計算,利用相關係數的大小來判定兩屬性間的相似與否,如此,產生出來的群集便能使空間中所有的資料點達到多對多相似度屬性計算之效果,最終再將一個群集保留一個特徵即可。此特徵擷取步驟能保證所取出的特徵彼此之間必定是極度相異的,可視為重覆資料濾除的特徵擷取法。
  本論文中,我們針對三種生物資訊的應用主題進行實驗分析,分別為蛋白質非穩定區段的偵測、蛋白質結晶與否的預測及微陣列的基因選取。在這些實驗中,我們證明本論文所提出的模組可以將上述實驗之預測效能實質提升。此外我們亦同時比較其他著名的選取模組,結果亦證明我們提出的方法能更有效的增進效能。
英文摘要
This dissertation focuses on the feature selection problems, especially in bioinformatics. For the most applications of prediction, classification or gene selection, the feature or data number are over hundreds or even thousands. Too many data features would waste computation time for classifiers. Even more, the prediction accuracy might also decrease greatly.
  Therefore, regarding to the above-mentioned problems, feature selection would be an efficient solution. It not only selects the most important features for learning models to train, but also removes the noisy features from the original feature set. In contrast to the feature extraction techniques, feature selection keeps important information from the original feature set. Besides, we can find out the relationships between problems and features.
  However, in most feature selection models, there is no model with both fast and satisfied prediction accuracy. Generally speaking, there are two main kinds of feature selection models, including filter and wrapper models. The filter kind of feature selection models is based on information theorem, which calculates the amount of information or dependency between features, and performs faster than wrappers. However, this model does not guarantee the prediction accuracy. As for the wrapper, it combines a learning model in feature selection procedures, and uses the learning results to decide the removal of features. The result of the wrapper is much more accurate than the filter model, but requires more computational time.
  In this dissertation, we proposed two mechanisms to improve the computational performance and prediction accuracies. The first one is the hybrid feature selection model. It applies the filter models to remove most noisy features in the primary procedure, and applies the wrapper with a learning model in the secondary selection process. With these two main procedures, the filter provides efficient performance and the wrapper increases the system accuracy.
For the second proposed solution, we developed a clustering based feature selection model. Using the concept of clustering which can analyze distances among data points in the high dimensional space, we can use it to measure similarities among features. In the final clustered results, redundant features were removed and only one feature was retained as the representative feature of the group. In additional to the original concept of clustering, we changed the core of distance calculation mechanism with correlation coefficient calculation. We measured the similarity between two features by the value of correlation coefficient. This procedure guarantees the result of selected features processing quite different characteristics from each other. This kind of feature selection is considered as the model of redundant feature removal.
  In the thesis, we have evaluated the proposed methods with respect to three different bioinformatics applications, including protein disordered region prediction, protein crystallization prediction and gene selection in the microarray. In these experiments, the proposed models were applied to assist the prediction procedures and enhance their performance. Besides, we also compare our methods to other famous feature selection models. The results show that the performance can be improved efficiently and effectively.
第三語言摘要
論文目次
Chapter 1 Introduction	1
1.1 Motivation	1
1.2 Problem Definition	3
1.3 Organization of Dissertation	5
Chapter 2 Literature Review	7
2.1 Proteins Disordered Regions Prediction	8
2.2 Gene Selection on Microarray	14
2.3 Protein Crystallization Prediction	16
2.4 Feature Selection Studies	21
2.5 Chapter Summary	25
Chapter 3 Feature Selection Models	28
3.1 Filters and Wrappers	28
3.2 Hybrid Feature Selection	31
3.2.1 Preliminary screening	32
3.2.2 Combination	34
3.2.3 Fine tuning	36
3.3 Clustering Feature Selection	39
3.3.1 Clustering	40
3.3.2 K- medoids	42
3.3.3 Correlation Coefficient Clustering	45
3.4 Chapter Summary	51
Chapter 4 Data Collection and Classification Model	54
4.1 Bioinformatics Data	55
4.1.1 Disordered Region of Protein	55
4.1.2 Microarray	62
4.1.3 Protein Crystallization	64
4.2 Training Procedure	66
4.2.1 Support Vector Machine	66
4.2.2 Adaboost	69
4.2.3 Testing and Evaluation	72
4.3 Chapter Summary	74
Chapter 5 Experimental Results	77
5.1 Protein Disordered Regions Prediction	78
5.1.1 Feature Combination	78
5.1.2 Prediction with Hybrid Feature Selection Model	81
5.1.3 Prediction with K- medoids Clustering Feature Selection	83
5.1.4 Prediction with Correlation Coefficient Clustering Feature Selection	86
5.2 Protein Crystallization Prediction	87
5.3 Gene Selection	92
5.3.1 Selection with Hybrid Feature Selection Model	92
5.3.2 Selection with Correlation Coefficient Clustering	95
5.4 Chapter Summary	97
Chapter 6 Discussions and Conclusions	100
References	103


List of Figures
Figure 1.The filters	29
Figure 2. The wrappers	30
Figure 3. Hybrid feature selection procedure	31
Figure 4. Intersection and exclusive-OR sets	35
Figure 5. Sequential backward searching (SBS)	37
Figure 6. Sequential forward searching (SFS)	37
Figure 7. Inversed sequential floating search method (iSFSM)	39
Figure 8. Separation of data via clustering	40
Figure 9. The process of correlation coefficient clustering feature selection	48
Figure 10. Remark 465: missing residues fields in 1BHS.pdb file	57
Figure 11. Feature generation architecture	58
Figure 12. Kernel function of SVM	66
Figure 13. Support vectors	67
Figure 14. Ten-fold cross-validation accuracy comparison	96



List of Tables
Table 1. The comparison of the filters and the wrappers	31
Table 2. Strength of F-score and Information Gain	34
Table 3. Collected features	65
Table 4. Training SVM with different kernels.	79
Table 5. Prediction results of seven kinds of combinations.	80
Table 6. Preliminary screening on disordered protein data	81
Table 7. Combinations of disordered data after screening	82
Table 8. Fine tuning result	83
Table 9. The feature selection results of 440 features.	84
Table 10. Self training/testing analysis	85
Table 11. Five-fold cross-validation results on disordered protein data	87
Table 12. Crystallization prediction results	88
Table 13. Feature selection results	90
Table 14. Feature IDs and corresponding features	91
Table 15. Preliminary screening on microarray cancer data	92
Table 16. Combinations of microarray cancer data after screening	94
Table 17. Prediction accuracy after fine tuning	94
Table 18. The comparison with other methods (AML & ALL)	94
參考文獻
[1]	H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, P. E. Bourne, “The Protein Data Bank,” Nucleic Acids Resource, Vol. 28, 2000, pp. 235-242.
[2]	S. Vucetic, Z. Obradovic, V. Vacic, P. Radivojac, K. Peng, L. M. Iakoucheva, M. S. Cortese, J. D. Lawson, C. J. Brown, J. G. Sikes, C. D. Newton, A. K. Dunker, “DisProt: A Database of Protein Disorder,” Bioinformatics Vol. 21, 2005, pp. 137-140.
[3]	C. R. Cantor, Principles of Protein X-Ray Crystallography, Springer Verlag, New York, 1996, pp. 1-341. 
[4]	C. Bracken, “NMR spin relaxation methods for characterization of disorder and folding in proteins,” J Mol Graph Model, Vol. 19, 2001, pp. 3-12.
[5]	G. D. Fasman, Circular Dichroism and the Conformational Analysis of Biomolecules, Plenum Press, New York, 1996.
[6]	LIBSVM - A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (last accessed June 3, 2010).
[7]	C. J. C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, Vol. 2, 1998, pp. 121-167.
[8]	Protein Structure Prediction Center, http://predictioncenter.gc.ucdavis.edu/ (last accessed June 3, 2009).
[9]	R. J. Williams, “The conformational mobility of proteins and its functional significance,” Biochem. Soc. Trans, Vol. 6, pp.1123-1126, 1978.
[10]	V. N. Uversky, J. R. Gillespie, A. L. Fink, “Why are "natively unfolded" proteins unstructured under physiologic conditions?,” Proteins, Vol. 41, 2000, pp.415-427.
[11]	R. Linding, R. B. Russell, V. Neduva, T. J. Gibson, “GlobPlot: Exploring protein sequences for globularity and disorder,” Nucleic Acids Res, Vol. 31, 2003, pp.3701-3708.
[12]	T. P. Hopp, K. R. Woods, “A computer program for predicting protein antigenic determinants,” Mol Immunol, Vol. 20, 1983, pp.483-448.
[13]	J. Kyte, R. F. Doolittle, “A simple method for displaying the hydropathic character of a protein,” J Mol Biol, Vol. 157, 1982, pp.105-132.
[14]	Z. R. Yang, R. Thomson, P. McNeil, R. Esnouf, “RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins,” Bioinformatics, Vol. 21, 2005, pp.3369-3376. 
[15]	K. Peng, P. Radovojac, S. Vucetic, A. K. Dunker, Z. Obradovic, “Length-dependent prediction of protein intrinsic disorder,” Bioinformatics, Vol. 7, 2006, pp. 208-225.
[16]	G. Piatetsky-Shapiro, P. Tamayo, “Microarray Data Mining: Facing the Challenges,” ACM SIGKDD Explorations Newsletter, Vol. 5, No. 2, Dec, 2003, pp. 1-5.
[17]	V. Vapnik, I Guyon, J. Weston, S. Barnhill, “Gene Selection for Cancer Classification using Support Vector Machines,” Machine Learning, Vol. 46, No. 1-3 Jan, 2002, pp. 389-422.
[18]	J. Zhang, R. Lee, Y. J. Wang, “Support vector machine classifications for microarray expression data set,” IEEE International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2003), 27 -30 Sep, 2003, pp. 67-71.
[19]	S. Cho, J. Ryu, “Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features,” Proceedings of the IEEE, Vol. 90, No. 11, Nov, 2002, pp. 1744-1753.
[20]	W. Fujibuchi, T. Kato, “Classification of heterogeneous microarray data by maximum entropy kernel,” BMC Bioinformatics 2007, Vol. 8, July, 2007, pp. 267-277.
[21]	S. Cho, H. Won, “Cancer classification using ensemble of neural networks with multiple significant gene subsets,” Applied Intelligence, Vol. 26, No. 3, Jun, 2007, pp. 243-250.
[22]	J. M. Chandonia, S. E. Brenner, “The Impact of Structural Genomics: Expectations and Outcomes,” Science, Vol. 311, 2000, pp. 347-351.
[23]	J. M. Tyszka, S. E. Fraser, R. E. Jacobs, “Magnetic Resonance Microscopy: Recent Advances and Applications,” Current Opinion in Biotechnology, Vol. 16, 2006, pp. 93-99. 
[24]	W. L. Bragg, “The Structure of Some Crystals as Indicated by Their Diffraction of X-rays,” Proceedings of the Royal Society (London), Vol. A89, 1914, pp. 248-277.
[25]	P. Smialowski, T. Schmidt, J. Cox, A. Kirschner, D. Frishmanl, “Will My Protein Crystallize? A Sequenced-Based Predictor,” PROTEIN: Structure, Function and Bioinformatics, Wiley InterScience, Vol. 62, 2005, pp. 343-355. 
[26]	K. Chen, L. Kurgan, M. Rahbari, “Prediction of protein crystallization using collocation of amino acid pairs,” Biochemical and biophysical research communications, Vol. 355, 2007, pp. 764-769.
[27]	I. M. Overton, G. J. Barton, “A Normalised Scale for Structural Genomics Target Ranking: The OB-Score,” FEBS Letters, Vol. 580, 2006, pp. 4005-4009.
[28]	L. Slabinski, L. Jaroszewski, L. Rychlewski , I. A. Wilson, S. A. Lesley, A. Godzik, “XtalPred: a Web Server for Prediction of Protein Crystallizability,” Bioinformatics Applications Note, Vol. 23, 2007, pp. 3403-3405.
[29]	H. H. Hsu, S. M. Wang, “Protein crystallization prediction with a combined feature set,” Proceedings of 5th International Conference on Innovations in Information Technology, 2008, pp. 702-706.
[30]	R. Kohavi, G. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, Vol. 97, 1997, pp. 273-324. 
[31]	Al-Ani, “A dependency-based search strategy for feature selection,” Expert Systems with Applications: an International Journal, Vol.36, 2009, pp. 12392-12398.
[32]	F. E. Bonev, M. Angel-Cazorla, “A Novel Information Theory Method for Filter Feature Selection,” MICAI 2007: Advances in Artificial Intelligence, Springer Berlin / Heidelberg, 2007, pp. 431-440.
[33]	J. J. Huang, Y. Z. Cai, X. M. Xu, “A Filter Approach to Feature Selection Based on Mutual Information,” Cognitive Informatics, Vol. 1, 2006, pp. 84 -89. 
[34]	H. C. Peng, F. H. Long, C. Ding, “Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, 2005, pp. 1226-1238.
[35]	C. Ding, H. C. Peng, “Minimum Redundancy Feature Selection from Microarray Gene Expression Data,” Proceedings of the Second IEEE Computational Systems Bioinformatics Conference, 2003, pp. 523-528.
[36]	C. Deisy, B. Subbulakshmi, S. Baskar, N. Ramaraj,  “Efficient Dimensionality Reduction Approaches for Feature Selection,” International Conference on Computational Intelligence and Multimedia Applications, Vol. 2, 2007, pp. 121- 127.
[37]	P. Pudil, J. Novovicova, J. Kittler, “Floating search methods in feature selection,” Pattern Recognition Letters, Vol. 15, 1994, pp.1119-1125.
[38]	M. Matteucci, “A Tutorial on Clustering Algorithms”, 
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/ (last accessed Feb 3, 2010)
[39]	T. M. Cover, “The Best Two Independent Measurements Are Not the Two Best,” IEEE Trans. Systems, Man, and Cybernetics, Vol. 4, 1974, pp. 116-117.
[40]	U. Hobohm, C. Sander, “Enlarged representative set of protein structures,” Protein Sci, Vol. 3, 1994, pp. 522-524.
[41]	The UniProt Consortium, “The Universal Protein Resource (UniProt),” Nucleic Acids Res., Vol. 35, Jan, 2007, pp. 193-197. 
[42]	A. Bairoch, “Serendipity in bioinformatics, the tribulations of a Swiss bioinformatician through exciting times!” Bioinformatics, Vol.16, 2000, pp. 48-64.
[43]	S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, “Basic local alignment search tool,” J. Mol. Biol., Vol. 215, 1990, pp. 403-410.
[44]	D. Gilis, S. Massar, N. J. Cerf, M. Rooman, “Optimality of the genetic code with respect to protein stability and amino acid frequencies,” Genome Biology, Vol. 2, 2001, pp. 0049.1-0049.12. 
[45]	D. M.Engelman, T. A. Steitz, A. Goldman, “Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins,”  Annu. Rev. Biophys. Biophys. Chem., Vol. 15, 1986, pp. 321-353.
[46]	K. A. Kantardjieff, B. Rupp, “Protein isoelectric point as a predictor for increased crystallization screening efficiency,” Bioinformatics, Vol. 20, 2004, pp. 2162-2168.
[47]	R. Linding, L. J. Jensen, F. Diella, P. Bork, T. J. Gibson, R. B. Russell, “Protein disorder prediction: implications for structural proteomics,” Structure, Vol. 11, 2003, pp. 1453-1459.
[48]	Y. Freund, R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, Citeseer, Vol. 55, 1997, pp. 119-139. 
[49]	Y. Freund, R. E. Schapire, “Experiments with a new boosting algorithm,” Proceedings of the Thirteenth International Conference on Machine Learning, 1996, pp. 148-156. 
[50]	A. Vezhnevets, V. Vezhnevets, “Modest AdaBoost – teaching AdaBoost to generalize better,” Proceedings of Graphicon, Novosibirsk Akademgorodok, Russia, 2005.
[51]	R. Tibshirani, G. Walther, T. Hastie, “Estimating the number of clusters in a data set via the gap statistics,” Journal of the Royal Statistical Society, Series B 63, 2001, pp. 411-423.
[52]	R. B. Calinski, J. A. Harabasz, “A denrite method for cluster analysis,” Communications in Statistics, Vol. 3, 1974, pp. 1-27.
[53]	L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley, 1990.
[54]	J. A. Hartigan, Clustering Algorithms. Wiley, 1975.  
[55]	S. M. Wang, Protein Crystallization Prediction Using Support Vector Machine, National Central Library, R.O.C., 2007.
[56]	J. Y. Yeh, Hybrid Feature Selection in Protein Crystallization Prediction, National Central Library, R.O.C., 2008.
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信