淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-3012201004162000
中文論文名稱 使用微陣列資料與本體論於基因調控關係預測
英文論文名稱 Regulatory Genes Prediction with Microarray Data and Ontology
校院名稱 淡江大學
系所名稱(中) 資訊工程學系博士班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 99
學期 1
出版年 100
研究生中文姓名 楊朝勛
研究生英文姓名 Chao-Hsun Yang
學號 894190114
學位類別 博士
語文別 英文
口試日期 2010-12-20
論文頁數 121頁
口試委員 指導教授-許輝煌
委員-施國琛
委員-曾新穆
委員-白敦文
委員-王英宏
委員-許輝煌
中文關鍵字 基因微陣列  調控基因預測  遺失值填補  動態時間配置法  基因本體論 
英文關鍵字 Microarray Time-Series Data  Gene Regulation Prediction  Missing Value Imputation  Dynamic Time Warping  Gene Ontology 
學科別分類 學科別應用科學資訊工程
中文摘要 基因微陣列近年來被大量應用在生物相關研究上。生物學家可藉由基因微陣列實驗所得之大量實驗結果,來進行後續的研究與分析。然而,如何在大量微陣列實驗資料中找出具有調控關係的基因組,是微陣列資料分析中的一重要研究議題。現今由文獻中所提出之數種方法,皆有其限制與缺點。
另一方面,許多基因微陣列實驗所得實驗資料,經常包含許多不存在的欄位,這些欄位被稱為遺失值。遺失值的產生,可能源自於許多原因,例如:該實驗欄位反應不明顯、實驗儀器誤差、或是人為疏失等。由於許多用來對微陣列資料進行後續分析的演算法,都需要較完整的基因微陣列資料。因此,這些遺失值必須藉由有效的方法來加以估算與填補。
在此論文中,我們提出一個可有效預測調控基因的方法。此預測方法是基於我們所設計的兩基因之間距離測量法。該距離測量法,結合了動態時間配置法和基因本體論資訊的應用。動態時間配置法,可有效計算兩序列資料之間的距離;而基因本體論結構中,對於每一個已知基因,皆有關於該基因屬性的描述與註解資料。透過對這些用來描述基因特性之資料的量化與評估,我們可以估算兩基因間在生物意義上的距離,或者相似度。因此,我們所設計的兩基因之間距離測量法,不但將基因微陣列時間序列資料中之基因表現值列入考量,同時也參考了與這些基因相關之基因本體論描述資訊。
除此之外,我們也提出一個填補方法來有效估算與填補基因微陣列資料中所包含之遺失值。此遺失值填補法將我們所設計的兩基因之間距離測量法與K個相鄰節點法進行結合,來估算並填補基因微陣列資料中的遺失值。由我們的實驗結果顯示,比起其他相關文獻所提出的方法,我們的遺失值填補法能夠更有效的對於基因微陣列中的遺失值進行填補與估算。因此我們先以所設計的遺失值填補法將基因微陣列中的遺失值填補後,再以我們的調控基因預測法來預測在大量基因微陣列資料中,可能存在哪些調控基因組。由調控基因預測的實驗結果也顯示,比起其他方法,我們所提出之調控基因預測法,能夠更有效的找出在所使用的基因微陣列資料中已知的調控基因組,進而提供可能具有調控關係的候選基因組。
我們所提出之遺失值填補法與調控基因預測法,將有助於後續基因微陣列資料之分析與研究。
英文摘要 Microarray technology provides an opportunity for scientists to analyze thousands of gene expression profiles simultaneously. However, microarray gene expression data often contain multiple missing expression values due to many reasons. Effective methods to impute these missing values are needed since many algorithms for microarray data analysis require a complete matrix of gene expression values. In addition, selecting informative genes from microarray gene expression data is essential while performing data analysis on these large amounts of data. To fit this need, a number of methods were proposed from various points of view. However, most existing methods have their limitations and disadvantages.
In this dissertation, we propose a novel approach to predict potential regulatory gene pairs through our distance measurement that estimates the distances between gene pairs effectively. The distance measurement is based on the dynamic time warping (DTW) algorithm and the well-defined gene ontology (GO) structure for genes or proteins. GO contains definition (annotations) for genes that describe the biological meanings of them. The semantic distance of two genes within biological aspect can be measured by performing proper quantitative assessments of their corresponding GO annotations. Our distance measurement takes both DTW distances of expression values and GO semantic distances of gene pairs into consideration.
Besides, we also propose a novel missing value imputation approach by combining our distance measurement with the k-nearest neighbor (KNN) method. Experimental results show that our missing value imputation approach outperforms other major methods in terms of the commonly-used assessment. After missing values in microarray time series raw data are estimated effectively with our imputation approach, we then perform our gene regulation prediction approach. According to experimental results, our approach can discover more known regulatory gene pairs compared with other methods. Researches on microarray time series data can hence be improved and facilitated with our approaches.
論文目次 Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Problem Definition 4
1.3 Organization of Dissertation 8
Chapter 2 Literature Review 10
2.1 Existing Missing Value Imputation Methods 10
2.2 Regulatory Genes Prediction Methods 16
2.3 Gene Ontology Based Data Analysis 21
2.4 Chapter Summary 23
Chapter 3 DTW-GO Based Microarray Data Analysis 26
3.1 Dynamic Time Warping Algorithm 28
3.1.1 Refinement of the DTW Algorithm 32
3.2 Gene Ontology 42
3.2.1 Application of Gene Ontology 46
3.2.2 Finding Informative Data-Specific GO Terms 59
3.3 Combining DTW and GO for Microarray Data Analysis 67
3.3.1 Missing Value Imputation 68
3.3.2 Gene Regulation Prediction 73
3.4 Chapter Summary 74
Chapter 4 Datasets and Performance Assessment 77
4.1 Microarray Datasets 77
4.2 Performance Assessment 79
4.2.1 Missing Value Imputation 80
4.2.2 Gene Regulation Prediction 81
4.3 Chapter Summary 85
Chapter 5 Experimental Results 86
5.1 Design of the Experiments 86
5.2 Experimental Results 87
5.2.1 Effect of DTW adjustments 88
5.2.2 Effect of Various Parameters Used in GO Distance Measurement 90
5.2.3 Missing Value Imputation 96
5.2.4 Gene Regulation Prediction 98
5.3 Chapter Summary 108
Chapter 6 Discussions and Conclusions 110
Bibliography 112

List of Figures
Figure 1. Missing values in microarray time series data 6
Figure 2. Time series sequence similarity measurement 29
Figure 3. Example of dynamic time warping 30
Figure 4. Modification of the FastDTW method 35
Figure 5. Expectative alignment for two sequences 37
Figure 6. Singularity problem of DTW algorithm 37
Figure 7. Example of gene ontology 44
Figure 8. GO terms and the three domains: cellular component, biological process, molecular function 50
Figure 9. Contents of an ontology file 52
Figure 10. The tree-like structure of gene ontology 53
Figure 11. Contents of an annotation file 55
Figure 12. Parsed results of an annotation file 56
Figure 13. Three relations between GO term pairs 58
Figure 14. Content of each cell in the comparison matrix 64
Figure 15. K-nearest neighbor example 70
Figure 16. Spellman's Dataset 78
Figure 17. Filkov's Database 82
Figure 18. The saccharomyces genome database 84
Figure 19. Imputation results of alpha sub-dataset 97
Figure 20. Imputation results of cdc28 sub-dataset 97

List of Tables
Table 1. Microarray time series data 5
Table 2. Comparison matrix of each GO term 62
Table 3. Concentrated rows in the comparison matrix 66
Table 4. Parsing results for gene regulations 83
Table 5. NRMS values for DTW adjustments in alpha dataset 90
Table 6. NRMS values for DTW adjustments in cdc28 dataset 90
Table 7. NRMS values for different parameters of GO semantic distance measurement at 20% data missing rate 93
Table 8. NRMS values for different parameters of GO semantic distance measurement with found informative GO terms at 20% data missing rate 95
Table 9. Number of identified regulatory gene pairs 99
Table 10. Known alpha activations found by our approach 100
Table 11. Known alpha inhibitions found by our approach 103
Table 12. Known cdc28 activations found by our approach 103
Table 13. Known cdc28 inhibitions found by our approach 107
參考文獻 E. Acuna and C. Rodriguez, “The treatment of missing values and its effect in the classifier accuracy,” in: Classification, Clustering and Data Mining Applications, pp.639-648, 2004.
A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, and L.M. Staudt, “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature, Vol. 403, pp.503-511, 2000.
D. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series,” in: Proceedings of the Workshop on Knowledge Discovery in Databases, pp.359-370, Jul. 31 - Aug. 1, 1994.
Y. Cao, and K.L. Poh, “An accurate and robust missing value estimation for microarray data: least absolute deviation imputation,” in: Proceedings of the 5th International Conference on Machine Learning and Applications, pp.157-161, Dec.16, 2006.
L.C. Chen, Y.C. Lin, M. Arita, and V.S. Tseng, “A novel approach for handling missing values in microarray data,” in: Proceedings of the International Computer Symposium, pp.45-50, Jun. 21-25, 2008.
R. Cho, M. Campbell, E. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. Wolfsberg, A. Gabrielian, D. Landsman, and D. Lockhart, “A genome-wide transcriptional analysis of the mitotic cell cycle,” Molecular Cell, Vol. 2, pp.65-73, 1998.
M.K. Choong, M. Charbit, and H. Yan, “Autoregressive-model-based missing value estimation for DNA microarray time series data,” IEEE Transactions on Information Technology in Biomedicine, Vol. 13, No. 1, pp.131-137, 2009.
M.K. Choong, D. Levy, and H. Yang, “Study of microarray time series data based on forward–backward linear prediction and singular value decomposition,” International Journal of Data Mining and Bioinformatics, Vol. 3, No. 2, pp.145-159, 2009.
M.K. Choong, K.C. Lye, D. Levy, and H. Yang, “Periodicity identification of microarray time series data based on spectral analysis,” in: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, pp.1281-1285, Oct. 8-11, 2006.
M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” National Academy of Science, Vol. 95, pp.14863-14868, 1998.
V. Filkov, S. Skiena, and J. Zhi, “Analysis techniques for microarray time-series data,” in: Proceedings of the 5th Annual International Conference on Computational Molecular Biology, pp.124-131, Apr. 22-25, 2001.
N. Friedman, M. Linial, I. Nachman, and DanaPe’er, “Using Bayesian network to analyze expression data,” in: Proceedings of the fourth Annual International Conference on Computational Molecular Biology, pp.601-620, Apr. 8-11, 2000.
C. Furlanello, S. Merler, and G. Jurman, “Combining feature selection and DTW for time-varying functional genomics,” IEEE Transactions on Signal Processing, Vol. 54, Issue 6, Part 2, pp.2436-2443, 2006.
T. Hastie, R. Tibshirani, G. Sherlock, M. Eisen, P. Brown, and D. Botstein, “Imputing missing data for gene expression arrays,” in: Technical Report, Division of Biostatistics, Stanford University, pp.1-9, Sep. 9, 1999.
F. Itakura, “Minimum prediction residual principle applied to speech recognition.” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-23, pp.52-72, 1975.
K. Kalpakis, D. Gada, and V. Puttagunta, ”Distance measures for effective clustering of ARIMA time-series,” in: Proceedings of the IEEE International Conference on Data Mining, pp.273-280, Nov. 29 - Dec. 2, 2001.
E.J. Keogh, and M.J. Pazzani, “Scaling up dynamic time warping to massive datasets,” in: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.285-289, Aug. 20-23, 2000.
H. Kim, G.H. Golub, and H. Park, “Imputation of missing values in DNA microarray gene expression data,” in: Proceedings of the IEEE Computational Systems Bioinformatics Conference, pp.572-573, Aug. 16-19, 2004.
H. Kim, G. H. Golub, and H. Park, “Missing value estimation for DNA microarray gene expression data: local least squares imputation,” Bioinformatics, Vol. 21 no. 2, pp.187-198, 2005.
S. Kim, S. Imoto, and S. Miyano, “Dynamic Bayesian network and nonparametric regression for nonlinear modeling of gene networks from time series gene expression data,” Biosystems, Vol. 75, pp.57-65, 2004.
J.B. Kruskal, and M. Liberman, “The symmetric time-warping problem: from continuous to discrete,” in: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Stanford: CSLI Publications, pp. 125-161, 1999.
P. Larsen, E. Almasri, G. Chen, and Y. Dai, “Correlated discretized expression score: a method for identifying gene interaction networks from time course microarray expression data,” in: Proceedings of the 28th IEEE Annual International Conference of the Engineering in Medicine and Biology Society, pp.5842-5845, Aug. 31- Sep. 3, 2006.
M.S. Lee, L.Y. Liu, and M.Y. Chen, “Similarity analysis of time series gene expression using dual-tree wavelet transform,” in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.I-413 – I-416, Mar. 30 - Apr. 4, 2007.
J. Liu, B. Ni, C. Dai, and N. Wang, “A simple method of inferring pairwise gene interactions from microarray time series data,” in: Proceedings of the fourth International Conference on Machine Learning and Cybernetics, pp.3346-3351, Aug. 18-21, 2005.
P.W. Lord, R.D. Stevens, A. Brass, and C.A. Goble, “Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation,” Bioinformatics, Vol. 19, pp.1275-1283, 2003.
X. Lu, Y. Li, and X. Zhang, “A simple strategy for detecting outlier samples in microarray data,” in: Proceedings of the 8th Control, Automation, Robotics and Vision Conference, Vol. 2, pp.1331-1335, Dec. 6-9, 2004.
J.W. Luo, T. Yang, and Y. Wang, “Missing value estimation for microarray data based on fuzzy C-means clustering,” in: Proceedings of the 8th International Conference on High-Performance Computing in Asia-Pacific Region, pp.611-616, Nov. 30 - Dec. 3, 2005.
A. Mohammadi, and M.H. Saraee, “Estimating missing value in microarray data using fuzzy clustering and gene ontology,” in: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, pp.382-385, Nov. 3-5, 2008.
C. Myers, L. Rabiner, and A. Roseneberg, “Performance tradeoffs in dynamic time warping algorithms for isolated word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-28, pp.623-635, 1980.
S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara, and S.Ishii, “A Bayesian missing value estimation method for gene expression profile data,” Bioinformatics, Vol. 19, no.16, pp. 2088-2096, 2003.
M. Ouyang, W.J. Welsh, and P. Georgopoulos, “Gaussian mixture clustering and imputation of microarray data,” Bioinformatics, Vol. 20, No. 6, pp.917-923, 2004.
C. Phong, and R. Singh, “Missing value estimation for time series microarray data using linear dynamical systems modeling,” in: Proceedings of the 22th International Conference on Advanced Information Networking and Applications, pp.814-819, Mar. 25-28, 2008.
L. Rabiner, A. Rosenberg, and S. Levinson, “Considerations in dynamic time warping algorithms for discrete word recognition,” IEEE Trans. on Acoustics, Speech, and Signal Processing, Vol. ASSP-26, pp.575-582, 1978.
S. Salvador and P. Chan, “Toward accurate dynamic time warping in linear time and space,” Intelligent Data Analysis, Vol. 11, Issue 5, pp.561-580, 2007.
H. Sakoe, and S. Chiba, ”Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing., Vol. ASSP-26, pp.43-49, 1978.
A. Sanfilippo, B. Baddeley, N. Beagley, and B. Gopalan, “Enhancing automatic biological pathway generation with GO-based gene similarity,” in: Proceedings of International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, pp.448-453, Aug. 3-5, 2009.
M.S.B. Sehgal, I. Gondal, and L. Dooley, “A collateral missing value estimation algorithm for DNA microarrays,” in: Proceedings if IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 5, pp.v/377-v/380, Mar. 18-23, 2005.
Y. Shan, and G. Deng, “Kernel PCA regression for missing data estimation in DNA microarray analysis,” in: Proceedings of the IEEE International Symposium on Circuits and Systems, pp.1477-1480, May 24-27, 2009.
P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, and B. Futcher, “Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization,” Molecular Biology Cell, Vol. 9, pp.3273-3297, 1998.
O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R.B. Altman, “Missing value estimation methods for DNA microarrays,” Bioinformatics, Vol. 17, no. 6, pp.520-525, 2001.
V.S. Tseng, L.C. Chen, and J.J. Chen, “Gene relation discovery by mining similar subsequences in time-series microarray data,” in: Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pp.106-112, Apr. 1-5, 2007.
T.Q. Tung, T. Ryu, K.H. Lee, and D. Lee, “Inferring gene regulatory networks from microarray time series data using transfer entropy,” in: Proceedings of the 20th IEEE International Symposium on Computer-Based Medical Systems, pp.383-388, Jun. 20-22, 2007.
M. Vlachos, G. Kollios, and G. Gunopulos, “Discovering similar multidimensional trajectories,” in: Proceedings of the 18th International Conference on Data Engineering, Feb. 26 - Mar. 1, pp.673-684, 2002.
X. Wang, A. Li, Z. Jiang, and H. Feng “Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme,” BMC Bioinformatics, Vol. 7, pp.1-10, 2006.
W.S.V. Wong, F.K. Wong, and G.R. Wood, “A multi-stage approach to clustering and imputation of gene expression profiles,” Bioinformatics, Vol. 23, pp. 998-1005,2007.
Q. Xiang, and X. Dai, “Improving missing value imputation in microarray data by using gene regulatory information,” in: Proceedings of the second International Conference on Bioinformatics and Biomedical Engineering, pp.326-329, May 16-18, 2008.
X. Xu and A. Zhang, “Selecting informative genes from microarray dataset by incorporating gene ontology,” in: Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering, pp.241-245, Oct. 19-21, 2005.
Y. Yamada, K. Hirotani, K. Satou, and K. Muramoto, “An identification method of data-specific GO terms from a microarray data Set,” IEICE Transactions on Information and Systems, Vol. E92-D, No.5, 2009.
A.C. Yang, H.H. Hsu, and M.D. Lu, “Outlier filtering for identification of gene regulations in microarray time-series data,” in: Proceedings of the third International Conference on Complex, Intelligent and Software Intensive System, pp.854-859, Mar. 16-19, 2009.
L.K. Yeung, H. Yan, A.W.C. Liew, L.K. Szeto, M. Yang, and R. Kong, “Measuring correlation between microarray time series data using dominant spectral component,” in: Proceedings of the second Asia-Pacific Bioinformatics Conference, Vol. 29, pp.309-314, Jan. 18-22, 2004.
H.M. Yu, W.H. Tsai, and H.M. Wang, “Query-by-singing system for retrieving karaoke music,” IEEE Transactions on Multimedia, Vol. 10, Issue 8, pp.1626-1637, 2008.
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2013-01-10公開。
  • 同意授權瀏覽/列印電子全文服務,於2013-01-10起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信