淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-0307200813293200
中文論文名稱 應用支持向量機於癌症微陣列資料識別
英文論文名稱 Cancer Classification on Microarray Expression Data with Support Vector Machine
校院名稱 淡江大學
系所名稱(中) 資訊工程學系碩士班
系所名稱(英) Department of Computer Science and Information Engineering
學年度 96
學期 2
出版年 97
研究生中文姓名 呂明達
研究生英文姓名 Ming-Da Lu
學號 695411958
學位類別 碩士
語文別 中文
第二語文別 英文
口試日期 2008-06-23
論文頁數 77頁
口試委員 指導教授-許輝煌
委員-白敦文
委員-林慧珍
中文關鍵字 癌症分類  微陣列  支持向量機  特徵選擇  皮爾森相關係數 
英文關鍵字 Cancer Classification  Microarray  Support Vector Machine  Feature Selection  Pearson Correlation Coefficient 
學科別分類 學科別應用科學資訊工程
中文摘要 微陣列是一個現今十分重要的基因分析工具,他可以協助分別多種的癌症類別。我們進行了一個癌症微陣列資料的識別工作,在這個工作中,我們運用了資訊科學的特徵選擇方法和支持向量機的機器學習方法,來進行將資料簡化和資料預測的工作。
我們將這兩樣的工具運用在三種的癌陣微陣列資料上,分別是白血病、肺癌和前列腺癌。我們運用的特徵選擇方法主要有兩類的方法,分別是距離測量法類的歐式距離特徵選擇法和相依性測量法類的皮爾森相關係數特徵選擇法。我們運用支持向量機在不同的特徵個數和三種不同的核函式,來進行分類的工作。而我們的結果顯示出距離式特徵選擇法是適合支持向量機分類器的特徵選擇法,且線性核函式在我們所進行的這三種問題來說是較佳的核函式。在這三組資料不同的特徵個數中,將至少7129個特徵數量,減少至僅15到100個特徵個數之間的狀況下,仍然能夠獲得了相等或較佳的預測結果。
英文摘要 Microarray is an important tool in gene analysis research. It can help identify genes that might cause various cancers. In this thesis, we use feature selection methods and the support vector machine (SVM) to search for the disease-causing genes in microarray data of three different cancers. The feature selection methods are based on Euclidian distance (ED) and Pearson correlation coefficient (PCC). We selected three most reference microarray data sets for classification which are AML & ALL data sets, Lung cancer data sets, and Prostate data sets. We investigated the effect on prediction results by training the SVM with different numbers of features and different kinds of kernels. The results show that linear kernel is the fittest kernel in this issue. Also, equal or higher accuracy can be achieved with only 15 to 100 features which are selected from 7129 or more features of the original data sets.
論文目次 目 錄.........................................................................................................I
圖目錄.......................................................................................................III
表目錄........................................................................................................V
第一章 緒論...............................................................................................1
1.1 研究背景................................................................................1
1.2 研究動機................................................................................2
1.3 論文架構說明........................................................................3
第二章 文獻分析......................................................................................5
2.1 微陣列資料簡介....................................................................5
2.2 特徵選取................................................................................9
2.2.1 特徵子集的產生....................................................12
2.2.2 特徵子集的搜尋....................................................14
2.2.3 特徵子集的測量法................................................14
2.3 支持向量機..........................................................................17
2.3.1 支持向量機原理說明............................................20
2.3.2 核函式介紹............................................................22
第三章 癌症微陣列資料識別................................................................24
3.1 流程架構..............................................................................24
3.2 資料之取得與資料前處理..................................................28
3.3 資料的正規化......................................................................31
3.4 距離測量法特徵選取..........................................................33
3.5 相依性測量法特徵選取......................................................34
3.6 支持向量機於資料識別......................................................36
第四章 實驗進行與結果分析................................................................39
4.1 實驗結果..............................................................................39
4.1.1 白血病微陣列資料................................................41
4.1.2 肺癌微陣列資料....................................................49
4.1.3 前列腺癌微陣列資料............................................56
4.2 特徵選取方法結果比較......................................................60
4.3 核函式分類結果比較..........................................................61
第五章 結論.............................................................................................65
參考文獻...................................................................................................68
附錄 英文論文........................................................................................72
圖一、微陣列實驗流程圖(引自Introduction to Microarray[6])..............6
圖二、特徵選取主要特性示意圖............................................................9
圖三、特徵選擇流程圖..........................................................................11
圖四、支持向量機示意圖......................................................................19
圖五、支持向量機說明圖......................................................................21
圖六、核函式轉換示意圖......................................................................22
圖七、流程架構圖..................................................................................25
圖八、Libsvm 資料格式........................................................................30
圖九、白血病微陣列資料歐式距離特徵選擇結果圖..........................42
圖十、白血病微陣列資料正規化後歐式距離法的特徵選擇結果......43
圖十一、白血病癌症微陣列皮爾森相關係數特徵選擇結果..............44
圖十二、白血病癌症微陣列正規化後皮爾森相關係數特徵選擇結果...................................................................................................................45
圖十三、白血病癌症微陣列正相關係數特徵選擇結果......................45
圖十四、白血病癌症微陣列正規化後正相關係數特徵選擇結果......46
圖十五、白血病癌症微陣列負相關係數特徵選擇結果......................46
圖十六、白血病癌症微陣列正規化後負相關係數特徵選擇結果......47
圖十七、肺癌微陣列資料歐式距離特徵選擇結果圖..........................49
圖十八、肺癌微陣列資料正規化後歐式距離特徵選擇結果圖..........50
圖十九、肺癌微陣列皮爾森相關係數特徵選擇結果..........................51
圖二十、肺癌微陣列正規化後皮爾森相關係數特徵選擇結果..........52
圖二十一、肺癌微陣列正相關係數特徵選擇結果..............................53
圖二十二、肺癌微陣列負相關係數特徵選擇結果..............................53
圖二十三、前列腺癌微陣列歐式距離特徵選擇結果..........................57
圖二十四、前列腺癌微陣列正相關係數特徵選擇結果......................58
圖二十五、白血病微陣列多項式次方調整結果..................................62
圖二十六、肺癌微陣列多項式次方調整結果......................................63
圖二十七、前列腺癌微陣列多項式次方調整結果..............................63
表一、特徵選取測量法比較表..............................................................17
表二、特徵選取前資料分類結果..........................................................40
表三、白血病微陣列資料分類結果整理..............................................48
表四、肺癌微陣列相關係數特徵選擇法選擇之特徵編號表..............54
表五、肺癌微陣列結果資料整理..........................................................55
表六、前列腺癌微陣列結果資料整理..................................................59
參考文獻 [1] Margaret Gardiner-Garden and Timothy G. Littlejohn, “A Comparison of Microarray Databases," Briefings in Bioinformatics, Vol. 2, No 2, May 2001, pp. 143-158.
[2] G. Piatetsky-Shapiro and P. Tamayo, “Microarray Data Mining: Facing the Challenges," ACM SIGKDD Explorations Newsletter, Vol. 5, No. 2, Dec. 2003, pp. 1–5.
[3] W. S. Noble, “Support Vector Machine Applications in Computational Biology," http://noble.gs.washington.edu/papers/noble_support.pdf (last accessed May 8, 2008).
[4] V. Vapnik, I Guyon, J. Weston, S. Barnhill, “Gene Selection for Cancer Classification using Support Vector Machines," Machine Learning, Vol. 46, No. 1-3 Jan. 2002, pp. 389-422.
[5] J. Zhang, R. Lee, Y. J. Wang, “Support vector machine classifications for microarray expression data set," IEEE International Conference on Computational Intelligence and Multimedia Applications (ICCIMA) 2003, 27 -30 Sep. 2003, pp. 67-71.
[6] V. Mittard-Runte, “Introduction to Microarray," http://www.cebitec.uni-bielefeld.de/groups/brf/software/emma_info/Microarray_introduction.pdf (last accessed May 8, 2008).
[7] H. Liu, H. Motoda “Feature Selection for Knowledge Discovery and Data Mining," Kluwer Academic Publishers, 1998.
[8] M. Dash, H. Liu, “Feature Selection for Classification," Intelligent Data Analysis, Vol. 1, No. 3, Mar. 1997, pp. 131-156.
[9] C. J.C. BURGES, “A Tutorial on Support Vector Machines for Pattern Recognition," Data Mining and Knowledge Discovery, Vol. 2, No. 2, Jun. 1998, pp. 121-167.
[10] S. Cho and J. Ryu, “Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features," PROCEEDINGS OF THE IEEE, Vol. 90, No. 11, Nov. 2002, pp. 1744-1753.
[11] S. Cho and H. Won, “Cancer classification using ensemble of neural networks with multiple significant gene subsets," Applied Intelligence, Vol. 26, No. 3, Jun. 2007, pp. 243-250.
[12] W. Fujibuchi and T. Kato, “Classification of heterogeneous microarray data by maximum entropy kernel," BMC Bioinformatics 2007, Vol. 8, Jul. 26 2007, pp. 267-277.
[13] W. Cao, L. Li, and X. Lv, “Kernel function characteristic analysis based on support vector machine in face recognition," Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, Vol. 5, 19-22 Aug. 2007, pp. 2869-2873.
[14] “LIBSVM - A Library for Support Vector Machines," http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (last accessed May 8, 2008).
[15] “Kent Ridge Bio-medical Data Set Repository," http://sdmc.lit.org.sg/GEDatasets/Datasets.html(last accessed Dec. 27, 2007).
[16] “Broad Institute Cancer Program Data Sets," http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi (last accessed May 8, 2008).
[17] H. Liu, “Evolving Feature Selection," IEEE Intelligent Systems, Vol. 20, No. 6, Nov. 2005, pp. 64 -76.
[18] L.C. Molina, L. Belanche, À. Nebot, “Feature Selection Algorithms: A Survey and Experimental Evaluation," http://www.lsi.upc.es/~belanche/research/R02-62.pdf (last accessed May 8, 2008).
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2009-07-17公開。
  • 同意授權瀏覽/列印電子全文服務,於2008-07-17起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2281 或 來信