| 系統識別號 | U0002-0209202403140400 |
|---|---|
| DOI | 10.6846/tku202400735 |
| 論文名稱(中文) | 基於機器學習之疾病風險預測的特徵選擇方法 |
| 論文名稱(英文) | Feature Selection Methods for Machine Learning-Based Disease Risk Prediction |
| 第三語言論文名稱 | |
| 校院名稱 | 淡江大學 |
| 系所名稱(中文) | 統計學系數據科學碩士班 |
| 系所名稱(英文) | MASTER'S PROGRAM IN DATA SCIENCE, DEPARTMENT OF STATISTICS |
| 外國學位學校名稱 | |
| 外國學位學院名稱 | |
| 外國學位研究所名稱 | |
| 學年度 | 112 |
| 學期 | 2 |
| 出版年 | 113 |
| 研究生(中文) | 陳品樺 |
| 研究生(英文) | Pin-Hua Chen |
| 學號 | 611890095 |
| 學位類別 | 碩士 |
| 語言別 | 繁體中文 |
| 第二語言別 | |
| 口試日期 | 2024-07-18 |
| 論文頁數 | 64頁 |
| 口試委員 |
指導教授
-
謝璦如(142438@mail.tku.edu.tw)
口試委員 - 張書瑋(shwchang@mail.cgu.edu.tw) 口試委員 - 廖文伶(wl0129@mail.cmu.edu.tw) |
| 關鍵字(中) |
特徵選擇 機器學習 台灣人體生物資料庫 多基因風險評分 單核苷酸多型性集合分析 |
| 關鍵字(英) |
Feature Selection Machine Learning Taiwan Biobank Polygenic risk score SNP-set analysis |
| 第三語言關鍵字 | |
| 學科別分類 | |
| 中文摘要 |
由於第二型糖尿病(T2D)在全球範圍內的高發病率及其併發症的嚴重性,準確的風險預測對於早期診斷和預防具有重要意義。本研究旨在探討多種特徵選擇方法在機器學習模型中預測T2D風險的效能,並比較其準確性。使用台灣人體生物資料庫(TWB)的數據,我們實施的特徵選擇方法分為三種,包括C+T、Pruning及SNP-set分析,其中,SNP-set的特徵選擇方法應用了多種機器學習算法如隨機森林(Random Forest)、極限梯度提升(eXtreme Gradient Boosting, XGBoost)及隨機梯度下降(Stochastic Gradient Descent, SGD),進行特徵選擇後利用漢明距離將SNPs分群,最後計算PRS分數進行風險預測評估。 研究結果顯示,使用Pruning方法選擇特徵的預測準確性最高,AUC為0.879;C+T方法AUC達到0.805;SNP-set方法中,機器學習方法與非機器學習方法AUC平均分別為0.6及0.536,其中,隨機森林在所有機器學習方法有最高的準確率,其AUC為0.6819,其次為SGD及XGBoost。這些結果表明,機器學習方法相較傳統特徵排序方法在T2D的風險預測中具有較高的準確性,然而,常用於SNP選擇的Pruning方法仍最具優勢。 本研究貢獻在於展示了不同特徵選擇方法在疾病風險預測中的效能比較,提供了具體的數據支持,在未來預防醫學領域的應用提供了有價值的參考。未來研究應在數據多樣性、特徵選擇方法優化及模型解釋性提升等方面行更深入的探索,以進一步提高風險預測的準確性和應用範圍。 |
| 英文摘要 |
This study aims to investigate the performance of various feature selection methods in predicting the risk of type 2 diabetes (T2D) using machine learning models and compare their accuracy. Given the high incidence and severe public health impact of T2D worldwide, accurate risk prediction is crucial for early diagnosis and prevention. Using data from the Taiwan Biobank (TWB), we implemented three different feature selection methods, including the C+T method, the Pruning method, and SNP-Set analysis. The SNP-Set feature selection method applied various machine learning algorithms, such as Random Forest, eXtreme Gradient Boosting (XGBoost), and Stochastic Gradient Descent (SGD), for feature selection, followed by clustering SNPs using Hamming distance and calculating PRS scores for risk prediction assessment. The results show that the Pruning method achieved the highest prediction accuracy with an AUC value of 0.879. The C+T method reached an AUC of 0.805, while the SNP-Set method had a slightly lower average AUC of 0.6 for machine learning methods and 0.536 for common non-machine learning methods. Among the machine learning methods, Random Forest had the highest accuracy with an AUC of 0.6819, followed by SGD and XGBoost. These results suggest that machine learning methods generally have higher accuracy compared to traditional feature selection methods for T2D risk prediction. However, the Pruning method remains the most advantageous for SNP selection. The contribution of this study lies in comparing the efficacy of different feature selection methods in disease risk prediction, providing specific data support, and offering valuable references for future applications in genomics and predictive medicine. Future research should delve deeper into data diversity, optimization of feature selection methods, and enhancement of model interpretability to further improve the accuracy and applicability of risk prediction. |
| 第三語言摘要 | |
| 論文目次 |
目錄 目錄 I 表目錄 III 圖目錄 V 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 2 1.3 論文架構 3 第二章 文獻探討 4 2.1 台灣人體生物資料庫 4 2.2 機器學習在基因數據的應用 4 2.3 特徵選擇方法 5 2.4 疾病的預測 8 第三章 研究方法 9 3.1 研究架構 9 3.2 資料介紹及初步整理 10 3.3 全基因組關聯性研究(GWAS) 11 3.4 SNP-set分析方法 12 3.5 衡量指標 15 3.6 多基因風險評分(PRS) 17 3.7 交叉驗證(Cross-Validation, CV) 19 第四章 研究結果 20 4.1 QC結果 20 4.2 C+T 22 4.3 Pruning 27 4.4 SNP-Set 34 4.5 PRS模型預測結果比較 49 第五章 結論與討論 50 5.1 結論 50 5.2 討論及未來發展 51 參考文獻 52 附錄 55 表目錄 表 1 品質控制標準及結果 11 表 2 混淆矩陣 16 表 3 10-fold 訓練集QC結果 20 表 4 (續)10-fold 訓練集QC結果 21 表 5 10-fold 測試集QC結果 21 表 6 Clumping T2D 訓練集與測試集預測的預測結果 22 表 7 (續)Clumping T2D 訓練集與測試集預測的預測結果 23 表 8 Clumping GWAS結果 24 表 9 (續)Clumping GWAS結果 25 表 10 Pruning T2D 訓練集與測試集預測結果 28 表 11 (續)Pruning T2D 訓練集與測試集預測結果 29 表 12 Pruning GWAS 結果 30 表 13 (續)Pruning GWAS 結果 31 表 14 (續)Pruning GWAS 結果 32 表 15 XGB分類器分類結果 35 表 16 Gain Ratio 前250 SNPs分群結果 37 表 17 Gain Ratio PRSet結果 38 表 18 Gini Decrease前250 SNPs分群結果 39 表 19 Gini Decrease PRSet結果 40 表 20 Random Forest前250 SNPs分群結果 42 表 21 Random Forest PRSet結果 42 表 22 XGB前250 SNPs分群結果 44 表 23 XGBoost PRSet結果 44 表 24 SGD前250 SNPs分群結果 46 表 25 SGD PRSet結果 46 表 26 五種特徵篩選方法PRS預測結果: lm(T2D~Sets) 48 表 27 特徵選擇方法PRS模型測試集預測結果比較 49 圖目錄 圖 1 研究架構圖 9 圖 2 SNP-set分析流程圖 12 圖 3 T2D Clumping 曼哈頓圖 24 圖 4 Clumping PRS Thresholds長條圖 26 圖 5 Clumping PRS Density Plot 27 圖 6 T2D Pruning曼哈頓圖 30 圖 7 Pruning PRS Thresholds長條圖 33 圖 8 Pruning PRS Density Plot 33 圖 9 特徵排序前250個SNPs交集狀況 35 圖 10 訓練集機器學習方法特徵篩選Top 250 SNPs XGBoost分類表現 36 圖 11 訓練集機器學習方法特徵篩選Top 100 SNPs XGBoost分類表現 36 圖 12 Gain Ratio T2D PRS箱型圖 38 圖 13 Gain Ratio PRS預測訓練ROC曲線 39 圖 14 Gini T2D PRS箱型圖 41 圖 15 Gini PRS預測訓練ROC曲線 41 圖 16 Random Forest T2D PRS箱型圖 43 圖 17 Random Forest PRS預測訓練ROC曲線 43 圖 18 XGBoost T2D PRS箱型圖 45 圖 19 XGBoost PRS預測訓練ROC曲線 45 圖 20 SGD T2D PRS箱型圖 47 圖 21 SGD PRS預測訓練ROC曲線 47 圖 22 五種特徵篩選方法PRS測試集預測結果 48 |
| 參考文獻 |
Abdellaoui, A., Yengo, L., Verweij, K. J., & Visscher, P. M. (2023). 15 years of GWAS discovery: realizing the promise. The American Journal of Human Genetics, 110(2), 179-194. Ao, S. I., Yip, K., Ng, M., Cheung, D., Fong, P.-Y., Melhado, I., & Sham, P. C. (2005). CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs. Bioinformatics, 21(8), 1735-1736. Arthur Korte, A. F. (2013). The advantages and limitations of trait analysis with GWAS a review. Plant Methods. B Prasetiyo, A., M A Muslim, N Baroroh. (2021). Evaluation of feature selection using information gain and gain ratio on bank marketing classification using naïve bayes Journal of Physics, Calonaci, N., Jones, A., Cuturello, F., Sattler, M., & Bussi, G. (2020). Machine learning a model for RNA structure prediction. NAR Genom Bioinform, 2(4), lqaa090. https://doi.org/10.1093/nargab/lqaa090 Choi, S. W., Garcia-Gonzalez, J., Ruan, Y., Wu, H. M., Porras, C., Johnson, J., Bipolar Disorder Working group of the Psychiatric Genomics, C., Hoggart, C. J., & O'Reilly, P. F. (2023). PRSet: Pathway-based polygenic risk score analyses and software. PLoS Genet, 19(2), e1010624. https://doi.org/10.1371/journal.pgen.1010624 Choi, S. W., Mak, T. S.-H., & O’Reilly, P. F. (2020). Tutorial: a guide to performing polygenic risk score analyses. Nature protocols, 15(9), 2759-2772. Feng, Y. A., Chen, C. Y., Chen, T. T., Kuo, P. H., Hsu, Y. H., Yang, H. I., Chen, W. J., Su, M. W., Chu, H. W., Shen, C. Y., Ge, T., Huang, H., & Lin, Y. F. (2022). Taiwan Biobank: A rich biomedical research database of the Taiwanese population. Cell Genom, 2(11), 100197. https://doi.org/10.1016/j.xgen.2022.100197 Gaudillo, J., Rodriguez, J. J. R., Nazareno, A., Baltazar, L. R., Vilela, J., Bulalacao, R., Domingo, M., & Albia, J. (2019). Machine learning approach to single nucleotide polymorphism-based asthma prediction. PLoS One, 14(12), e0225574. https://doi.org/10.1371/journal.pone.0225574 Greener, J. G., Kandathil, S. M., Moffat, L., & Jones, D. T. (2022). A guide to machine learning for biologists. Nat Rev Mol Cell Biol, 23(1), 40-55. https://doi.org/10.1038/s41580-021-00407-0 Kuzudisli, C., Bakir-Gungor, B., Bulut, N., Qaqish, B., & Yousef, M. (2023). Review of feature selection approaches based on grouping of features. PeerJ, 11, e15666. https://doi.org/10.7717/peerj.15666 Lee, C.-P., & Leu, Y. (2011). A novel hybrid feature selection method for microarray data analysis. Applied Soft Computing, 11(1), 208-213. https://doi.org/10.1016/j.asoc.2009.11.010 Liu, J., Cheng, Y., Wang, X., Zhang, L., & Wang, Z. J. (2018). Cancer Characteristic Gene Selection via Sample Learning Based on Deep Sparse Filtering. Sci Rep, 8(1), 8270. https://doi.org/10.1038/s41598-018-26666-0 Liu, W., Zhuang, Z., Wang, W., Huang, T., & Liu, Z. (2021). An Improved Genome-Wide Polygenic Score Model for Predicting the Risk of Type 2 Diabetes. Front Genet, 12, 632385. https://doi.org/10.3389/fgene.2021.632385 Mahendran, N., Durai Raj Vincent, P. M., Srinivasan, K., & Chang, C. Y. (2020). Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions. Front Genet, 11, 603808. https://doi.org/10.3389/fgene.2020.603808 Naoual El aboudi, L. B. (2016). Review on Wrapper Feature Selection Approaches.pdf Privé, F., Arbel, J., & Vilhjálmsson, B. J. (2020). LDpred2: better, faster, stronger. Bioinformatics, 36(22-23), 5424-5431. Pudjihartono, N., Fadason, T., Kempa-Liehr, A. W., & O'Sullivan, J. M. (2022). A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front Bioinform, 2, 927312. https://doi.org/10.3389/fbinf.2022.927312 Silva, P. P., Gaudillo, J. D., Vilela, J. A., Roxas-Villanueva, R. M. L., Tiangco, B. J., Domingo, M. R., & Albia, J. R. (2022). A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility loci. Sci Rep, 12(1), 15817. https://doi.org/10.1038/s41598-022-19708-1 St-Pierre, J., Zhang, X., Lu, T., Jiang, L., Loffree, X., Wang, L., Bhatnagar, S., Greenwood, C. M. T., Inference, C. t. o. I. R. H.-D. C., & Prediction, M. (2022). Considering strategies for SNP selection in genetic and polygenic risk scores. Front Genet, 13, 900595. https://doi.org/10.3389/fgene.2022.900595 Suresh, S., Newton, D. T., Everett IV, T. H., Lin, G., & Duerstock, B. S. (2022). Feature selection techniques for a machine learning model to detect autonomic dysreflexia. Frontiers in Neuroinformatics, 16, 901428. Uffelmann, E., Huang, Q. Q., Munung, N. S., de Vries, J., Okada, Y., Martin, A. R., Martin, H. C., Lappalainen, T., & Posthuma, D. (2021). Genome-wide association studies. Nature Reviews Methods Primers, 1(1). https://doi.org/10.1038/s43586-021-00056-9 Wang, C., Kao, W. H., & Hsiao, C. K. (2015). Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies. PLoS One, 10(8), e0135918. https://doi.org/10.1371/journal.pone.0135918 Wang, X., Wang, S., & Meng, X. (2018). A novel SNP-set analytical method without distinguishing common variants or rare variants in genome-wide association study. International Journal of Biomathematics, 11(07). https://doi.org/10.1142/s1793524518500948 Wei, C. Y., Yang, J. H., Yeh, E. C., Tsai, M. F., Kao, H. J., Lo, C. Z., Chang, L. P., Lin, W. J., Hsieh, F. J., Belsare, S., Bhaskar, A., Su, M. W., Lee, T. C., Lin, Y. L., Liu, F. T., Shen, C. Y., Li, L. H., Chen, C. H., Wall, J. D., . . . Kwok, P. Y. (2021). Genetic profiles of 103,106 individuals in the Taiwan Biobank provide insights into the health and history of Han Chinese. NPJ Genom Med, 6(1), 10. https://doi.org/10.1038/s41525-021-00178-9 Wu, M. C., Kraft, P., Epstein, M. P., Taylor, D. M., Chanock, S. J., Hunter, D. J., & Lin, X. (2010). Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet, 86(6), 929-942. https://doi.org/10.1016/j.ajhg.2010.05.002 Yuan, X., Liu, S., Feng, W., & Dauphin, G. (2023). Feature Importance Ranking of Random Forest-Based End-to-End Learning Algorithm. Remote Sensing, 15(21). https://doi.org/10.3390/rs15215203 Zhang, Z., & Liu, Z.-P. (2021). Robust biomarker discovery for hepatocellular carcinoma from high-throughput data by multiple feature selection methods. BMC medical genomics, 14, 1-12. Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand, A., & Liu, H. (2010). Advancing feature selection research. ASU feature selection repository, 1-28. |
| 論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信