§ 瀏覽學位論文書目資料
系統識別號 U0002-2608202419502500
論文名稱(中文) 基於機器學習與深度學習對肝癌患者存活期預測之支援系統
論文名稱(英文) A Support System for Predicting the Survival Time of Liver Cancer Patients Based on Machine Learning and Deep Learning
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系碩士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 112
學期 2
出版年 113
研究生(中文) 蔡宗穎
研究生(英文) TSUNG-YING TSAI
學號 611410407
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2024-07-03
論文頁數 81頁
口試委員 口試委員 - 廖文華(whliao@ntub.edu.tw)
指導教授 - 黃仁俊(victor@gms.tku.edu.tw)
口試委員 - 張志勇(cychang@mail.tku.edu.tw)
關鍵字(中) 人工智慧
機器學習
深度學習
肝癌存活預測
數據分析
特徵提取
數據挖掘
關鍵字(英) Artificial intelligence
machine learning
deep learning
liver cancer survival prediction
data analysis
feature extraction
data mining
第三語言關鍵字
學科別分類
中文摘要
在醫療領域,預測系統具有重要地位。準確的預測可以幫助醫療團隊制定更合適的治療計劃,包括手術、放療、化療等,從而提高治療效果。同時,對於患者的臨終關懷和心理支持,準確的預測能讓患者及早進行臨終關懷和心理支持,減少不必要的痛苦,並為家人提供適當的輔導和支持。由於人工智慧近年的快速發展,已可用在分析醫療數據、抓取特徵並進一步對結果進行預測。
本研究擬針對肝癌過去以去逝病人的資料,進行分析,並預測肝癌可能的存活期,這個研究遇見下面挑戰。其一,由於肝癌相關的臨床數據中包含大量的變數,超過一百多個欄位,因此從中篩選出對存活期具有顯著影響的特徵具有相當的難度。其二,每位患者各只有一筆資料,導致缺乏歷史資料,無法追蹤和分析其過去的病歷資料,故難以全面了解患者在治療過程中的反應及其與存活期之間的關聯。最後,數據中的不平衡問題也是一大挑戰。由於存活期較短的患者數量往往多於存活期較長的患者,這可能導致模型在預測時偏向於較短的存活期。
為克服醫療特徵選取的挑戰,本研究利用SHAP值分析找出最具影響力特徵,在克服缺乏歷史資料的部分挑戰的部份,本研究以Cox比例風險模型來計算患者的風險分數與風險機率,以分析治療方案的改進和新技術。此外,透過分群演算法,我們能夠藉此找到與新患者相似的既往患者,以此參考最有可能的存活期長進行分析。接著,我們再使用機器學習與深度學習模型,分析和學習既有的醫療數據和經驗,以快速且精確地協助醫師評估肝癌患者的存活期。我們希望透過此方式,不僅為醫師提供有效的輔助工具以改進疾病診斷和治療決策,還能根據預測結果提供相應的醫療建議或策略,從而優化肝癌患者的治療過程。
實驗結果顯示,透過SHAP值分析、Cox比例風險模型的應用,以及分群演算法對模型的改進,肝癌患者存活期長預測模型的準確性得到了提升,特別是在增加分群數量和擴充訓練資料。
英文摘要
In the medical field, predictive systems hold a crucial role. Accurate predictions can help medical teams develop more suitable treatment plans, including surgery, radiotherapy, chemotherapy, and others, thereby enhancing treatment outcomes. Additionally, for end-of-life care and psychological support, precise predictions enable patients to engage in end-of-life care and receive psychological support earlier, reducing unnecessary suffering and providing appropriate guidance and support to their families. With the rapid development of artificial intelligence in recent years, it is now possible to analyze medical data, extract features, and further predict outcomes.
This study aims to analyze data from deceased liver cancer patients to predict potential survival times for those with liver cancer. Several challenges are encountered in this research. First, clinical data related to liver cancer contains a large number of variables, exceeding over a hundred fields. Therefore, selecting the features that significantly impact survival time is particularly difficult. Second, each patient only has a single set of data, resulting in a lack of historical data. This limits the ability to track and analyze their past medical records, making it challenging to fully understand the patient’s response to treatment and its relationship with survival time. Lastly, the issue of data imbalance is also a significant challenge. The number of patients with shorter survival times usually outnumbers those with longer survival times, which may cause the model to be biased toward predicting shorter survival periods.
To address the challenges of medical feature selection, this study uses SHAP value analysis to identify the most influential features. To overcome the challenge of lacking historical data, we employ the Cox proportional hazards model to calculate patients' risk scores and probabilities, thereby analyzing treatment improvements and new technologies. Furthermore, through clustering algorithms, we can identify past patients similar to new ones, using this reference to predict the most probable longer survival time. Subsequently, machine learning and deep learning models are employed to analyze and learn from existing medical data and experiences, aiding physicians in rapidly and accurately assessing the survival times of liver cancer patients. We hope that this approach will not only provide effective support tools for physicians to improve disease diagnosis and treatment decisions but also offer corresponding medical recommendations or strategies based on the predicted results, optimizing the treatment process for liver cancer patients.
Experimental results show that with SHAP value analysis, the application of the Cox proportional hazards model, and improvements through clustering algorithms, the accuracy of predicting long-term survival for liver cancer patients has been enhanced, particularly with increased clustering and expanded training data.
第三語言摘要
論文目次
目錄

目錄	VI
圖目錄	IX
表目錄	XII
第一章、簡介	1
第二章、相關研究	8
2-1機器學習	9
2-2深度學習	10
2-3生存分析	13
第三章、背景知識	18
3-1 醫療特徵分析	18
3-1-1 Pearson相關係數	18
3-1-2 Spearson相關係數	21
3-1-3 SHAP值	22
3-1-4 Cox比例風險模型	24
3-2相似患者分群	27
3-3 機器學習	28
3-3-1Random Forest模型	29
3-3-2 Gradient Boosting模型	30
3-3-3 XGBoost模型	32
3-4 深度學習	34
3-4-1 MLP模型	35
3-4-2 DNN模型	37
第四章、系統架構	40
4-1環境與問題描述	40
4-1-1欲解決問題	40
4-1-2目標	40
4-2系統架構	41
4-2-1前處理	41
A.	資料收集	42
B.	資料前處理	42
C.	存活時長與肝癌患者資料進行視覺化與觀察	45
D.	特徵工程	47
4-2-2模型建構	49
A.	風險分析	51
B.	相似度分析	55
C.	模型建構與訓練	57
D.	模型評估	60
第五章、實驗分析	63
5-1環境設定	63
5-2實驗數據	64
5-3實驗結果	64
第六章、結論	76
參考文獻	78

圖目錄

圖1、研究目標	2
圖2、研究架構	3
圖3、Pearson 相關係數熱力圖 (圖取自研究[25])	20
圖4、SHAP值特徵分析 (圖取自研究[26])	23
圖5、Cox比例風險模型分析 (圖取自研究[21])	26
圖6、K-Means相似患者分群 (圖取自研究[15])	28
圖7、Random Forest架構 (圖取自研究[14])	29
圖8、MLP架構 (圖取自研究[10])	36
圖9、DNN架構 (圖取自研究[9])	38
圖10、前處理流程	42
圖11、整理缺失值	43
圖12、資料轉換	44
圖13、資料正規化	45
圖14、肝癌患者存活時間分布狀況	46
圖15、肝癌患者性別與存活期長機率關係圖	46
圖16、肝癌患者存活時間與診斷年齡關係	46
圖17、腫瘤大小與患者年齡的相關性分析	47
圖18、Pearson相關係數分析重要醫療欄位	48
圖19、Spearson相關係數分析重要醫療欄位	49
圖20、特徵萃取	49
圖21、模型建構總覽	51
圖22、風險分數	52
圖23、SHAP值分析特徵重要性	53
圖24、以XGBoost作為base model實作的特徵分析	54
圖25、患者分群策略	56
圖26、分類患者存活期長	58
圖27、XGBoost模型	59
圖28、XGBoost結合DNN模型	60
圖29、資料量大小與分群數量對 Accuracy 的影響	66
圖30、資料量大小與分群數量對 Precision 的影響	66
圖31、資料量大小與分群數量對 F1-Score 的影響	67
圖32、未進行原發部位切除手術患者的存活期長分布	68
圖33、腫瘤破壞手術患者的存活期長分布	69
圖34、切除手術患者的存活期長分布	70
圖35、資料量大小與手術方式對 Accuracy 的影響	71
圖36、資料量大小與手術方式對 Precision 的影響	71
圖37、資料量大小與手術方式對 F1-Score 的影響	72
圖38、分群數量與Silhouette Score的關係圖	73
圖39、以Random Forest作為base model實作的特徵影響關係	75
圖40、以Gradient Boosting作為base model實作的特徵影響關係	75
圖41、以XGBoost作為base model實作的特徵影響關係	75

表目錄

表1、相關研究比較表	17
表2、混淆矩陣表	61
表3、環境套件與版本表	63
表4、實驗參數	64
參考文獻
參考文獻

[1]	P. Sathiyanarayanan, S. Pavithra., M. SAI SARANYA. and M. Makeswari., "Identification of Breast Cancer Using The Decision Tree Algorithm," 2019 IEEE International Conference on System, Computation, Automation and Networking (ICSCAN), Pondicherry, India, 2019, pp. 1-6.
[2]	Z. -J. Lee, C. -Y. Lee and J. Yao, "A Distributed Simulated Annealing Based Decision Tree (DSABDT) for Cancer Classification," 2021 IEEE 4th International Conference on Knowledge Innovation and Invention (ICKII), Taichung, Taiwan, 2021, pp. 1-4.
[3]	G. Sajiv and G. Ramkumar, "Machine Learning based Analysis of Histopathological Images of Breast Cancer Classification using Decision Tree Classifier," 2022 Sixth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Dharan, Nepal, 2022, pp. 989-995.
[4]	N. Meenakshisundaram and G. Ramkumar, "An Optimized Machine learning model for Automatic Prediction of Cervical Cancer Using Decision Tree Classifier," 2022 International Conference on Computer, Power and Communications (ICCPC), Chennai, India, 2022, pp. 336-341.
[5]	P. Hamsagayathri and P. Sampath, "Priority based decision tree classifier for breast cancer detection," 2017 4th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 2017, pp. 1-6.
[6]	G. Sruthi, C. L. Ram, M. K. Sai, B. P. Singh, N. Majhotra and N. Sharma, "Cancer Prediction using Machine Learning," 2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM), Gautam Buddha Nagar, India, 2022, pp. 217-221.
[7]	D. P. Rajan, K. S. Kannan, P. Divya and S. Velliangiri, "Comparative Analysis of Liver diseases by using Machine Learning Techniques," 2022 International Conference on Emerging Smart Computing and Informatics (ESCI), Pune, India, 2022, pp. 01-05.
[8]	G. S. P. Ghantasala, A. Kunchala, S. R, V. N. B, Y. Raparthi and P. Vidyullatha, "Machine Learning Based Ensemble Classifier using Wisconsin Dataset For Breast Cancer Prediction," 2023 International Conference on Integrated Intelligence and Communication Systems (ICIICS), Kalaburagi, India, 2023, pp. 1-4.
[9]	V. Asha, B. Saju, S. Mathew, A. M. V, Y. Swapna and S. P. Sreeja, "Breast Cancer classification using Neural networks," 2023 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE), Bengaluru, India, 2023, pp. 900-905.
[10]	I. S. Isa, Z. Saad, S. Omar, M. K. Osman, K. A. Ahmad and H. A. M. Sakim, "Suitable MLP Network Activation Functions for Breast Cancer and Thyroid Disease Detection," 2010 Second International Conference on Computational Intelligence, Modelling and Simulation, Bali, Indonesia, 2010, pp. 39-44.
[11]	Y. -W. Wan, J. Nagorski, G. I. Allen, Z. Li and Z. Liu, "Identifying cancer biomarkers through a network regularized Cox model," 2013 IEEE International Workshop on Genomic Signal Processing and Statistics, Houston, TX, USA, 2013, pp. 36-39.
[12]	N. Thongpim, C. Choksuchat, T. Bejrananda and S. Matayong, "On Predicting Survival Opportunities for Prostate Cancer by COX Regression in PSU Patients Data," 2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Phuket, Thailand, 2020, pp. 775-778.
[13]	H. Kutrani and S. Eltalhi, "Decision Tree Algorithms for Predictive Modeling in Breast Cancer Treatment," 2022 IEEE 2nd International Maghreb Meeting of the Conference on Sciences and Techniques of Automatic Control and Computer Engineering (MI-STA), Sabratha, Libya, 2022, pp. 223-227.
[14]	M. Ranjan, A. Shukla, K. Soni, S. Varma, M. Kuliha and U. Singh, "Cancer Prediction Using Random Forest and Deep Learning Techniques," 2022 IEEE 11th International Conference on Communication Systems and Network Technologies (CSNT), Indore, India, 2022, pp. 227-231.
[15]	S. Marne, S. Churi and M. Marne, "Predicting Breast Cancer using effective Classification with Decision Tree and K Means Clustering technique," 2020 International Conference on Emerging Smart Computing and Informatics (ESCI), Pune, India, 2020, pp. 39-42.
[16]	A. J. M. Rani, S. Nishanthini, D. C. J. Josephine, H. Venugopal, S. G. Nissi and V. Jacintha, "Liver Disease Prediction using Semi Supervised based Machine Learning Algorithm," 2022 3rd International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 2022, pp. 1389-1392.
[17]	C. N. Rao, K. Chatrapathy, A. J. Fathima, G. Sathish, S. Mukherjee and P. C. S. Reddy, "Intelligent Deep Learning Framework for Breast Cancer Prediction using Feature Ensemble Learning," 2023 4th IEEE Global Conference for Advancement in Technology (GCAT), Bangalore, India, 2023, pp. 1-5.
[18]	N. Jillani, A. M. Khattak, M. Z. Asghar and H. Ullah, "Efficient Diagnosis of Liver Disease using Deep Learning Technique," 2023 IEEE International Symposium on Medical Measurements and Applications (MeMeA), Jeju, Korea, Republic of, 2023, pp. 1-6.
[19]	A. R and A. L, "Effective Methods to Detect Liver Cancer Using CNN and Deep Learning Algorithms," 2023 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI), Chennai, India, 2023, pp. 1-7.
[20]	M. Boddam and W. Kim, "Interpretable Deep Learning Models With Concept Whitening Layers," 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkiye, 2023, pp. 1824-1827.
[21]	N. Ashleik, R. Shembesh, S. Saad and R. E. Seaiti, "Factors Associated with Recurrence of Breast Cancer Using Cox Proportional Hazard Model," 2023 IEEE 11th International Conference on Systems and Control (ICSC), Sousse, Tunisia, 2023, pp. 626-631.
[22]	S. Arora, A. Kumar and S. Sambhav, "Analysing the Effect of Gender on Mortality of COVID-19 Patients through Cox-Proportional Hazard Model," 2021 International Conference on Intelligent Technologies (CONIT), Hubli, India, 2021, pp. 1-5.
[23]	A. Gupta, S. Bharuka and P. T, "Feature-Engineered Exploration: Prediction of Smoking Status using Machine Learning Ensemble Techniques and Survival Analysis Using Cox Proportional Hazard Model," 2024 2nd International Conference on Networking and Communications (ICNWC), Chennai, India, 2024, pp. 1-6.
[24]	P. Liu, B. Fu, S. X. Yang, L. Deng, X. Zhong and H. Zheng, "Optimizing Survival Analysis of XGBoost for Ties to Predict Disease Progression of Breast Cancer," in IEEE Transactions on Biomedical Engineering, vol. 68, no. 1, pp. 148-160.
[25]	M. S. Anggreainy and D. Fitrianah, "A Logistic Regression Model using Pearson Correlation for Breast Cancer Classification," 2023 6th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), Batam, Indonesia, 2023, pp. 284-288.
[26]	M. N. S. Choudary, V. B. Bommineni, G. Tarun, G. P. Reddy and G. Gopakumar, "Predicting Covid-19 Positive Cases and Analysis on the Relevance of Features using SHAP (SHapley Additive exPlanation)," 2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 2021, pp. 1892-1896.
[27]	D. Kumar, B. Klefsjö, "Proportional hazards model: a review," Reliability Engineering & System Safety,Volume 44, Issue 2, 1994, pp. 177-188.
[28]	S. Lundberg, S. Lee, "A Unified Approach to Interpreting Model Predictions," arXiv:1705.07874, 2017.
論文全文使用權限
國家圖書館
不同意無償授權國家圖書館
校內
校內紙本論文立即公開
電子論文全文不同意授權
校內書目立即公開
校外
不同意授權予資料庫廠商
校外書目延後至2029-08-10公開,延後「中英文摘要」

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信