電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2019-06-24起於校外公開使用
本論文紙本於2019-06-24起公開使用

系統識別號	U0002-2406201908534500
DOI	10.6846/TKU.2019.00758
論文名稱(中文)	資料探勘技術在信用卡不平衡資料上之應用
論文名稱(英文)	In Application of Data Mining Technology in Credit Card Imbalanced Data
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	大數據分析與商業智慧碩士學位學程
系所名稱(英文)	Master's Program In Big Data Analytics and Business Intelligence
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	107
學期	2
出版年	108
研究生(中文)	郭珉辰
研究生(英文)	Min-Chen Guo
學號	606890050
學位類別	碩士
語言別	繁體中文
第二語言別
口試日期	2019-06-14
論文頁數	54頁
口試委員	指導教授 - 林志娟(117604@gms.tku.edu.tw) 委員 - 張慶暉委員 - 羅惠瓊
關鍵字(中)	機器學習過採樣欠採樣隨機森林極限梯度提升
關鍵字(英)	machine learning oversampling undersampling random forests extreme gradient boosting
第三語言關鍵字
學科別分類
中文摘要	本研究主要利用機器學習常用分類模型中的隨機森林、支持向量機、極限梯度提升以及類神經網路4 種模型來預測信用卡持卡客戶是否會違約，實際上資料顯示有違約的持卡客戶比例是偏低的，此乃屬於不平衡資料中的少數類別(minority)，因此透過混淆矩陣計算分類模型所得到的整體預測正確率容易失真。為了改善不平衡資料造成模型預測不佳的現象，本研究透過合成少數類過採樣技術(synthetic minority over-sampling technique)、過採樣(oversampling)、欠採樣 (undersampling)、隨機過採樣(random over-sampling examples,ROSE)等4 種抽樣方式來提升分類模型的效能。本研究採用評估模型優劣之準則有預測正確率、召回率、精確率、F1-score、馬修相關係數、ROC、AUC、交叉驗證及模型運算時間。本研究實證結果顯示從重抽樣方法來看，合成少數類過採樣技術與隨機過採樣能明顯改善原先因資料不平衡所發生召回率偏低的問題；以分類模型來看隨機森林與極限梯度提升表現皆優於支持向量機與類神經網路。綜合而論，倘若有時間性考量，以極限梯度提升較為推薦。
英文摘要	This study uses traditional classification models of machine learning mainly, such as random forests, support vector machines, extreme gradient boosting and neural network models to predict the credit card holder’s default status. In fact, the data itself is imbalanced due to the proportion of the defaulted cardholders is low. Therefore, the overall prediction accuracy obtained from the confusion matrix is sometimes misleading. To overcome the disadvantages caused by imbalanced data, this study uses synthetic minority over-sampling technique, oversampling, undersampling, random oversampling examples, four resampling methods to enhance the performance of the classification models. The criteria used to evaluate the pros and cons of the model are predictive accuracy, recall rate, precision, F1-score, Matthew’s correlation coefficient, ROC, AUC, crossvalidation, and computing speed. In resampling methods’ perspectives, the empirical results of this study show that the synthesis of minority oversampling technique and random oversampling examples improve the recall rate significantly. On the other hand, in the classification models’ perspectives, the performance of random forest and extreme gradient boosting outperform support vector machine and neural network models. Generally, along with real-time analytics viewpoint, oversampling and the extreme gradient boosting is recommended due to high recall rate and the computing speed.
第三語言摘要
論文目次	目錄第一章緒論 1 第一節研究背景與動機 1 第二節研究目的 2 第三節研究流程 2 第二章文獻探討 4 第一節信用風險與信用風險評分 4 第二節不平衡分類 5 第三節分類模型 8 第四節馬修相關係數 10 第三章研究方法 11 第一節研究架構 11 第二節不平衡資料處理方法 12 1. 過採樣 12 2. 欠採樣 12 3. 隨機過採樣 12 4. 合成少數類過採樣技術 13 第三節分類模型 14 1. 隨機森林 14 2. 支持向量機 17 3. 梯度提升決策樹(gradient boosting decision tree, GBDT) 19 4. 極限梯度提升 20 5. 類神經網路 23 第四節模型評估 25 1. 混淆矩陣 25 2. 接收者操作特徵 27 3. ROC 曲線下面積 28 4. 交叉驗證 30 第四章實證結果與分析 32 第一節資料敘述 32 第二節實證結果 34 第五章結論 51 參考文獻 52 中文文獻 52 網站文獻 52 英文文獻 53 表目錄表 2.1 信用風險評估法比較 5 表 2.2 重新採樣方法對於分類法影響 7 表 3.1 混淆矩陣 25 表 3.2 模型預測能力 29 表 4.1 變數說明表 33 表 4.2 原始資料下各分類模型之評估與彙整 34 表 4.3 原始資料下各分類模型之AUC 與模型運算時間 36 表 4.4 原始資料十次交叉驗證結果 36 表 4.5 合成少數類過採樣下各分類模型之評估與彙整 37 表 4.6 合成少數類過採樣下各分類模型之AUC 與模型運算時間 39 表 4.7 合成少數類過採樣十次交叉驗證結果 39 表 4.8 過採樣下各分類模型之評估與彙整 40 表 4.9 過採樣下各分類模型之AUC 與模型運算時間 42 表 4.10 過採樣十次交叉驗證結果 42 表 4.11 欠採樣下各分類模型之評估與彙整 43 表 4.12 欠採樣下各分類模型之AUC 與模型運算時間 45 表 4.13 欠採樣十次交叉驗證結果 45 表 4.14 隨機過採樣下各分類模型之評估與彙整 46 表 4.15 隨機過採樣下各分類模型之AUC 與模型運算時間 48 表 4.16 隨機過採樣十次交叉驗證結果 48 表 4.17 採樣方法與分類模型評估彙整總表 50 圖目錄圖 1.1 研究流程圖 3 圖 3.1 研究架構圖 11 圖 3.2 合成少數類過採樣示意圖 13 圖 3.3 決策樹示意圖 15 圖 3.4 隨機森林建置流程示意圖 16 圖 3.5 支持向量機示意圖 17 圖 3.6 支持向量機核函數示意圖 19 圖 3.7 XGBoost 樹集成模型示意圖 21 圖 3.8 XGBoost 樹的複雜度示意圖 21 圖 3.9 類神經網路示意圖 23 圖 3.10 ROC 曲線示意圖 29 圖 4.1 原始資料下各分類模型之ROC 35 圖 4.2 合成少數類過採樣下各分類模型之ROC 38 圖 4.3 過採樣下各分類模型之ROC 41 圖 4.4 欠採樣下各分類模型之ROC 44 圖 4.5 隨機過採樣下各分類模型之ROC 47
參考文獻	中文文獻呂美慧，2000。銀行授信評等模式---Logistic Regression 之應用，國立政治大學金融學系碩士論文。李美笑，2002。信用卡持卡人信用風險之研究，逢甲大學保險學系碩士論文。周俊宏，2006。運用類神經網路與支撐向量機於個人信用卡授信決策之研究，國立台灣科技大學資訊管理系碩士論文。林宸翊，2009。應用於行為評等之Random forests 及其變數選擇法，輔仁大學應用統計學研究所碩士論文。黃衍浩，2012。應用資料探勘技術建置兩階段之信用評等預測模式，國立中正大學會計與資訊科技研究所碩士論文。葉丞峻，2017。適用於分類變數資料的二元不平衡資料自動分類系統，淡江大學統計學系應用統計學碩士班碩士論文。龔昶元，1998。Logistic Regression 模式應用於信用卡信用風險審核之研究－以國內某銀行信用卡中心為例，台北銀行月刊，28：9，35-49 頁。網站文獻財團法人聯合信用卡處理中心， 2014 。支付卡發展史，取自︰https://www.nccc.com.tw/wps/wcm/connect/zh/home/KnowledgeSharing/PaymentCardKnowledge 英文文獻 Boser, B. E., Guyon, I. M., Vapnik, V. N., 1992. A training algorithm for optimal margin classifiers, COLT '92 Proceedings of the fifth annual workshop on Computational learning theory, 144-152. Breiman, L. 1997. Arcing The Edge, Technical Report 486, Statistics Department, University of California, Berkeley. Breiman, L., 2001. Random Forests, Machine Learning, 45(1), 5-32. Brown, I. and Mues, C., 2012. An experimental comparison of classification algorithms for imbalanced credit scoring data sets, Expert Systems with Applications, 39(3), 3245-3453. Butaru, F., Chen Q., Clark B., Das S., Lo A. and Siddique A., 2016. Risk and risk management in the credit card industry, Journal of Banking and Finance, 72, 218-239. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeye, W. P., 2002. SMOTE:Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research, 16, 321–357. Chen, T. and Guestrin, C., 2016. XGBoost: A Scalable Tree Boosting System, In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, San Francisco, California, USA, 785-794, ACM. Chicco, D., 2017. Ten quick tips for machine learning in computational biology, BioData Mining, 10(35). Fan, H., 2013. Land-Cover Mapping in the Nujiang Grand Canyon: Integrating Spectral, Textural, and Topographic Data in a Random Forest Classifier, International Journal of Remote Sensing, 34(21), 7545-7567. Fan, R. E., Chen, P. H. and Lin, C. J., 2005. Working Set Selection Using Second Order Information for Training Support Vector Machines, The Journal of Machine Learning Research, 6, 1889-1918. Fawcett, T., 2006. An introduction to ROC analysis, Pattern Recognition Letters, 27(8), 861-874. Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B. and Herrera, F., 2018. Learning from Imbalanced Data Sets, CH14, Springer International Publishing. Hosmer, D.W. and Lemeshow, S., 2000. Applied Logistic Regression, 2nd ed, New York, Chichester, Wiley. Liang, G., Chi, Z., Hai, B. G. and Yong, R. S., 2006. Credit Scoring Model Based on Neural Network with Particle Swarm Optimization, Advances in Natural 54 Computation, 76-79. Lunardon, N., Menardi, G. and Torelli, N., 2014. ROSE: A Package for Binary Imbalanced Learning, The R Journal, 6(1), 79-89. Neema, S. and Soibam, B., 2017. The comparison of machine learning methods to achieve most cost-effective prediction for credit card default, Journal of Management Science and Business Intelligence, 2017, 2(2), 36-41. Nello, C. and John, S. T., 2000. An introduction to support Vector Machines: and other kernel-based learning methods, Cambridge University Press New York, USA. Powers, D. M., 2011. Evaluation: from Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation, Journal of Machine Learning Technologies, 3(1), 37-63. Stein, R. M., 2005. The relationship between default prediction and lending profits: Integrating ROC analysis and loan pricing, Journal of Banking & Finance, 29, 1213-1236. Veganzones, D. and Séverina, E., 2018. An investigation of bankruptcy prediction in imbalanced datasets, Decision Support Systems, 112, 111-124. Venables, W. N. and Ripley, B. D., 2002. Modern Applied Statistics with S 4th, Springer Verlag. Xie, J. and Qiu, Z., 2006. The effect of imbalanced data sets on LDA:A theoretical and empirical analysis, Pattern Recognition, 40(2), 557-562. Yeh, I. C. and Lien, C. H., 2009. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients, Expert Systems with Applications, 36(2), 2473-2480.
論文全文使用權限	校內：校內紙本論文立即公開同意電子論文全文授權校園內公開校內電子論文立即公開校外：同意授權校外電子論文立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信