關鍵字(中) 機器學習
關鍵字(英) machine learning
random forests
extreme gradient boosting
本研究主要利用機器學習常用分類模型中的隨機森林、支持向量機、極限梯度提升以及類神經網路4 種模型來預測信用卡持卡客戶是否會違約,實際上資料顯示有違約的持卡客戶比例是偏低的,此乃屬於不平衡資料中的少數類別(minority),因此透過混淆矩陣計算分類模型所得
到的整體預測正確率容易失真。為了改善不平衡資料造成模型預測不佳的現象,本研究透過合成少數類過採樣技術(synthetic minority over-sampling technique)、過採樣(oversampling)、欠採樣
(undersampling)、隨機過採樣(random over-sampling examples,ROSE)等4 種抽樣方式來提升分類模型的效能。

This study uses traditional classification models of machine learning mainly, such as random forests, support vector machines, extreme gradient boosting and neural network models to predict the credit card holder’s default status. In fact, the data itself is imbalanced due to the proportion of the defaulted cardholders is low. Therefore, the overall prediction accuracy obtained from the confusion matrix is sometimes misleading. To overcome the disadvantages caused by imbalanced data, this study uses synthetic minority over-sampling technique, oversampling, undersampling, random oversampling examples, four resampling methods to enhance the performance of the classification models.

The criteria used to evaluate the pros and cons of the model are predictive accuracy, recall rate, precision, F1-score, Matthew’s correlation coefficient, ROC, AUC, crossvalidation, and computing speed. In resampling methods’ perspectives, the empirical results of this study show that the synthesis of minority oversampling technique and random oversampling examples improve the recall rate significantly. On the other hand, in the classification models’ perspectives, the performance of random forest and extreme gradient boosting outperform support vector machine and neural network models. Generally, along with real-time analytics viewpoint, oversampling and the extreme gradient boosting is recommended due to high recall rate and the computing speed.
