§ 瀏覽學位論文書目資料
  
系統識別號 U0002-1607202523354200
DOI 10.6846/tku202500571
論文名稱(中文) SMOTE插補法結合類神經網路、分類提升器與零膨脹伯努利迴歸模型之效能比較
論文名稱(英文) A Study on the Performance Comparison of the ANN, CatBoost, and Zero-Inflated Bernoulli Regression Model with SMOTE
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 統計學系應用統計學碩士班
系所名稱(英文) Department of Statistics
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 113
學期 2
出版年 114
研究生(中文) 張家瑞
研究生(英文) Chia-Jui Chang
學號 612650217
學位類別 碩士
語言別 繁體中文
第二語言別
口試日期 2025-07-09
論文頁數 48頁
口試委員 指導教授 - 蔡宗儒(078031@mail.tku.edu.tw)
口試委員 - 楊文
口試委員 - 李名鏞
關鍵字(中) 零膨脹資料
SMOTE
類神經網路
CatBoost
零膨脹伯努利迴歸模型
關鍵字(英) Zero-inflated data
SMOTE
ANN
CatBoost
zero-inflated Bernoulli regression model
第三語言關鍵字
學科別分類
中文摘要
本研究探討針零膨脹特性的二分類資料結合過採樣技術SMOTE後,傳統機器學習模型與統計模型在分類效能上的表現比較。研究選用三種具代表性的模型進行分析:類神經網路(ANN)、分類提升器(CatBoost)與零膨脹伯努利迴歸模型(ZIB)。透過蒙地卡羅模擬與乳癌資料實例,評估不同模型於SMOTE插補前後之敏感度、準確度與特異度等效能指標。研究結果發現,ZIB模型在敏感度方面相對穩定,能有效處理結構性零值與類別不平衡的問題;而CatBoost則於整體分類效能上展現出優異表現。此外,SMOTE插補對於ANN與CatBoost在敏感度上有顯著提升,惟同時也可能導致過度配適現象。研究結果可提供使用者在實務中處理零膨脹與不平衡資料時,在模型選擇時的參考依據。
英文摘要
This study investigates the classification performance of traditional machine learning and statistical models on binary data with zero inflation, particularly after applying the synthetic minority over-sampling technique (SMOTE). Three models are compared, including the Artificial Neural Network (ANN), Categorical Boosting (CatBoost, CBT), and Zero-Inflated Bernoulli Model (ZIB). Monte Carlo simulations and a breast cancer data set are used to assess model’s performance in terms of the indices of sensitivity, accuracy, and specificity before and after using SMOTE augmentation. The results show that the ZIB model demonstrates stable sensitivity and effectively handles structural zeros and class imbalance. CatBoost, on the other hand, exhibits outstanding overall classification performance. While SMOTE improves the sensitivity of ANN and CatBoost, it may also introduce overfitting . This study provides practical guidance for selecting appropriate models when dealing with zero-inflated and imbalanced data.
第三語言摘要
論文目次
目錄
目錄	I
表目錄	II
圖目錄	III
第一章	緒論	1
1.1	研究動機與背景	1
1.2	研究目的	1
1.3	論文架構	2
第二章	文獻回顧	3
2.1	文獻綜述	3
第三章	模型介紹	13
3.1	類神經網路	13
3.2	CatBoost	14
3.3	零膨脹伯努利模型	15
3.4	SMOTE	19
第四章	蒙地卡羅模擬	21
4.1	效能指標	21
4.2	模擬參數設定	22
4.3	模型之效能比較	23
第五章	實例分析	29
5.1	乳癌資料集	29
5.2	模型分析	30
第六章	結論與未來研究方向	43
6.1	結論	43
6.2	未來研究方向	44
參考文獻	45 

表目錄
表4.1.1、混淆矩陣	21
表4.1.2、混淆矩陣說明	21
表4.2.1、ANN模型在不同隱藏層節點數量下的Resampling結果	22
表4.3.1、ANN、CBT、ZIB 在測試集上不同樣本進行1000次的模擬結果	24
表4.3.2、結合SMOTE方法在測試集上不同樣本進行1000次的模擬結果	25
表5.1.1、乳癌資料集變數說明	30
表5.2.1、模型使用乳癌資料集於訓練集的效能比較	31
表5.2.2、模型使用乳癌資料集於測試集的效能比較	31
表5.2.3、使用乳癌資料集結合ANN模型於訓練集插補前後的效能比較	34
表5.2.4、使用乳癌資料集結合ANN模型於測試集插補前後的效能比較	34
表5.2.5、使用乳癌資料集結合CBT模型於訓練集插補前後的效能比較	37
表5.2.6、使用乳癌資料集結合CBT模型於測試集插補前後的效能比較	37
表5.2.7、使用乳癌資料集結合ZIB模型於訓練集插補前後的效能比較	40
表5.2.8、使用乳癌資料集結合ZIB模型於測試集插補前後的效能比較	40
 
圖目錄
圖4.3.1、ANN、CBT、ZIB三種方法的評估結果	24
圖4.3.2、使用SMOTE方法之模型效能評估結果	26
圖4.3.3、ANN在插補前後之模型效能評估結果	26
圖4.3.4、CBT在插補前後之模型效能評估結果	27
圖4.3.5、ZIB在插補前後之模型效能評估結果	28
圖5.2.1、模型在訓練集上之指標表現與指標分布的區間比較	32
圖5.2.2、模型在測試集上之指標表現與指標分布的區間比較	33
圖5.2.3、ANN於插補前後在訓練集上之指標表現與指標分布區間之比較	35
圖5.2.4、ANN於插補前後在測試集上之指標表現與指標分布區間之比較	36
圖5.2.5、CBT於插補前後在訓練集上之指標表現與指標分布區間之比較	38
圖5.2.6、CBT於插補前後在測試集上之指標表現與指標分布區間之比較	39
圖5.2.7、ZIB於插補前後在訓練集上之指標表現與指標分布區間之比較	41
圖5.2.8、ZIB於插補前後在測試集上之指標表現與指標分布區間之比較	42
參考文獻
[1]	Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1), 1-14.

[2]	Saffari, S. E., & Adnan, R. (2012). Parameter estimation on zero-inflated negative binomial regression with right truncated data [Article]. Sains Malaysiana, 41(11), 1483-1487. 

[3]	Hall, D. B. (2000). Zero‐inflated Poisson and binomial regression with random effects: a case study. Biometrics, 56(4), 1030-1039.

[4]	Ghosh, S. K., Mukhopadhyay, P., & Lu, J. C. J. (2006). Bayesian analysis of zero-inflated regression models. Journal of Statistical Planning and Inference, 136(4), 1360-1375.

[5]	Lee, K. H., Coull, B. A., Moscicki, A. B., Paster, B. J., & Starr, J. R. (2020). Bayesian variable selection for multivariate zero-inflated models: Application to microbiome count data. Biostatistics, 21(3), 499-517.

[6]	Oganisian, A., Mitra, N., & Roy, J. A. (2021). A Bayesian nonparametric model for zero-inflated outcomes: Prediction, clustering, and causal estimation. Biometrics, 77(1), 125-135.

[7]	Diaz, J., & Joseph, M. B. (2019). Predicting property damage from tornadoes with zero-inflated neural networks. Weather and Climate Extremes, 25, 100216.

[8]	Blasques, F., Holý, V., & Tomanová, P. (2024). Zero-inflated autoregressive conditional duration model for discrete trade durations with excessive zeros. Studies in Nonlinear Dynamics & Econometrics, 28(5), 673-702.

[9]	Feng, C. X. (2021). A comparison of zero-inflated and hurdle models for modeling zero-inflated count data. Journal of Statistical Distributions and Applications, 8(1), 8.

[10]	Chiang, J. Y., Lio, Y., Hsu, C. Y., Ho, C. L., & Tsai, T. R. (2023). Binary classification with imbalanced data. Entropy, 26(1), 15.

[11]	Xin, H., Lio, Y., Chen, H. C., & Tsai, T. R. (2024). Zero-Inflated Binary Classification Model with Elastic Net Regularization. Mathematics, 12(19), 2990.

[12]	Su, C. J., Chen, I. F., Tsai, T. R., & Lio, Y. (2025). A Hybrid Algorithm with a Data Augmentation Method to Enhance the Performance of the Zero-Inflated Bernoulli Model. Mathematics, 13(11), 1702.

[13]	Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.

[14]	Chauvin, Y. (1989). A back-propagation algorithm with optimal use of hidden units. In D. S. Touretzky (Ed.), Advances in Neural Information Processing Systems (Vol. 1, pp. 519–526). Morgan Kaufmann.

[15]	Tang, Z., & Fishwick, P. A. (1993). Feedforward neural nets as models for time series forecasting. ORSA Journal on Computing, 5(4), 374-385.

[16]	Zhang, G., Patuwo, B. E., & Hu, M. Y. (1998). Forecasting with artificial neural networks:: The state of the art. International Journal of Forecasting, 14(1), 35-62.

[17]	Basheer, I. A., & Hajmeer, M. (2000). Artificial neural networks: fundamentals, computing, design, and application. Journal of Microbiological Methods, 43(1), 3-31.

[18]	Kariri, E., Louati, H., Louati, A., & Masmoudi, F. (2023). Exploring the advancements and future research directions of artificial neural networks: a text mining approach. Applied Sciences, 13(5), 3186.

[19]	Jose, A., Macdonald, A. S., Tzougas, G., & Streftaris, G. (2024). Interpretable zero-inflated neural network models for predicting admission counts. Annals of Actuarial Science, 18(3), 644-674.

[20]	Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 31, pp. 6638–6648). Curran Associates, Inc.

[21]	Dorogush, A. V., Ershov, V., & Gulin, A. (2018). CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363.

[22]	Wang, H., & Cheng, L. (2021). CatBoost model with synthetic features in application to loan risk assessment of small businesses. arXiv preprint arXiv:2106.07954.

[23]	Bentéjac, C., Csörgő, A., & Martínez-Muñoz, G. (2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3), 1937-1967. 

[24]	So, B. (2024). Enhanced gradient boosting for zero-inflated insurance claims and comparative analysis of CatBoost, XGBoost, and LightGBM. Scandinavian Actuarial Journal, 2024(10), 1013-1035.

[25]	Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.

[26]	Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. In Knowledge Discovery in Databases: PKDD 2003: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, September 22-26, 2003. Proceedings 7 (pp. 107-119). Springer Berlin Heidelberg.

[27]	Han, H., Wang, W. Y., & Mao, B. H. (2005, August). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing (pp. 878-887). Berlin, Heidelberg: Springer Berlin Heidelberg.
 
[28]	Blagus, R., & Lusa, L. (2013). SMOTE for high-dimensional class-imbalanced data. BMC bioinformatics, 14(1), 106.

[29]	Pradipta, G. A., Wardoyo, R., Musdholifah, A., Sanjaya, I. N. H., & Ismail, M. (2021, November). SMOTE for handling imbalanced data problem: A review. In 2021 sixth international conference on informatics and computing (ICIC) (pp. 1-8). IEEE.
論文全文使用權限
國家圖書館
同意無償授權國家圖書館,書目與全文電子檔於繳交授權書後, 於網際網路立即公開
校內
校內紙本論文立即公開
同意電子論文全文授權於全球公開
校內電子論文立即公開
校外
同意授權予資料庫廠商
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信