淡江大學覺生紀念圖書館 (TKU Library)
進階搜尋


下載電子全文限經由淡江IP使用) 
系統識別號 U0002-0808201908424700
中文論文名稱 以迴歸樹預測美國職棒大聯盟各球隊的年度勝率及晉級季後賽之名單
英文論文名稱 Predicting Yearly Winning Percentage and Playoff List of MLB Teams by Regression Trees
校院名稱 淡江大學
系所名稱(中) 大數據分析與商業智慧碩士學位學程
系所名稱(英) Master's Program In Big Data Analytics and Business Intelligence
學年度 107
學期 2
出版年 108
研究生中文姓名 羅莉雯
研究生英文姓名 Li-Wen LO
學號 606890027
學位類別 碩士
語文別 英文
口試日期 2019-06-21
論文頁數 60頁
口試委員 指導教授-周清江
委員-戴敏育
委員-陸承志
中文關鍵字 美國職棒大聯盟  迴歸樹  迴歸分析  最大概似迴歸樹  分類迴歸樹  赤池信息量準則  平均絕對百分比誤差  平均精度均值  預測勝率  預測季後賽名單 
英文關鍵字 MLB  Regression tree  Regression analysis  MLRT  CART  AIC  MAPE  Mean Reciprocal Rank  Winning percentage prediction  Playoff list prediction 
學科別分類
中文摘要 棒球是世界上最受歡迎的運動之一。根據谷歌趨勢分析,過去五年中,觀看美國棒球大聯盟(MLB)比賽的人越來越多。許多學者和球迷都對預測比賽結果有很大的興趣,他們使用球隊的各項表現來預測比賽結果。過去的研究預測準確率約為50%左右。然而,有一些球隊在確定進入季後賽後讓主力球員休息,以替補球員為主進行比賽。因此,本研究僅使用2016到2018各年度前半季各球隊表現的平均來預測年度勝率。本研究的方法分別為分類和迴歸樹(CART)、最大概似迴歸樹(MLRT)、分類相關迴歸樹(CCRT)和最大概似相關迴歸樹(MCRT)。本研究使用平均絕對百分誤差(MAPE)作為評量模型依據,所有模型的預測MAPE皆在10%到15%之間。最後,我們將預測的勝率應用於預測大聯盟季後賽名單及預測各分區第一名。由於MAPE無法將球隊排名加入計算,本研究使用平均精度均值(MRR)於預測各分區第一名。季後賽名單及預測各分區第一名的最高預測準確率皆為88%。根據本研究結果顯示MLRT的性能優於CART。在季後賽名單預測中,本研究結果也優於先前之研究。實驗顯示,本研究使用的方法可用於預測球隊的年度勝率、MLB的季後賽名單及各分區第一名預測。
英文摘要 Baseball is one of the most popular sports in the world. According to Google Trends analysis, over the past five years, more and more people are watching games of Major League Baseball (MLB) of the USA. Many scholars and fans are interested in predicting these games’ outcome based on the teams’ yearly performances. The prediction accuracies of previous studies were around 50 percent. However, some of teams do not play seriously after they are sure to enter the playoff. Thus, instead of prediction based on a team’s performance of the whole season, we used a team’s performance of the first half season to predict its yearly winning percentage. We apply Classification and Regression Trees (CART), Maximum Likelihood Regression Trees (MLRT), Classification and Correlation Coefficient Regression Trees (CCRT) and Maximum Likelihood and Correlation Coefficient Regression Trees (MCRT) separately in the study. Our prediction error of the winning percentage in terms of Mean Absolute Percent Error (MAPE) is between 10 to 15 percent. Then we apply the predicted results to obtain the playoff list. And we use Mean Reciprocal Rank (MRR) in predicting the best team in each region. Our highest prediction accuracy of the playoff list and the best team in each region are 88 percent. Our results show that performance of MLRT is better than that of CART. In predicting playoff list, our study's result is better than previous research. Experiments demonstrate that our used methodology could be used to predict a team’s yearly winning percentage, playoff list and the best team in each region of MLB.
論文目次 Contents
I. Introduction 1
1.1 Background 1
1.2 Motivation 2
1.3 Research goal 3
II. Related work 4
2.1 Baseball game result prediction 4
2.2 Baseball performance 7
2.3 Decision tree 8
2.4 Regression analysis 10
2.5 Multicollinearity 10
2.6 Akaike information criterion (AIC) 11
2.7 Mean absolute percentage error (MAPE) 12
III. Our prediction methodology 14
3.1 Data source 15
3.2 Data preprocessing 17
3.3 Our models 20
3.3.1. CART 22
3.3.2. MLRT 25
3.3.3. CCRT 29
3.3.4. MCRT 31
IV. Experiment results 35
4.1 Regression tree 35
4.2.1. Results of the CART model 35
4.2.2. Results of the MLRT model 39
4.2.3. Results of the CCRT model 42
4.2.4. Results of the MCRT model 43
4.2 Performance comparison of models 44
4.2.1. Performance comparison of models in first phase 44
4.2.2. Performance comparison of models in second phase 44
4.2.3. Performance comparison of models in first and second phase 45
4.2.4. Performance comparison of models in third phase 46
4.2.5. Performance comparison of models in second and third phases 47
4.3 Applying predicted winning percentage in playoff list prediction 48
4.4 Applying predicted winning percentage for predicting the best team in each region 51
V. Conclusions and future work 54
VI. References 56


Table Contents
Table 1. Fifteen variables used in our models 18
Table 2. The variables used in our models 20
Table 3. The steps using multicollinearity solutions and correlation in our models 21
Table 4. The example data 24
Table 5. Regression formula of CART-15 36
Table 6. Regression formula of CART-5 37
Table 7. Regression formula of CART-V 38
Table 8. Regression formula of MLRT-15 40
Table 9. Regression formula of MLRT-5 41
Table 10. Regression formula of MLRT-V 42
Table 11. Regression formula of CCRT 43
Table 12. Regression formula of MCRT 44
Table 13. Average accuracy of each model in playoff 50
Table 14. Average accuracy of each model in each region 53


Figure contents
Figure 1. Process chart 14
Figure 2. The defensive variables in a game 16
Figure 3. The offensive variables in a game 17
Figure 4. The CART-15 model 36
Figure 5. The CART-5 model 37
Figure 6. The CART-V model 38
Figure 7. The MLRT-15 model 39
Figure 8. The MLRT-5 model 40
Figure 9. The MLRT-V model 41
Figure 10. The CCRT model 42
Figure 11. The MCRT model 43
Figure 12. The predicted MAPE of models in the first phase 45
Figure 13. The predicting MAPE in the second phase 45
Figure 14. Compare training and testing performances of models in first and second phases 46
Figure 15. The MAPE of CCRT and MCRT 47
Figure 16. Comparing testing performances of models in second, third phases and regression analysis 48
Figure 17. The predicted accuracy of each model in playoff 50
Figure 18. The predicted accuracy of each model in each region 53
參考文獻 [1] Akaike, H., A new look at the statistical model identification, IEEE Transactions on Automatic Control, 1975, Vol. 19, No. 6, p.716-723.
[2] Benesty, J., Chen, J., Huang, Y., Cohen, I., Noise Reduction in Speech Processing, 2009, p.37-40.
[3] Brieman, L., Friedman, J., Olshen, R., and Stone, C., Classification and Regression Trees, Belmont, CA: Wadsworth International Group, 1984.
[4] Chaudhuri, P., Loh, W., Nonparametric estimation of conditional quantiles using quantile regression trees, Bernoulli, 2002, Vol. 8, No. 5, p. 561–576.
[5] Chen, C., Developing winner prediction models of professional baseball using data mining techniques, Master thesis of National Taiwan Sport University (in Chinese), 2011.
[6] Dong, Y., Liu, L., An optimized algorithm of decision trees based on correlation coefficients, Computer Engineering & Science (in Chinese), 2015, Vol. 37, No. 9.
[7] Draper, N., Smith, H., Applied regression analysis, 3rd edition, Wiley-Interscience, 1998.
[8] Fong, R., Studies on Predicting the Outcome of Professional Baseball Games with Data Mining Techniques: MLB as a Case, Master thesis of Chinese Culture University (in Chinese), 2013.
[9] Jiang, W., A One-to-One Game Forecast for The Professional Baseball League Using Neural Network ---A Case Study on Los Angeles Dodgers and San Francisco Giants, Master thesis of Tungnan university (in Chinese), 2012.
[10] Kass, G., An Exploratory Technique for Investigating Large Quantities of Categorical Data Applied Statistics, Applied Statistics, 1980, Vol. 29, No. 2, p.119-127.
[11] Lewis, C., Industrial and business forecasting methods: A practical guide to exponential smoothing and curve fitting, Butterworth-Heinemann, 1982.
[12] Loh, W., Regression trees with unbiased variable selection and interaction detection, Statistica Sinica, 2002, Vol. 12, No. 2, p.361-386.
[13] Ma, X., Diagnosis and Empirical Analysis on Multicollinearity in Linear Regression Model, Journal of Geodesy and Geoinformation Science (in Chinese), 2008, Vol. 35, No. 3, p.210-214.
[14] Menéndez, H., Vázquez, M., and Camacho, D., Mixed Clustering Methods to Forecast Baseball Trends, Intelligent Distributed Computing VIII, Studies in Computational Intelligence, 2015, Vol. 570.
[15] Miller, S., A derivation of the Pythagorean won-loss formula in baseball, Master thesis of Cornell University, 2006.
[16] Pan, Y., Purchase Decision of Sports Lottery by Money Line : A Case Study of MLB, Master thesis of Shih Hsin University (in Chinese), 2012.
[17] Pavitt, C., An Estimate of How Hitting, Pitching, Fielding, and Base stealing Impact Team Winning Percentages in Baseball, Journal of Quantitative Analysis in Sports, 2011, Vol. 7, No. 4.
[18] Quinlan, R., Induction of Decision Trees. Mach. Learn., 1986, Vol. 1, Issue 1, p.81-106.
[19] Schwarz, G., Estimating the dimension of a model, The Annals of Statistics, 1978, Vol. 6, No.2, p. 461-464.
[20] Shih, C., Huang, H., Ni, Y., Establishing Models to Predict the Outcomes of Baseball Games in CPBL, Physical Education Journal (in Chinese), 2010, Vol.43, No.2, p,115-130.
[21] Su, X., Wang, M., Fan, J., Maximum Likelihood Regression Trees, Journal of Computational and Graphical Statistics, 2004, Vol.13, No.3, p.586-598.
[22] Valero, S., Department of Computer Science, Universidad Central “Marta Abreu” de Las Villas, Cuba, Predicting Win-Loss outcomes in MLB regular season games – A comparative study using data mining methods, International Journal of Computer Science in Sport, 2016, Vol. 15, Issue 2.
[23] Voorhees, E., Proceedings of the 8th Text Retrieval Conference, TREC-8 Question Answering Track Report, 1999, p.77-82.
[24] Wang, C., A Study of the Winning Factors in Baseball Games: an Example of the 2012 Chinese Professional Baseball League (CPBL) Season, Master thesis of Fu Jen Catholic University (in Chinese), 2013.
[25] Weng, J., Zheng, Y., Qu, X., Yan, X., Development of a maximum likelihood regression tree-based model for predicting subway incident delay, Transportation Research Part C, 2014, Vol. 57, p.30-41.
[26] Yu, M., 2018 Preseason Major League Baseball Teams Performance Based upon DEA to Forecast Playoff Teams Model, Master thesis of Chung Hua University (in Chinese), 2018.
[27] Yu, T., Forecasting MLB Playoff Teams Using GA-SVM, 2017 IEEE International Conference on Applied System Innovation, 2017.
論文使用權限
  • 同意紙本無償授權給館內讀者為學術之目的重製使用,於2019-08-08公開。
  • 同意授權瀏覽/列印電子全文服務,於2019-08-08起公開。


  • 若您有任何疑問,請與我們聯絡!
    圖書館: 請來電 (02)2621-5656 轉 2486 或 來信