系統識別號 | U0002-0808201908424700 |
---|---|
DOI | 10.6846/TKU.2019.00190 |
論文名稱(中文) | 以迴歸樹預測美國職棒大聯盟各球隊的年度勝率及晉級季後賽之名單 |
論文名稱(英文) | Predicting Yearly Winning Percentage and Playoff List of MLB Teams by Regression Trees |
第三語言論文名稱 | |
校院名稱 | 淡江大學 |
系所名稱(中文) | 大數據分析與商業智慧碩士學位學程 |
系所名稱(英文) | Master's Program In Big Data Analytics and Business Intelligence |
外國學位學校名稱 | |
外國學位學院名稱 | |
外國學位研究所名稱 | |
學年度 | 107 |
學期 | 2 |
出版年 | 108 |
研究生(中文) | 羅莉雯 |
研究生(英文) | Li-Wen LO |
學號 | 606890027 |
學位類別 | 碩士 |
語言別 | 英文 |
第二語言別 | |
口試日期 | 2019-06-21 |
論文頁數 | 60頁 |
口試委員 |
指導教授
-
周清江(cjou@mail.tku.edu.tw)
委員 - 戴敏育(myday@mail.tku.edu.tw) 委員 - 陸承志(imcjluh@saturn.yzu.edu.tw) |
關鍵字(中) |
美國職棒大聯盟 迴歸樹 迴歸分析 最大概似迴歸樹 分類迴歸樹 赤池信息量準則 平均絕對百分比誤差 平均精度均值 預測勝率 預測季後賽名單 |
關鍵字(英) |
MLB Regression tree Regression analysis MLRT CART AIC MAPE Mean Reciprocal Rank Winning percentage prediction Playoff list prediction |
第三語言關鍵字 | |
學科別分類 | |
中文摘要 |
棒球是世界上最受歡迎的運動之一。根據谷歌趨勢分析,過去五年中,觀看美國棒球大聯盟(MLB)比賽的人越來越多。許多學者和球迷都對預測比賽結果有很大的興趣,他們使用球隊的各項表現來預測比賽結果。過去的研究預測準確率約為50%左右。然而,有一些球隊在確定進入季後賽後讓主力球員休息,以替補球員為主進行比賽。因此,本研究僅使用2016到2018各年度前半季各球隊表現的平均來預測年度勝率。本研究的方法分別為分類和迴歸樹(CART)、最大概似迴歸樹(MLRT)、分類相關迴歸樹(CCRT)和最大概似相關迴歸樹(MCRT)。本研究使用平均絕對百分誤差(MAPE)作為評量模型依據,所有模型的預測MAPE皆在10%到15%之間。最後,我們將預測的勝率應用於預測大聯盟季後賽名單及預測各分區第一名。由於MAPE無法將球隊排名加入計算,本研究使用平均精度均值(MRR)於預測各分區第一名。季後賽名單及預測各分區第一名的最高預測準確率皆為88%。根據本研究結果顯示MLRT的性能優於CART。在季後賽名單預測中,本研究結果也優於先前之研究。實驗顯示,本研究使用的方法可用於預測球隊的年度勝率、MLB的季後賽名單及各分區第一名預測。 |
英文摘要 |
Baseball is one of the most popular sports in the world. According to Google Trends analysis, over the past five years, more and more people are watching games of Major League Baseball (MLB) of the USA. Many scholars and fans are interested in predicting these games’ outcome based on the teams’ yearly performances. The prediction accuracies of previous studies were around 50 percent. However, some of teams do not play seriously after they are sure to enter the playoff. Thus, instead of prediction based on a team’s performance of the whole season, we used a team’s performance of the first half season to predict its yearly winning percentage. We apply Classification and Regression Trees (CART), Maximum Likelihood Regression Trees (MLRT), Classification and Correlation Coefficient Regression Trees (CCRT) and Maximum Likelihood and Correlation Coefficient Regression Trees (MCRT) separately in the study. Our prediction error of the winning percentage in terms of Mean Absolute Percent Error (MAPE) is between 10 to 15 percent. Then we apply the predicted results to obtain the playoff list. And we use Mean Reciprocal Rank (MRR) in predicting the best team in each region. Our highest prediction accuracy of the playoff list and the best team in each region are 88 percent. Our results show that performance of MLRT is better than that of CART. In predicting playoff list, our study's result is better than previous research. Experiments demonstrate that our used methodology could be used to predict a team’s yearly winning percentage, playoff list and the best team in each region of MLB. |
第三語言摘要 | |
論文目次 |
Contents I. Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Research goal 3 II. Related work 4 2.1 Baseball game result prediction 4 2.2 Baseball performance 7 2.3 Decision tree 8 2.4 Regression analysis 10 2.5 Multicollinearity 10 2.6 Akaike information criterion (AIC) 11 2.7 Mean absolute percentage error (MAPE) 12 III. Our prediction methodology 14 3.1 Data source 15 3.2 Data preprocessing 17 3.3 Our models 20 3.3.1. CART 22 3.3.2. MLRT 25 3.3.3. CCRT 29 3.3.4. MCRT 31 IV. Experiment results 35 4.1 Regression tree 35 4.2.1. Results of the CART model 35 4.2.2. Results of the MLRT model 39 4.2.3. Results of the CCRT model 42 4.2.4. Results of the MCRT model 43 4.2 Performance comparison of models 44 4.2.1. Performance comparison of models in first phase 44 4.2.2. Performance comparison of models in second phase 44 4.2.3. Performance comparison of models in first and second phase 45 4.2.4. Performance comparison of models in third phase 46 4.2.5. Performance comparison of models in second and third phases 47 4.3 Applying predicted winning percentage in playoff list prediction 48 4.4 Applying predicted winning percentage for predicting the best team in each region 51 V. Conclusions and future work 54 VI. References 56 Table Contents Table 1. Fifteen variables used in our models 18 Table 2. The variables used in our models 20 Table 3. The steps using multicollinearity solutions and correlation in our models 21 Table 4. The example data 24 Table 5. Regression formula of CART-15 36 Table 6. Regression formula of CART-5 37 Table 7. Regression formula of CART-V 38 Table 8. Regression formula of MLRT-15 40 Table 9. Regression formula of MLRT-5 41 Table 10. Regression formula of MLRT-V 42 Table 11. Regression formula of CCRT 43 Table 12. Regression formula of MCRT 44 Table 13. Average accuracy of each model in playoff 50 Table 14. Average accuracy of each model in each region 53 Figure contents Figure 1. Process chart 14 Figure 2. The defensive variables in a game 16 Figure 3. The offensive variables in a game 17 Figure 4. The CART-15 model 36 Figure 5. The CART-5 model 37 Figure 6. The CART-V model 38 Figure 7. The MLRT-15 model 39 Figure 8. The MLRT-5 model 40 Figure 9. The MLRT-V model 41 Figure 10. The CCRT model 42 Figure 11. The MCRT model 43 Figure 12. The predicted MAPE of models in the first phase 45 Figure 13. The predicting MAPE in the second phase 45 Figure 14. Compare training and testing performances of models in first and second phases 46 Figure 15. The MAPE of CCRT and MCRT 47 Figure 16. Comparing testing performances of models in second, third phases and regression analysis 48 Figure 17. The predicted accuracy of each model in playoff 50 Figure 18. The predicted accuracy of each model in each region 53 |
參考文獻 |
[1] Akaike, H., A new look at the statistical model identification, IEEE Transactions on Automatic Control, 1975, Vol. 19, No. 6, p.716-723. [2] Benesty, J., Chen, J., Huang, Y., Cohen, I., Noise Reduction in Speech Processing, 2009, p.37-40. [3] Brieman, L., Friedman, J., Olshen, R., and Stone, C., Classification and Regression Trees, Belmont, CA: Wadsworth International Group, 1984. [4] Chaudhuri, P., Loh, W., Nonparametric estimation of conditional quantiles using quantile regression trees, Bernoulli, 2002, Vol. 8, No. 5, p. 561–576. [5] Chen, C., Developing winner prediction models of professional baseball using data mining techniques, Master thesis of National Taiwan Sport University (in Chinese), 2011. [6] Dong, Y., Liu, L., An optimized algorithm of decision trees based on correlation coefficients, Computer Engineering & Science (in Chinese), 2015, Vol. 37, No. 9. [7] Draper, N., Smith, H., Applied regression analysis, 3rd edition, Wiley-Interscience, 1998. [8] Fong, R., Studies on Predicting the Outcome of Professional Baseball Games with Data Mining Techniques: MLB as a Case, Master thesis of Chinese Culture University (in Chinese), 2013. [9] Jiang, W., A One-to-One Game Forecast for The Professional Baseball League Using Neural Network ---A Case Study on Los Angeles Dodgers and San Francisco Giants, Master thesis of Tungnan university (in Chinese), 2012. [10] Kass, G., An Exploratory Technique for Investigating Large Quantities of Categorical Data Applied Statistics, Applied Statistics, 1980, Vol. 29, No. 2, p.119-127. [11] Lewis, C., Industrial and business forecasting methods: A practical guide to exponential smoothing and curve fitting, Butterworth-Heinemann, 1982. [12] Loh, W., Regression trees with unbiased variable selection and interaction detection, Statistica Sinica, 2002, Vol. 12, No. 2, p.361-386. [13] Ma, X., Diagnosis and Empirical Analysis on Multicollinearity in Linear Regression Model, Journal of Geodesy and Geoinformation Science (in Chinese), 2008, Vol. 35, No. 3, p.210-214. [14] Menéndez, H., Vázquez, M., and Camacho, D., Mixed Clustering Methods to Forecast Baseball Trends, Intelligent Distributed Computing VIII, Studies in Computational Intelligence, 2015, Vol. 570. [15] Miller, S., A derivation of the Pythagorean won-loss formula in baseball, Master thesis of Cornell University, 2006. [16] Pan, Y., Purchase Decision of Sports Lottery by Money Line : A Case Study of MLB, Master thesis of Shih Hsin University (in Chinese), 2012. [17] Pavitt, C., An Estimate of How Hitting, Pitching, Fielding, and Base stealing Impact Team Winning Percentages in Baseball, Journal of Quantitative Analysis in Sports, 2011, Vol. 7, No. 4. [18] Quinlan, R., Induction of Decision Trees. Mach. Learn., 1986, Vol. 1, Issue 1, p.81-106. [19] Schwarz, G., Estimating the dimension of a model, The Annals of Statistics, 1978, Vol. 6, No.2, p. 461-464. [20] Shih, C., Huang, H., Ni, Y., Establishing Models to Predict the Outcomes of Baseball Games in CPBL, Physical Education Journal (in Chinese), 2010, Vol.43, No.2, p,115-130. [21] Su, X., Wang, M., Fan, J., Maximum Likelihood Regression Trees, Journal of Computational and Graphical Statistics, 2004, Vol.13, No.3, p.586-598. [22] Valero, S., Department of Computer Science, Universidad Central “Marta Abreu” de Las Villas, Cuba, Predicting Win-Loss outcomes in MLB regular season games – A comparative study using data mining methods, International Journal of Computer Science in Sport, 2016, Vol. 15, Issue 2. [23] Voorhees, E., Proceedings of the 8th Text Retrieval Conference, TREC-8 Question Answering Track Report, 1999, p.77-82. [24] Wang, C., A Study of the Winning Factors in Baseball Games: an Example of the 2012 Chinese Professional Baseball League (CPBL) Season, Master thesis of Fu Jen Catholic University (in Chinese), 2013. [25] Weng, J., Zheng, Y., Qu, X., Yan, X., Development of a maximum likelihood regression tree-based model for predicting subway incident delay, Transportation Research Part C, 2014, Vol. 57, p.30-41. [26] Yu, M., 2018 Preseason Major League Baseball Teams Performance Based upon DEA to Forecast Playoff Teams Model, Master thesis of Chung Hua University (in Chinese), 2018. [27] Yu, T., Forecasting MLB Playoff Teams Using GA-SVM, 2017 IEEE International Conference on Applied System Innovation, 2017. |
論文全文使用權限 |
如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信