電子學位論文服務

§ 瀏覽學位論文書目資料

本論文電子全文於2019-08-08起於校外公開使用
本論文紙本於2019-08-08起公開使用

系統識別號	U0002-0808201908424700
DOI	10.6846/TKU.2019.00190
論文名稱(中文)	以迴歸樹預測美國職棒大聯盟各球隊的年度勝率及晉級季後賽之名單
論文名稱(英文)	Predicting Yearly Winning Percentage and Playoff List of MLB Teams by Regression Trees
第三語言論文名稱
校院名稱	淡江大學
系所名稱(中文)	大數據分析與商業智慧碩士學位學程
系所名稱(英文)	Master's Program In Big Data Analytics and Business Intelligence
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度	107
學期	2
出版年	108
研究生(中文)	羅莉雯
研究生(英文)	Li-Wen LO
學號	606890027
學位類別	碩士
語言別	英文
第二語言別
口試日期	2019-06-21
論文頁數	60頁
口試委員	指導教授 - 周清江(cjou@mail.tku.edu.tw) 委員 - 戴敏育(myday@mail.tku.edu.tw) 委員 - 陸承志(imcjluh@saturn.yzu.edu.tw)
關鍵字(中)	美國職棒大聯盟迴歸樹迴歸分析最大概似迴歸樹分類迴歸樹赤池信息量準則平均絕對百分比誤差平均精度均值預測勝率預測季後賽名單
關鍵字(英)	MLB Regression tree Regression analysis MLRT CART AIC MAPE Mean Reciprocal Rank Winning percentage prediction Playoff list prediction
第三語言關鍵字
學科別分類
中文摘要	棒球是世界上最受歡迎的運動之一。根據谷歌趨勢分析，過去五年中，觀看美國棒球大聯盟（MLB）比賽的人越來越多。許多學者和球迷都對預測比賽結果有很大的興趣，他們使用球隊的各項表現來預測比賽結果。過去的研究預測準確率約為50%左右。然而，有一些球隊在確定進入季後賽後讓主力球員休息，以替補球員為主進行比賽。因此，本研究僅使用2016到2018各年度前半季各球隊表現的平均來預測年度勝率。本研究的方法分別為分類和迴歸樹（CART）、最大概似迴歸樹（MLRT）、分類相關迴歸樹（CCRT）和最大概似相關迴歸樹（MCRT）。本研究使用平均絕對百分誤差（MAPE）作為評量模型依據，所有模型的預測MAPE皆在10％到15％之間。最後，我們將預測的勝率應用於預測大聯盟季後賽名單及預測各分區第一名。由於MAPE無法將球隊排名加入計算，本研究使用平均精度均值（MRR）於預測各分區第一名。季後賽名單及預測各分區第一名的最高預測準確率皆為88％。根據本研究結果顯示MLRT的性能優於CART。在季後賽名單預測中，本研究結果也優於先前之研究。實驗顯示，本研究使用的方法可用於預測球隊的年度勝率、MLB的季後賽名單及各分區第一名預測。
英文摘要	Baseball is one of the most popular sports in the world. According to Google Trends analysis, over the past five years, more and more people are watching games of Major League Baseball (MLB) of the USA. Many scholars and fans are interested in predicting these games’ outcome based on the teams’ yearly performances. The prediction accuracies of previous studies were around 50 percent. However, some of teams do not play seriously after they are sure to enter the playoff. Thus, instead of prediction based on a team’s performance of the whole season, we used a team’s performance of the first half season to predict its yearly winning percentage. We apply Classification and Regression Trees (CART), Maximum Likelihood Regression Trees (MLRT), Classification and Correlation Coefficient Regression Trees (CCRT) and Maximum Likelihood and Correlation Coefficient Regression Trees (MCRT) separately in the study. Our prediction error of the winning percentage in terms of Mean Absolute Percent Error (MAPE) is between 10 to 15 percent. Then we apply the predicted results to obtain the playoff list. And we use Mean Reciprocal Rank (MRR) in predicting the best team in each region. Our highest prediction accuracy of the playoff list and the best team in each region are 88 percent. Our results show that performance of MLRT is better than that of CART. In predicting playoff list, our study's result is better than previous research. Experiments demonstrate that our used methodology could be used to predict a team’s yearly winning percentage, playoff list and the best team in each region of MLB.
第三語言摘要
論文目次	Contents I. Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Research goal 3 II. Related work 4 2.1 Baseball game result prediction 4 2.2 Baseball performance 7 2.3 Decision tree 8 2.4 Regression analysis 10 2.5 Multicollinearity 10 2.6 Akaike information criterion (AIC) 11 2.7 Mean absolute percentage error (MAPE) 12 III. Our prediction methodology 14 3.1 Data source 15 3.2 Data preprocessing 17 3.3 Our models 20 3.3.1. CART 22 3.3.2. MLRT 25 3.3.3. CCRT 29 3.3.4. MCRT 31 IV. Experiment results 35 4.1 Regression tree 35 4.2.1. Results of the CART model 35 4.2.2. Results of the MLRT model 39 4.2.3. Results of the CCRT model 42 4.2.4. Results of the MCRT model 43 4.2 Performance comparison of models 44 4.2.1. Performance comparison of models in first phase 44 4.2.2. Performance comparison of models in second phase 44 4.2.3. Performance comparison of models in first and second phase 45 4.2.4. Performance comparison of models in third phase 46 4.2.5. Performance comparison of models in second and third phases 47 4.3 Applying predicted winning percentage in playoff list prediction 48 4.4 Applying predicted winning percentage for predicting the best team in each region 51 V. Conclusions and future work 54 VI. References 56 Table Contents Table 1. Fifteen variables used in our models 18 Table 2. The variables used in our models 20 Table 3. The steps using multicollinearity solutions and correlation in our models 21 Table 4. The example data 24 Table 5. Regression formula of CART-15 36 Table 6. Regression formula of CART-5 37 Table 7. Regression formula of CART-V 38 Table 8. Regression formula of MLRT-15 40 Table 9. Regression formula of MLRT-5 41 Table 10. Regression formula of MLRT-V 42 Table 11. Regression formula of CCRT 43 Table 12. Regression formula of MCRT 44 Table 13. Average accuracy of each model in playoff 50 Table 14. Average accuracy of each model in each region 53 Figure contents Figure 1. Process chart 14 Figure 2. The defensive variables in a game 16 Figure 3. The offensive variables in a game 17 Figure 4. The CART-15 model 36 Figure 5. The CART-5 model 37 Figure 6. The CART-V model 38 Figure 7. The MLRT-15 model 39 Figure 8. The MLRT-5 model 40 Figure 9. The MLRT-V model 41 Figure 10. The CCRT model 42 Figure 11. The MCRT model 43 Figure 12. The predicted MAPE of models in the first phase 45 Figure 13. The predicting MAPE in the second phase 45 Figure 14. Compare training and testing performances of models in first and second phases 46 Figure 15. The MAPE of CCRT and MCRT 47 Figure 16. Comparing testing performances of models in second, third phases and regression analysis 48 Figure 17. The predicted accuracy of each model in playoff 50 Figure 18. The predicted accuracy of each model in each region 53
參考文獻	[1] Akaike, H., A new look at the statistical model identification, IEEE Transactions on Automatic Control, 1975, Vol. 19, No. 6, p.716-723. [2] Benesty, J., Chen, J., Huang, Y., Cohen, I., Noise Reduction in Speech Processing, 2009, p.37-40. [3] Brieman, L., Friedman, J., Olshen, R., and Stone, C., Classification and Regression Trees, Belmont, CA: Wadsworth International Group, 1984. [4] Chaudhuri, P., Loh, W., Nonparametric estimation of conditional quantiles using quantile regression trees, Bernoulli, 2002, Vol. 8, No. 5, p. 561–576. [5] Chen, C., Developing winner prediction models of professional baseball using data mining techniques, Master thesis of National Taiwan Sport University (in Chinese), 2011. [6] Dong, Y., Liu, L., An optimized algorithm of decision trees based on correlation coefficients, Computer Engineering & Science (in Chinese), 2015, Vol. 37, No. 9. [7] Draper, N., Smith, H., Applied regression analysis, 3rd edition, Wiley-Interscience, 1998. [8] Fong, R., Studies on Predicting the Outcome of Professional Baseball Games with Data Mining Techniques: MLB as a Case, Master thesis of Chinese Culture University (in Chinese), 2013. [9] Jiang, W., A One-to-One Game Forecast for The Professional Baseball League Using Neural Network ---A Case Study on Los Angeles Dodgers and San Francisco Giants, Master thesis of Tungnan university (in Chinese), 2012. [10] Kass, G., An Exploratory Technique for Investigating Large Quantities of Categorical Data Applied Statistics, Applied Statistics, 1980, Vol. 29, No. 2, p.119-127. [11] Lewis, C., Industrial and business forecasting methods: A practical guide to exponential smoothing and curve fitting, Butterworth-Heinemann, 1982. [12] Loh, W., Regression trees with unbiased variable selection and interaction detection, Statistica Sinica, 2002, Vol. 12, No. 2, p.361-386. [13] Ma, X., Diagnosis and Empirical Analysis on Multicollinearity in Linear Regression Model, Journal of Geodesy and Geoinformation Science (in Chinese), 2008, Vol. 35, No. 3, p.210-214. [14] Menéndez, H., Vázquez, M., and Camacho, D., Mixed Clustering Methods to Forecast Baseball Trends, Intelligent Distributed Computing VIII, Studies in Computational Intelligence, 2015, Vol. 570. [15] Miller, S., A derivation of the Pythagorean won-loss formula in baseball, Master thesis of Cornell University, 2006. [16] Pan, Y., Purchase Decision of Sports Lottery by Money Line : A Case Study of MLB, Master thesis of Shih Hsin University (in Chinese), 2012. [17] Pavitt, C., An Estimate of How Hitting, Pitching, Fielding, and Base stealing Impact Team Winning Percentages in Baseball, Journal of Quantitative Analysis in Sports, 2011, Vol. 7, No. 4. [18] Quinlan, R., Induction of Decision Trees. Mach. Learn., 1986, Vol. 1, Issue 1, p.81-106. [19] Schwarz, G., Estimating the dimension of a model, The Annals of Statistics, 1978, Vol. 6, No.2, p. 461-464. [20] Shih, C., Huang, H., Ni, Y., Establishing Models to Predict the Outcomes of Baseball Games in CPBL, Physical Education Journal (in Chinese), 2010, Vol.43, No.2, p,115-130. [21] Su, X., Wang, M., Fan, J., Maximum Likelihood Regression Trees, Journal of Computational and Graphical Statistics, 2004, Vol.13, No.3, p.586-598. [22] Valero, S., Department of Computer Science, Universidad Central “Marta Abreu” de Las Villas, Cuba, Predicting Win-Loss outcomes in MLB regular season games – A comparative study using data mining methods, International Journal of Computer Science in Sport, 2016, Vol. 15, Issue 2. [23] Voorhees, E., Proceedings of the 8th Text Retrieval Conference, TREC-8 Question Answering Track Report, 1999, p.77-82. [24] Wang, C., A Study of the Winning Factors in Baseball Games: an Example of the 2012 Chinese Professional Baseball League (CPBL) Season, Master thesis of Fu Jen Catholic University (in Chinese), 2013. [25] Weng, J., Zheng, Y., Qu, X., Yan, X., Development of a maximum likelihood regression tree-based model for predicting subway incident delay, Transportation Research Part C, 2014, Vol. 57, p.30-41. [26] Yu, M., 2018 Preseason Major League Baseball Teams Performance Based upon DEA to Forecast Playoff Teams Model, Master thesis of Chung Hua University (in Chinese), 2018. [27] Yu, T., Forecasting MLB Playoff Teams Using GA-SVM, 2017 IEEE International Conference on Applied System Innovation, 2017.
論文全文使用權限	校內：校內紙本論文立即公開同意電子論文全文授權校園內公開校內電子論文立即公開校外：同意授權校外電子論文立即公開

返回頁首

如有問題，歡迎洽詢！
圖書館數位資訊組　(02)2621-5656 轉 2487 或來信