§ 瀏覽學位論文書目資料
  
系統識別號 U0002-0808201908424700
DOI 10.6846/TKU.2019.00190
論文名稱(中文) 以迴歸樹預測美國職棒大聯盟各球隊的年度勝率及晉級季後賽之名單
論文名稱(英文) Predicting Yearly Winning Percentage and Playoff List of MLB Teams by Regression Trees
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 大數據分析與商業智慧碩士學位學程
系所名稱(英文) Master's Program In Big Data Analytics and Business Intelligence
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 107
學期 2
出版年 108
研究生(中文) 羅莉雯
研究生(英文) Li-Wen LO
學號 606890027
學位類別 碩士
語言別 英文
第二語言別
口試日期 2019-06-21
論文頁數 60頁
口試委員 指導教授 - 周清江(cjou@mail.tku.edu.tw)
委員 - 戴敏育(myday@mail.tku.edu.tw)
委員 - 陸承志(imcjluh@saturn.yzu.edu.tw)
關鍵字(中) 美國職棒大聯盟
迴歸樹
迴歸分析
最大概似迴歸樹
分類迴歸樹
赤池信息量準則
平均絕對百分比誤差
平均精度均值
預測勝率
預測季後賽名單
關鍵字(英) MLB
Regression tree
Regression analysis
MLRT
CART
AIC
MAPE
Mean Reciprocal Rank
Winning percentage prediction
Playoff list prediction
第三語言關鍵字
學科別分類
中文摘要
棒球是世界上最受歡迎的運動之一。根據谷歌趨勢分析,過去五年中,觀看美國棒球大聯盟(MLB)比賽的人越來越多。許多學者和球迷都對預測比賽結果有很大的興趣,他們使用球隊的各項表現來預測比賽結果。過去的研究預測準確率約為50%左右。然而,有一些球隊在確定進入季後賽後讓主力球員休息,以替補球員為主進行比賽。因此,本研究僅使用2016到2018各年度前半季各球隊表現的平均來預測年度勝率。本研究的方法分別為分類和迴歸樹(CART)、最大概似迴歸樹(MLRT)、分類相關迴歸樹(CCRT)和最大概似相關迴歸樹(MCRT)。本研究使用平均絕對百分誤差(MAPE)作為評量模型依據,所有模型的預測MAPE皆在10%到15%之間。最後,我們將預測的勝率應用於預測大聯盟季後賽名單及預測各分區第一名。由於MAPE無法將球隊排名加入計算,本研究使用平均精度均值(MRR)於預測各分區第一名。季後賽名單及預測各分區第一名的最高預測準確率皆為88%。根據本研究結果顯示MLRT的性能優於CART。在季後賽名單預測中,本研究結果也優於先前之研究。實驗顯示,本研究使用的方法可用於預測球隊的年度勝率、MLB的季後賽名單及各分區第一名預測。
英文摘要
Baseball is one of the most popular sports in the world. According to Google Trends analysis, over the past five years, more and more people are watching games of Major League Baseball (MLB) of the USA. Many scholars and fans are interested in predicting these games’ outcome based on the teams’ yearly performances. The prediction accuracies of previous studies were around 50 percent. However, some of teams do not play seriously after they are sure to enter the playoff. Thus, instead of prediction based on a team’s performance of the whole season, we used a team’s performance of the first half season to predict its yearly winning percentage. We apply Classification and Regression Trees (CART), Maximum Likelihood Regression Trees (MLRT), Classification and Correlation Coefficient Regression Trees (CCRT) and Maximum Likelihood and Correlation Coefficient Regression Trees (MCRT) separately in the study. Our prediction error of the winning percentage in terms of Mean Absolute Percent Error (MAPE) is between 10 to 15 percent. Then we apply the predicted results to obtain the playoff list. And we use Mean Reciprocal Rank (MRR) in predicting the best team in each region. Our highest prediction accuracy of the playoff list and the best team in each region are 88 percent. Our results show that performance of MLRT is better than that of CART. In predicting playoff list, our study's result is better than previous research. Experiments demonstrate that our used methodology could be used to predict a team’s yearly winning percentage, playoff list and the best team in each region of MLB.
第三語言摘要
論文目次
Contents
I.	Introduction	1
1.1	Background	1
1.2	Motivation	2
1.3	Research goal	3
II.	Related work	4
2.1	Baseball game result prediction	4
2.2	Baseball performance	7
2.3	Decision tree	8
2.4	Regression analysis	10
2.5	Multicollinearity	10
2.6	Akaike information criterion (AIC)	11
2.7	Mean absolute percentage error (MAPE)	12
III.	Our prediction methodology	14
3.1	Data source	15
3.2	Data preprocessing	17
3.3	Our models	20
3.3.1.	CART	22
3.3.2.	MLRT	25
3.3.3.	CCRT	29
3.3.4.	MCRT	31
IV.	Experiment results	35
4.1	Regression tree	35
4.2.1.	Results of the CART model	35
4.2.2.	Results of the MLRT model	39
4.2.3.	Results of the CCRT model	42
4.2.4.	Results of the MCRT model	43
4.2	Performance comparison of models	44
4.2.1.	Performance comparison of models in first phase	44
4.2.2.	Performance comparison of models in second phase	44
4.2.3.	Performance comparison of models in first and second phase	45
4.2.4.	Performance comparison of models in third phase	46
4.2.5.	Performance comparison of models in second and third phases	47
4.3	Applying predicted winning percentage in playoff list prediction	48
4.4	Applying predicted winning percentage for predicting the best team in each region	51
V.	Conclusions and future work	54
VI.	References	56

 
Table Contents
Table 1. Fifteen variables used in our models	18
Table 2. The variables used in our models	20
Table 3. The steps using multicollinearity solutions and correlation in our models	21
Table 4. The example data	24
Table 5. Regression formula of CART-15	36
Table 6. Regression formula of CART-5	37
Table 7. Regression formula of CART-V	38
Table 8. Regression formula of MLRT-15	40
Table 9. Regression formula of MLRT-5	41
Table 10. Regression formula of MLRT-V	42
Table 11. Regression formula of CCRT	43
Table 12. Regression formula of MCRT	44
Table 13. Average accuracy of each model in playoff	50
Table 14. Average accuracy of each model in each region	53

 
Figure contents
Figure 1. Process chart	14
Figure 2. The defensive variables in a game	16
Figure 3. The offensive variables in a game	17
Figure 4. The CART-15 model	36
Figure 5. The CART-5 model	37
Figure 6. The CART-V model	38
Figure 7. The MLRT-15 model	39
Figure 8. The MLRT-5 model	40
Figure 9. The MLRT-V model	41
Figure 10. The CCRT model	42
Figure 11. The MCRT model	43
Figure 12. The predicted MAPE of models in the first phase	45
Figure 13. The predicting MAPE in the second phase	45
Figure 14. Compare training and testing performances of models in first and second phases	46
Figure 15. The MAPE of CCRT and MCRT	47
Figure 16. Comparing testing performances of models in second, third phases and regression analysis	48
Figure 17. The predicted accuracy of each model in playoff	50
Figure 18. The predicted accuracy of each model in each region	53
參考文獻
[1]	Akaike, H., A new look at the statistical model identification, IEEE Transactions on Automatic Control, 1975, Vol. 19, No. 6, p.716-723.
[2]	Benesty, J., Chen, J., Huang, Y., Cohen, I., Noise Reduction in Speech Processing, 2009, p.37-40.
[3]	Brieman, L., Friedman, J., Olshen, R., and Stone, C., Classification and Regression Trees, Belmont, CA: Wadsworth International Group, 1984.
[4]	Chaudhuri, P., Loh, W., Nonparametric estimation of conditional quantiles using quantile regression trees, Bernoulli, 2002, Vol. 8, No. 5, p. 561–576.
[5]	Chen, C., Developing winner prediction models of professional baseball using data mining techniques, Master thesis of National Taiwan Sport University (in Chinese), 2011.
[6]	Dong, Y., Liu, L., An optimized algorithm of decision trees based on correlation coefficients, Computer Engineering & Science (in Chinese), 2015, Vol. 37, No. 9.
[7]	Draper, N., Smith, H., Applied regression analysis, 3rd edition, Wiley-Interscience, 1998.
[8]	Fong, R., Studies on Predicting the Outcome of Professional Baseball Games with Data Mining Techniques: MLB as a Case, Master thesis of Chinese Culture University (in Chinese), 2013.
[9]	Jiang, W., A One-to-One Game Forecast for The Professional Baseball League Using Neural Network ---A Case Study on Los Angeles Dodgers and San Francisco Giants, Master thesis of Tungnan university (in Chinese), 2012.
[10]	Kass, G., An Exploratory Technique for Investigating Large Quantities of Categorical Data Applied Statistics, Applied Statistics, 1980, Vol. 29, No. 2, p.119-127.
[11]	Lewis, C., Industrial and business forecasting methods: A practical guide to exponential smoothing and curve fitting, Butterworth-Heinemann, 1982.
[12]	Loh, W., Regression trees with unbiased variable selection and interaction detection, Statistica Sinica, 2002, Vol. 12, No. 2, p.361-386.
[13]	Ma, X., Diagnosis and Empirical Analysis on Multicollinearity in Linear Regression Model, Journal of Geodesy and Geoinformation Science (in Chinese), 2008, Vol. 35, No. 3, p.210-214. 
[14]	Menéndez, H., Vázquez, M., and Camacho, D., Mixed Clustering Methods to Forecast Baseball Trends, Intelligent Distributed Computing VIII, Studies in Computational Intelligence, 2015, Vol. 570.
[15]	Miller, S., A derivation of the Pythagorean won-loss formula in baseball, Master thesis of Cornell University, 2006.
[16]	Pan, Y., Purchase Decision of Sports Lottery by Money Line : A Case Study of MLB, Master thesis of Shih Hsin University (in Chinese), 2012.
[17]	Pavitt, C., An Estimate of How Hitting, Pitching, Fielding, and Base stealing Impact Team Winning Percentages in Baseball, Journal of Quantitative Analysis in Sports, 2011, Vol. 7, No. 4.
[18]	Quinlan, R., Induction of Decision Trees. Mach. Learn., 1986, Vol. 1, Issue 1, p.81-106.
[19]	Schwarz, G., Estimating the dimension of a model, The Annals of Statistics, 1978, Vol. 6, No.2, p. 461-464.
[20]	Shih, C., Huang, H., Ni, Y., Establishing Models to Predict the Outcomes of Baseball Games in CPBL, Physical Education Journal (in Chinese), 2010, Vol.43, No.2, p,115-130.
[21]	Su, X., Wang, M., Fan, J., Maximum Likelihood Regression Trees, Journal of Computational and Graphical Statistics, 2004, Vol.13, No.3, p.586-598.
[22]	Valero, S., Department of Computer Science, Universidad Central “Marta Abreu” de Las Villas, Cuba, Predicting Win-Loss outcomes in MLB regular season games – A comparative study using data mining methods, International Journal of Computer Science in Sport, 2016, Vol. 15, Issue 2.
[23]	Voorhees, E., Proceedings of the 8th Text Retrieval Conference, TREC-8 Question Answering Track Report, 1999, p.77-82.
[24]	Wang, C., A Study of the Winning Factors in Baseball Games: an Example of the 2012 Chinese Professional Baseball League (CPBL) Season, Master thesis of Fu Jen Catholic University (in Chinese), 2013.
[25]	Weng, J., Zheng, Y., Qu, X., Yan, X., Development of a maximum likelihood regression tree-based model for predicting subway incident delay, Transportation Research Part C, 2014, Vol. 57, p.30-41.
[26]	Yu, M., 2018 Preseason Major League Baseball Teams Performance Based upon DEA to Forecast Playoff Teams Model, Master thesis of Chung Hua University (in Chinese), 2018.
[27]	Yu, T., Forecasting MLB Playoff Teams Using GA-SVM, 2017 IEEE International Conference on Applied System Innovation, 2017.
論文全文使用權限
校內
校內紙本論文立即公開
同意電子論文全文授權校園內公開
校內電子論文立即公開
校外
同意授權
校外電子論文立即公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信