§ 瀏覽學位論文書目資料
  
系統識別號 U0002-0907202017423100
DOI 10.6846/TKU.2020.00215
論文名稱(中文) 結合長短期記憶模型與近端策略優化為基礎之策略增強式學習
論文名稱(英文) A Strategy-Enhanced Reinforcement Learning by fusing LSTM and PPO models
第三語言論文名稱
校院名稱 淡江大學
系所名稱(中文) 資訊工程學系碩士班
系所名稱(英文) Department of Computer Science and Information Engineering
外國學位學校名稱
外國學位學院名稱
外國學位研究所名稱
學年度 108
學期 2
出版年 109
研究生(中文) 林威廷
研究生(英文) Wei-Ting Lin
學號 607410247
學位類別 碩士
語言別 英文
第二語言別
口試日期 2020-07-14
論文頁數 24頁
口試委員 指導教授 - 王英宏(inhon@mail.tku.edu.tw)
委員 - 陳以錚(ycchen@mgt.ncu.edu.tw)
委員 - 慧霖(121678@mail.tku.edu.tw)
關鍵字(中) 長短期記憶模型
近端策略優化
增強式學習
關鍵字(英) Long short-term memory model
Proximal Policy Optimization
Reinforcement Learning
第三語言關鍵字
學科別分類
中文摘要
隨著人工智慧相關的研究興起,許多機器學習的技術漸漸地發展成熟,並相繼地被應用在各個領域上,然而遊戲領域在此情況下,卻仍有極大的發展空間,其原因在於,遊戲的複雜性。代理者的一個動作可能會造就許多種不同情況,這不但使模型複雜度大增,且訓練時間也更長。因此本研究提出了一種結合長短期記憶模型與近端策略優化的策略增強式學習(SEPPO),根據特徵來制定代理者策略,並通過結合長短期記憶模型,來對近端策略優化進行優化。我們可以利用策略判斷使增強式學習更快地達到相同的成效。SEPPO確認的遊戲領域方面的實驗結果表明,可以有效減少訓練時間過長的問題。
英文摘要
With the rise of research related to artificial intelligence, many machine learning technologies have gradually matured and have been applied in various fields one after another. However, in this case, the game field still has great room for development. The reason is the complexity of games. An agent's action may create many different situations, which not only greatly increases the complexity of the model, but also takes longer to train. Therefore, this study proposes a strategy-enhanced proximal policy optimization(SEPPO) that combines long short-term memory models with proximal policy optimization, formulates agent strategies based on features, and optimizes proximal policy optimization by combining long short-term memory models. We can use strategic judgment to make reinforcement learning achieve the same results faster. The experimental results confirmed by SEPPO in the field of games show that it can effectively reduce the problem of too long training time.
第三語言摘要
論文目次
Table of Contents
Chinese Abstract	I
Abstract	III
Table of Contents	IV
List of Figures	V
List of Tables	VI
1.	Introduction	1
2.	Related work	4
2.1.	Long short-term memory	4
2.2.	Single dimensional action space	5
2.3.	Multidimensional action space	6
3.	Preliminary	7
3.1.	Notation	7
3.2.	Problem Definition	7
4.	Proposed RL: SEPPO	8
4.1.	Feature extraction and Reward function	8
4.2.	Strategy prediction	9
4.3.	Strategy-enhanced proximal policy optimization(SEPPO)	11
5.	Performance Evaluation	13
5.1.	Experiment Setting	13
5.2.	The effectiveness of strategy prediction	15
5.3.	SEPPO Performance	17
5.4.	The effectiveness of SEPPO training episodes	18
6.	Conclusion	20
Reference	21

List of Figures
Fig. 1 The snapshot of StarCraft2 game	2
Fig. 2 The architecture of SEPPO	8
Fig. 3 The architecture of Strategy prediction	9
Fig. 4 The environment’s snapshot	14
Fig. 5 The total images in Feature sets	14
Fig. 6 The performance of different parameter setting in cell size in strategy prediction	15
Fig. 7 The performance of different parameter setting of batch size in strategy prediction	16
Fig. 8 Accuracy of training processes	17
Fig. 9 Accuracy of validation process	17
Fig. 10 Reward of training process	19

List of Tables
Table 1 Different models comparison	18
參考文獻
[1] 	D. Silver, T. Hubert, J. Schrittwieser, "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm," in Science, 2018. 
[2] 	O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, J. Quan, S. Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. van, "StarCraft II: A New Challenge for Reinforcement Learning," in arXiv, 2017. 
[3] 	X. Wang, L. Gao, J. Song, H. Shen, "Beyond Frame-level CNN: Saliency-Aware 3-D CNN With LSTM for Video Action Recognition," in IEEE Signal Processing Letters, vol. 24, no. 4, pp. 510-514, 2017. 
[4] 	Z. Wu, X. Wang, Y.G. Jiang, H. Ye, X. Xue, "Modeling spatial-temporal clues in a hybrid deep learning framework for video classification," in Proceedings of the 23rd ACM international conference on Multimedia, pp. 461-470, 2015. 
[5] 	Q. Li, Z. Qiu, T. Yao, T. Mei, Y. Rui, J. Luo, "Action recognition by learning deep multi-granular spatio-temporal video representation," in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 159-166, 2016. 
[6] 	W. Lotter, G. Kreiman, D. Cox., "Deep predictive coding networks for video prediction and unsupervised learning," in arXiv, 2016. 
[7] 	X. Ouyang, S. Xu, C. Zhang, P. Zhou, Y. Yang, G. Liu, X. Li, "A 3D-CNN and LSTM Based Multi-Task Learning Architecture for Action Recognition," in IEEE Access, vol. 7, pp. 40757-40770, 2019. 
[8] 	T. Akilan, Q. J. Wu, A. Safaei, J. Huo and Y. Yang, "A 3D CNN-LSTM-Based Image-to-Image Foreground Segmentation," in IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 3, pp. 959-971, 2020. 
[9] 	V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, "Playing Atari with Deep Reinforcement Learning," in Neural Information Processing Systems, 2013. 
[10] 	T. Schaul, John Quan, Ioannis Antonoglou and David Silver, "Prioritized Experience Replay," in International Conference on Learning Representations, 2016. 
[11] 	K. De Asis, J. Fernando Hernandez-Garcia, G. Zacharias Holland, Richard S. Sutton, "Multi-Step Reinforcement Learning: A Unifying Algorithm," in Thirty-Second AAAI Conference on Artificial Intelligence, 2018. 
[12] 	H. van Hasselt, A. Guez, D. Silver, "Deep Reinforcement Learning with Double Q-learning," in Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, 2016. 
[13] 	Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, N. de Freitas, "Dueling Network Architectures for Deep Reinforcement Learning," in The 33rd International Conference on Machine Learning, 2016. 
[14] 	M. Hessel, J. Modayil,H. van Hasselt, "Rainbow: Combining Improvements in Deep Reinforcement Learning," in Association for the Advancement of Artificial Intelligence 2018, 2017. 
[15] 	R. S. Sutton, D. McAllester, S. Singh, Y. Mansour, "Policy Gradient Methods for Reinforcement Learning with Function Approximation," in 12th International Conference on Neural Information Processing Systems, 1999. 
[16] 	J. Schulman, S. Levine, P. Moritz, M. I. Jordan, P. Abbeel, "Trust Region Policy Optimization," in International conference on machine learning, 2015. 
[17] 	N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. Ali Eslami, Martin Riedmiller, David Silver, "Emergence of Locomotion Behaviours in Rich Environments," in arXiv, 2017. 
[18] 	J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, "Proximal Policy Optimization Algorithms," in arXiv, 2017. 
[19] 	V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, "Asynchronous Methods for Deep Reinforcement Learning," in International Conference on Machine Learning, 2016. 
[20] 	M. Hausknecht, Mupparaju, S. Subramanian, S. Kalyanakrishnan, and P. Stone, "Half field offense: An environment for multiagent learning and ad hoc teamwork," in AAMAS Adaptive Learning Agents (ALA) Workshop, 2016. 
[21] 	M. L. Littman, "Markov games as a framework for multi-agent reinforcement learning," in eleventh international conference on machine learning, 1994. 
[22] 	W. Masson, P. Ranchod, G. Konidaris, "Reinforcement learning with parameterized actions," in Thirtieth of Association for the Advancement of Artificial Intelligence, 2016. 
[23] 	M. Hausknecht, P. Stone, "Deep reinforcement learning in parameterized," in International Conference on Learning Representations, 2016. 
[24] 	J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang, J. Liu, H. Liu, "Parametrized Deep Q-Networks Learning: Reinforcement Learning with Discrete-Continuous Hybrid Action Space," in CoRR, abs/1810.06394, 2018. 
[25] 	E. Wei, D. Wicke, S. Luke, "Hierarchical Approaches for Reinforcement Learning in Parameterized Action Space," in AAAI Fall Symposium on Data Efficient Reinforcement Learning, 2018. 
[26] 	Y. Zhang, Q. H. Vuong, K. Song, X. Y. Gong, K. W. Ross, "Efficient Entropy for Policy Gradient with Multidimensional Action Space," in International Conference on Learning Representations, 2018. 
[27] 	Z. Fan, R. Su,W. Zhang, Y. Yu, "Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space," in International Joint Conferences on Artificial Intelligence 2019, 2019. 
[28] 	S. Kakade, J. Langford, "Approximately optimal approximate reinforcement learning," in Nineteenth International Conference on Machine Learning, 2002.
論文全文使用權限
校內
校內紙本論文延後至2023-06-01公開
同意電子論文全文授權校園內公開
校內電子論文延後至2023-06-01公開
校內書目立即公開
校外
同意授權
校外電子論文延後至2023-06-01公開

如有問題,歡迎洽詢!
圖書館數位資訊組 (02)2621-5656 轉 2487 或 來信