Application of reinforcement learning methods to computer game dynamics

Booth, David John (2018) Application of reinforcement learning methods to computer game dynamics. Doctoral thesis, London Metropolitan University.

Abstract

The dynamics of the game world present both challenges and opportunities for AI to make a useful difference. Learning smart behaviours for game assets is a first step towards realistic conflict or cooperation. The scope this thesis is the application of Reinforcement Learning to moving assets in the game world. Game sessions a generate stream data on asset's performance which must be processed on the fly. The lead objective is to produce fast, lightweight and flexible learning algorithms for run-time embedding. The motivation from current work is to shorten the time to achieve a workable policy solution by investigating the exploration / exploitation balance, overcome the curse of dimensionality of complex systems, and avoid the use of extra endogenous parameters which require multiple data passes and use a simple state aggregation rather than functional approximation. How action selection (AS) contributes to efficient learning is a key issue in RL since is determines the balance between exploiting and confirming the current policy or exploring an early less likely policy which may prove better in the long run. The methodology deploys the simulation of several AS using 10-armed bandit problem averaged over 10000 epochs. The results show a considerable variation in performance in terms of latency and asymptotic direction. The Upper Confidence Bound comes out leader over most of the episode range, especially at about 100. Using insight from action selection order statistics are applied to determine a criterion for the convergence of policy evaluation. The probability that the action of maximum sample mean is indeed the action of maximum population mean (PMSMMPM) is calculated using the 3 armed bandit problem. PMSMMPM reaches 0.988 by play 26 which provides evidence for it as a convergence criterion. An iteration stopping rule is defined using PMSMMPM and it shows plausible properties as the population parameters are varied. A mathematical analysis of the approximation (P21) of just taking the top two actions yields a minimum sampling size for any level of P21. Using the gradient of P21 a selection rule is derived and when combined with UCB a new complete exploratory policy is demonstrated for 3-arm bandit that requires just over half the sample size when compared with pure UCB. The results provide evidence that the augmented UCB selection rule will contribute to faster learning. TD sarsa(0) learning algorithm has been applied to learn a steering policy for the untried caravan reversing problem and for the kerb avoiding steering problem of racing car both using negative rewards on failure and a simple aggregation. The output policy for the caravan is validated as non jack-knifing for a high proportion of start states. The racing car policy has a similar validation outcome for two exploratory polies which are compared and contrasted.

Documents
4939:25221
[thumbnail of Booth,DavidJohn_Thesis.pdf]
Preview
Booth,DavidJohn_Thesis.pdf - Published Version

Download (2MB) | Preview
Details
Record
View Item View Item