Reinforcement Learning in the Minecraft Gaming Environment

Reinforcement Learning in the Minecraft Gaming Environment


1.     Introduction


Researchers are keen to solve the challenge of a robot successfully interacting with an external environment. In this regard, the progress made in reinforcement learning (RL), such as Atari 2600 from Google DeepMind, Alpha Go winning the current world champion in the board game Go, and OpenAI winning a 5v5 match against the top players in the world in Dota 2, RL has become a powerful tool to achieve superhuman results in games. RL agents appear to be able to master any game, but what about a game such as Minecraft.

The long-term objective of this research is to use RL to teach an agent to survive a day-night cycle in the Minecraft gaming environment. To achieve this, the research tests a new method, referred to as dojo learning based on curriculum learning, against current methods to progress one step closer to the mentioned objective. Although Minecraft is used as a testing platform, the method investigated in this thesis could be generalised and adapted to work in any appropriate gaming environment.


2.     The Minecraft environment


Minecraft, a 3D game in which players can move, explore and build, allows for continuous movement, is to some extent observable, and is mostly deterministic and dynamic. While Minecraft uses a graphical aesthetic of blocks and thus simpler than photo-realistic graphical environments, it easily accommodates large, easily modifiable, open worlds. This makes Minecraft a valuable stepping stone to more real environments; other 3D game environments tend to have smaller state spaces. The day-night cycle in Minecraft referred to above lasts for 20 minutes in real-world time. 


3.     What is reinforcement learning?


RL is an ML paradigm that trains the policy of the agent so that it can make a sequence of decisions. The agent aims to produce actions according to the observation it makes about its environment. These actions then lead to further observations and rewards. Training involves numerous trial-and-error runs as the agent interacts with the environment, and in each iteration, it can improve the policy. Specific to Minecraft: during the day, the layer can gather all the resources they want with little danger or consequences. However, once the sun sets the world of Minecraft is swamped with mobs – dangerous creatures which roam the environment at night. The mobs can attack the player when in range and if the player loses all their health, they will be stripped of all the gathered resources and respawn elsewhere. There are many game modes in Minecraft. The focus of this research is survival mode where the mobs pose a threat, and the player must gather resources.


In other words, the idea behind RL is to teach a computer in a similar way dogs learn new tricks – after performing an action, it will receive either a positive or negative reward. This is also true when humans learn. The major difference between a human and a computer learning a new task is that a human will use common sense (usually from life experience or instinct) to determine the next approach if the first was unsuccessful. Conversely, a computer will often perform a random action or only adjust slightly from its previous action until it achieves the goal of the task. However, the computer will attempt the task many times and will gather experience quicker than the human can, given the same time allocation.


4.     Research approach and initial Minecraft environment


Before attempting to solve the Minecraft environment using RL, a snake game was developed in Python using the PyGame library to run tests on Q-learning. The objective for this minigame was to understand the fundamentals of Q-learning and the ML libraries, namely TensorFlow, available before delving into the complex world of Minecraft. To make the learning efficient, a 2-D Python environment of Minecraft was also created using PyGame. We use the premise of curriculum learning where an agent learns different skills in independent and isolated sub-environments referred to as dojos. The skills learnt in the dojos are then used as different actions as the agent decides which skill to perform that best applies to the current game state. We evaluate this with experiments conducted in the Minecraft gaming environment. 


5.     Conclusion and future research 


Given the approach described above it follows that an agent that learnt a subset of complex actions in dojos or sub-environments before being exposed to the complex environment might outperform an agent that was trained in the complex environment from the start. The evaluated models do not appear to support our hypothesis with the standard DQN outperforming our dojo network. Evaluating the results shows the dojo network is being limited by the individual modules that were previously learnt in isolated dojos and when the agent is exposed to the complex environment the agent is ‘stuck’ performing those previously learnt actions in a sub-optimal manner. It is possible to match the performance of the standard DQN by allowing the dojo modules to be further trained in the complex environment.


It has shown promise that the dojo learning method might be viable in specific environments. In which environments it performs better and why need to be explored in further research. One of the main limitations the agent currently faces is the choice in deciding the skills learnt in the different dojos and then locking the number of skills by predetermining the number of dojos beforehand. We need to improve the agent’s method of executing actions and not limit it by our choice in dojos.


It will be beneficial to explore a non-fixed action structure or use the chosen actions as a stepping stone to learning about the complex environment more efficiently. Investigate the idea of time integration with the chosen actions having a duration with a specific endpoint previously mentioned, some environments seem to work better than others. We can thoroughly research which environments the dojo network outperforms the standard network, such as the complex environment with no zombies, and identify reasons for these results.


We can include a more complex algorithm and investigate the impact of these more advanced algorithms and methods on our dojo network approach. We could also investigate different architectures for different modules, with the actions of each dojo module being unique and fitting to that particular skill.


Based on the following research thesis: