I’ve finally had a decent amount of time to invest in my Big 2 reinforcement learning AI, and it’s actually working really well (much, much better than I was ever expecting in fact!). At some point I will do a full detailed write up but for now I’ll just make a few notes about the process I used and summarize the results so far but the main result is that in the initial testing the AI actually beat me (who has played the game a lot) pretty damn convincingly and showed clear signs of being able to formulate good plans to get rid of its cards! I totally wasn’t expecting this to happen so I’m really pleased with this, particularly as I was recently reading this article recently about how deep reinforcement learning doesn’t really work very well yet (or rather that it’s very difficult to get it to work properly compared to supervised deep learning).
So in the end I decided to use the “Proximal Policy Optimization” algorithm which seems to be very popular atm (particularly at OpenAI) and from what I’ve read is one of the best in terms of sample efficiency and robustness to varying hyper-parameters. It’s also relatively simple to implement and OpenAI have released an excellent implementation in their “Baselines” project which was incredibly useful to use as a basis for my own code (which can be found here). I won’t go into the details of the theory behind the PPO algorithm (as I actually still need to read more about this myself to really understand why it works so well) but it’s a policy where you policy gradient method of sorts but where you use a surrogate loss function which is clipped so as to stop updates which change the policy too much. The surrogate function requires being able to estimate the advantage of a given action in a given state given the current policy you have. There are a number of ways you can do this but I followed the original paper which suggests using “Generalized Advantage Estimation” which tries to balance as best as possible reducing the bias and reducing the variance of the advantage estimates. This requires being able to estimate the value function of each state under the policy being employed making this an actor-critic algorithm, i.e. you are trying to learn a policy \( \pi \) and a value function \( v \) at the same time.
I also read quite a bit about how a lot of deep reinforcement learning algorithms (e.g. deep Q-learning) often don’t perform very well in the multi-agent setting where you have a number of agents who are competing against each other in some way because these environments are complex and non-stationary whereas typical policy gradient methods suffer from very high variance which increases rapidly with the number of agents. This worried me that a four-player game of imperfect information would be very tricky to get working (especially with a complicated action space) but I was somewhat encouraged by this paper which demonstrated that using PPO with huge training batch sizes seemed to work really well in complicated competitive environments.
What I actually did:
The first thing I needed to do was to parallelize my Big2 game class to allow for multiple games to be ran at once on different cores of my processor. The way PPO works is by running a number of environments \(n_e\) in parallel (for ATARI and MuJoCo they use \(n_e=8\) and run them forward for some number of steps \(n_s\) into the future with the current policy to generate a batch to train on. The reason for running many games at once is so as to have a batch of samples which aren’t all correlated (which they would be if they all game from the same game) and so it provides a big speed up if you can run these games on different cores as much as possible. Then when the batch has been generated you train the neural network using the PC’s GPU. There are a number of challenges and things I had to alter from the OpenAI Baselines implementation which are mainly to do with the fact that this is designed for training a single agent whereas I needed it to be set up for four players. The main difficulty here is that to use generalized advantage estimation you need to work backwards and use all of the states in the batch you simulate after the one you’re estimating the advantage for to get this estimate but now the “next state” is actually the state four time steps after the current time step as the three other players have to make a decision first. Another issue is the fact that you don’t know if a state is terminal (and what reward to assign) until the other three players have had their next turn because the game finishes at one point when a player plays their final cards. To account for this it’s necessary to run an extra four steps after each batch, save these, and then put them in as the first four states on the next batch. It also makes vectorizing the whole batch processing a bit trickier to deal with but I was eventually able to get this working.
In terms of the neural network architecture I went for the following:
Here the input layer has size 412 and contains information about the player’s current hand and the cards/hands that other players have already played and which I described in a previous blog post. This is fed into a fully connected layer of 512 neurons which use a “RelU” activation function. This layer is shared and fed into two further hidden fully connected layers of 256 neurons each (also with RelU activation) which are each in turn connected to an output layer (each of which is linear (i.e. has no activation function). The first of these outputs represents the probability weighting that the network gives to each possible action in this current state whilst the second outputs the value estimate of the current state. To get the actual probabilities we consider only the outputs of actions \( \{o_i \} \) that are actually allowed in the current state and sample actions with probabilities: \( p_i = \frac{ e^{ o_i } }{\sum_j e^{o_j} } \). This means there are nearly a million trainable parameters in this model! I chose the number of hidden neurons in a fairly arbitrary way really and have made no attempt to play around with this so far but Big 2 is a reasonably complicated game so I figured that having fairly large layers would be sensible. Also sharing the first layer between the probability output and the value output seemed sensible as there are likely to be a lot of features of the game state which are useful for calculating both things!
In terms of the parameters I used for the PPO algorithm I chose to run 48 games in parallel and run them all for 20 steps to generate a training batch. This leads to batch sizes of 960 samples (which is tiny in comparison to what is used in the OpenAI paper for multi-agent environments where they generated batches of around 400,000 samples apparently) but a lot more than was used for Atari where they had only 8 games running in parallel. These batches were then divided into four mini-batches of equal size to be trained on with 5 epochs of SGD per batch. For the generalized advantage estimation I chose \(\gamma = 0.995 \) and \(\lambda = 0.95\) and for the main PPO algorithm I chose a learning rate of \( \alpha = 0.00025\) and a clip range of 0.2 (but both of these are linearly annealed to zero as the training progresses) and set the value/entropy coefficients in the loss function to 0.5 and 0.01 respectively. I then trained the agent by making it play against itself (always using the current version of the neural network – I’d quite like to experiment with some kind of opponent sampling and see if that makes any difference) for 136500 updates (~130 million time steps). I made absolutely no attempt to tune these hyperparameters in anyway (although I know they are somewhat sensible from reading the results of the paper) so it’s really pretty cool that it ends up working so well!
Results
The main result is that the network which is trained learns to play Big 2 really well! I’ve only played 15 games against it so far because the GUI I’ve made is kind of clunky and needs improving a bit but I got well and truly embarrassed – and I’ve played the game a lot (although I am by no means a proper expert). From the 15 games I played against three of the final trained AIs my scores were: \( \{-3, -3, -1, -1, -10, -11, -1, -1, -4, -8, +16, -2, -5, +14, -8 \} \)
So I only won two games (and in one of those I had what was essentially an unbeatable hand). There were also situations where I could tell that the AI was playing really well and had clearly “planned” in some sense how to get rid of its cards by learning both simple things like the value of passing to save 2s to gain control at a later stage in the game but also in terms of when to play certain number of cards when it gets control. I have to say this was really surprising and cool to see! Obviously I need to test this a bit more rigorously and so I am planning to make a web app where people can play against it and record the results so that I can get a better idea of how good it really is. One thing I have looked at is just its performance against earlier versions of itself as well just against random opponents and I get the following results (each point is averaged over 10,000 simulated games):
Note that the first data point is not at zero updates but after 1000 updates – hence a lot is getting learned very quickly! More importantly we see that there seems to be steady improvement throughout training which suggests that training for even longer could yield further improvement still!
To Do:
- A proper write up fully explaining the game, the PPO algorithm etc.
- A web app that allows for people to play and save their results in a leaderboard against the currently trained neural net.
- Experiment with different batch sizes (e.g. try >100 games running in parallel) and other parameters to see if performance can be improved further. Also would like to experiment with some kind of opponent sampling, i.e. only having one player definitely being the most recent neural network with other players being sampled from versions of the NN earlier on in the training. The point of this would be to try and ensure good play vs. all opponents, not just the new one. I guess what I’m thinking about here is the way a poker pro plays vs. a new player vs. another pro is completely different so this could be important to try and get the network to try and understand this sort of thing.
EDIT
Code is available on Github here. The webapp is finished, so you can play against the AI here (it will take a while to load most likely, as I am using a free Heroku account). A preprint of a paper describing things in more detail is here.