# Big 2 Reinforcement Learner Progress

This is just a post mainly to document (to myself more than anything) the progress that I’ve made over the past couple of weeks with the “Big 2 reinforcement learning project” that I’m working on as well as to lay out my plan for what to do next. The main goal of the project is to train an AI to learn how to play the four player card game “Big 2”. This is a game of imperfect information with a reasonably complicated action space and so should be an interesting challenge particularly as I have only recently become interested in the field of reinforcement learning.

So far I’ve mainly just been working on implementing the game logic in python and writing the code that generates what will be the input to the neural network. I spent some time creating a GUI in python with tkinter which allows you to generate a big 2 game, see what all of the options available to each player are, and essentially play through a full game. It also has a separate window that shows you what the input to the neural network for each player will be given the current state of the game. If you want to play around with it yourself you can find it here on Github or if you don’t have python installed I made a standalone executabe you can run (although it’s quite a big file). I did this partly because I thought it would be interesting to learn how to make a GUI with python (and tkinter turns out to be really cool and easy to use!) but it’s also been extremely useful in debugging the game logic. Below is a screenshot of it in action:

In the middle are the three most recently played hands. The green circle shows that it is player 1’s turn and that they have control (it is red otherwise) and then in the top right are the card indices (from 0 to 12) that are playable hands in the current situation. When I have a neural network to play against it should be easy to add this in to the GUI so that I can play against it as well as display statistics on things like the value it associates with each option that it has available to it, so I think this has definitely been a good use of my time. The second window it shows is the input state that will be provided to the neural network when it has to make a decision:

This may not be absolutely final but it’s going to be what I try out initially. The first part of the input is the player’s actual hand where each card (up to a maximum of 13 – the size of the initial hand you’re dealt) is given a value and a suit. I also decided to include whether or not each card is part of a hand that uses more than one card (e.g. pair, straight, flush etc). In principle this could be learned by the network but I suspect/hope that it will be easier for the network to learn somewhat advanced strategies that involve saving cards for later on if you provide this information manually. Next we have the information about what the other players have/what they have played previously. One of the things which makes Big 2 interesting is that it is a game of imperfect information which of course means that we cannot provide the neural network with complete information about what is in each of the other player’s hands, but instead can only include what is “allowed” to be known. So far I have included the number of cards they have left, and whether at some previous point they have played any ace or any two (the highest cards), as well as whether they’ve played any pairs, three of a kinds, two pairs, straights, flushes or full houses. It’s at this point that I’m injecting the most “prior information” about the game of Big 2 because I know from experience that noting who’s played certain high cards/hands is important in figuring out what decision you should make. Ideally you’d just like to include a full history of every hand that each player has played so far and try and learn directly from that but that simply isn’t practical here and so until I can think of a better approach I’ll stick with this for the time being. We then have an input which describes the previous hand which needs to be beaten – the value/suit as well as the type and then finally a “cards played” input which tracks whether anyone (so not specifically who) has played any of the 16 most valuable cards. Again this is injecting some prior knowledge about the game because I am not tracking whether the least valuable cards have been played as this is relatively unimportant (as the game is really all about gaining/maintaining control at the right times) and the input vector is already big enough as it is – 412 elements at the moment.

So this means everything is set up quite nicely meaning that I now need to start thinking about exactly how I’m going to try and train a neural network to play. Hopefully this weekend I will have a test set up working where I can get two untrained neural networks to play against each other in an essentially randomly manner so that I have an idea of how long it will take to simulate a single game, and hence have some idea about the amount of data I can generate in a reasonable time frame. This will have some effect on what I decide to do because unfortunately I don’t work for Google and have virtually unlimited computational resources at my disposal and certain algorithms are much more computationally expensive than others!

### PLAN

Deep-Q Network

Probably the first thing I’m going to try is just a normal Deep Q Network like the one used in the Deepmind paper that learned to play Atari from raw pixel inputs. This involves training a deep neural network to try and learn the value function $$Q(s,a)$$ for any given state s and choice of action a. The reason that I’ll try this first is because it is relatively simple (although there are a few tricks needed such as using experience replay and target networks to ensure that the training is stable), and there exist a number of good online tutorials for implementing this algorithm that I can use as a template (e.g. here and here). The obvious difficulty that I can see with using this approach is that the number of possible actions is quite high (see my previous post for a discussion of the Big 2 action space). The usual approach is to learn a function which outputs a Q-value for each possible action, i.e. you only provide the state s as input to the neural network rather than the state and the action. This has the big advantage that you then only need to carry out one forward pass of the network to obtain the Q-values for each action so that you can easily choose the one which is largest. The disadvantage is that it means that the network does nothing in terms of generalizing over actions which might be similar, i.e. you learn a different function for each action essentially independently. In Big 2 the number of possible actions is 1694 (in the way that I have decided to represent them) so it seems likely to me that this approach simply will not be possible and that the action will also have to be provided as an input to the network, such that the output is just a single number $$Q(s,a$$. This will cause an issue because then to choose the most “valuable” action you will have to evaluate the neural network for however many actions are available to the player in the current state each time requiring a full forward pass of the neural network. The only reason this might be possible is that although the number of possible actions is 1694 the actual number of actions that are allowable in any given situation is usually significantly lower meaning that this approach could be feasible. However being completely realistic I am expecting this to be too slow to get really good performance so the target here will be to hopefully train a network that can reliably beat a group of players who make random moves.

Large Action Spaces Paper

If this turns out to be too computationally expensive (or even if it doesn’t) I would then like to have a look at using the technique described in this paper by some of the folks at Deepmind that I found recently and which can be applied to situations where you have large discrete action spaces. I haven’t read through it in complete detail but the basic idea seems to be to describe actions as vectors $$\mathbf{a} \in \mathbb{R}^n$$ such that there is some (predefined) way of measuring how close actions are to each other. This will be quite natural given the way that I am choosing to represent the possible actions available in Big 2. The idea then seems to be to use an actor-critic approach whereby the actor chooses a “proto-action” $$\mathbf{\hat{a}} = f(s)$$ in continuous space and from which you evaluate the k-nearest actually allowable actions. Essentially the aim of the paper is to introduce an architecture that can generalize over actions without having the heavy cost associated with evaluating the neural network for every action that occurs when you explicitly include the action as an input to the neural network. Obviously I need to think more about the details (and read the paper properly!!) but it seems like it should be possible to apply this to the Big 2 action space so this will be interesting to try!

Alpha Zero and Generalizing the Monte Carlo Tree Search

The recent papers on “Alpha Go Zero” and then more generally “Alpha Zero” (which was applied to Chess and Shogi) are really, really cool! (and I should also mention this paper which uses a very similar technique and was developed at the same time!) The reason is because they learn to play these games completely from self-play, i.e. without any domain specific knowledge provided to them they just play games against each other and work out what moves lead to the best results. Doing this they beat their previous Alpha Go program with significantly less training time required and also beat the computer “world chess champion” which is pretty damn awesome if you ask me!

A key component of this algorithm is the Monte Carlo Tree Search (MCTS) which is used both during in training the neural network and afterwards once the neural network has been trained. I think the MCTS is a really cool algorithm and had a go at implementing my own version in c++ a while ago, applying it to the extremely complicated game of Tic Tac Toe. If you’ve never heard of it before then I would recommend reading this introductory blog post on the subject. Unfortunately despite it being a very cool algorithm it can only really be applied directly to games of perfect information. The basic idea is that you gradually build up a game tree:

(Image taken from “Recent Advances in General Game Playing”)

in a way that sensibly balances exploration and exploitation, i.e. you don’t build up a full game tree but just focus on “promising” branches. The actual implementation used by Alpha Zero is slightly different from the standard approach of using rollouts (and is again something I need to read up on in more detail – luckily there is an interesting post with code that implements the Alpha Zero algorithm on the much simpler game of Othello here. I am going to read through this and try to replicate their results when I get a chance). The reason this can only be applied to games of perfect information is that to properly build a game tree you need to know what state you will end up in when you take any particular action. In Big 2 you do not know this because you do not know what your opponents hand is and the number of possible things they could possibly do in their next move with all the possible cards they could have is far too large. Nevertheless I feel like it should be possible to do something similar where in the simplest case you just sample a certain number $$k$$ of possible hands that your opponents could have and use them to generate future states. This would be a very crude way of doing it – a more interesting way could be to train a neural network to try and learn how likely it is that a player has each card given what has happened during the game so far and then use these probabilities to sample the hands rather than doing so at random. Again I don’t have any details for how I could go about implementing this at the moment but it seems possible that something like this should be possible. Then if I have a neural network which has learned to play I would expect that using some generalization of MCTS on top of this would lead to much better play (as you explicitly explore a large number of possible futures for each option available to you and are just guided by the neural network – in fact even without a neural network you could expect that this might lead to decent play if you let it think for long enough. This would be interesting to check!). In an ideal world I would want to use the MCTS generalization to train the network too but this has got to be hugely expensive and I don’t have 5000 TPUs at my disposable so I suspect that this will not be realistic!

Anyway this has been a bit of a brain dump – it will be interesting to look back on this in a few months time and see whether anything I’ve thought about here ends up working out!

## Author: Henry Charlesworth

I am currently a PhD student at the University of Warwick, based in the Centre for Complexity Science. My research has been looking at how the principle of "future state maximisation" can be applied to both explaining behaviour in animal systems as well as generating behaviour in artificial systems. Loosely speaking, this is the idea that taking everything else to be equal it is preferable to keep your options open as much as possible. I am also very interested in artificial intelligence and machine learning, particularly reinforcement learning.