## Big 2 AI Working!!

I’ve finally had a decent amount of time to invest in my Big 2 reinforcement learning AI, and it’s actually working really well (much, much better than I was ever expecting in fact!). At some point I will do a full detailed write up but for now I’ll just make a few notes about the process I used and summarize the results so far but the main result is that in the initial testing the AI actually beat me (who has played the game a lot) pretty damn convincingly and showed clear signs of being able to formulate good plans to get rid of its cards! I totally wasn’t expecting this to happen so I’m really pleased with this, particularly as I was recently reading this article recently about how deep reinforcement learning doesn’t really work very well yet (or rather that it’s very difficult to get it to work properly compared to supervised deep learning).

So in the end I decided to use the “Proximal Policy Optimization” algorithm which seems to be very popular atm (particularly at OpenAI) and from what I’ve read is one of the best in terms of sample efficiency and robustness to varying hyper-parameters. It’s also relatively simple to implement and OpenAI have released an excellent implementation in their “Baselines” project which was incredibly useful to use as a basis for my own code (which can be found here). I won’t go into the details of the theory behind the PPO algorithm (as I actually still need to read more about this myself to really understand why it works so well) but it’s a policy where you policy gradient method of sorts but where you use a surrogate loss function which is clipped so as to stop updates which change the policy too much. The surrogate function requires being able to estimate the advantage of a given action in a given state given the current policy you have. There are a number of ways you can do this but I followed the original paper which suggests using “Generalized Advantage Estimation” which tries to balance as best as possible reducing the bias and reducing the variance of the advantage estimates. This requires being able to estimate the value function of each state under the policy being employed making this an actor-critic algorithm, i.e. you are trying to learn a policy $$\pi$$ and a value function $$v$$ at the same time.

I also read quite a bit about how a lot of deep reinforcement learning algorithms (e.g. deep Q-learning) often don’t perform very well in the multi-agent setting where you have a number of agents who are competing against each other in some way because these environments are complex and non-stationary whereas typical policy gradient methods suffer from very high variance which increases rapidly with the number of agents. This worried me that a four-player game of imperfect information would be very tricky to get working (especially with a complicated action space) but I was somewhat encouraged by this paper which demonstrated that using PPO with huge training batch sizes seemed to work really well in complicated competitive environments.

### What I actually did:

The first thing I needed to do was to parallelize my Big2 game class to allow for multiple games to be ran at once on different cores of my processor. The way PPO works is by running a number of environments $$n_e$$ in parallel (for ATARI and MuJoCo they use $$n_e=8$$ and run them forward for some number of steps $$n_s$$ into the future with the current policy to generate a batch to train on. The reason for running many games at once is so as to have a batch of samples which aren’t all correlated (which they would be if they all game from the same game) and so it provides a big speed up if you can run these games on different cores as much as possible. Then when the batch has been generated you train the neural network using the PC’s GPU. There are a number of challenges and things I had to alter from the OpenAI Baselines implementation which are mainly to do with the fact that this is designed for training a single agent whereas I needed it to be set up for four players. The main difficulty here is that to use generalized advantage estimation you need to work backwards and use all of the states in the batch you simulate after the one you’re estimating the advantage for to get this estimate but now the “next state” is actually the state four time steps after the current time step as the three other players have to make a decision first. Another issue is the fact that you don’t know if a state is terminal (and what reward to assign) until the other three players have had their next turn because the game finishes at one point when a player plays their final cards. To account for this it’s necessary to run an extra four steps after each batch, save these, and then put them in as the first four states on the next batch. It also makes vectorizing the whole batch processing a bit trickier to deal with but I was eventually able to get this working.

In terms of the neural network architecture I went for the following:

Here the input layer has size 412 and contains information about the player’s current hand and the cards/hands that other players have already played and which I described in a previous blog post. This is fed into a fully connected layer of 512 neurons which use a “RelU” activation function. This layer is shared and fed into two further hidden fully connected layers of 256 neurons each (also with RelU activation) which are each in turn connected to an output layer (each of which is linear (i.e. has no activation function). The first of these outputs represents the probability weighting that the network gives to each possible action in this current state whilst the second outputs the value estimate of the current state. To get the actual probabilities we consider only the outputs of actions $$\{o_i \}$$ that are actually allowed in the current state and sample actions with probabilities: $$p_i = \frac{ e^{ o_i } }{\sum_j e^{o_j} }$$. This means there are nearly a million trainable parameters in this model! I chose the number of hidden neurons in a fairly arbitrary way really and have made no attempt to play around with this so far but Big 2 is a reasonably complicated game so I figured that having fairly large layers would be sensible. Also sharing the first layer between the probability output and the value output seemed sensible as there are likely to be a lot of features of the game state which are useful for calculating both things!

In terms of the parameters I used for the PPO algorithm I chose to run 48 games in parallel and run them all for 20 steps to generate a training batch. This leads to batch sizes of 960 samples (which is tiny in comparison to what is used in the OpenAI paper for multi-agent environments where they generated batches of around 400,000 samples apparently) but a lot more than was used for Atari where they had only 8 games running in parallel. These batches were then divided into four mini-batches of equal size to be trained on with 5 epochs of SGD per batch. For the generalized advantage estimation I chose $$\gamma = 0.995$$ and $$\lambda = 0.95$$ and for the main PPO algorithm I chose a learning rate of $$\alpha = 0.00025$$ and a clip range of 0.2 (but both of these are linearly annealed to zero as the training progresses) and set the value/entropy coefficients in the loss function to 0.5 and 0.01 respectively. I then trained the agent by making it play against itself (always using the current version of the neural network – I’d quite like to experiment with some kind of opponent sampling and see if that makes any difference) for 136500 updates (~130 million time steps). I made absolutely no attempt to tune these hyperparameters in anyway (although I know they are somewhat sensible from reading the results of the paper) so it’s really pretty cool that it ends up working so well!

### Results

The main result is that the network which is trained learns to play Big 2 really well! I’ve only played 15 games against it so far because the GUI I’ve made is kind of clunky and needs improving a bit but I got well and truly embarrassed – and I’ve played the game a lot (although I am by no means a proper expert). From the 15 games I played against three of the final trained AIs my scores were: $$\{-3, -3, -1, -1, -10, -11, -1, -1, -4, -8, +16, -2, -5, +14, -8 \}$$

So I only won two games (and in one of those I had what was essentially an unbeatable hand). There were also situations where I could tell that the AI was playing really well and had clearly “planned” in some sense how to get rid of its cards by learning both simple things like the value of passing to save 2s to gain control at a later stage in the game but also in terms of when to play certain number of cards when it gets control. I have to say this was really surprising and cool to see! Obviously I need to test this a bit more rigorously and so I am planning to make a web app where people can play against it and record the results so that I can get a better idea of how good it really is. One thing I have looked at is just its performance against earlier versions of itself as well just against random opponents and I get the following results (each point is averaged over 10,000 simulated games):

Note that the first data point is not at zero updates but after 1000 updates – hence a lot is getting learned very quickly! More importantly we see that there seems to be steady improvement throughout training which suggests that training for even longer could yield further improvement still!

To Do:

• A proper write up fully explaining the game, the PPO algorithm etc.
• A web app that allows for people to play and save their results in a leaderboard against the currently trained neural net.
• Experiment with different batch sizes (e.g. try >100 games running in parallel) and other parameters to see if performance can be improved further. Also would like to experiment with some kind of opponent sampling, i.e. only having one player definitely being the most recent neural network with other players being sampled from versions of the NN earlier on in the training. The point of this would be to try and ensure good play vs. all opponents, not just the new one. I guess what I’m thinking about here is the way a poker pro plays vs. a new player vs. another pro is completely different so this could be important to try and get the network to try and understand this sort of thing.

EDIT

Code is available on Github here. The webapp is finished, so you can play against the AI here (it will take a while to load most likely, as I am using a free Heroku account). A preprint of a paper describing things in more detail is here.

## Big 2 Reinforcement Learner Progress

This is just a post mainly to document (to myself more than anything) the progress that I’ve made over the past couple of weeks with the “Big 2 reinforcement learning project” that I’m working on as well as to lay out my plan for what to do next. The main goal of the project is to train an AI to learn how to play the four player card game “Big 2”. This is a game of imperfect information with a reasonably complicated action space and so should be an interesting challenge particularly as I have only recently become interested in the field of reinforcement learning.

So far I’ve mainly just been working on implementing the game logic in python and writing the code that generates what will be the input to the neural network. I spent some time creating a GUI in python with tkinter which allows you to generate a big 2 game, see what all of the options available to each player are, and essentially play through a full game. It also has a separate window that shows you what the input to the neural network for each player will be given the current state of the game. If you want to play around with it yourself you can find it here on Github or if you don’t have python installed I made a standalone executabe you can run (although it’s quite a big file). I did this partly because I thought it would be interesting to learn how to make a GUI with python (and tkinter turns out to be really cool and easy to use!) but it’s also been extremely useful in debugging the game logic. Below is a screenshot of it in action:

In the middle are the three most recently played hands. The green circle shows that it is player 1’s turn and that they have control (it is red otherwise) and then in the top right are the card indices (from 0 to 12) that are playable hands in the current situation. When I have a neural network to play against it should be easy to add this in to the GUI so that I can play against it as well as display statistics on things like the value it associates with each option that it has available to it, so I think this has definitely been a good use of my time. The second window it shows is the input state that will be provided to the neural network when it has to make a decision:

This may not be absolutely final but it’s going to be what I try out initially. The first part of the input is the player’s actual hand where each card (up to a maximum of 13 – the size of the initial hand you’re dealt) is given a value and a suit. I also decided to include whether or not each card is part of a hand that uses more than one card (e.g. pair, straight, flush etc). In principle this could be learned by the network but I suspect/hope that it will be easier for the network to learn somewhat advanced strategies that involve saving cards for later on if you provide this information manually. Next we have the information about what the other players have/what they have played previously. One of the things which makes Big 2 interesting is that it is a game of imperfect information which of course means that we cannot provide the neural network with complete information about what is in each of the other player’s hands, but instead can only include what is “allowed” to be known. So far I have included the number of cards they have left, and whether at some previous point they have played any ace or any two (the highest cards), as well as whether they’ve played any pairs, three of a kinds, two pairs, straights, flushes or full houses. It’s at this point that I’m injecting the most “prior information” about the game of Big 2 because I know from experience that noting who’s played certain high cards/hands is important in figuring out what decision you should make. Ideally you’d just like to include a full history of every hand that each player has played so far and try and learn directly from that but that simply isn’t practical here and so until I can think of a better approach I’ll stick with this for the time being. We then have an input which describes the previous hand which needs to be beaten – the value/suit as well as the type and then finally a “cards played” input which tracks whether anyone (so not specifically who) has played any of the 16 most valuable cards. Again this is injecting some prior knowledge about the game because I am not tracking whether the least valuable cards have been played as this is relatively unimportant (as the game is really all about gaining/maintaining control at the right times) and the input vector is already big enough as it is – 412 elements at the moment.

So this means everything is set up quite nicely meaning that I now need to start thinking about exactly how I’m going to try and train a neural network to play. Hopefully this weekend I will have a test set up working where I can get two untrained neural networks to play against each other in an essentially randomly manner so that I have an idea of how long it will take to simulate a single game, and hence have some idea about the amount of data I can generate in a reasonable time frame. This will have some effect on what I decide to do because unfortunately I don’t work for Google and have virtually unlimited computational resources at my disposal and certain algorithms are much more computationally expensive than others!

### PLAN

Deep-Q Network

Probably the first thing I’m going to try is just a normal Deep Q Network like the one used in the Deepmind paper that learned to play Atari from raw pixel inputs. This involves training a deep neural network to try and learn the value function $$Q(s,a)$$ for any given state s and choice of action a. The reason that I’ll try this first is because it is relatively simple (although there are a few tricks needed such as using experience replay and target networks to ensure that the training is stable), and there exist a number of good online tutorials for implementing this algorithm that I can use as a template (e.g. here and here). The obvious difficulty that I can see with using this approach is that the number of possible actions is quite high (see my previous post for a discussion of the Big 2 action space). The usual approach is to learn a function which outputs a Q-value for each possible action, i.e. you only provide the state s as input to the neural network rather than the state and the action. This has the big advantage that you then only need to carry out one forward pass of the network to obtain the Q-values for each action so that you can easily choose the one which is largest. The disadvantage is that it means that the network does nothing in terms of generalizing over actions which might be similar, i.e. you learn a different function for each action essentially independently. In Big 2 the number of possible actions is 1694 (in the way that I have decided to represent them) so it seems likely to me that this approach simply will not be possible and that the action will also have to be provided as an input to the network, such that the output is just a single number $$Q(s,a$$. This will cause an issue because then to choose the most “valuable” action you will have to evaluate the neural network for however many actions are available to the player in the current state each time requiring a full forward pass of the neural network. The only reason this might be possible is that although the number of possible actions is 1694 the actual number of actions that are allowable in any given situation is usually significantly lower meaning that this approach could be feasible. However being completely realistic I am expecting this to be too slow to get really good performance so the target here will be to hopefully train a network that can reliably beat a group of players who make random moves.

Large Action Spaces Paper

If this turns out to be too computationally expensive (or even if it doesn’t) I would then like to have a look at using the technique described in this paper by some of the folks at Deepmind that I found recently and which can be applied to situations where you have large discrete action spaces. I haven’t read through it in complete detail but the basic idea seems to be to describe actions as vectors $$\mathbf{a} \in \mathbb{R}^n$$ such that there is some (predefined) way of measuring how close actions are to each other. This will be quite natural given the way that I am choosing to represent the possible actions available in Big 2. The idea then seems to be to use an actor-critic approach whereby the actor chooses a “proto-action” $$\mathbf{\hat{a}} = f(s)$$ in continuous space and from which you evaluate the k-nearest actually allowable actions. Essentially the aim of the paper is to introduce an architecture that can generalize over actions without having the heavy cost associated with evaluating the neural network for every action that occurs when you explicitly include the action as an input to the neural network. Obviously I need to think more about the details (and read the paper properly!!) but it seems like it should be possible to apply this to the Big 2 action space so this will be interesting to try!

Alpha Zero and Generalizing the Monte Carlo Tree Search

The recent papers on “Alpha Go Zero” and then more generally “Alpha Zero” (which was applied to Chess and Shogi) are really, really cool! (and I should also mention this paper which uses a very similar technique and was developed at the same time!) The reason is because they learn to play these games completely from self-play, i.e. without any domain specific knowledge provided to them they just play games against each other and work out what moves lead to the best results. Doing this they beat their previous Alpha Go program with significantly less training time required and also beat the computer “world chess champion” which is pretty damn awesome if you ask me!

A key component of this algorithm is the Monte Carlo Tree Search (MCTS) which is used both during in training the neural network and afterwards once the neural network has been trained. I think the MCTS is a really cool algorithm and had a go at implementing my own version in c++ a while ago, applying it to the extremely complicated game of Tic Tac Toe. If you’ve never heard of it before then I would recommend reading this introductory blog post on the subject. Unfortunately despite it being a very cool algorithm it can only really be applied directly to games of perfect information. The basic idea is that you gradually build up a game tree:

(Image taken from “Recent Advances in General Game Playing”)

in a way that sensibly balances exploration and exploitation, i.e. you don’t build up a full game tree but just focus on “promising” branches. The actual implementation used by Alpha Zero is slightly different from the standard approach of using rollouts (and is again something I need to read up on in more detail – luckily there is an interesting post with code that implements the Alpha Zero algorithm on the much simpler game of Othello here. I am going to read through this and try to replicate their results when I get a chance). The reason this can only be applied to games of perfect information is that to properly build a game tree you need to know what state you will end up in when you take any particular action. In Big 2 you do not know this because you do not know what your opponents hand is and the number of possible things they could possibly do in their next move with all the possible cards they could have is far too large. Nevertheless I feel like it should be possible to do something similar where in the simplest case you just sample a certain number $$k$$ of possible hands that your opponents could have and use them to generate future states. This would be a very crude way of doing it – a more interesting way could be to train a neural network to try and learn how likely it is that a player has each card given what has happened during the game so far and then use these probabilities to sample the hands rather than doing so at random. Again I don’t have any details for how I could go about implementing this at the moment but it seems possible that something like this should be possible. Then if I have a neural network which has learned to play I would expect that using some generalization of MCTS on top of this would lead to much better play (as you explicitly explore a large number of possible futures for each option available to you and are just guided by the neural network – in fact even without a neural network you could expect that this might lead to decent play if you let it think for long enough. This would be interesting to check!). In an ideal world I would want to use the MCTS generalization to train the network too but this has got to be hugely expensive and I don’t have 5000 TPUs at my disposable so I suspect that this will not be realistic!

Anyway this has been a bit of a brain dump – it will be interesting to look back on this in a few months time and see whether anything I’ve thought about here ends up working out!

## Big 2 Action Space

This is just a post with some thoughts on how I am planning to go about setting up the action space for the Big 2 reinforcement learner I am working on at the moment. For anyone unfamiliar with the game you can see the rules and try out a single player version of the game against a very basic AI (essentially just a series of “if” statements) here. The plan for this project is to train a neural network that learns to play the game through self-play reinforcement learning. There are a couple of things that make this an interesting and challenging game to apply reinforcement learning to (in my opinion anyway!). Firstly, unlike games such as Chess and Go it’s a game of imperfect information, i.e. each player does not have full knowledge of the current state of the game because they do not know what cards the other players have (although of course things can be deduced as the game progresses). Secondly, Big 2 is most naturally a four player game whereas the most famous applications of reinforcement learning have been primarily to two player games and finally (and in this case similarly to Chess) the state of possible actions that can be taken is actually fairly complicated and state dependent. This is because the actions that can be taken depend on what the previous player has played and whether or not you have “control” of the game, and include all 2,3,4 and 5 card “poker hands” i.e. pairs, three of a kinds, two pairs, four of a kinds, straights, flushes, full houses and straight flushes. This post is concerned with how I am planning to go about representing all of the actions that the neural network has available to it in a sensible way such that it is able to learn a sensible policy.

The reason this is tricky is because of being able to play hands with more than one card in them as it is possible for there to be many ways to do this. Let us consider the most extreme example where we have the following starting hand made up of 13 cards of the same suit:

Imagine that the previous player has played a straight. Clearly we can play a flush that beats this, but which flush should be chosen? In this case we can choose any of the 5 cards to make a flush which means there are $${13 \choose 5} = 1287$$ unique flushes that can be played here. I will do a separate post about what information will be used as the input state to the neural network but I think at the moment that it will most likely include the following (along with other information such as cards other people have played so far):

i.e. for each of the 13 initial cards there will be 13 binary values (0 or 1) for the value of each card as well as for the suit. The alternative would be to use 52 binary inputs representing whether each card is present in the current hand but I think the first method will be necessary if we’re going to use a neural network that can choose any possible action as the can use the indices of the cards (1-13) to systematically describe all of the available actions. e.g. we can say that $$\{1,3,4,6,10 \}$$ would be the 5 card hand represented by the 1st, 3rd, 4th, 6th and 10th card in the current hand (of course this may not be a real hand, in which case the move would not be valid). If we are to consider every possible action we must allow for $${13 \choose 5}=1287$$ five card moves, $${13 \choose 4}=715$$ four card moves, $${13 \choose 3}=286$$ three card moves, $${13 \choose 2} = 78$$ two card moves and 13 one card moves. In fact this is actually more than is strictly necessary as for 2,3 and 4 card hands there are certain index combinations that will never be possible (e.g. card 1 and card 13 can never be played together as a pair assuming the hand is sorted initially. But it might be better to not actually order the cards – this might need some more thought!) This leaves a total of 2379 possible actions (theoretically, most will of course not correspond to valid moves at any particular time). Initially I was worried that this would be far too many actions for a neural network to output and to be honest I am still worried, but reading the recent paper where the Alpha Go algorithm was applied to Chess I read the following:

So clearly their neural network considered an output space with an even larger number of possible moves, but they also have a hell of a lot more computing power at their disposal! Still this gives me hope that this approach is at least worth a go and may be able to work!

To get anywhere with this approach we need a way of indexing the actions for a particular combination of cards so that we do not have to consider whether every single one of the theoretically possible actions are available each time a hand is updated. So for 5 card hands we can define an action by a set of integers$$\{c_1, c_2, c_3, c_4, c_5 \}$$ (with $$c_5 > c_4 > c_3 > c_2 > c_1$$ ) which we need to convert into a unique integer between 1 and 1287. It took me probably longer than it should have done to come up with a way to do this, but we can evaluate the sum:

$$\begin{eqnarray}a(c_1,c_2,c_3,c_4,c_5) = \sum_{i’=1}^{c_1-1} \sum_{j’=i’+1}^{10} \sum_{k’=j’+1}^{11} \sum_{l’=k’+1}^{12} \sum_{m’=l’+1}^{13} 1 + \sum_{i’=c_1+1}^{c_2-1} \sum_{j’=i’+1}^{11} \sum_{k’=j’+1}^{12} \sum_{l’=k’+1}^{13} 1 \nonumber \\ + \sum_{i’=c_2+1}^{c_3-1} \sum_{j’=i’+1}^{12} \sum_{k’=j’+1}^{13} 1 + \sum_{i’=c_3+1}^{c_4-1} \sum_{j’+1}^{13} 1 + (c_5-c_4) \end{eqnarray}$$

which can be evaluated with Mathematica:

Now that I think about it we need to be able to invert this as well, i.e. for a given action index retrieve the $$\{c_1, c_2, c_3, c_4, c_5 \}$$ indices. This calculation can be done once and we can just store the result in memory so should be trivial.

The reason this is useful is that it means we don’t have to evaluate all 1287 combinations of $$\{c_i\}$$ and check whether each is a valid move in the current state. Imagine we have the following hand and are trying to calculate the 5 card options available to us:

we can play a straight with any combination of 4,5,6,7,8 (one of each). In general for calculating the straights available we can calculate the set of consecutive numbers (including repeats), so in the example above this would be $$\{4C,5H,5S,6D,6H,7C,8D,8H,QS,2H \}$$ and then calculate of the actions which are actually valid in this state and so that need to be considered by the neural network. We can do similar things for flushes and full houses as well as for the 4,3,2 card hands.

This is the approach I would really like to take as this is basically not giving the neural network any human knowledge about the game and letting it learn everything from scratch which to me would be a lot more elegant and interesting. The other approach I have been considering is to consider a much smaller action space that uses a bit more game knowledge with options that represent what I consider to be reasonable potential options from any given state. E.g:

This approach requires a lot more calculation and analyzing each hand the neural network sees to find out e.g. what the best card is that is not apart of another hand. This quickly becomes inelegant and pretty annoying to code up and also means that there are possible moves that the neural network will never consider in certain situations. Given that I’m not an expert player at all I would rather only fall back on this approach if it seems like the action space is too large with the first approach I have suggested.

## EDIT

Having thought about it a bit more I have a much better solution I think which slightly reduces the size of the action space which needs to be considered by ensuring that we use a sorted hand and then just use a lookup table – for example considering 5 card moves we define a look up matrix of dimensions $$13 \times 13 \times 13 \times 13 \times 13$$ which we can index with the values of $$\{c_i \}$$. Although most entries will correspond to moves which can never be made $$13^5 = 371293$$ is small enough that it doesn’t take up too much memory. The following code is used to create a lookup table and an inverse lookup table:

nActions5 = 1287 #number of potentially possible 5 card moves
fiveCardIndices = np.zeros((13,13,13,13,13),dtype=int)
inverseFiveCardIndices = np.zeros((nActions5,5),dtype=int)
c = 0
for c1 in range(9):
for c2 in range(c1+1,10):
for c3 in range(c2+1,11):
for c4 in range(c3+1,12):
for c5 in range(c4+1,13):
fiveCardIndices[c1][c2][c3][c4][c5] = c
inverseFiveCardIndices[c][0] = c1
inverseFiveCardIndices[c][1] = c2
inverseFiveCardIndices[c][2] = c3
inverseFiveCardIndices[c][3] = c4
inverseFiveCardIndices[c][4] = c5
c += 1

For four card hands (i.e. two pairs or four of a kinds) we really don’t need to consider all $${13 \choose 4} = 715$$ moves as many are not allowable in any situation. Instead we can just create a lookup table as follows:

nActions4 = 330
fourCardIndices = np.zeros((13,13,13,13), dtype=int)
inverseFourCardIndices = np.zeros((nActions4,4), dtype=int)
c=0
for c1 in range(10):
nextInd = np.min([c1+3,10])
for c2 in range(c1+1,nextInd+1):
for c3 in range(c2+1,12):
nextInd = np.min([c3+3,12])
for c4 in range(c3+1,nextInd+1):
fourCardIndices[c1][c2][c3][c4] = c
inverseFourCardIndices[c][0] = c1
inverseFourCardIndices[c][1] = c2
inverseFourCardIndices[c][2] = c3
inverseFourCardIndices[c][3] = c4
c += 1

And we can do a similar thing for three and two card hands as well. Doing this we have a total number of actions of $$13 + 33 + 31 + 330 + 1287 = 1694$$ rather than 2379 moves from the previous method. This possibly doesn’t help that much because we still have an enormous number of possible 5 card moves but every little helps!