Machine Learning – A Probabilistic Perspective Exercises – Chapter 6

There are far fewer exercises in this chapter than in the previous ones and so this will be a much shorter post! Chapter 6 is on frequentist statistics, and it is clear that the author is not a fan!

6.1 Pessimism of LOOCV

Suppose we have a completely random labelled dataset (i.e. the features tell us nothing about the class labels) with \(N_1\) examples of class 1 and \(N_2\) examples of class 2, with \(N_1 = N_2\). What is the best misclassification rate that any method can achieve? What does LOOCV predict as the misclassification rate?

Let’s say \(N_1 = N_2 = N\) for notational simplicity. Clearly on average there is no method that can do better than a 50% misclassification rate. If we think about what happens when we apply leave-one-out-cross-validation, every time we will leave out one sample (either of class 1 or class 2, it doesn’t matter, but to be concrete let’s imagine it’s class 1). In the remaining data, which the best method can only predict completely randomly, we will then have \(N-1\) examples of the class 1 and \(N\) of class 2. As such this classifier will, on average, predict class 1 only with probability \(\frac{N-1}{2N-1} < \frac{1}{2}\) . This means that the predicted misclassification rate will be \( 100 \times \frac{N}{2N-1} \% \), which is worse than it should be (hence the “pessimism” of LOOCV).


6.3 MLE estimate for variance is biased

Show that \( \hat{\sigma}_{MLE}^2 = \frac{1}{N} \sum_{n=1}^N (x_n – \hat{\mu})^2 \) is a biased estimator of \(\sigma^2\). i.e. show:

\( \mathbb{E}_{X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2)} \left[ \hat{\sigma}^2(X_1, \dots, X_n) \neq \sigma^2\right] \)

To start with we can say:

\( N \mathbb{E} \left[ \hat{\sigma}_{MLE}^2 \right] = \sum_{n=1}^N \left( \mathbb{E}(x_n^2) – 2 \mathbb{E}(x_n \hat{\mu}) + \mathbb{E}(\hat{\mu}) \right) \)

The first term is just: \( \mathbb{E}(x_n^2) = \sigma^2 + \mu^2\), where \(\mu = \mathbb{E}(x_n)\). Next, we need to calculate \(\mathbb{E}(\hat{\mu} x_n) \). We can say:

\( \mathbb{E}(\hat{\mu} x_n) = \frac{1}{N} \mathbb{E} \left[ \sum_{n’=1}^N x_n x_n’ \right] = \frac{1}{N} \left[ \mathbb{E}(x_n^2) + (N-1) \mu^2 \right] = \frac{\sigma^2}{N} + \mu^2\)

We then note that \( \mathbb{E}(\hat{\mu}^2) = \frac{1}{N} \sum_{n=1}^N \mathbb{E}(\hat{\mu} x_n) = \mathbb{E}(\hat{\mu} x_n) = \frac{\sigma^2}{N} + \mu^2\)

Putting this together we find:

\( N \mathbb{E} \left[ \hat{\sigma}_{MLE}^2 \right] = \sum_{n=1}^N \left[ \sigma^2 + \mu^2 – (\frac{\sigma^2}{N} + \mu^2) \right] = \sum_{n=1}^N \left( \frac{N-1}{N} \sigma^2\right) \)

This means that:

\( \mathbb{E} \left[ \hat{\sigma}_{MLE}^2 \right] = \left( \frac{N-1}{N} \right) \sigma^2 \)

and hence this estimate is biased!


6.4 Estimation of the variance when the mean is known

Suppose we sample \( x_1, \dots, x_N \sim \mathcal{N}(\mu, \sigma^2)\), where \(\mu\) is a known constant. Derive an expression for the MLE for \(\sigma^2\). Is it unbiased?

We can write the log-likelihood as:

\( \log p(D | \sigma^2) = -\frac{N}{2} \log(\sigma^2) – \sum_i \frac{(x_i-\mu)^2}{2 \sigma^2} \)

Taking the derivative with respect to \(\sigma^2\):

\( – \frac{n}{2 \sigma^2} + \sum_i \frac{(x_i – \mu)^2}{2 \sigma^4} = 0 \)

So it’s clear that the MLE is \( \hat{\sigma}^2 = \frac{1}{N} \sum_i (x_i – \mu)^2 \), which is intuitively obvious. The interesting part is that this is now very clearly unbiased because:

\( \mathbb{E}(\hat{\sigma}^2) = \mathbb{E} \left[ \frac{1}{N} \sum_i (x_i – \mu)^2 \right] = \sigma^2 \)

just by definition. The difference between this result and that in the previous question is that here we are using the exact, known value of \(\mu\) here rather than our MLE estimate.

Machine Learning – A Probabilistic Perspective Exercises – Chapter 5

I imagine my rate of doing these exercises is going to drop slightly as I’ve started my machine learning internship at BMLL Technologies and will be there up until Christmas. Still, my plan is to (hopefully) get up to about Chapter 8 finished by then – although this may be optimistic as my working days are pretty long at the moment! Still, I am learning a lot which means the exercises should be easier, right?! Anyway, chapter 5 is about Bayesian statistics and there are significantly less exercises than in the previous two chapters!


5.1 Proof that a mixture of conjugate priors is indeed conjugate

We consider the case where we have a prior of the form: \(p(\theta) = \sum_k p(Z=k) p(\theta | Z=k) \) where each \(p(\theta | Z=k)\) is conjugate to the likelihood, i.e. \(P(\theta | D, Z=k) \)  is the same type of probability distribution as \(p(\theta | Z=k)\). We want to derive the general result:

\(p(\theta | D) = \sum_k p(Z=k | D) p(\theta | D, Z=k) \)

as this shows that the posterior is a weighted sum of the posterior of each mixture component – so the whole thing is conjugate to the mixture prior we started with. I’m not sure if I’m missing something and this is too simplistic but it seems we can just say:

\( \sum_k p(Z=k | D) p(\theta | Z=k, D) = \sum_k p(\theta, Z=k | D) = p(\theta | D) \)

An alternative way is to use Baye’s theorem:

\(p(\theta | D) = p( D | \theta) p(\theta)/p(D) = \sum_k \frac{p(D|\theta) p(Z=k) p(\theta | Z=k)}{p(D)}\)

We then note that \(P(D | \theta) = P(D | \theta, Z=k)\) since if we already know \(\theta\) then knowing what Z is tells us nothing. We then use that \(p(D | \theta, Z=k) p(\theta | Z=k) = p(\theta | D, Z=k) p(D | Z=k) \) and then \(\frac{p(D| Z=k) p(Z=k)}{p(D)} = p(Z=k | D) \) which gives us the result we want!

5.2 Optimal threshold on classification probability

Consider a case where we have learned a conditional probability distribution \(p(y | x)\), and suppose there are only two classes. Let \(p_0 = P(Y=0 | x)\) and \(p_1 = P(Y=1 | x)\). Consider the following loss matrix:

Predicted label \(\hat{y}\) True label y
0 1
0 0 \(\lambda_{01}\)
1 \(\lambda_{10}\) 0

(a) Show that the decision \( \hat{y}\) which minimises the expected loss is equivalent to setting a probability threshold \(\theta\)

The expected loss if we choose 0 is simply:

\(\rho( \hat{y}=0 | x) = \lambda_{01} p(y=1 | x) = \lambda_{01} p_1\)

if we choose 1:

\(\rho(\hat{y} =1 | x) = \lambda_{10} p(y=0 | x) = \lambda_{10} p_0 \)

So we will pick 0 if:

\(\lambda_{01} p_1 < \lambda_{10} p_0 = \lambda_{10} (1-p_1) \)

or re-arranging, if:

\(p_1 < \frac{\lambda_{1o}}{\lambda_{01} + \lambda{10}} \equiv \theta \)

Otherwise we pick 1.

(b) Show a loss matrix where the threshold is 0.1

We just want \( 0.1 = \frac{\lambda_{10}}{\lambda_{01} + \lambda_{10}}\), which implies that \(9 \lambda_{10} = \lambda_{01}\). So any loss matrix which satisfies this.


5.3 Reject option in classifiers

In many classification problems it is useful to include a reject option – where we are too uncertain to make a classification. Let \(\alpha_i\) be the action you choose with \(i \in \{1, \dots, C+1 \}\), where \(C\) is the number of classes and action \(C+1\) is reject. Let \(Y=j\) be the true (but unknown) state of nature. Consider the following loss:

\( L(\alpha_i | Y=j) = 0 \ \text{if} \ i=j\)

\( \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = \lambda_r \ \text{if} \ i=C+1 \)

\( \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = \lambda_s \ \text{otherwise} \)

(a) Describe how to minimise the risk

Let’s first define \(P(Y=j | x) = p_j\). Now if we choose an action \( a \in \{1, \dots, C\}\), then we have the following expected loss:

\( \rho(a | x) = \sum_{i=1}^C L(a | Y=i) p_i\)

But of course \(L(i | Y=i) = 0\) and for all other values of Y is a constant, so this gives:

\(\rho(a | x) = (1-p_a) \lambda_s\)

So choosing amongst classes we want to choose the a which  has the minimum value of: \(1-p_a) \lambda_s\). So we need:

\(p_a \ge p_b \ \forall b \in \{1,\dots,C\}\)

This shows that if we are going to choose a class then as we should intuitively expect we choose the most probable one. However, we can also choose to reject. In this case we have an expected loss equal to \(\lambda_r\), and so if we are going to choose a class rather than reject then we need the best class to satisfy:

\((1-p_a) \lambda_s \le \lambda_r \ \ \implies p_a \ge 1 – \frac{\lambda_r}{\lambda_s} \)

(b) Describe qualitatively what happens as \(\lambda_r\) is increased from 0 to 1.

The threshold for rejection changes from 1 down to 0.


5.4 More reject options

Consider, for example, a case in which the cost of misclassification is $10, but the cost of having a human manually make the decision is only $3.

(a) Suppose \(P(y=1 | x)\) is predicted to be 0.2. Which decision minimises the expected loss?

This is pretty much the same question as the previous one – clearly we are going to choose \(y=0\) if we are going to classify at all. The threshold for rejection is then: \(1-\frac{\lambda_r}{\lambda_s}) = 1 – \frac{3}{10} = 0.7\). So as \(P(y=0 | x) = 0.8\), we don’t reject and classify as \(y=0\).

(b) Now suppose \(P(y=1 | x) = 0.4\). Now which decision is optimal?

In this case, \(P(y=0 | x)=0.6\) which is below the reject threshold, and so we choose the reject option as we are not sure enough to predict \(y=0\).

(c) Show that in general there are two thresholds \(\theta_0\) and \(\theta_1\) such that the optimal decision is to predict 0 if \(p_1 < \theta_0\), reject if \( \theta_0 \le p_1 \le \theta_1\) and predict 1 if \(p_1 > \theta_1\)

This is clear given the results of the previous question: \(\theta_0 = 0.3\) and \(\theta_1 = 0.7\)


5.5 Newsvendor problem

Consider the following classic problem from decision theory/ economics. Suppose you are trying to decide how much quantity of some product (e.g. newspapers) to buy to maximize your profits. The optimal amount will depend on how much demand D you think there is, as well as its cost C and its selling price P. Suppose D is unknown but has pdf f(D) and cdf F(D), We can evaluate the expected profit by considering two cases: if \(D > Q\), then we sell all Q items, and make profit \(\pi = (P-C)Q\), but if \( D < Q\) we only sell D items, at profit \( (P-C)D\), but have wasted \(C(Q-D)\) on the unsold items. So the expected profit if we buy quantity Q is:

\( \mathbb{E}_{\pi}(Q)  = \int_Q^{\infty} (P-C)Q \ f(D) \ dD + \int_0^Q (P-C)D \ f(D) – \int_0^Q C(Q-D) \ f(D) \ dD \)

Show that the optimal \(Q*\) satisfies:

\(F(Q*) = \frac{P-C}{P}\)

Let’s start with saying \(F(a) = P(D \le a) = \int_0^a f(D) dD\), since \(F(0)=0\). Now:

\(\int_0^Q f(D) dD + \int_Q^{\infty} f(D)dD = F(Q) + \int_Q^{\infty} f(D)dD = 1 \ \implies \int_Q^{\infty} f(D) dD = 1-F(Q)\)

So the first term in the expectation is just: \((P-C)Q(1-F(Q))\). So using this and multiplying out a bit we have:

\( \mathbb{E}_{\pi}(Q)  = (P-C)Q – PQ F(Q) + CQ F(Q) + \int_0^Q (P-C)Df(D)dD – \int_0^Q CQf(D) dD + C\int_0^Q D f(D) dD\)

But the second to last term is just \(-CQ F(Q)\) which cancels with the third term. The last term also cancels with a part of the fourth term, so we just have:

\( \mathbb{E}_{\pi}(Q)  = (P-C)Q – PQ F(Q) + \int_0^Q P D f(D) dD\)

Differentiating with respect to Q:

\( (P-C) – P F(Q) – PQ f(Q) + PQ f(Q) = 0\)

And so therefore:

\((P-C) = P F(Q*) \implies F(Q*) = \frac{P-C}{P}\)


5.6 Bayes factors and ROC curves

Let \( B= p(D | H_1) / p(D | H_0)\) be the Bayes factor in favour of model 1. Suppose we plot two ROC curves, one computed by thresholding \(B\) and the other computed by thresholding \(p(H_1 | D)\). Will they be the same or different?

I’m not sure if I fully understand what this question is trying to get at, but we can write \(B = \frac{p(D | H_1)}{ p(D | H_0)} = \frac{ p(H_0) p(H_1 | D) }{ p (H_1) p(H_0 | D)}\), and so it seems like in general using B as a threshold will not be the same as using \(p(H_1 | D) \)


5.7 Bayes model averaging helps predictive accuracy

Let \(\Delta\) be a quantity we are trying to predict. Then let \(\mathcal{D}\) is the observed data and \(\mathcal{M}\) be a finite set of models. Suppose our “action” is to provide a probabilistic prediction p(), with a loss function \( L(\Delta, p()) = – \log(p(\Delta)) \). We can predict \(\Delta\) in two different ways – firstly by using Bayes model averaging:

\( p^{BMA}(\Delta) = \sum_{m \in \mathcal{M}} p(\Delta | m, D) p(m | D) \)

or using a single model (plugin approximation):

\(p^m(\Delta) = p(\Delta | m, D) \)

Show that for all models the posterior expected loss using BMA is lower, i.e.:

\( \mathbb{E} \left[ L(\Delta, p^{BMA}) \right] \le \mathbb{E} \left[ L(\Delta, p^m) \right] \)

where the expectation over \(\Delta\) is with respect to:

\(p(\Delta | D) = \sum_m p(\Delta | m, D) p(m | D) \)

This actually turns out to be pretty straightforward – we can write:

\( \mathbb{E} \left[ L(\Delta, p^{BMA}) \right] = – \int p^{BMA}(\Delta | D) \log(p^{BMA}(\Delta | D)) d \Delta \)


\( \mathbb{E} \left[ L(\Delta, p^m) \right] = -\int p^{BMA}(\Delta | D) \log(p^m(\Delta | D)) d \Delta \)

Now we consider the KL divergence between \(p^m\) and \(p^{BMA}\), using the fact that we know that it is always non-negative:

\(KL(p^{BMA}(\Delta | D), p^m(\Delta | D)) = \int p^{BMA}(\Delta | D) \log \left( \frac{p^{BMA}(\Delta | D)}{p^m(\Delta | D)} \right) d \Delta \ge 0 \)

\( = – \mathbb{E} \left[ L(\Delta, p^{BMA}) \right] + \mathbb{E} \left[ L(\Delta, p^m) \right] \ \implies \mathbb{E} \left[ L(\Delta, p^m) \right] \ge \mathbb{E} \left[ L(\Delta, p^{BMA}) \right] \)


5.8 MLE and model selection for a 2D distribution

Let \( x \in \{0,1\} \) denote a result of a coin toss (\(x=0\) for tails, \(x=1\) for heads). Let’s say that heads occurs with probability \(\theta_1\). Suppose that someone else observes the coin flip and observes the outcome, y. But this person is unreliable and only reports the result correctly with probability \(\theta_2\). Assume that \(\theta_2\) is independent of x and \(\theta_1\).

(a) Write down the probability distribution \(p(x,y | \theta\) in terms of \(\theta_1\) and \(\theta_2\).

\(p(x=0, y=0 | \theta) = (1-\theta_1) \theta_2\)

\(p(x=0, y=1 | \theta) = (1-\theta_1)(1-\theta_2) \)

\(p(x=1, y=0 | \theta) = \theta_1 (1-\theta_2) \)

\(p(x=1, y=1 | \theta) = \theta_1 \theta_2 \)

(b) Suppose we have the following dataset: \(\mathbf{x} = (1,1,0,1,1,0,0), \mathbf{y} = (1,0,0,0,1,0,1)\). What are the MLEs for \(\theta_1\) and \(\theta_2\)? What is \(P(D | \mathbf{\hat{\theta}}, M_2\)) (where \(M_2\) denotes this 2-parameter model)?

There are 4 heads in the data, and 4 values where the values of x and y are in agreement, so:

\(P(D | \mathbf{\theta}) = \theta_1^4 \theta_2^4 (1-\theta_1)^3 (1-\theta_2)^3\)

So the log-likelihood is:

\( \log(P(D|\mathbf{\theta})) = 4 \log(\theta_1) + 4 \log(\theta_2) + 3 \log(1-\theta_1) + 3 \log(1-\theta_2) \)

Taking the derivative with respect to \(\theta_1\) is identical to taking it with respect to \(\theta_2\), so the MLEs will be identical:

\( \frac{4}{\theta_1} – \frac{3}{1-\theta_1} = 0 \implies 4(1-\theta_1) = 3 \theta_1 \implies \theta_1 = \frac{4}{7} \)

Now looking at \(P(D | \mathbf{\hat{\theta}}, M_2)\):

\( \left(\frac{4}{7}\right)^4 \left(\frac{4}{7}\right)^4 \left(\frac{3}{7}\right)^3 \left(\frac{3}{7}\right)^3 \simeq 7 \times 10^{-5} \)

(c) Now consider a model with 4 parameters, \(\mathbf{\theta} = (\theta_{0,0}, \theta_{0,1}, \theta_{1,0}, \theta_{1,1})\), representing \(p(x,y|\mathbf{\theta}) = \theta_{x,y}\). Only three of these parameters are free (as they must sum to zero). What is the MLE of \(\mathbf{\theta}\)? What is \(p(D|\mathbf{\hat{\theta}}, M_4)\), where \(M_4\) denotes this 4-parameter model?

\(P(D | \mathbf{\theta}) = \theta_{0,0}^2 \theta_{1,0}^2 \theta_{1,1}^2 \theta_{0,1} \)

Now let’s say \( \theta_{0,1} = (1-\theta_{0,0} – \theta_{1,0} – \theta_{1,1}) \). Let’s take the log of this:

\( \log(P(D|\mathbf{\theta})) = 2 \log(\theta_{1,1}) + 2 \log(\theta_{1,0}) + 2 \log(\theta_{0,0}) + \log(1-\theta_{0,0} – \theta_{1,0} – \theta_{1,1}) \)

Clearly the MLE estimates of \(\theta_{0,0}\), \(\theta_{1,0}\) and \(\theta_{1,1}\) are equal, and \(\theta_{1,1} = 2 \theta_{0,1}\). It follows immediately that:

\( \hat{\theta}_{1,1} = \hat{\theta}_{1,0} = \hat{\theta}_{0,0} = \frac{2}{7} \)

\( \hat{\theta}_{0,1} = \frac{1}{7} \)

We can then evaluate \(P(D| \mathbf{\hat{\theta}}, M_4) = \frac{2^6}{7^7} \simeq 7.8 \times 10^{-5}\). This doesn’t seem significantly higher than the 2 parameter model, so already this is reason to suspect that the 2 parameter model would be preferable.

(d) Suppose we are not sure which model is correct. We compute the leave-one-out cross-validated log-likelihood of each model:

\(L(m) = \sum_{i=1}^n \log p(x_i, y_i | m, \hat{\theta}(\mathcal{D}_{-i}))\)

where \(\hat{\theta}(\mathcal{D}_{-i}) \) denotes the MLE computed on \(\mathcal{D}\) excluding row i. Which model will CV pick and why?

Initially I wrote some code to generate this for each model, but actually to answer the question this really isn’t necessary. The reason is that for \(M_4\) there is only one data point of \(x=0, y=1\), and so when we estimate the MLE for \(\theta_{0,1}\) this is zero. This means when we use it to evaluate the full log-likehood using our MLE estimate we get a \(\log(0) \to -\infty\) term. This means that clearly \(M_2\) will be favoured by this cross-validation technique.

(e) Recall that an alternative to CV is to use the BIC score, defined as:

\(\text{BIC}(M, \mathcal{D}) = \log p(\mathcal{D} | \mathbf{\hat{\theta}_{MLE}}) – \frac{\text{dof}(M)}{2} \log(N) \)

where \(\text{dof}(M)\) is the number of free parameters in the model. Compute the BIC scores for each model. Which is preferred?

For \(M_2\) we just have: \( \text{BIC} \simeq \log(7 \times 10^{-5}) – \log(7) \simeq -11.51\). For \(M_4\) we have:

\(\text{BIC} \simeq \log(7.8 \times 10^{-5}) – \frac{3}{2} \log(7) \simeq -12.38\)

Therefore we see that the BIC score also favours the simpler 2-parameter model!


5.9 Prove that the median is the optimal estimate under L1 loss

L1 loss means \( L(y,a) = |y-a|\), so the expected loss is:

\( \rho(a | x) = \mathbb{E} \left[ |y-a| | x \right] = \int p(y | x) |y-a| dy\)

Differentiating this:

\( \frac{d \rho}{da} = \int_{-\infty}^{\infty} – \text{sgn}(y-a) p(y|x) dy = \int_{-\infty}^a p(y|x)dy – \int_a^{\infty} p(y|x)dy = 0\)

This means:

\(\int_{-\infty}^a p(y|x)dy = \int_a^{\infty} p(y|x)dy\)

Which is essentially the definition of the median, so clearly a is the median.


5.10 Decision rule for trading off FPs and FNs

Let \(L_{FN} = c L_{FP}\). I’m pretty sure there is an error in the stated result because it just doesn’t make sense. If we pick \(\hat{y} = 1\), the expected loss is:

\(\rho(\hat{y}=1) = P(y=0 | x) L_{FP}\)

and similarly:

\(\rho(\hat{y}=0)=P(y=1 | x) L_{FN}\)

So we pick \(\hat{y}=1\) iff \(P(y=0|x) L_{FP} < P(y=1|x) L_{FN}\), which we can write as:

\( \frac{P(y=1 | x)}{P(y=0|x)} > \frac{L_{FP}}{L_{FN}} = \frac{1}{c} \)

Machine Learning – A Probabilistic Perspective Exercises – Chapter 4

This is a continuation of the exercises in “Machine learning – a probabilistic perspective” by Kevin Murphy. Chapter 4 is on “Gaussian Models”. Let’s get started!

4.1 Uncorrelated does not imply independent

Let \( X \sim U(-1,1) \) and \(Y = X^2\). Clearly Y is dependent on X, show \(\rho(X,Y)=0\).

\(\rho(X,Y)\) is just a normalised version of the covariance, so we just need to show the covariance is zero, i.e.:

\(\text{Cov}(X,Y) = \mathbb{E}[XY] – \mathbb{E}[X]\mathbb{E}[Y]\)

Clearly \( \mathbb{E}[X] = 0\) and so we just need to calculate \(\mathbb{E}[XY]\) and show this is zero. We can write:

\( \mathbb{E}[XY] = \int_{-1}^1 dx \int_0^1 dy \ xy p(x,y) \)

Then we say \(p(x,y) = p(y|x) p(x)\), but \(p(y|x) = \delta(y – x^2)\), i.e. a dirac-delta function, and \(p(x)=1/2\), i.e. just a constant. This means we can evaluate the integral over y to get:

\( \mathbb{E}[XY] = 1/2 \int_{-1}^1 x^3\)

This is the integral of an odd function and so is clearly equal to zero.


4.2 Uncorrelated and Gaussian does not imply independent, unless jointly Gaussian

Let \(X \sim \mathcal{N}(0,1)\) and \(Y=WX\), where W takes values \( \pm 1\) with equal probability. Clearly X and Y are not independent, as Y is a function of X.

(a) Show\(Y \sim \mathcal{N}(0,1)\)

This is kind of obvious from symmetry because \(\mathcal{N}(0,1)\) is symmetric, i.e. \( \mathcal{N}(x|0,1) = \mathcal{N}(-x|0,1) \). This means we can write:

\(P(Y=y) = P(W=1)P(X=y) + P(W=-1)P(X=-y) = P(X=y) = \mathcal{N}(0,1) \)

(b) Show covariance between X and Y is zero

We know that \(\mathbb{E}[X] = \mathbb{E}[Y] = 0\), so we just need to evaluate \( \mathbb{E}[XY]\):

\( \mathbb{E}[XY] = \int \int dx \ dy \ xy p(x,y)\)

But again \(p(x,y) = p(y|x)p(x)\), and we can write \(p(y|x) = 0.5 \delta(y-x) + 0.5 \delta(y+x)\). This means we are left with:

\(\mathbb{E}[XY] = \int_{-\infty}^{\infty} x \mathcal{N}(x|0,1)(0.5(x-x)) dx = 0\)

which proves the result.


4.3 Prove \(-1 \le \rho(X,Y) \le 1\)

Let us start with the definitions:

\(\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}}\)

\(\text{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]\)

\(\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2] \)

Let us write \(\mu_X = \mathbb{E}[X]\) and \(\mu_Y = \mathbb{E}[Y]\), for notational convenience. If now for any constants a and b we consider:

\( \mathbb{E}[(a(X-\mu_X) + b(Y-\mu_Y))^2] \)

which is clearly greater than or equal to zero. Multiplying out, this inequality gives:

\(a^2 \mathbb{E}[(X-\mu_X)^2] + b^2 \mathbb{E}[(Y-\mu_Y)^2] + 2ab \mathbb{E}[(X-\mu_X)(Y-\mu_Y)] \ge 0 \)

Which we can re-write as:

\(2ab \text{Cov}(X,Y) \ge -a^2 \text{Var}(X) – b^2 \text{Var}(Y)\)

Now let us substitute in \(a^2 = \text{Var}(Y)\) and \(b^2 = \text{Var}(X)\):

\(2 \sqrt{\text{Var}(X) \text{Var}(Y)} \text{Cov}(X,Y) \ge -2 \text{Var}(X) \text{Var}(Y) \)

\( \implies \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} = \rho(X,Y) \ge -1\)

If we do the same thing, but instead now consider \(\mathbb{E}[(a(X-\mu_X) – b(Y-\mu_Y))^2]\), with the same definitions of a and b, it’s easy to show that \( \rho(X,Y) \le 1\) as well.


4.4 Correlation coefficient for linearly related variables

If \(Y=aX + b\), then if \( a > 0 \) show that \( \rho(X,Y)=1\), and if \(a < 0\) that \( \rho(X,Y) = -1\).

Let’s say \(\mathbb{E}[X] = \mu_X\) and \(\text{Var}(X) = \sigma_X^2\). It follows that:

\( \mathbb{E}[Y] = a \mu_X + b\) and \( \text{Var}(Y) = a^2 \sigma_X^2\).

Now, to evaluate the correlation we need \(\mathbb{E}[XY] = \mathbb{E}[aX^2 + bX] = a \mathbb{E}[X^2] + b \mu_X\)

This means that the covariance is:

\( \text{Cov}(X,Y) = a \mathbb{E}[X^2] + b \mu_X – \mu_X(a \mu_X + b) = a \sigma_X^2\)

This allows us to get the correlation:

\( \rho(X,Y) = \frac{ \text{Cov}(X,Y)}{ \sqrt{\sigma_X^2 \sigma_Y^2}} = \frac{a \sigma_X^2}{\sqrt{a^2 \sigma_X^4}} = \frac{a \sigma_X^2}{|a| \sigma_X^2} = sgn(a)\)

Which is all we were asked to show!


4.5 Normalization constant for MV Gaussian

Prove that: \( (2 \pi)^{d/2} | \mathbf{\Sigma}|^{1/2} = \int \exp(-\frac{1}{2} (\mathbf{x-\mu}^T \mathbf{\Sigma}^{-1} (\mathbf{x-\mu})) d \mathbf{x} \)

We are told to diagonalize the covariance matrix, which can always be done since it is symmetric. That is, we can write:

\(D = P^{-1} \Sigma P\)

Where D is a diagonal matrix where the entries are the eigenvalues of \(\Sigma\) and the columns of P are the eigenvectors. In fact, since \(\Sigma\) is symmetric the eigenvectors can form an orthogonal basis it is possible to make P an orthogonal matrix, such that \(P^{-1} = P^T\). This allows us to say:

\(D^{-1} = P^T \Sigma^{-1} P \implies \Sigma^{-1} = P D^{-1} P^T\)

As such, we can write the integral as:

\( \int \exp(-\frac{1}{2}(x-\mu)^T P D^{-1} P^T(x-\mu)) dx = \int \exp(-\frac{1}{2} (P(x-\mu))^T \begin{bmatrix} \frac{1}{\lambda_1} & & \\ & \ddots & \\ & & \frac{1}{\lambda_d} \end{bmatrix} (P(x-\mu))) dx \)

Now let us define \(y = P(x-\mu)\). Because P is an orthogonal matrix (which has determinant 1), the Jacobian is 1 and we can replace \(dx\) with \(dy\). The term inside the exponential is then:

\( \sum_{ij} y_i \delta_{ij} \frac{1}{\lambda_i} y_j = \sum_i \frac{y_i^2}{\lambda_i}\). Effectively by transforming to the eigenbasis we have decoupled the components of y, so we can write:

\( = \int_{-\infty}^{\infty} dy_1 e^{-\frac{y_1^2}{2 \lambda_1}} \dots \int_{-\infty}^{\infty} dy_d e^{-\frac{y_d^2}{2 \lambda_d}}\)

i.e. just the product of many one-dimensional Gaussians. This is equal to:

\( \sqrt{2 \pi \lambda_1} \sqrt{2 \pi \lambda_2} \dots \sqrt{2 \pi \lambda_d} = (2 \pi)^{d/2} \sqrt{\lambda_1 \dots \lambda_d}\)

We then use that \(det(\Sigma) = \prod_{i=1}^d \lambda_i\), which gives us the final answer we want!


4.7 Conditioning a bivariate Gaussian

Take \(\Sigma = \sigma_1 \sigma_2 \begin{bmatrix} \frac{\sigma_1}{\sigma_2} & \rho \\ \rho & \frac{\sigma_2}{\sigma_1} \end{bmatrix} \)

(a) What is \(P(X_2 | x_1) \)?

Here we just need to apply the general result which was derived in the chapter – which is that the conditional distribution is also Normal with mean and covariance:

\( \mu_{1 | 2} = \mu_1 + \Sigma_{12} \Sigma_{22}^{-1}(x_2-\mu_2) \)

\( \Sigma_{1 | 2} = \Sigma_{11} – \Sigma_{21} \Sigma_{22}^{-1} \Sigma_{21} \)

So in this bivariate case we simply have:

\( \mu_{2 | 1} = \mu_2 + \frac{\rho \sigma_1 \sigma_2}{\sigma_1^2}(x_1-\mu_1) \)

\( \sigma_{2 | 1}^2 = \sigma_2^2(1-\rho^2) \)

(b) What if \(\sigma_1 = \sigma_2 = 1\)?

Then we just have:

\(P(x_2 | x_1) = \mathcal{N}(x_2 | \mu_2 + \rho(x_1 – \mu_1), 1-\rho^2) \)

4.8 Whitening vs. Standardizing

(a) Firstly we load in the data it mentions (height vs weight) and calculate the empirical mean and covariance, and plot:

data = dlmread("heightWeightData.txt");
data = data(:,2:3);

dataMean = mean(data);
centredData = data - dataMean;
N = length(data);
covariance = (1/N)*(centredData')*centredData;

%%part a
hold on
scatter(data(:,1), data(:,2));
gaussPlot2d(dataMean, covariance);

(b) Standardizing – subtract the mean and divide by the standard deviation

%%part b
empStd = sqrt([covariance(1,1), covariance(2,2)]);
standardData = centredData ./ empStd;
sMean = mean(standardData);
sCovariance = (1/N)*(standardData')*standardData;

hold on
scatter(standardData(:,1), standardData(:,2));
gaussPlot2d(sMean, sCovariance);

(c) This section of the question had a typo in it which confused me for a while – it says U is the matrix of eigenvectors of X but this should be the eigenvectors of \( \Sigma\). I decided to go into a bit more detail to understand what the whitening transformation actually does – the idea is it transforms the data in such a way that that the covariance matrix is the identity matrix.

If we take X to be the \(k \times N\) data matrix (in this case k=2), we are looking for a transformation W such that the covariance of \(Y = W X\) is the \(k \times k\) identity matrix. Now, the (empirical) covariance matrix of the data is given by: \( \Sigma = \frac{1}{N} X X^T \). If we perform an eigendecomposition of this we can write:

\( \Sigma = U \Lambda U^T\)

Where U is a matrix with columns made up of the eigenvectors of \(\Sigma\) and \(\Lambda\) is a diagonal matrix where the entries are the eigenvalues of \(\Sigma\). We can do this because \(\Sigma\) is real and symmetric, and we can also rewrite this as: \(\Lambda = U^T \Sigma U\). Now, let’s consider the empirical covariance of Y:

\( \Sigma_Y = \frac{1}{N} Y Y^T = \frac{1}{N} W X X^T W^T = W \Sigma W^T\)

So we see that if we wanted \(\Sigma_Y\) to be diagonal and equal to \(\Lambda\), we could choose \(W = U^T\). If we want it to be the identity matrix, we can achieve this by going one step further and saying \( W = \Lambda^{-1/2} U^T\). This provides the motivation for the formula given in the question – in code:

%%part c
[V, D] = eig(covariance);
W = (D^(-0.5))*(V');
Y = W*(centredData');
yMean = mean(Y,2);
yCovariance = (1/N)*Y*(Y');
hold on;
scatter(Y(1,:), Y(2,:));
gaussPlot2d(yMean, yCovariance);


4.11 Derivation of the NIW posterior

We are asked to show that the posterior for the MV normal parameters are given by:

\(NIW(\mu, \Sigma | m_N, \kappa_N, \nu_n, S_N) \)

When we use an NIW (normal inverse-Wishart) prior:

\(NIW(\mu, \Sigma | m_0, \kappa_0, \nu_0, S_0)\)

and \(m_N = \frac{ \kappa_0 m_0 + N \bar{x}}{\kappa_N}\), \(\kappa_N = \kappa_0 + N\), \(\nu_N = \nu_0 + N\), \(S_N = S_0 + S_{\bar{x}} + \frac{\kappa_0 N }{\kappa_0 + N} (\bar{x}-m_0)(\bar{x}-m_0)^T\)

We are given the following hint: \( N(\bar{x}-\mu)(\bar{x}-\mu)^T + \kappa_0 (\mu-m_0)(\mu-m_0)^T = \kappa_N (\mu – m_N)(\mu-m_N)^T + \frac{\kappa_0 N}{\kappa_N}(\bar{x}-m_0)(\bar{x}-m_0)^T\)

Now the actual form of the prior is:

\( NIW(\mu, \Sigma | m_0, \kappa_0, \nu_0, S_0) = \frac{1}{Z_{NIW}} |\Sigma|^{-1/2} \exp(-\frac{\kappa_0}{2} (\mu-m_0)^T \Sigma^{-1} (\mu-m_0)) |\Sigma|^{-\frac{\nu_0 + D + 1}{2}} \exp(-\frac{1}{2} tr(\Sigma^{-1} S_0)) \)

and we can write the likelihood as:

\( P(D | \mu, \Sigma) = (2\pi)^{-ND/2} |\Sigma|^{-N/2} \exp(-1/2 \sum_{i=1}^N (x_i – \mu)^T \Sigma^{-1} (x_i – \mu)) \)

during the chapter it was stated that this could be rewritten as:

\( P(D | \mu, \Sigma) = (2\pi)^{-ND/2} |\Sigma|^{-N/2} \exp( -\frac{N}{2}(\mu – \bar{x})^T \Sigma^{-1} (\mu – \bar{x})) \exp( -\frac{1}{2} tr(\Sigma^{-1} S_{\bar{x}})) \)

although this was not proven. Since it’s crucial for getting this result, I thought I would fill in the details. If we apply the “trace trick” we can say:

\( \sum_i (x_i – \mu)^T \Sigma^{-1} (x_i-\mu) = tr(\Sigma^{-1} \sum_i (x_i – \mu) (x_i – \mu)^T = tr(\Sigma^{-1} \left[ (\sum_i x_i x_i^T) – N \mu \bar{x}^T – N \bar{x} \mu^T + N \mu \mu^T \right]) \)

Then if we consider:

\(tr(\Sigma^{-1} S_{\bar{x}}) = tr(\Sigma^{-1} \sum_i (x_i-\bar{x})(x_i – \bar{x})^T ) = tr(\Sigma^{-1} ((\sum_i x_i x_i^T) – N \bar{x} \bar{x}^T)) \)

and then:

\(N(\bar{x}-\mu)^T \Sigma^{-1} (\bar{x}-\mu) = tr(\Sigma^{-1} N (\bar{x} \bar{x}^T – \mu \bar{x}^T – \bar{x} \mu^T + \mu \mu^T))

Writing them out explicitly like this it should be easy to see that:

\( \sum_i (x_i-\mu)^T \Sigma^{-1} (x_i – \mu) = tr(\Sigma^{-1} S_{\bar{x}}) + N(\bar{x}-\mu)^T \Sigma^{-1} (\bar{x}-\mu) \)

from which we can easily get the second expression for the likelihood. Now the posterior is proportional to the likelihood multiplied by the prior:

\(P(\mu, \Sigma | D) \propto |\Sigma|^{-1/2} |\Sigma|^{-\frac{\nu_0 + D + N + 1}{2}} \exp(-\frac{1}{2} (\kappa_0 (\mu-m_0)^T \Sigma^{-1} (\mu-m_0)  + N (\mu-\bar{x})^T \Sigma^{-1} (\mu-\bar{x})))  \exp(-\frac{1}{2} tr(\Sigma^{-1}(S_0 + S_{\bar{x}}))) \)

Now if we substitute the result in the hint we can rewrite the first exponential term as:

\(\exp( -\frac{1}{2} (\kappa_N (\mu – m_N)^T \Sigma^{-1} (\mu-m_N) + tr(\Sigma^{-1} \frac{\kappa_0 N}{\kappa_N} (\bar{x}-m_0)(\bar{x}-m_0)^T)) \)

From this point we just have to collect terms and look at the definitions and it should be clear that the stated result is true!


4.12 BIC for Gaussians

The BIC or “Bayesian Information Criterion” is a concept actually introduced in the next chapter for model selection, and represents an approximation to the marginal likelihood given the model. Read ahead to chapter 5 for more info! The definition is:

\( BIC = \log(P(D | \hat{\theta_{ML}})) – \frac{d}{2} \log(N) \)

where here d is the number of parameters in the model. We are asked to derive this in the case of a multivariate Gaussian model. We start by writing the log-likelihood (using the trace trick):

\( \log P(D | \hat{\Sigma}, \hat{\mu}) -\frac{N}{2} tr( \hat{\Sigma} S) – \frac{N}{2} \log( |\Sigma|) \)

where S is the scatter matrix: \( S = \frac{1}{N} \sum_{i=1}^N (x_i – \hat{\mu}) (x_i – \hat{\mu})^T \)

But of course this is exactly equal to the empirical covariance, which is equal to the maximum likeilhood, and so the first term is simply \( -ND/2)\) where D is the number of dimensions.

(a) Derive BIC when we use a full covariance matrix.

In this case, there are D parameters in the mean and D(D+1)/2 in the covariance matrix (which has to be symmetric), so a total of D(D+3)/2. This means the BIC is simply:

\( BIC = -\frac{ND}{2} – \frac{N}{2} \log( |\hat{\Sigma}|) – \frac{D(D+3)}{4} \log(N) \)

(b) How about when we have a diagonal covariance matrix

The hint seems to suggest there is more to this, but as far as I can tell the only difference is the number of parameters (which is now just 2D). I don’t see how having a diagonal covariance simplifies things any further, but I may be missing something! Anyway, this gives the BIC as:

\( BIC = \frac{ND}{2} – \frac{N}{2} \log( |\hat{\Sigma}|) – D \log(N) \)


4.13 Gaussian posterior credible intervals

Let \( X \sim \mathcal{N}(\mu, \sigma^2 = 4)\) where \(\mu\) is unknown but has prior \( \mu \sim \mathcal{N}(\mu_0, \sigma_0^2 = 9) \). The posterior after seeing n samples is \( \mu \sim \mathcal{N}(\mu_n, \sigma_n^2) \). How big does n have to be to ensure \( p(l \le \mu_n \le u | D) \ge 0.95\), where \( (l,u)\) is an interval of width 1 centred on \(\mu_n\). Hint: recall 95% of the probability mass is within \( \pm 1.96 \sigma\) of the mean.

Essentially what we need to do is to calculate n such that \( \sigma_n = 0.5 / 1.96\). To do this, we just need to use the result derived in the chapter for the posterior distribution of a Gaussian mean, given that \( \Sigma\) is given:

\( P(\mu | D, \Sigma) = \mathcal{N}(\mu | m_n, V_n) \)

where \( V_n^{-1} = V_0^{-1} + n \Sigma^{-1} \)

In our case this means:

\( \frac{1}{\sigma_n^2} = \frac{1}{\sigma_0^2} + \frac{n}{\sigma^2} = \frac{\sigma^2 + n \sigma_0^2}{\sigma_0^2 \ \sigma^2} \)

Rearranging this we get:

\( n = \frac{\sigma_2}{\sigma_n^2} – \frac{\sigma_2}{\sigma_0^2} \)

So putting in the numbers I find that we need \( n \simeq 61\).


4.14 MAP estimation for 1D Gaussians

Consider samples \( x_1 \dots x_n \) from a Gaussian RV with known variance \( \sigma^2\) and unknown mean \( \mu\). We assume a prior over the mean \( \mu \sim \mathcal{N}(m, s^2) \), where m and s are fixed.

(a) Calculate the MAP estimate \( \hat{\mu}_{MAP}\)

Rather than actually differentiating and setting equal to zero, we know the most probable value of a Gaussian is equal to its mean, and so we know that \( \hat{\mu}_{MAP}\) is simply the posterior mean. We know from results derived in the main chapter than \( m_N = V_N (\Sigma^{-1}(N \bar{x}) + V_0^{-1} m_0) \). We just calculated \(V_N\) (or really \(\sigma_N\) as we are in 1D) in the previous question, so we find that:

\(m_N = \hat{\mu}_{MAP} = \frac{s^2 \sigma^2}{\sigma^2 + N s^2} ( \frac{N \bar{X}}{\sigma^2} + \frac{m}{s^2}) = \frac{s^2 N \bar{x} + \sigma^2 m}{\sigma^2 + N s^2} \)

(b) Show that as n increases this converges to the MLE.

as \(N \to \infty\), \(\hat{\mu}_{MAP} \to \frac{s^2 N \bar{x}}{N s^2} = \bar{x} \), which we know is the MLE.

(c) increase prior variance s^2

\(\hat{\mu}_{MAP} \to \frac{s^2 N \bar{x}}{N s^2} = \bar{x} \)

so it converges to the MLE again.

(d) decrease s^2 (s -> 0)

clearly \( \hat{\mu}_{MAP} \to m\), i.e. the prior mean.

All of these results make intuitive sense.


4.15 Sequential updating of \( \hat{\Sigma}\)

The unbiased estimate for the covariance of a d-dimensional Gaussian based on n samples is given by:

\( \hat{\Sigma} = C_n = \frac{1}{n-1} \sum_{i=1}^n (x_i – m_n) (x_i – m_m)^T \)

This takes \(O(nd^2)\) to compute.

(a) Show that the covariance can be updated sequentially as follows:

\( C_{n+1} = \frac{n-1}{n} C_n + \frac{1}{n+1} (x_{n+1} – m_n)(x_{n+1}-m_n)^T\)

To start with let’s write:

\(C_{n+1} = \frac{1}{n} \sum_{i=1}^{n+1} (x_i – m_{n+1})(x_i – m_{n+1})^T\)

Next, let us note now that we can update the estimate of the mean sequentially as follows:

\( m_{n+1} = \frac{1}{n+1} \sum_{i=1}^{n+1} (n \ m_n + x_{n+1}) \)

However, for now we’ll leave it as \(m_{n+1}\). It’s easy to show that we can expand out the previous two equations for \(C_n\) and \(C_{n+1}\) in the following way:

\( n C_{n+1} = (\sum_{i=1}^{n+1} x_i x_i^T) – (n+1) m_{n+1} m_{n+1}^T \)

\((n-1) C_n = (\sum_{i=1}^n x_i x_i^T) – n \ m_n m_n^T \)

If we now expand out \(m_{n+1}m_{n+1}^T\):

\( m_{n+1} m_{n+1}^T = \frac{1}{(n+1)^2} ( n^2 m_n m_n^T + n m_n \_{n+1}^T + n x_{n+1} m_n^T + x_{n+1} x_{n+1}^T) \)

then subtract \( n C_{n+1} – (n-1) C_n\), and substitute in the above expression we find:

\( n C_{n+1} – (n-1) C_n = x_{n+1} x_{n+1}^T + n m_n m_n^T – \frac{1}{n+1}(n^2 m_n m_n^T + n m_n \_{n+1}^T + n x_{n+1} m_n^T + x_{n+1} x_{n+1}^T) \)

Simplifying and rearranging this we get the result which is stated.

(b) How much time does it take per sequential update?

The most expensive part here is the outer product, which is \(O(d^2)\).

(c) Show that we can sequentially update the precision matrix

We just showed that:

\(C_{n+1} = \frac{n-1}{n} C_n + \frac{1}{n+1} u u^T \)

where \(u = x_{n+1}-m_n\). We just need to use the matrix inversion lemma, given as a hint in the question:

\( (E + u v^T)^{-1} = E^{-1} – \frac{ E^{-1} u v^T E^{-1}}{1 + v^T E^{-1} u}\)

so using \( E = \frac{n-1}{n} C_n\) and \(u = v = \frac{1}{\sqrt{n+1}}(x_{n+1}-m_n) \) we get:

\( C_{n+1}^{-1} = \frac{n}{n-1} C_n^{-1} – \frac{ \frac{n}{n-1} C_n^{-1} \frac{1}{n+1} u u^T \frac{n}{n-1} C_n^{-1}}{1 + \frac{1}{n+1} u^T \frac{n}{n-1} C_n^{-1}  u} \)

Some simple rearranging gives the result given in the textbook:

\(C_{n+1}^{-1} = \frac{n}{n-1} \left[ C_n^{-1} – \frac{ C_n^{-1}(x_{n+1}-m_n)(x_{n+1}-m_n)^T C_n^{-1}}{\frac{n^2-1}{n} + (x_{n+1} – m_n)^T C_n^{-1} (x_{n+1}-m_n)} \right] \)

(d) What is the time complexity per update?

There are two key things to note here – one is that we don’t need to compute \(C_n^{-1}\) as we already have this in memory (or at least we should do). The second is that we can split the matrix multiplications up:

\( A_1 = C_n^{-1} (x_{n+1}-m_n) \)

this is a \( d \times d\) matrix multiplied by a \(d \times 1\) column vector, so this is \(O(d^2)\). Then we can do:

\( (x_{n+1}-m_n)^T C_n^{-1}\)

which is a \( 1 \times d\) row vector times a \( d \times d\) matrix, which is again \(O(d^2)\). Combining the resulting vectors together as an outer product is again \(O(d^2)\). We can make the same argument about the denominator where all the matrix multiplications are again \(O(d^2)\), and so this is the overall update complexity!


4.16 Likelihood ratio for Gaussians

Consider a binary classifier where the K class conditional densities are MVN \(p(x|y=j) = \mathcal{N}(x | \mu_j, \Sigma_j)\). Derive an expression for the log-likelihood ratio \( \log \frac{p(x | y=1)}{p(x | y=0)} \) for:

(a) arbitrary covariance \( \Sigma_j\), (b) shared covariance \(\Sigma_j = \Sigma\), (c) shared, axis-aligned covariance \( \Sigma_j = \Sigma\) with \(\Sigma_{ij}=0\) for \(i \ne j\) and (d) shared spherical covariance \(\Sigma_j = \sigma^2 I\).

(a) full form:

\( \frac{1}{2}((x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1) – (x-\mu_0)^T \Sigma_0^{-1} (x-\mu_0) \)

(b) we can simplify slightly:

\( -\frac{1}{2} tr(\Sigma^{-1} \left( (x-\mu_1)(x-\mu_1)^T – (x-\mu_0)(x-\mu_0)^T \right) \)

(c) We can multiply the matrices out here to simplify it a bit further:

\( -\frac{1}{2} \sum_{i=1}^K \frac{1}{\sigma_i^2} \left[ (x_i – \mu_{1i})^2 – (x_i – \mu_{0i})^2 \right] \)

(d) The only further simplification I can see if to factor out the variance:

\( -\frac{1}{2 \sigma^2} \sum_{i=1}^K \frac{1}{\sigma_i^2} \left[ (x_i – \mu_{1i})^2 – (x_i – \mu_{0i})^2 \right] \)


4.17 LDA/QDA on height/weight data

If we run the code (which you can get from the book’s github repository) we can see how the classifier works:

The following is my MATLAB code to calculate the misclassification rates for the LDA and the QDA:

rawdata = dlmread("heightWeightData.txt");
data.Y = rawdata(:,1); % 1=male, 2=female
data.X = [rawdata(:,2) rawdata(:,3)]; % height, weight
maleNdx = find(data.Y == 1);
femaleNdx = find(data.Y == 2);
classNdx = {maleNdx, femaleNdx};

%untied - i.e. class-specific covariances
X_male = data.X(classNdx{1},:);
X_female = data.X(classNdx{2},:);

mu{1} = mean(X_male); mu{2} = mean(X_female);
Sigma{1} = cov(X_male); Sigma{2} = cov(X_female);

prob_c1 = gaussProb(data.X, mu{1}, Sigma{1});
prob_c2 = gaussProb(data.X, mu{2}, Sigma{2});

predictedClasses = (prob_c2 > prob_c1) + 1;
misclassificationQDA = 100*(sum(predictedClasses ~= data.Y) / length(data.X));

%shared covariance, i.e. LDA
mu{1} = mean(X_male); mu{2} = mean(X_female);
Sigma{1} = cov(data.X); Sigma{2} = Sigma{1};

prob_c1 = gaussProb(data.X, mu{1}, Sigma{1});
prob_c2 = gaussProb(data.X, mu{2}, Sigma{2});

predictedClasses = (prob_c2 > prob_c1) + 1;
misclassificationLDA = 100*(sum(predictedClasses ~= data.Y) / length(data.X));

I find that for the LDA the misclassification rate is 12.4%, and for the QDA it’s reduced slightly at 11.9%.


4.18 Naive bayes with mixed features

Consider a 3 class naive Bayes classifier with one binary feature and one Gaussian feature.:

\( y \sim Mu(y | \pi, 1)\), \( x_1 | y = c \sim Ber(x_1 | \theta_c)\), \( x_2 | y=c \sim \mathcal{N}(x_2 | \mu_c, \sigma_c^2) \)

Let the parameter vectors be the following:

\(\pi = (0.5, 0.25, 0.25), \theta = (0.5,0.5,0.5), \mu=(-1,0,1), \sigma^2 = (1,1,1) \)

(a) Compute \(p(y|x_1=0, x_2=0)\)

\(p(y | x) \propto p(x | y) p(y) \), so:

\(p(y=c | x) \propto Ber(0, \theta_c) \mathcal{N}(0 | \mu_c, \sigma_c^2) \pi_c \propto (1-\theta_c) \exp( – \frac{\mu_c^2}{2 \sigma_c^2}) \pi_c \)

So we just have to put these numbers in for each class, and then normalize afterwards. Doing this I find: \( [0.4302, 0.3547, 0.2151] \).

(b) Compute \(p(y | x_1 = 0) \)

\(p(y | x_1 = 0) \propto p(x_1 = 0 | y) p(y) = (1-\theta_c) \pi_c \)

So we just put the numbers in again and normalise: \([0.5, 0.25, 0.25]\)

(c) Compute \(p(y | x_2 = 0)\)

\( p(y | x_2 = 0) \propto p(x_2 = 0 | y) p(y) = \mathcal{N}(0 | \mu_c, \sigma_c^2) \pi_c \)

Put numbers in and normalise: \([0.4302, 0.3547, 0.2151]\).

(d) Explain any interesting patterns you see in your results

I guess it’s interesting that the answers to (a) and (c) are identical – we see this arises because all of the \(\theta\) values are equal. This is also the reason why the answer to (b) is equal to the prior.


4.19 Decision boundary for LDA with semi-tied covariance

Consider a generative classifier with class-conditional densities of the form \(\mathcal{N}(x | \mu_c, \Sigma_c ) \). We consider a two class case in which \( \Sigma_1 = k \Sigma_0 \), with \( k > 1\). Derive an expression for \(p(y=1 | x, \theta)\), simplifying as much as possible. Give a geometric interpretation of the result.

\(p(y=c | x) \propto \pi_c \mathcal{N}(x| \mu_c, \sigma_c^2) \)

so we can say:

\( p(y=1 | x) = \frac{ \pi_c |k \Sigma_0 |^{-1/2} \exp( -\frac{k}{2}(x-\mu_1)^T \Sigma_0^{-1} (x-\mu_1))}{ \pi_1 |k \Sigma_0|^{-1/2} \exp( -\frac{k}{2}(x-\mu_1)^T \Sigma_0^{-1} (x-\mu_1)) + \pi_0 |\Sigma_0|^{-1/2} \exp( -\frac{1}{2}(x-\mu_0)^T \Sigma_0^{-1} (x-\mu_0))} \)

Using \( |k \Sigma_0|^{-1/2} = k^{-d/2} |\Sigma_0| \), we can simplify this a bit:

\( = \frac{1}{1 + \frac{\pi_0}{\pi_1} k^{d/2} \exp( -\frac{1}{2} \left[ (x-\mu_0)^T \Sigma_0^{-1} (x-\mu_0) – k(x-\mu_1)^T \Sigma_0^{-1}(x-\mu_1) \right]} \)

We can go a bit further by saying the term in the exponential is:

\( -\frac{1}{2} tr( \Sigma_0^{-1}((x-\mu_0)(x-\mu_0)^T – k(x-\mu_1)(x-\mu_1)^T) \)

and then expanding out the outer products:

\( x x^T – \mu_0 x^T – x \mu_0^T + \mu_0 \mu_0^T – k x x^T + k \mu_1 x^T + k x \mu_1^T – k \mu_1 \mu_1^T \)

we can then complete the square:

\( = (1-k) (x- \frac{(\mu_0 – k \mu_1)}{1-k})(x-\frac{\mu_0-k \mu_1}{1-k})^T + C \)

where C is a constant involving only k, \(\mu_1\) and \(\mu_0\) which I can’t be bothered to calculate. This means the x-dependence in the exponential is:

\( \exp(-\frac{(1-k)}{2} (x-\frac{\mu_0-k\mu_1}{1-k})^T \Sigma_0^{-1} (x-\frac{\mu_0 – k \mu_1}{1-k})) \)

But I don’t really see any way of simplifying this further, and I’m not really sure how to interpret it geometrically. If anyone has an answer to this I’d be interested to hear about it in the comments!


4.21 Gaussian decision boundaries

Let \(p(x|y=j) = \mathcal{N}(x | \mu_j, \sigma_j^2) \) where j=1,2 and \(\mu_1 = 0, \sigma_1^2 = 1, \mu_2 = 1, \sigma_2^2 = 10^6\). Let the class priors be equal.

(a) Find the decision region: \(R_1 = \{ x : p(x|\mu_1, \sigma_1) \ge p(x_2 | \mu_2, \sigma_2) \} \)

To start with let’s just visualize this by plotting the probability of x given each class:

We now solve for the two points of intersection here – I did this in Mathematica:

So we see that the region R1 is for \( -3.72 < x < 3.72\). In the case where \(\sigma_2=1\) as well, we find only one point of intersection (which makes sense) – and for R1 is any \( x < 0.5\)


4.22 QDA with three classes

Consider a three category classification problem. Let the prior probabilities be uniform, and have the class-conditional densities be MVNs with parameters:

\( \mu_1 = [0,0]^T, \mu_2 = [1,1]^T, \mu_3 = [-1,1]^T \)

\( \Sigma_1 = \begin{bmatrix} 0.7 & 0 \\ 0 & 0.7 \end{bmatrix} , \Sigma_2 = \begin{bmatrix} 0.8 & 0.2 \\ 0.2 & 0.8 \end{bmatrix}, \Sigma_3 = \begin{bmatrix} 0.8 & 0.2 \\ 0.2 & 0.8 \end{bmatrix} \)

Classify the following points:

(a) \( x = [-0.5, 0.5]\), (b) \( x = [0.5, 0.5] \)


4.23 Scalar QDA

Consider the following training set of heights and the corresponding labels of gender: \( x = (67, 79, 71, 68, 67, 60), y=(m,m,m,f,f,f) \).

(a) Fit a Bayes classifier to this data using MLE.

Clearly the class priors are uniform, i.e. \( \pi_m = \pi_f = 0.5\). Then the MLE estimates for the means are just the empirical values:

\( \mu_m = \frac{67 + 79 + 71}{3} = 72.33 \)

\(mu_f = \frac{68 + 67 + 60}{3} = 65 \)

And likewise for the estimates of the variances:

\( \sigma_m^2 = \frac{1}{3} ((67 – \mu_m)^2 + (79-\mu_m)^2 + (71-\mu_m)^2) = 24.89 \)

\( \sigma_f^2 = \frac{1}{3}((68-\mu_f)^2 + (67-\mu_f)^2 + (60-\mu_f)^2) = 12.67 \)

(b) Compute \(p(y=m | x, \hat{\theta}) \) where \(x=72\).

\( p(y=m | x=72) = \frac{\frac{1}{2} \mathcal{N}(72 | \mu_m, \sigma_m^2)}{ \frac{1}{2} \mathcal{N}(72 | \mu_m, \sigma_m^2) + \frac{1}{2} \mathcal{N}(72 | \mu_f, \sigma_f^2) } \)

Putting in the numbers I find this gives \( \simeq 0.83 \), which seems about right just looking at the data by eye.

Continuous Blackjack

Recently I had an interview where as an extra “bonus” question at the end I was asked an interesting maths problem. With a couple of hints from the interviewer, I was able to sketch out a rough solution, however afterwards I wanted to look up a proper solution just to verify it. Interestingly, I wasn’t able to find one (I’m sure it’s out there, I just need to look harder). Anyway, I thought it was a nice little problem and so I thought it was worth posting what I believe to be the correct solution.

The problem is this – consider that you are playing a game with one other person, and you are going to be going first. The rules are you pick a number randomly from the Uniform(0,1) distribution (i.e. a random number between 0 and 1). You then decide either to stick with this total, or you play on and choose another such random number. You can do this as many times as you like, however if the sum total of the numbers you pick goes over 1, you go bust and automatically lose. If you decide to stick with a number less than 1, the other player has a go and plays by the same rules. The person who sticks at the higher number (or doesn’t go bust when their opponent does) is the winner. Clearly it is an advantage to go second, and the optimal strategy for player 2 is extremely simple – keep playing until you get a higher total than your opponent or until you go bust. The question is, given that you are player 1 what is the best strategy you can adopt?

The first thing to realise is that our optimal strategy will be divided at some number t, which I shall call the “decision boundary”, and where if we have a sum less than t we will draw a new number, and if we have a sum greater than t we will stick. We can then think about what the probability of winning is, given that we stick at a particular value t. This is equal to 1 minus the probability that we lose – and the probability that we lose is the probability that the second player is able to land their sum within the interval \([t,1]\), given that they play on until they either reach this interval or go bust. To go about calculating this, let us define \(P_t[x]\) to be the probability that we are able to land in the interval \([t,1]\), given that we are currently at x and definitely going to play on if we have not yet reached t. We can write down an equation for this as follows:

\(P_t[x] = (1-t) + \int_x^t P_t[y]dy\)

where the first term is the probability that we reach the interval in the next turn, and the second term is the integral of the probability of reaching y < t (=1 as we are drawing uniform(0,1) random variables) multiplied by the probability that we reach the interval \([t,1]\) starting from y. We can convert this from an integral equation into an ODE:

\(\frac{dP_t[x]}{dx} = -P_t[x] \ \ \ \implies P_t[x] = A e^{-x} \)

We can obtain the constant A by noting that \(P_t[t] = 1-t \), and hence that \(A = (1-t)e^t\). This means that:

\(P_t[x] = (1-t)e^{t-x}\). Now, player 2 starts from \(x=0\), and so the probability of losing given that we stuck at a value t is simply \(P_t[0] = (1-t)e^t\). The probability that we win is \(1-P_t[0] = 1-(1-t)e^t\).

The final step we need to solve this is to say that the threshold at which our strategy changes should be the following point: where the probability of winning given that we stick at t is exactly the same as the probability that we win given that we choose one more number. We can write this condition as:

\(1-(1-t)e^t = \int_t^1 \left[ 1-(1-t’)e^{t’}\right]dt’ = (1-t) – e^t(t-2) – e \)

This gives a non-linear equation for the optimal decision boundary t, which cannot be re-arranged nicely but numerically we can solve to find that \(t \approx 0.57\). That is, if our sum is less than approximately 0.57 we should pick another number, and if it’s more we should stick!

Machine Learning – A Probabilistic Perspective Exercises – Chapter 3

This is a continuation of the exercises in “Machine learning – a probabilistic perspective” by Kevin Murphy. Chapter 3 is on “Generative Models for Discrete Data”.

3.1 MLE for the Bernoulli/ binomial model

We start off with a nice simple one. If we have data D consisting of N1 heads out of a total number N trials, then the likelihood is \(P(D|\theta) = \theta^{N_1} (1-\theta)^{N-N_1} \). However, it’s a lot easier to work with the log likelihood here:

\( \log(P(D|\theta)) = N_1 \log(\theta) + (N-N_1) \log(1-\theta) \)

Now we take the derivative:

\(\frac{d}{d \theta} = \frac{N_1}{\theta} – \frac{N-N_1}{1-\theta} = 0 \ \ \implies \frac{N_1}{\theta} = \frac{N-N_1}{1-\theta} \)

and rearrange to find that the MLE is \( \hat{\theta} = \frac{N_1}{N} \)


3.2 Marginal likelihood for the Beta-Bernoulli model.

This question is looking at deriving the marginal likelihood, \(P(D) = \int P(D|\theta) P(\theta) d\theta \). We are told to use the chain rule of probability: \(P(x_{1:N}) = p(x_1) p(x_2 | x_1) p(x_3: x_{1:2})\dots \)

and reminded that in the chapter we derived the posterior predictive distribution:

\( P(X=k | D_{1:N}) = \frac{N_k + \alpha_k}{\sum_i N_i + \alpha_i} \)

We are the given an example – suppose D = H,T,T,H,H (or D=1,0,0,1,1). It follows that:

\(P(D) = \frac{\alpha_1}{\alpha} \frac{\alpha_0}{\alpha + 1} \frac{\alpha_0+1}{\alpha+2} \frac{\alpha_1 + 1}{\alpha + 3} \frac{\alpha_1+2}{\alpha+4} \)

where we have just applied the chain rule, using the posterior predictive distribution after each data point has been collected. It’s clear that if we do this more generally (for any collection of data), that we will be left with:

\(P(D) = \frac{\left[ \alpha_0 \dots (\alpha_0 + N_0 – 1) \right] \left[ \alpha_1 \dots (\alpha_1 + N_1 – 1) \right]}{\alpha \dots (\alpha + N-1)} \)

We then note that this can be re-written using factorials as follows:

\(P(D) = \frac{(\alpha_0+N_0-1)! (\alpha_1 + N_1 -1)! (\alpha-1)!}{(\alpha_0-1)! (\alpha_1-1)! (\alpha + N -1)!} \)

Now remember that \( \Gamma(N) = (N-1)! \), and \( \alpha = \alpha_0 + \alpha_1 \), so we get the result which is given in the question:

\(P(D) = \frac{ \Gamma(\alpha_0 + N_0) \Gamma(\alpha_1 + N_1) \Gamma(\alpha_1 + \alpha_0) }{\Gamma(\alpha_1 + \alpha_0 + N) \Gamma(\alpha_1) \Gamma(\alpha_0) } \)


3.3 Posterior predictive for Beta-Binomial model

In the text the posterior predictive distribution for the Beta-Binomial model was derived for the case of predicting the outcome of multiple future trials, given the data:

\(P(x|n,D) = \frac{B(x+\alpha_1, n-x+\alpha_0)}{B(\alpha_1, \alpha_0)} \binom{n}{x} \)

where \(\alpha_1\) and \(\alpha_0\) involve the prior parameters and the data. The question simply asks to show that when \(n=1\) that we have: \(P(x=1|D) = \frac{\alpha_1}{\alpha_0 + \alpha_1}\).

To do this we need to remember that by definition, \(B(a,b) = \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)} \), hence:

\( \frac{B(1+\alpha_1, \alpha_0)}{B(\alpha_1, \alpha_0)} = \frac{\Gamma(1+\alpha_1) \Gamma(\alpha_0)}{\Gamma(1 + \alpha_0 + \alpha_1)} \frac{\Gamma(\alpha_0 + \alpha_1)}{\Gamma(\alpha_0) \Gamma(\alpha_1)} \)

But then we simply note the following: \(\Gamma(1+\alpha_1) = \alpha_1! = \alpha_1 (\alpha_1-1)! = \alpha_1 \Gamma(\alpha_1)\). Using this and simplifying clearly leaves us with the desired result.


3.4 Beta updating from censored likelihood

Suppose we toss a coin \(n=5\) times. Let X be the number of heads. Suppose that we observe there are fewer than 3 heads  – but we don’t know precisely how many. The prior we use is \(P(\theta) = \text{Beta}(\theta|1,1)\). Compute the posterior, \(P(\theta | X < 3) \).

Now \( \text{Beta}(\theta|1,1)\) is a constant, so it plays no real role here. As such, the posterior is:

\(P(\theta | X<3) \propto P(X < 3|\theta) P(\theta) \propto P(X<3 | \theta) \)

So we just need to consider the likelihood, \(P(X<3|\theta)\). This is straightforward to calculate as it is the sum of the probability of no heads, plus the probability of one head, plus the probability of two heads, i.e.:

\(P(X<3 | \theta) = (1-\theta)^5 + \binom{5}{4}(1-\theta)^4 \theta + \binom{5}{3} (1-\theta)^3 \theta^2 = Bin(0|\theta, 5) + Bin(1|\theta,5) + Bin(2|\theta,5) \)

It follows that:

\(P(X<3 | \theta) \propto \text{Beta}(1,6) + \text{Beta}(2,5) + \text{Beta}(3,4) \)

which is a mixture distribution.


3.5 Uninformative prior for log-odds ratio

Let \( \phi = \text{logit}(\theta) = \log(\frac{\theta}{1-\theta}) \). If \(p(\phi) \propto 1\), show \(p(\theta) \propto \text{Beta}(\theta| 0,0) \).

If we simply apply the change of variables formula from chapter 2:

\(p(\theta) = \bigg| \frac{d\phi}{d\theta} \bigg| p(\phi) \)

but \(p(\phi)\) is a constant, and \(\phi = \log(\theta) – \log(1-\theta)\), so:

\( \frac{d\phi}{d\theta} = \frac{1}{\theta} + \frac{1}{1-\theta} = \frac{1}{\theta(1-\theta)} \)

Remembering the definition for the Beta distribution: \(\text{Beta}(x|a,b) = \frac{1}{B(a,b)} x^{a-1}(1-x)^{b-1}\), and so clearly \(p(\theta) \propto \text{Beta}(\theta|0,0) \).


3.6 MLE for Poisson distribution

Definition: \( \text{Poi}(x|\lambda) = \frac{\lambda^x}{x!} e^{-\lambda}\)

So the likelihood of a set of data \( \{x_i\} \) is given by:

\(L(\lambda ; \{x_i\}) =\prod_i \frac{\lambda^{x_i}}{x_i!} e^{-\lambda} \)

Unsurprisingly, it’s easier to work with the log-likelihood:

\( l(\lambda; \{x_i\}) = \sum_i \left[ -\lambda + x_i \log(\lambda) – \log(x_i !) \right]\)

If we ignore the term that doesn’t depend on \(\lambda\) then we are left with \(-\lambda N + \log(\lambda \left( \sum_i x_i \right) \). Differentiating this w.r.t \(\lambda\) and setting it equal to zero: we are left with

\( -N + \frac{1}{\lambda} \sum_i x_i = 0 \ \ \ \implies \hat{\lambda} = \frac{1}{N} \sum_i x_i \)


3.7 Bayesian analysis of the Poisson distribution

(a) Derive the posterior assuming a Gamma prior

The prior we are told to use is:

\(P(\lambda) = Ga(\lambda |a,b) \propto \lambda^{a-1} e^{-\lambda b}\).

Then the posterior is proportional to the likelihood times the prior, i.e. \(P(\lambda | D) \propto P(D | \lambda) P(\lambda) \).

We already looked at the likelihood for the Poisson distribution in the previous section, so:

\(P(\lambda | D) \propto \prod_{i=1}^N \frac{e^{-\lambda} \lambda^{x_i}}{x_1!} \lambda^{a-1} e^{-\lambda b} \propto e^{-\lambda(N+b)}\lambda^{a-1+\sum_i x_i} = Ga(\lambda | a + \sum_i x_i, N+b) \)

So we see that the posterior is also a Gamma distribution, making the Gamma distribution a conjugate prior to the Poisson distribution.

(b) Posterior mean as a->0, b->0

We use that the fact that we derived the mean of a Gamma distribution in the text, finding that it is equal to a/b. So clearly this is just:

\( \frac{1}{N} \sum_{i=1}^N x_i \).

This is equal to the MLE that we found in the previous section.


3.8 MLE for the uniform distribution

Consider a uniform distribution centered on 0 with width 2a. \(p(x) = \frac{1}{2a} I(x \in [-a,a]) \) is the density function.

(a) Given a data set x1, …, xn, what is the MLE estimate of a?

The key point here is that \(P(D|a) = 0\) for any a which is less than the data point with the largest magnitude, and equal to \(\frac{1}{(2a)^n}\) for any a larger than this. This is clearly minimised when a is made as small as possible, i.e. \(\hat{a} = max|x_i|\).

(b)What probability would the model assign to a new data point x_n+1 using the MLE estimate for a?

Clearly \( \frac{1}{2\hat{a}}\) if \(|x_{n+1}| \le \hat{a}\), and 0 otherwise.

(c) Do you see a problem with this approach?

Yes, there is an issue here as any value with an absolute value larger than \(max |x_i|\) is assigned zero probability. For relatively small data sets this will be a big issue, but even for larger data sets it seems far from ideal.


3.9 Bayesian analysis of the uniform distribution

This is very much a continuation of the previous question, although here we are told to consider a \(\text{Unif}(0,\theta)\) distribution. The MLE is now \(\text{max}(D)\). Overall I’m fairly sure the question is either extremely poorly worded or has some mistakes, so I’m just going to go through it in the way that makes sense to me.

We are told to use a Pareto prior – the Pareto distribution is defined as:

\(p(x | k,m) = k m^k x^{-(k+1)} I(x \ge m) \)

where I is an indicator function. So a \(\text{Pareto}(\theta | b, K)\) prior is:

\( P(\theta) = K b^K \theta^{-(K+1)} I(\theta \ge b) \)

We more or less established the likelihood in the previous section, which is given by:

\(P(D | \theta) = \frac{1}{\theta^N} I(\theta \ge max(D)) \)

This means that the joint distribution, \(P(D,\theta)\), is given by \(P(D,\theta) = P(D | \theta) P(\theta) = \frac{K b^K}{\theta^{N+K+1}} I(\theta \ge m)\) where \(m = \text{max}(b, D)\).

We can use this to write the marginal likelihood:

\(P(D) = \int P(D,\theta) d\theta = \int_m^{\infty} \frac{K b^K}{\theta^{N+K+1}} d\theta = \frac{K b^K}{(N+K) m^{N+K}} \)

Now the posterior is given by:

\(P(\theta | D) = \frac{P(\theta, D)}{P(D)} = \frac{(N+K) m^{N+K}}{\theta^{N+K+1}} I(\theta \ge m) = \text{Pareto}(\theta | N+K, m=\text{max}(D,b))\)


3.11 Bayesian analysis of the exponential distribution

\(p(x|\theta) = \theta e^{-\theta x}\) for \(x \ge 0, \theta > 0\) defines the exponential distribution.

(a) Derive the MLE

We can write the likelihood function as:

\(L(\mathbf{x};\theta) = \theta^N e^{-\theta \sum_{i=1}^N x_i} \)

Clearly working with the log-likelihood will be better here:

\(l(\mathbf{x}; \theta) = N \log(\theta) – \theta \sum_i x_i \)

Taking the derivative wrt theta and setting it equal to zero:

\(0 = \frac{N}{\theta} – \sum_i x_i \)

and so the MLE is: \( \hat{\theta} = \frac{1}{\frac{1}{N} \sum_i x_i} = \frac{1}{\bar{x}} \)

(b) Suppose we observe X1=5, X2=6, X3=4. What is the MLE?

The mean is 5, so \(\hat{\theta} = 1/5\).

(c) Assume a prior \(p(\theta) = \text{Expon}(\theta | \lambda) \). What value should \(\lambda\) take to give \( \mathbb{E} [\theta] = 1/3\)?

The exponential distribution is just a special case of the Gamma distribution where \(a=1\) and \(b=\lambda\). We derived that the mean of a Gamma distribution is just given by a/b, and so we want:

\( 1/3 = \frac{1}{\hat{\lambda}}\), hence \(\hat{\lambda}=3\).

(d) What is the posterior?

\(P(\theta | D) \propto \theta^N e^{-N \theta \bar{x}} \lambda e^{-\lambda \theta} \propto \theta^N e^{-\theta(N \bar{x} + \lambda)} = Ga(\theta | N+1, N\bar{x}+\lambda)\)

(e) Is exponential prior conjugate to exponential likelihood?

Kind of, in the sense that both the prior and the posterior are Gamma distributions. But the posterior is not also an Exponential distribution.

(f) What is the posterior mean?

Again, mean of Gamma is a/b so we have \(\frac{N+1}{N \bar{x} + \lambda}\). If \(\lambda = 0\) and as \(N \to \infty\) we recover the MLE.

(g) Why do they differ?

This is a bit of a stupid question – because we calculated them in a different manner. We should expect the posterior mean to be less prone to overfitting though, and just generally be a bit better.


3.12 MAP estimation for Bernoulli with non-conjugate priors

In the book we discussed Bayesian inference of a Bernoulli rate parameter when we used a \(\text{Beta}(\theta|\alpha, \beta)\) prior. In this case, the MAP estimate was given by:

\( \theta_{MAP} = \frac{N_1 + \alpha – 1}{N + \alpha + \beta – 2} \)

(a) Now consider the following prior:

\(p(\theta) = 0.5\) if \(\theta = 0.5\), \(p(\theta) = 0.5\) if \(\theta = 0.4\) and \(0\) otherwise. Clearly this is a bit of a weird prior to use, but I guess it’s just for an exercise to make a point so let’s go with it. The question is to derive the MAP estimate as a function of \(N_1\) and \(N\).

We can write the posterior as:

\(P(\theta | D) \propto \theta^{N_1} (1-\theta)^{N-N_1} I( \theta \in \{0.4, 0.5 \}) \)

So the MAP is simply: \(\text{max}(0.5^{N_1} 0.5^{N-N_1}, 0.4^{N_1} 0.6^{N-N_1}) = \text{max}(0.5^{N}, 0.4^{N_1} 0.6^{N-N_1})\).

With some algebraic manipulations we can show that the MAP is 0.5 if: \(N_1 > \frac{\log(6/5)}{\log(6/4)} N \approx 0.45 N\). Otherwise it is \(0.4\). I was kind of surprised that it wasn’t exactly \(0.45N\), but I guess it has something with it being constrained to be between 0 and 1. I’m actually not sure though.

(b) Suppose the true theta is 0.41. Which prior leads to the better estimate?

When N is small, we would expect this second approach to work better (as it is choosing only between 0.4 and 0.5), however as N becomes larger eventually the other prior, where \(\theta\) can take any value, will give a better estimate.


3.13 Posterior predictive for a batch of data using the dirichlet-multinomial model

Derive an expression for \(P(\tilde{D} | D, \alpha)\), i.e. use the posterior to predict the results for a whole batch of data. Now, the definition of the Dirichlet distribution is:

\(\text{Dir}(\mathbf{\theta} | \mathbf{\alpha}) = \frac{1}{B(\mathbf{\alpha})} \prod_{k=1}^K \theta_k^{\alpha_k-1} I(\mathbf{\theta} \in S_k) \) (the identity function is ensuring the components of \(\theta\) sum to one.)

where \(B(\mathbf{\alpha}) = \frac{\prod_k \Gamma(\alpha_k)}{\Gamma[\sum_k \alpha_k]}\).

Using a \(\text{Dir}(\mathbf{\theta} | \mathbf{\alpha})\) prior for \(\theta\) we showed in the book that the posterior was given by:

\( P(\mathbf{\theta} | D) = \text{Dir}(\mathbf{\theta} | \alpha_1 + N_1, \dots, \alpha_K + N_K) \)

This means that we can write:

\(P(\tilde{D} | D, \alpha) = \int P(\tilde{D} | \theta) P(\theta | D) d\theta = \int \prod_{k=1}^K \theta_k^{x_k} \text{Dir}(\theta | \alpha_1 + N_1, \dots, \alpha_K + N_K) d\theta\)

where \( \mathbf{x} = \{x_k\} \) is the numbers of each class in the data we are predicting. This is equal to:

\( \int \prod_k \theta_k^{x_k} \frac{1}{B(\mathbf{\alpha + N})} \prod_k \theta_k^{\alpha_k + N_k} I(\theta \in S_k) d\theta \)

\( = \int \frac{B(\mathbf{\alpha + N + X})}{B(\mathbf{\alpha+N})} \text{Dir}(\theta | \alpha_1 + N_1 + x_1, \dots, \alpha_K + N_K + x_K) d\theta = \frac{B(\mathbf{\alpha + N + X})}{B(\mathbf{\alpha+N})} \)

where we just converted to a Dirichlet distribution and introduced the correct normalisation parameter. Since it is a probability distribution, it then of course integrates to 1 (the final step).


3.14 Posterior predictive for the Dirichlet-Multinomial

(a) Suppose we compute the empirical distribution over letters of the Roman alphabet plus the space character (27 values) from 2000 samples. Suppose we see the letter “e” 260 times. What is \(P(x_{2001} = e | D)\), assuming a Dirichlet prior with alpha_k = 10 for all k?

We showed in the text that the posterior predictive is given by:

\(P(X=j | D) = \frac{\alpha_j + N_j}{\sum_k (\alpha_k + N_k)} = \frac{10 + 260}{270 + 2000} \simeq 0.119 \)

(b) Suppose actually we saw “e” 260 times, “a” 100 times and “p” 87 times. What is \(P(x_{2001}=p, x_{2002} = a | D)\) under the same assumptions?

We basically just derived what we need for this in the previous question. We are looking for the probability of the data vector \(\mathbf{X} = (1,0,\dots, 1, 0, \dots, 0)\), where the non-zero components are at indices 1 (“a”) and 16 (“p”). We showed:

\(P(X | D) = \frac{B(\mathbf{\alpha + N + X})}{B(\mathbf{\alpha+N})} \)

This is equal to:

\( \frac{\prod_k \Gamma(\alpha_k + N_k + x_k) \Gamma(\sum_k \alpha_k + N_k)}{\Gamma(\sum_k \alpha_k + N_k + x_k) \prod_k \Gamma(\alpha_k + N_k)} \)

Now in the product terms, everything cancels except the components corresponding to p and a, where we pick up factors of \(\frac{\Gamma(98)}{\Gamma(97)}\) and \(\frac{\Gamma(111)}{\Gamma(110)} \) respectively. Overall, we are left with:

\(P(X|D) = \frac{ \Gamma(111) \Gamma(98) \Gamma(2270)}{\Gamma(110) \Gamma(97) \Gamma(2272)} = \frac{(110!)(97!)(2269!)}{(109!)(96!)(2271!)} = \frac{110*97}{2270*2271} \simeq 0.002 \)


3.17 Marginal likelihood for beta-binomial under uniform prior

Suppose we toss a coin N times and observe N1 heads. Let \(N_1 \sim Bin(N,\theta)\) and \(\theta \sim Beta(1,1)\). Show that the marginal likelihood is given by: \(P(N_1 | N) = \frac{1}{N+1}\).

We can write:

\(P(N_1 | N) = \int P(N_1 | N, \theta) P(\theta) d\theta \)

But a \(Beta(1,1)\) distribution is just uniform, so we don’t need to take this into account. This means we have:

\(P(N_1 | N) = \int_0^1 \text{Bin}(N_1 | N, \theta) d\theta = \int_0^1 \frac{N!}{N_1! (N-N_1)!} \theta^{N_1} (1-\theta)^{N-N_1} d\theta \)

It helps to re-write the factorials in terms of Gamma functions:

\( P(N_1 | N) = \int_0^1 \frac{\Gamma(N+1)}{\Gamma(N_1+1) \Gamma(N-N_1 + 1)} \theta^{N_1} (1-\theta)^{N-N_1} d\theta \)

Now, by definition:

\(\text{Beta}(\theta | N_1 + 1, N-N_1+1) = \frac{\Gamma(N+2)}{\Gamma(N_1+1) \Gamma(N-N_1+1)} \theta^{N_1} (1-\theta)^{N-N_1}\)

This means that we can say:

\(P(N_1 | N) = \int_0^1 \frac{\Gamma(N+1)}{\Gamma(N+2)} \text{Beta}(\theta | N_1+1, N-N_1+1) d\theta = \frac{N!}{(N+1)!} = \frac{1}{N+1}\)

since a probability distribution must integrate to 1. This result kind of surprised me to be honest – I guess it kind of makes sense although intuitively I would have expected a uniform prior over \(\theta\) to lead to it being most likely to have \(N_1 = N/2\), rather than being completely uniform!


3.18 Bayes factor for coin tossing

Suppose we toss a coin \(N=10\) times and observe \(N_1 = 9\) heads. Let the null hypothesis be that the coin is fair, and the alternative be that the coin can have any bias – \(p(\theta) = \text{Unif}(0,1)\). Derive the Baye’s factor in favour of the biased coin hypothesis. What if \(N_1=90\) and \(N=100\)?

I think this just means we need to look at the ratios of the likelihoods under each assumption. Under the fair assumption:

\(P(N_1 | \theta = 1/2) = \frac{N!}{N_1!(N-N_1)!} \frac{1}{2}^N\)

Under the biased assumption, we just calculated this in the previous exercise:

\(P(N_1 | \theta \sim \text{Unif}(0,1)) = \frac{1}{N+1} \)

So \(BF = \frac{N_1! (N-N_1)! 2^N}{(N+1)!} \). For \(N_1 = 9\) and \(N=10\) I find that this gives \(BF \simeq 9.31 \). For \(N_1 = 90\), \(N=100\) I find: \(BF \simeq 7.25 \times 10^{14} \) – clearly the coin is amazingly unlikely to be fair in the second case.


3.19 Irrelevant features with Naive Bayes

Let \(x_{iw}=1\) if word w occurs in document i, and be 0 otherwise. Let \(\theta_{cw}\) be the estimated probability that word w occurs in documents of class c. The log-likelihood that document x belongs to class c is:

\( \log(p(\mathbf{x_i}|c,\theta)) = \log \prod_w \theta_{cw}^{x_{iw}}(1-\theta_{cw})^{1-x_{iw}} \)

\( = \sum_w x_{iw} \log(\frac{\theta_{cw}}{1-\theta_{cw}}) + \sum_w \log(1-\theta_{cw}) \)

This can be written more succinctly as \( \log(p(\mathbf{x_i}) = \mathbf{\phi(x_i)}^T \mathbf{\beta_c} \), where \( \mathbf{\phi(x_i)} = (\mathbf{x_i},1) \) and:

\(\mathbf{\beta_c} = (\log \frac{\theta_{c,1}}{1-\theta_{c,1}}, \dots, \log \frac{\theta_{c,W}}{1-\theta_{c,W}} , \sum_w \log(1-\theta_{cw}))^T \)

i.e. a linear classifier where the class-conditional densities are linear functions of the params \(\mathbf{\beta_c}\).

(a) Assuming P(C=1) = P(C=2) = 0.5, write an expression for the log posterior odds ratio in terms of the features and the parameters.

We just use Bayes’s theorem: \( P(C=1 | \mathbf{x_i}) = P(\mathbf{x_i} | C=1) P(C) / P(\mathbf{x_i}) \), and likewise for C=2. However, as \(P(C=1) = P(C=2)\) we get a cancellation such that:

\( \log \frac{P(C=1 | \mathbf{x_i})}{P(C=2 | \mathbf{x_i})} = \mathbf{\phi(x_i)}^T(\mathbf{\beta_1 – \beta_2}) \)

(b) Consider a particular word w. State the conditions on \(\theta_{1,w}\) and \(\theta_{2,w}\) under which the presence or absence of the word will have no effect on the class posterior.

For this, we want to poster odds ratio to be 1, and hence for the logarithm of this to be zero. This means that \( \beta_{1,w} = \beta_{2,w} \).

(c) The posterior mean estimate of theta, using a Beta(1,1) prior, is given by:

\( \hat{\theta_{cw}} = \frac{1+ \sum_{i \in c} x_{iw}}{2 + n_c} \)

where the sum is over the nc documents in class c. Consider a word w, and suppose it occurs in every document, regardless of class. Let there be n1 documents of class 1, and n2 of class 2, with n1 not equal to n2. Will this word be ignored by our classifier?

Clearly not, as we are told that \(\hat{\theta_{1,w}} = \frac{1+n_1}{2+n_c}\) and \(\hat{\theta_{2,w}} = \frac{1+n_2}{2+n_c}\), which are not equal as \(n_1 \neq n_2\), and hence the necessary condition derived in (b) does not hold.

(d) What other ways can you think of to encourage “irrelevant” words to be ignored?

I guess things like preprocessing and feature selection would be good for this.


3.21 Mutual information for Naive Baye’s with binary features

The result was stated in the chapter, here we are asked to derive it. We are looking for the mutual information between feature j and the class label Y, i.e. \(I(X_j, Y)\). By definition, this is equal to:

\(I(X_j;Y) = \sum_{x_j \in \{0,1\}} \sum_{y \in C} P(x_j, y) \log \left( \frac{P(x_j, y)}{p(x_j) p(y)} \right) \)

To get the joint values, we can say \(P(x_j = 1, y=c) = P(x_j=1 | y=c) P(y=c) = \theta_{jc} \pi_c \)

and then:

\( P(x_j=0, y=c) = P(x_j=0 | y=c) P(y=c) = (1-\theta_{jc}) \pi_c \).

By definition, \(P(y=c) = \pi_c\), so we can say:

\(P(x_j=1) = \sum_{c’} P(x_j=1, y=c’) = \sum_{c’} \theta_{jc’} \pi_{c’} \equiv \theta_j\)

\(P(x_j=0) = \sum_{c’} P(x_j=0, y=c’) = \sum_{c’} (1-\theta_{jc’}) \pi_{c’} = 1-\theta_j \)

Putting this together we get the desired result:

\( I(X_j, Y) = \sum_c \left[ \theta_{jc} \pi_c \log \left( \frac{\theta_{jc}}{\theta_j} \right) + (1-\theta_{jc})\pi_c \log \left( \frac{1-\theta_{jc}}{1-\theta_j} \right) \right] \)

Machine Learning – A Probabilistic Perspective Exercises – Chapter 2

Recently I’ve been becoming more and more interested in machine learning, and so far my attention has been primarily focused on reinforcement learning. I had a lot of fun working on the Big 2 AI, but I feel like I really need to invest more time to studying the fundamentals of machine learning. I’ve got myself a copy of “Machine Learning – A Probabilistic Perspective”, which seems like a great text book, and so I’m going to work my way through it. I’ve decided to make a decent attempt at doing as many of the exercises as possible, and I feel like actually writing up an explanation for them is quite useful for me in making sure I actually understand what’s going on. Potentially it might also be useful for other people too, so I thought I would post my answers as I go! I may skip some exercises if I think they’re boring (or too “wordy”), or of course if I’m unable to do them (although I will probably mention them in this case)!

2.1 My neighbor has two children. Assuming that the gender of a child is like a coin flip, it is most likely, a priori, that my neighbour has one boy and one girl, with probability 1/2. The other possibilities—two boys or two girls—have probabilities 1/4 and 1/4.
(a) Suppose I ask him whether he has any boys, and he says yes. What is the probability that one child is a girl?

This is a standard probability puzzle. The key here is in the wording of the question we ask him – whether he has any boys? A priori, there are four possible options for child 1 and child 2: BB (boy boy), BG, GB and GG, each with equal probability. If he says yes to our question, this is compatible with three of the initial four options: BB, BG and GB. That is, we can only exclude the possibility GG. In two out of these three remaining possibilities he has one girl, and so the probability is 2/3.

(b) Suppose instead that I happen to see one of his children run by, and it is a boy. What is the probability that the other child is a girl

In this case the question is more specific with its wording – we have seen one specific child, but this tells us nothing about the other child, and so we get the more intuitive answer of 1/2.


2.3 Show Var[X+Y] = Var[X] + Var[Y] + 2 Cov[X,Y] for any two random variables X and Y

This one is relatively straightforward and just involves starting from a basic relationship for the variance:

\( Var[X+Y] = \mathbb{E}[(X+Y)^2] – \mathbb{E}[X+Y]^2 = \mathbb{E}[(X+Y)^2] – (\mathbb{E}[X] + \mathbb{E}[Y])^2 \)

(using the linearity of expectation). Expanding this out:

\( \mathbb{E}[X]^2 + \mathbb{E}[Y]^2 + 2 \mathbb{E}[XY] – \mathbb{E}[X]^2 – 2\mathbb{E}[X] \mathbb{E}[Y] – \mathbb{E}[Y]^2 \)

which clearly gives the required result (as \( Cov[X,Y] = \mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y] \) ).


2.4 After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). The good news is that this is a rare disease, striking only one in 10,000 people. What are the chances that you actually have the disease?

This is another famous example of the use of Baye’s theorem. If we define two events as follows:

A – you have the disease

B – the test returns a positive result

If you go and take the test, and you get a positive result saying that you have the disease, then what you want to calculate is \(p(A|B)\). That is, the probability that you have the disease given that you tested positive. By Baye’s theorem, this is equivalent to:

\( p(A|B) = \frac{p(B|A) p(A)}{p(B)} \)

Now, since the test is 99% accurate, \( P(B|A) = 0.99 \), i.e. if you have the disease you will test positive 99% of the time. But p(A) = 1/10000, i.e. the probability of an “average” person in the population having the disease (with no other information about them). Finally, we can calculate p(B) as follows:

\(P(B) = 0.99 \frac{1}{10000} + 0.01 \frac{9999}{10000} \)

where the first term is the probability of being positive if you have the disease*prob of actually having the disease and the second term is prob of testing positive if you don’t have the disease (false positive) * probability of not having the disease.

Putting the numbers in you find that \(P(B|A) \approx 0.98 \% \), which makes it extremely unlikely that you have the disease despite testing positive! This is one example of why it would be a good idea if doctors learned some stats!


2.5 Monty Hall problem – On a game show, a contestant is told the rules as follows: There are three doors, labelled 1, 2, 3. A single prize has been hidden behind one of them. You get to select one door. Initially your chosen door will not be opened. Instead, the gameshow host will open one of the other two doors, and he will do so in such a way as not to reveal the prize. For example, if you first choose door 1, he will then open one of doors 2 and 3, and it is guaranteed that he will choose which one to open so that the prize will not be revealed. At this point, you will be given a fresh choice of door: you can either stick with your first choice, or you can switch to the other closed door. All the doors will then be opened and you will receive whatever is behind your final choice of door. Imagine that the contestant chooses door 1 first; then the gameshow host opens door 3, revealing nothing behind the door, as promised. Should the contestant (a) stick with door 1, or (b) switch to door 2, or (c) does it make no difference? You may assume that initially, the prize is equally likely to be behind any of the 3 doors.

This is another famous problem which has quite an interesting history, apparently leading to many genuine mathematicians complaining about the correct solution which was given in a newspaper column (although this was a little while ago, I still find this extremely surprising!) It also comes up online every now and then with people arguing/trying to explain why you should switch, so I’ll add my best attempt at explaining it here I guess!

For me, the crucial detail that helps in understanding it is that the host has information about where the prize is, which means that he is not choosing a door at random to open, since he will never open the door with the prize behind. If you take the starting point as the prize having equal probability of being behind each door, then you only need to consider two possibilities – 1) the initial door you chose does not have the prize behind it (which happens 2/3 of the time), and 2) the initial door does have the prize behind it. In case 1), you have selected one of the doors without the prize behind it – which means the host has no choice but to open the only other door that doesn’t have the prize behind it. This means in this case, if you switch, you win the prize – and remember this happens 2/3 of the time. Obviously if you did initially pick the winning door and you switch then you lose, but this only happens 1/3 of the time. So it’s better to switch!


2.6 Conditional Independence

(a) Let \(H \in \{1,\dots,K\}\) be a discrete RV, and let e1 and e2 be the observed values of two other RVs E1 and E2. Suppose we wish to calculate the vector:

\( \vec{P}(H|e_1, e_2) = (P(H=1|e_1, e_2), \dots, P(H=K|e_1, e_2)) \)

Which of the following are sufficient for the calculation?

(i) P(e1,e2), P(H), P(e1|H), P(e2|H)

(ii) P(e1,e2), P(H), P(e1,e2|H)

(iii) P(e1|H), P(e2|H), P(H)

We can use Baye’s theorem to write:

\( P(H| E_1, E_2) = \frac{P(E_1, E_2 | H) P(H)}{ P(E_1, E_2)} \)

so clearly (ii) is sufficient. The others are not in general sufficient.

(b) Now suppose we assume \(E_1 \bot E_2 | H\). Now which are sufficient?

Well, clearly (ii) still is. But conditional independence of E1 and E2 given H means that we can write:

\( P(E_1, E_2 | H) = P(E_1 | H) P(E_2 | H) \)

which means that (i) is also sufficient now.

However, we can also write:

\( P(e_1, e_2) = \sum_h p(e_1, e_2, h) = \sum_h p(e_1, e_2 | h) p(h) \)

which means that (iii) is also sufficient too!


2.7 Show that pairwise independence does not imply mutual independence

The best way to do this is to show an example where we have pairwise independence, but not mutual independence. Consider rolling a fair, four-sided dice. If we define 3 events, A = {1,2}, B={1,3} and C={1,4}, then clearly P(A) = P(B) = P(C) = 1/2. But also, P(A,B) = P(A,C) = P(B,C) = P({1}) = 1/4 = P(A)P(B) = P(A)P(C) = P(B)P(C). This means we have pairwise independence.

However, P(A,B,C) = P({1}) also, which is not equal to P(A)P(B)P(C) = 1/8, and so we do not have mutual independence.


2.9 Conditional Independence. Are the following properties true?

(a) \( (X \bot W|Z,Y) \ AND \ (X \bot Y|Z) \implies (X \bot Y , W|Z) \)

OK initially this looks pretty confusing, but once you break it down it’s not too bad.

The first condition tells us that:

\( P(X,W | Z,Y) = P(X | Z,Y) P(W| Z,Y) \)

and the second tells us:

\( P(X,Y | Z) = P(X | Z) P(Y | Z) \)

The condition we need to show is:

\( P(X,Y,W | Z) = P(X | Z) P(Y, W | Z) \)

Completely generally, we can use the chain rule of probability to say:

\( P(X,Y,W | Z) = P(W| X,Y,Z) P(Y| X,Z) P(X | Z) \)

However, since X and W are conditionally independent given Z,Y, we can say that \(P(W| X,Y,Z) = P(W | Y,Z) \) (this is because generally \(P(X,W | Y,Z) = P(W | X, Y, Z) P(X | Y, Z) \), so if conditional dependence is true, then so must be the previous identity. Similarly, P(Y|X,Z) = P(Y|Z) since X and Y are conditionally independent given Z. This means that:

\(P(X,Y,W | Z) = P(W|Y,Z)P(Y|Z) P(X|Z) \)

Then we can say that \(P(W|Y,Z)P(Y|Z) = P(W,Y|Z) \), which gives us exactly what we need.

I couldn’t be bothered to do part (b), but I expect it’s kind of similar.


2.10 Deriving the inverse Gamma density

Let \( X \sim Ga(a,b) \), such that:

\( p(x|a,b) = \frac{b^a}{\Gamma(a)} x^{a-1} e^{-xb} \).

If \( Y = 1/X \), show that \( Y \sim IG(a,b) \), where:

\(IG(x |a,b) = \frac{b^a}{\Gamma(a)} x^{-(a+1)}e^{-b/x} \)

Let’s start from the CDF of Y:

\( P(Y \le y) = P(\frac{1}{X} \le y) = P(X \ge \frac{1}{y}) = 1 – P(X \le \frac{1}{y}) \)

If we let \( u = 1/y \), then:

\( p(y) = \frac{d}{dy} P(Y \le y) = \frac{du}{dy} \frac{d}{du}(1-P(X \le u)) = \frac{1}{y^2} Ga(u | a, b) \)

Then we substitute y back in to find:

\( p(y) = \frac{b^a}{\Gamma(a)} y^{-(a+1)} e^{-b/y} \)

which is the result we require.


2.11 Basically asking to evaluate: \( \int_{-\infty}^{\infty} e^{-\frac{x^2}{2 \sigma^2}} dx \)

I remember seeing how to do this before – the trick is to consider the squared value and then convert to circular polar coordinates, so that you’re just left with an integral from \(r=0 \to \infty \) and a constant integral over \(\theta\). Since we change from \( dx dy \) to \(r dr d\theta \) this allows the integral in r to be evaluated. I can’t really be bothered to fill in the details but it should be easy to find elsewhere.


2.12 Expressing mutual information in terms of entropies: show that \( I(X;Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) \)

If we start from the definition in the textbook of the MI being equal to the KL divergence between \(P(X,Y)\) and \(P(X)P(Y)\) then we have:

\( I(X;Y) = \sum_{x,y} P(x,y) \log(\frac{P(x,y)}{P(x)P(y)}) = \sum_{x,y} P(x,y) [ \log(P(x,y)) – \log(P(x)) – \log(P(y)) ] \)

If we write \(P(x,y) = P(x | y) p(y) \) then we have:

\(I(X;Y) = \sum_{x,y} P(x|y)P(y) [ \log(P(x|y) – \log(P(x))] \)

We can then use that the conditional entropy \(H(X|Y) = -\sum_y p(y) \sum_x p(x|y) \log(p(x|y)) \), which is the negative of the first term we have. The second term, \(-\sum_{x,y} P(x,y) \log(P(x)) = H(X)\), clearly, and so we have found that \(I(X;Y) = H(X) – H(X|Y)\). It is easy to show the second identity, replacing \(P(x,y) = P(y|x)p(x)\) instead.


2.13 Mutual information for correlated normals. Find the mutual information I(X1; X2) where X has a bivariate normal distribution:

\( \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} \sim \mathcal{N} \left( \mathbf{0}, \begin{bmatrix} \sigma^2 & \rho \sigma^2 \\ \rho \sigma^2 & \sigma^2 \end{bmatrix} \right) \)

Evaluate when \(\rho = -1, 0, 1 \)

The question actually gives you the form for the differential entropy for a multivariate normal, but I found a derivation which I thought was pretty nice and so I’m going to include it here (this is for a completely general MV normal of any dimension).

The pdf of a MV normal is:

\( p(\mathbf{x}) = \frac{1}{\sqrt{det(2 \pi \Sigma)}} e^{-\frac{(\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu})}{2}} \)

Now the differential entropy is:

\( H(\mathbf{x}) = \int \int d\mathbf{x} \left[ \log(\sqrt{det(2 \pi \Sigma)}) + \frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right] \mathcal{N}(\mathbf{x}; \mathbf{\mu}, \Sigma) \)

Now the nice trick here is that, by definition, the covariance matrix \( \Sigma = \mathbb{E} \left[ (\mathbf{x}-\mathbf{\mu}) (\mathbf{x}-\mathbf{\mu}) ^T \right] \), and the second term in the differential entropy is 1/2 times \( \mathbb{E} \left[ (\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right] \). Crucially, since this is a scalar we can say it’s equal to it’s trace, i.e.

\( \mathbb{E} \left[ (\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right] = \mathbb{E} \left[ Tr\left( (\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right) \right] \)

Then we can use the cyclic properties of traces, i.e. \( Tr(ABC) = Tr(BCA) = Tr(CAB) \), such that this is equal to:

\( \mathbb{E} \left[ Tr\left( (\mathbf{x} -\mathbf{u})(\mathbf{x}-\mathbf{u})^T \Sigma^{-1} \right) \right] = Tr \left[ \Sigma \Sigma^{-1} \right] = d \)

where d is the number of random variables (i.e. the dimension). Now \( d = \log(e^d) \) and \(det(2\pi \Sigma) = (2 \pi)^d det(\Sigma) \), so we are left with:

\(H(\mathbf{x}) = \frac{1}{2} \log((2 \pi)^d det(\Sigma)) \)

Now that we have this the rest of the question is relatively straightforward as we can rewrite the mutual information as \(I(X;Y) = H(X) + H(Y) – H(X,Y)\), and so in this case:

\(I(X_1; X_2) = \log(2 \pi e \sigma^2) – \frac{1}{2} \log((2 \pi e)^2 det(\Sigma)) \)

Now, \(det(\Sigma) = \sigma^4 – \rho^2 \sigma^4\), and so making a few algebraic manipulations we arrive at:

\(I(X_1; X_2) = \frac{1}{2} \log(\frac{1}{1-\rho^2}) \)

As \( \rho \to 0\), \(I \to 0 \), and as \( \rho \to \pm 1 \), \(I \to \infty \). Intuitively this makes sense – if \(\rho = 0\), there is no correlation and so the variables give us no information about each other. If they are perfectly correlated, we learn everything about \(X_1\) from \(X_2\), and an infinite amount of information can be stored in a real number.


2.14 Normalised mutual information. Let X and Y be discrete and identically distributed RVs (so H(X) = H(Y)). Let:

\(r=1 – \frac{H(Y|X)}{H(X)}\)

(a) Show \( r = \frac{I(X;Y)}{H(X)} \):

This is quite straightforward: \(r = \frac{H(X) – H(Y|X)}{H(X)} = \frac{I(X;Y)}{H(X)} \)

(b) Show \( 0 \le r \le 1 \)

Entropy is always positive, so clearly \( r \le 1 \) from it’s initial definition. Then we know the mutual information is \( \ge 0 \) too, and so it’s greater than 0. I guess we should prove that \(I(X;Y) \ge 0 \). We do this starting from the initial definition:

\(I(X;Y) = -\sum_{x,y} p(x,y) \log(\frac{p(x)p(y)}{p(x,y)}) \)

Now, since the negative logarithm is convex we can apply Jensen’s inequality \(\sum_i \lambda_i f(x_i) \ge f(\sum_i \lambda_i x_i) \), where \( \sum_i \lambda_i = 1\)):

\( I(X;Y) \ge -\log \left( \sum_{x,y} p(x,y) \frac{p(x) p(y)}{p(x,y)} \right) = 0 \)

(c) When is r=0?

When \(I(X;Y) = 0\), i.e. when the variables give us no information about each other.

(d) When is r=1?

When \(H(Y|X) = 0\), i.e. when the variables tell us everything about each other (i.e. once you know x, you get no information by learning y). I guess this is why r can be thought of as a normalised mutual information!


2.15 MLE minimises the KL divergence with empirical distribution

\(P_{emp}(x_i) = N_i / N\) (sticking with the discrete case for now, where \(N_i\) is the number of occurrences of \(x_i\) and N is the total amount of data.

\(KL(p_{emp} || q(x;\theta)) = \sum_i p_{emp}(x_i) \log(\frac{p_{emp}(x_i)}{q(x_i;\theta)}) = \sum p_{emp} \log(p_{emp}) – \sum p_{emp} \log(q(x_i;\theta)) \)

The first term here is fixed by the data and so it is clear that:

\( argmin_{\theta} KL (p_{emp} || q) = argmax_{\theta} \frac{1}{N} \sum_i N_i \log(q(x_i;\theta)) \)

which is the same as maximising the log-likelihood, and hence the likelihood.


2.16 Derive the mean, mode and variance of the Beta(a,b) distribution.

\(Beta(x; a,b) = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} x^{a-1} (1-x)^{b-1} \equiv \frac{x^{a-1}(1-x)^{b-1}}{B(a,b)} \)

To get the mode you just differentiate wrt x and set equal to zero – I really couldn’t be bothered.

For the mean:

\( \text{mean} = \mathbb{E}[X] = \frac{1}{B(a,b)} \int_0^1 x^a (1-x)^{b-1} dx \)

Integrating by parts we find:

\( \mathbb{E}[X] = \frac{a}{B(a,b) b} \int_0^1 x^{a-1} (1-x)^b dx = \frac{a}{B(a,b) b} \left[ \int_0^1 x^{a-1} (1-x)^{b-1}dx – \int_0^1 x^a (1-x)^{b-1} dx \right]\)

However, one of these just integrates to 1 because it is a Beta distribution, and the second is exactly what we started with, i.e. \( \mathbb{E}[X]\), and hence:

\( \mathbb{E}[X] = \frac{a}{b}(1-\mathbb{E}[X])\)

rearranging we find: \( \mathbb{E}[X] = \frac{a}{a+b} \). You should be able to get the variance using the same kind of trick but it looked like a lot more algebra and I was too tired to attempt it!


2.17 Expected value of the minimum. Let X, Y be iid U(0,1) RVs. What is the expected value of min(X,Y).

If we let the CDF of X (and Y) be \(F(x) = x\) (for x between 0 and 1), then:

\(P(\text{min}(X,Y) \ge x) = 1-P(\text{min}(X,Y) \le x) = P( X \ge x, Y \ge x) = (1-F(x))^2 = (1-x)^2\)

Therefore \(P(\text{min}(X,Y) \le x) = 1 – (1-x)^2 \), and so the can write the probability density of the minimum as:

\(p_{min}(x) = 2(1-x)\)

as such, the expected value of the minimum, which is what we want, is:

\( \mathbb{E}[min] = 2 \int_0^1 x(1-x) dx = 1/3 \)


There are a couple of questions I skipped, maybe at some point I’ll get back to them but I don’t think they were especially interesting/informative. If there are any questions, or if you spot any errors, please leave a comment! I have done most of the CH3 exercises now, and so will be writing them up soon as well!

Big 2 AI Working!!

I’ve finally had a decent amount of time to invest in my Big 2 reinforcement learning AI, and it’s actually working really well (much, much better than I was ever expecting in fact!). At some point I will do a full detailed write up but for now I’ll just make a few notes about the process I used and summarize the results so far but the main result is that in the initial testing the AI actually beat me (who has played the game a lot) pretty damn convincingly and showed clear signs of being able to formulate good plans to get rid of its cards! I totally wasn’t expecting this to happen so I’m really pleased with this, particularly as I was recently reading this article recently about how deep reinforcement learning doesn’t really work very well yet (or rather that it’s very difficult to get it to work properly compared to supervised deep learning).

So in the end I decided to use the “Proximal Policy Optimization” algorithm which seems to be very popular atm (particularly at OpenAI) and from what I’ve read is one of the best in terms of sample efficiency and robustness to varying hyper-parameters. It’s also relatively simple to implement and OpenAI have released an excellent implementation in their “Baselines” project which was incredibly useful to use as a basis for my own code (which can be found here). I won’t go into the details of the theory behind the PPO algorithm (as I actually still need to read more about this myself to really understand why it works so well) but it’s a policy where you policy gradient method of sorts but where you use a surrogate loss function which is clipped so as to stop updates which change the policy too much. The surrogate function requires being able to estimate the advantage of a given action in a given state given the current policy you have. There are a number of ways you can do this but I followed the original paper which suggests using “Generalized Advantage Estimation” which tries to balance as best as possible reducing the bias and reducing the variance of the advantage estimates. This requires being able to estimate the value function of each state under the policy being employed making this an actor-critic algorithm, i.e. you are trying to learn a policy \( \pi \) and a value function \( v \) at the same time.

I also read quite a bit about how a lot of deep reinforcement learning algorithms (e.g. deep Q-learning) often don’t perform very well in the multi-agent setting where you have a number of agents who are competing against each other in some way because these environments are complex and non-stationary whereas typical policy gradient methods suffer from very high variance which increases rapidly with the number of agents. This worried me that a four-player game of imperfect information would be very tricky to get working (especially with a complicated action space) but I was somewhat encouraged by this paper which demonstrated that using PPO with huge training batch sizes seemed to work really well in complicated competitive environments.

What I actually did:

The first thing I needed to do was to parallelize my Big2 game class to allow for multiple games to be ran at once on different cores of my processor. The way PPO works is by running a number of environments \(n_e\) in parallel (for ATARI and MuJoCo they use \(n_e=8\) and run them forward for some number of steps \(n_s\) into the future with the current policy to generate a batch to train on. The reason for running many games at once is so as to have a batch of samples which aren’t all correlated (which they would be if they all game from the same game) and so it provides a big speed up if you can run these games on different cores as much as possible. Then when the batch has been generated you train the neural network using the PC’s GPU. There are a number of challenges and things I had to alter from the OpenAI Baselines implementation which are mainly to do with the fact that this is designed for training a single agent whereas I needed it to be set up for four players. The main difficulty here is that to use generalized advantage estimation you need to work backwards and use all of the states in the batch you simulate after the one you’re estimating the advantage for to get this estimate but now the “next state” is actually the state four time steps after the current time step as the three other players have to make a decision first. Another issue is the fact that you don’t know if a state is terminal (and what reward to assign) until the other three players have had their next turn because the game finishes at one point when a player plays their final cards. To account for this it’s necessary to run an extra four steps after each batch, save these, and then put them in as the first four states on the next batch. It also makes vectorizing the whole batch processing a bit trickier to deal with but I was eventually able to get this working.

In terms of the neural network architecture I went for the following:

Here the input layer has size 412 and contains information about the player’s current hand and the cards/hands that other players have already played and which I described in a previous blog post. This is fed into a fully connected layer of 512 neurons which use a “RelU” activation function. This layer is shared and fed into two further hidden fully connected layers of 256 neurons each (also with RelU activation) which are each in turn connected to an output layer (each of which is linear (i.e. has no activation function). The first of these outputs represents the probability weighting that the network gives to each possible action in this current state whilst the second outputs the value estimate of the current state. To get the actual probabilities we consider only the outputs of actions \( \{o_i \} \) that are actually allowed in the current state and sample actions with probabilities: \( p_i = \frac{ e^{ o_i } }{\sum_j e^{o_j} } \). This means there are nearly a million trainable parameters in this model! I chose the number of hidden neurons in a fairly arbitrary way really and have made no attempt to play around with this so far but Big 2 is a reasonably complicated game so I figured that having fairly large layers would be sensible. Also sharing the first layer between the probability output and the value output seemed sensible as there are likely to be a lot of features of the game state which are useful for calculating both things!

In terms of the parameters I used for the PPO algorithm I chose to run 48 games in parallel and run them all for 20 steps to generate a training batch. This leads to batch sizes of 960 samples (which is tiny in comparison to what is used in the OpenAI paper for multi-agent environments where they generated batches of around 400,000 samples apparently) but a lot more than was used for Atari where they had only 8 games running in parallel. These batches were then divided into four mini-batches of equal size to be trained on with 5 epochs of SGD per batch. For the generalized advantage estimation I chose \(\gamma = 0.995 \) and \(\lambda = 0.95\) and for the main PPO algorithm I chose a learning rate of \( \alpha = 0.00025\) and a clip range of 0.2 (but both of these are linearly annealed to zero as the training progresses) and set the value/entropy coefficients in the loss function to 0.5 and 0.01 respectively. I then trained the agent by making it play against itself (always using the current version of the neural network – I’d quite like to experiment with some kind of opponent sampling and see if that makes any difference) for 136500 updates (~130 million time steps). I made absolutely no attempt to tune these hyperparameters in anyway (although I know they are somewhat sensible from reading the results of the paper) so it’s really pretty cool that it ends up working so well!


The main result is that the network which is trained learns to play Big 2 really well! I’ve only played 15 games against it so far because the GUI I’ve made is kind of clunky and needs improving a bit but I got well and truly embarrassed – and I’ve played the game a lot (although I am by no means a proper expert). From the 15 games I played against three of the final trained AIs my scores were: \( \{-3, -3, -1, -1, -10, -11, -1, -1, -4, -8, +16, -2, -5, +14, -8 \} \)

So I only won two games (and in one of those I had what was essentially an unbeatable hand). There were also situations where I could tell that the AI was playing really well and had clearly “planned” in some sense how to get rid of its cards by learning both simple things like the value of passing to save 2s to gain control at a later stage in the game but also in terms of when to play certain number of cards when it gets control. I have to say this was really surprising and cool to see! Obviously I need to test this a bit more rigorously and so I am planning to make a web app where people can play against it and record the results so that I can get a better idea of how good it really is. One thing I have looked at is just its performance against earlier versions of itself as well just against random opponents and I get the following results (each point is averaged over 10,000 simulated games):

Note that the first data point is not at zero updates but after 1000 updates – hence a lot is getting learned very quickly! More importantly we see that there seems to be steady improvement throughout training which suggests that training for even longer could yield further improvement still!

To Do:

  • A proper write up fully explaining the game, the PPO algorithm etc.
  • A web app that allows for people to play and save their results in a leaderboard against the currently trained neural net.
  • Experiment with different batch sizes (e.g. try >100 games running in parallel) and other parameters to see if performance can be improved further. Also would like to experiment with some kind of opponent sampling, i.e. only having one player definitely being the most recent neural network with other players being sampled from versions of the NN earlier on in the training. The point of this would be to try and ensure good play vs. all opponents, not just the new one. I guess what I’m thinking about here is the way a poker pro plays vs. a new player vs. another pro is completely different so this could be important to try and get the network to try and understand this sort of thing.


Code is available on Github here. The webapp is finished, so you can play against the AI here (it will take a while to load most likely, as I am using a free Heroku account). A preprint of a paper describing things in more detail is here.

Big 2 Reinforcement Learner Progress

This is just a post mainly to document (to myself more than anything) the progress that I’ve made over the past couple of weeks with the “Big 2 reinforcement learning project” that I’m working on as well as to lay out my plan for what to do next. The main goal of the project is to train an AI to learn how to play the four player card game “Big 2”. This is a game of imperfect information with a reasonably complicated action space and so should be an interesting challenge particularly as I have only recently become interested in the field of reinforcement learning.

So far I’ve mainly just been working on implementing the game logic in python and writing the code that generates what will be the input to the neural network. I spent some time creating a GUI in python with tkinter which allows you to generate a big 2 game, see what all of the options available to each player are, and essentially play through a full game. It also has a separate window that shows you what the input to the neural network for each player will be given the current state of the game. If you want to play around with it yourself you can find it here on Github or if you don’t have python installed I made a standalone executabe you can run (although it’s quite a big file). I did this partly because I thought it would be interesting to learn how to make a GUI with python (and tkinter turns out to be really cool and easy to use!) but it’s also been extremely useful in debugging the game logic. Below is a screenshot of it in action:

In the middle are the three most recently played hands. The green circle shows that it is player 1’s turn and that they have control (it is red otherwise) and then in the top right are the card indices (from 0 to 12) that are playable hands in the current situation. When I have a neural network to play against it should be easy to add this in to the GUI so that I can play against it as well as display statistics on things like the value it associates with each option that it has available to it, so I think this has definitely been a good use of my time. The second window it shows is the input state that will be provided to the neural network when it has to make a decision:

This may not be absolutely final but it’s going to be what I try out initially. The first part of the input is the player’s actual hand where each card (up to a maximum of 13 – the size of the initial hand you’re dealt) is given a value and a suit. I also decided to include whether or not each card is part of a hand that uses more than one card (e.g. pair, straight, flush etc). In principle this could be learned by the network but I suspect/hope that it will be easier for the network to learn somewhat advanced strategies that involve saving cards for later on if you provide this information manually. Next we have the information about what the other players have/what they have played previously. One of the things which makes Big 2 interesting is that it is a game of imperfect information which of course means that we cannot provide the neural network with complete information about what is in each of the other player’s hands, but instead can only include what is “allowed” to be known. So far I have included the number of cards they have left, and whether at some previous point they have played any ace or any two (the highest cards), as well as whether they’ve played any pairs, three of a kinds, two pairs, straights, flushes or full houses. It’s at this point that I’m injecting the most “prior information” about the game of Big 2 because I know from experience that noting who’s played certain high cards/hands is important in figuring out what decision you should make. Ideally you’d just like to include a full history of every hand that each player has played so far and try and learn directly from that but that simply isn’t practical here and so until I can think of a better approach I’ll stick with this for the time being. We then have an input which describes the previous hand which needs to be beaten – the value/suit as well as the type and then finally a “cards played” input which tracks whether anyone (so not specifically who) has played any of the 16 most valuable cards. Again this is injecting some prior knowledge about the game because I am not tracking whether the least valuable cards have been played as this is relatively unimportant (as the game is really all about gaining/maintaining control at the right times) and the input vector is already big enough as it is – 412 elements at the moment.

So this means everything is set up quite nicely meaning that I now need to start thinking about exactly how I’m going to try and train a neural network to play. Hopefully this weekend I will have a test set up working where I can get two untrained neural networks to play against each other in an essentially randomly manner so that I have an idea of how long it will take to simulate a single game, and hence have some idea about the amount of data I can generate in a reasonable time frame. This will have some effect on what I decide to do because unfortunately I don’t work for Google and have virtually unlimited computational resources at my disposal and certain algorithms are much more computationally expensive than others!


Deep-Q Network

Probably the first thing I’m going to try is just a normal Deep Q Network like the one used in the Deepmind paper that learned to play Atari from raw pixel inputs. This involves training a deep neural network to try and learn the value function \( Q(s,a) \) for any given state s and choice of action a. The reason that I’ll try this first is because it is relatively simple (although there are a few tricks needed such as using experience replay and target networks to ensure that the training is stable), and there exist a number of good online tutorials for implementing this algorithm that I can use as a template (e.g. here and here). The obvious difficulty that I can see with using this approach is that the number of possible actions is quite high (see my previous post for a discussion of the Big 2 action space). The usual approach is to learn a function which outputs a Q-value for each possible action, i.e. you only provide the state s as input to the neural network rather than the state and the action. This has the big advantage that you then only need to carry out one forward pass of the network to obtain the Q-values for each action so that you can easily choose the one which is largest. The disadvantage is that it means that the network does nothing in terms of generalizing over actions which might be similar, i.e. you learn a different function for each action essentially independently. In Big 2 the number of possible actions is 1694 (in the way that I have decided to represent them) so it seems likely to me that this approach simply will not be possible and that the action will also have to be provided as an input to the network, such that the output is just a single number \(Q(s,a\). This will cause an issue because then to choose the most “valuable” action you will have to evaluate the neural network for however many actions are available to the player in the current state each time requiring a full forward pass of the neural network. The only reason this might be possible is that although the number of possible actions is 1694 the actual number of actions that are allowable in any given situation is usually significantly lower meaning that this approach could be feasible. However being completely realistic I am expecting this to be too slow to get really good performance so the target here will be to hopefully train a network that can reliably beat a group of players who make random moves.

Large Action Spaces Paper

If this turns out to be too computationally expensive (or even if it doesn’t) I would then like to have a look at using the technique described in this paper by some of the folks at Deepmind that I found recently and which can be applied to situations where you have large discrete action spaces. I haven’t read through it in complete detail but the basic idea seems to be to describe actions as vectors \( \mathbf{a} \in \mathbb{R}^n \) such that there is some (predefined) way of measuring how close actions are to each other. This will be quite natural given the way that I am choosing to represent the possible actions available in Big 2. The idea then seems to be to use an actor-critic approach whereby the actor chooses a “proto-action” \( \mathbf{\hat{a}} = f(s) \) in continuous space and from which you evaluate the k-nearest actually allowable actions. Essentially the aim of the paper is to introduce an architecture that can generalize over actions without having the heavy cost associated with evaluating the neural network for every action that occurs when you explicitly include the action as an input to the neural network. Obviously I need to think more about the details (and read the paper properly!!) but it seems like it should be possible to apply this to the Big 2 action space so this will be interesting to try!

Alpha Zero and Generalizing the Monte Carlo Tree Search

The recent papers on “Alpha Go Zero” and then more generally “Alpha Zero” (which was applied to Chess and Shogi) are really, really cool! (and I should also mention this paper which uses a very similar technique and was developed at the same time!) The reason is because they learn to play these games completely from self-play, i.e. without any domain specific knowledge provided to them they just play games against each other and work out what moves lead to the best results. Doing this they beat their previous Alpha Go program with significantly less training time required and also beat the computer “world chess champion” which is pretty damn awesome if you ask me!

A key component of this algorithm is the Monte Carlo Tree Search (MCTS) which is used both during in training the neural network and afterwards once the neural network has been trained. I think the MCTS is a really cool algorithm and had a go at implementing my own version in c++ a while ago, applying it to the extremely complicated game of Tic Tac Toe. If you’ve never heard of it before then I would recommend reading this introductory blog post on the subject. Unfortunately despite it being a very cool algorithm it can only really be applied directly to games of perfect information. The basic idea is that you gradually build up a game tree:


(Image taken from “Recent Advances in General Game Playing”)

in a way that sensibly balances exploration and exploitation, i.e. you don’t build up a full game tree but just focus on “promising” branches. The actual implementation used by Alpha Zero is slightly different from the standard approach of using rollouts (and is again something I need to read up on in more detail – luckily there is an interesting post with code that implements the Alpha Zero algorithm on the much simpler game of Othello here. I am going to read through this and try to replicate their results when I get a chance). The reason this can only be applied to games of perfect information is that to properly build a game tree you need to know what state you will end up in when you take any particular action. In Big 2 you do not know this because you do not know what your opponents hand is and the number of possible things they could possibly do in their next move with all the possible cards they could have is far too large. Nevertheless I feel like it should be possible to do something similar where in the simplest case you just sample a certain number \(k\) of possible hands that your opponents could have and use them to generate future states. This would be a very crude way of doing it – a more interesting way could be to train a neural network to try and learn how likely it is that a player has each card given what has happened during the game so far and then use these probabilities to sample the hands rather than doing so at random. Again I don’t have any details for how I could go about implementing this at the moment but it seems possible that something like this should be possible. Then if I have a neural network which has learned to play I would expect that using some generalization of MCTS on top of this would lead to much better play (as you explicitly explore a large number of possible futures for each option available to you and are just guided by the neural network – in fact even without a neural network you could expect that this might lead to decent play if you let it think for long enough. This would be interesting to check!). In an ideal world I would want to use the MCTS generalization to train the network too but this has got to be hugely expensive and I don’t have 5000 TPUs at my disposable so I suspect that this will not be realistic!

Anyway this has been a bit of a brain dump – it will be interesting to look back on this in a few months time and see whether anything I’ve thought about here ends up working out!

Big 2 Action Space

This is just a post with some thoughts on how I am planning to go about setting up the action space for the Big 2 reinforcement learner I am working on at the moment. For anyone unfamiliar with the game you can see the rules and try out a single player version of the game against a very basic AI (essentially just a series of “if” statements) here. The plan for this project is to train a neural network that learns to play the game through self-play reinforcement learning. There are a couple of things that make this an interesting and challenging game to apply reinforcement learning to (in my opinion anyway!). Firstly, unlike games such as Chess and Go it’s a game of imperfect information, i.e. each player does not have full knowledge of the current state of the game because they do not know what cards the other players have (although of course things can be deduced as the game progresses). Secondly, Big 2 is most naturally a four player game whereas the most famous applications of reinforcement learning have been primarily to two player games and finally (and in this case similarly to Chess) the state of possible actions that can be taken is actually fairly complicated and state dependent. This is because the actions that can be taken depend on what the previous player has played and whether or not you have “control” of the game, and include all 2,3,4 and 5 card “poker hands” i.e. pairs, three of a kinds, two pairs, four of a kinds, straights, flushes, full houses and straight flushes. This post is concerned with how I am planning to go about representing all of the actions that the neural network has available to it in a sensible way such that it is able to learn a sensible policy.

The reason this is tricky is because of being able to play hands with more than one card in them as it is possible for there to be many ways to do this. Let us consider the most extreme example where we have the following starting hand made up of 13 cards of the same suit:

Imagine that the previous player has played a straight. Clearly we can play a flush that beats this, but which flush should be chosen? In this case we can choose any of the 5 cards to make a flush which means there are \( {13 \choose 5} = 1287 \) unique flushes that can be played here. I will do a separate post about what information will be used as the input state to the neural network but I think at the moment that it will most likely include the following (along with other information such as cards other people have played so far):

i.e. for each of the 13 initial cards there will be 13 binary values (0 or 1) for the value of each card as well as for the suit. The alternative would be to use 52 binary inputs representing whether each card is present in the current hand but I think the first method will be necessary if we’re going to use a neural network that can choose any possible action as the can use the indices of the cards (1-13) to systematically describe all of the available actions. e.g. we can say that \( \{1,3,4,6,10 \} \) would be the 5 card hand represented by the 1st, 3rd, 4th, 6th and 10th card in the current hand (of course this may not be a real hand, in which case the move would not be valid). If we are to consider every possible action we must allow for \({13 \choose 5}=1287\) five card moves, \({13 \choose 4}=715\) four card moves, \({13 \choose 3}=286\) three card moves, \( {13 \choose 2} = 78 \) two card moves and 13 one card moves. In fact this is actually more than is strictly necessary as for 2,3 and 4 card hands there are certain index combinations that will never be possible (e.g. card 1 and card 13 can never be played together as a pair assuming the hand is sorted initially. But it might be better to not actually order the cards – this might need some more thought!) This leaves a total of 2379 possible actions (theoretically, most will of course not correspond to valid moves at any particular time). Initially I was worried that this would be far too many actions for a neural network to output and to be honest I am still worried, but reading the recent paper where the Alpha Go algorithm was applied to Chess I read the following:

So clearly their neural network considered an output space with an even larger number of possible moves, but they also have a hell of a lot more computing power at their disposal! Still this gives me hope that this approach is at least worth a go and may be able to work!

To get anywhere with this approach we need a way of indexing the actions for a particular combination of cards so that we do not have to consider whether every single one of the theoretically possible actions are available each time a hand is updated. So for 5 card hands we can define an action by a set of integers\( \{c_1, c_2, c_3, c_4, c_5 \} \) (with \( c_5 > c_4 > c_3 > c_2 > c_1 \) ) which we need to convert into a unique integer between 1 and 1287. It took me probably longer than it should have done to come up with a way to do this, but we can evaluate the sum:

$$\begin{eqnarray}a(c_1,c_2,c_3,c_4,c_5) = \sum_{i’=1}^{c_1-1} \sum_{j’=i’+1}^{10} \sum_{k’=j’+1}^{11} \sum_{l’=k’+1}^{12} \sum_{m’=l’+1}^{13} 1 + \sum_{i’=c_1+1}^{c_2-1} \sum_{j’=i’+1}^{11} \sum_{k’=j’+1}^{12} \sum_{l’=k’+1}^{13} 1 \nonumber \\ + \sum_{i’=c_2+1}^{c_3-1} \sum_{j’=i’+1}^{12} \sum_{k’=j’+1}^{13} 1 + \sum_{i’=c_3+1}^{c_4-1} \sum_{j’+1}^{13} 1 + (c_5-c_4) \end{eqnarray} $$

which can be evaluated with Mathematica:

Now that I think about it we need to be able to invert this as well, i.e. for a given action index retrieve the \( \{c_1, c_2, c_3, c_4, c_5 \} \) indices. This calculation can be done once and we can just store the result in memory so should be trivial.

The reason this is useful is that it means we don’t have to evaluate all 1287 combinations of \( \{c_i\} \) and check whether each is a valid move in the current state. Imagine we have the following hand and are trying to calculate the 5 card options available to us:

we can play a straight with any combination of 4,5,6,7,8 (one of each). In general for calculating the straights available we can calculate the set of consecutive numbers (including repeats), so in the example above this would be \( \{4C,5H,5S,6D,6H,7C,8D,8H,QS,2H \} \) and then calculate of the actions which are actually valid in this state and so that need to be considered by the neural network. We can do similar things for flushes and full houses as well as for the 4,3,2 card hands.

This is the approach I would really like to take as this is basically not giving the neural network any human knowledge about the game and letting it learn everything from scratch which to me would be a lot more elegant and interesting. The other approach I have been considering is to consider a much smaller action space that uses a bit more game knowledge with options that represent what I consider to be reasonable potential options from any given state. E.g:

This approach requires a lot more calculation and analyzing each hand the neural network sees to find out e.g. what the best card is that is not apart of another hand. This quickly becomes inelegant and pretty annoying to code up and also means that there are possible moves that the neural network will never consider in certain situations. Given that I’m not an expert player at all I would rather only fall back on this approach if it seems like the action space is too large with the first approach I have suggested.


Having thought about it a bit more I have a much better solution I think which slightly reduces the size of the action space which needs to be considered by ensuring that we use a sorted hand and then just use a lookup table – for example considering 5 card moves we define a look up matrix of dimensions \( 13 \times 13 \times 13 \times 13 \times 13 \) which we can index with the values of \( \{c_i \} \). Although most entries will correspond to moves which can never be made \(13^5 = 371293\) is small enough that it doesn’t take up too much memory. The following code is used to create a lookup table and an inverse lookup table:

nActions5 = 1287 #number of potentially possible 5 card moves
fiveCardIndices = np.zeros((13,13,13,13,13),dtype=int)
inverseFiveCardIndices = np.zeros((nActions5,5),dtype=int)
c = 0
for c1 in range(9):
    for c2 in range(c1+1,10):
        for c3 in range(c2+1,11):
            for c4 in range(c3+1,12):
                for c5 in range(c4+1,13):
                    fiveCardIndices[c1][c2][c3][c4][c5] = c
                    inverseFiveCardIndices[c][0] = c1
                    inverseFiveCardIndices[c][1] = c2
                    inverseFiveCardIndices[c][2] = c3
                    inverseFiveCardIndices[c][3] = c4
                    inverseFiveCardIndices[c][4] = c5
                    c += 1

For four card hands (i.e. two pairs or four of a kinds) we really don’t need to consider all \( {13 \choose 4} = 715 \) moves as many are not allowable in any situation. Instead we can just create a lookup table as follows:

nActions4 = 330
fourCardIndices = np.zeros((13,13,13,13), dtype=int)
inverseFourCardIndices = np.zeros((nActions4,4), dtype=int)
for c1 in range(10):
    nextInd = np.min([c1+3,10])
    for c2 in range(c1+1,nextInd+1):
        for c3 in range(c2+1,12):
            nextInd = np.min([c3+3,12])
            for c4 in range(c3+1,nextInd+1):
                fourCardIndices[c1][c2][c3][c4] = c
                inverseFourCardIndices[c][0] = c1
                inverseFourCardIndices[c][1] = c2
                inverseFourCardIndices[c][2] = c3
                inverseFourCardIndices[c][3] = c4
                c += 1

And we can do a similar thing for three and two card hands as well. Doing this we have a total number of actions of \( 13 + 33 + 31 + 330 + 1287 = 1694 \) rather than 2379 moves from the previous method. This possibly doesn’t help that much because we still have an enormous number of possible 5 card moves but every little helps!