Machine Learning – A Probabilistic Perspective Exercises – Chapter 6

There are far fewer exercises in this chapter than in the previous ones and so this will be a much shorter post! Chapter 6 is on frequentist statistics, and it is clear that the author is not a fan!

6.1 Pessimism of LOOCV

Suppose we have a completely random labelled dataset (i.e. the features tell us nothing about the class labels) with \(N_1\) examples of class 1 and \(N_2\) examples of class 2, with \(N_1 = N_2\). What is the best misclassification rate that any method can achieve? What does LOOCV predict as the misclassification rate?

Let’s say \(N_1 = N_2 = N\) for notational simplicity. Clearly on average there is no method that can do better than a 50% misclassification rate. If we think about what happens when we apply leave-one-out-cross-validation, every time we will leave out one sample (either of class 1 or class 2, it doesn’t matter, but to be concrete let’s imagine it’s class 1). In the remaining data, which the best method can only predict completely randomly, we will then have \(N-1\) examples of the class 1 and \(N\) of class 2. As such this classifier will, on average, predict class 1 only with probability \(\frac{N-1}{2N-1} < \frac{1}{2}\) . This means that the predicted misclassification rate will be \( 100 \times \frac{N}{2N-1} \% \), which is worse than it should be (hence the “pessimism” of LOOCV).

 

6.3 MLE estimate for variance is biased

Show that \( \hat{\sigma}_{MLE}^2 = \frac{1}{N} \sum_{n=1}^N (x_n – \hat{\mu})^2 \) is a biased estimator of \(\sigma^2\). i.e. show:

\( \mathbb{E}_{X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2)} \left[ \hat{\sigma}^2(X_1, \dots, X_n) \neq \sigma^2\right] \)

To start with we can say:

\( N \mathbb{E} \left[ \hat{\sigma}_{MLE}^2 \right] = \sum_{n=1}^N \left( \mathbb{E}(x_n^2) – 2 \mathbb{E}(x_n \hat{\mu}) + \mathbb{E}(\hat{\mu}) \right) \)

The first term is just: \( \mathbb{E}(x_n^2) = \sigma^2 + \mu^2\), where \(\mu = \mathbb{E}(x_n)\). Next, we need to calculate \(\mathbb{E}(\hat{\mu} x_n) \). We can say:

\( \mathbb{E}(\hat{\mu} x_n) = \frac{1}{N} \mathbb{E} \left[ \sum_{n’=1}^N x_n x_n’ \right] = \frac{1}{N} \left[ \mathbb{E}(x_n^2) + (N-1) \mu^2 \right] = \frac{\sigma^2}{N} + \mu^2\)

We then note that \( \mathbb{E}(\hat{\mu}^2) = \frac{1}{N} \sum_{n=1}^N \mathbb{E}(\hat{\mu} x_n) = \mathbb{E}(\hat{\mu} x_n) = \frac{\sigma^2}{N} + \mu^2\)

Putting this together we find:

\( N \mathbb{E} \left[ \hat{\sigma}_{MLE}^2 \right] = \sum_{n=1}^N \left[ \sigma^2 + \mu^2 – (\frac{\sigma^2}{N} + \mu^2) \right] = \sum_{n=1}^N \left( \frac{N-1}{N} \sigma^2\right) \)

This means that:

\( \mathbb{E} \left[ \hat{\sigma}_{MLE}^2 \right] = \left( \frac{N-1}{N} \right) \sigma^2 \)

and hence this estimate is biased!

 

6.4 Estimation of the variance when the mean is known

Suppose we sample \( x_1, \dots, x_N \sim \mathcal{N}(\mu, \sigma^2)\), where \(\mu\) is a known constant. Derive an expression for the MLE for \(\sigma^2\). Is it unbiased?

We can write the log-likelihood as:

\( \log p(D | \sigma^2) = -\frac{N}{2} \log(\sigma^2) – \sum_i \frac{(x_i-\mu)^2}{2 \sigma^2} \)

Taking the derivative with respect to \(\sigma^2\):

\( – \frac{n}{2 \sigma^2} + \sum_i \frac{(x_i – \mu)^2}{2 \sigma^4} = 0 \)

So it’s clear that the MLE is \( \hat{\sigma}^2 = \frac{1}{N} \sum_i (x_i – \mu)^2 \), which is intuitively obvious. The interesting part is that this is now very clearly unbiased because:

\( \mathbb{E}(\hat{\sigma}^2) = \mathbb{E} \left[ \frac{1}{N} \sum_i (x_i – \mu)^2 \right] = \sigma^2 \)

just by definition. The difference between this result and that in the previous question is that here we are using the exact, known value of \(\mu\) here rather than our MLE estimate.

Author: Henry Charlesworth

I am currently a PhD student at the University of Warwick, based in the Centre for Complexity Science. My research has been looking at how the principle of "future state maximisation" can be applied to both explaining behaviour in animal systems as well as generating behaviour in artificial systems. Loosely speaking, this is the idea that taking everything else to be equal it is preferable to keep your options open as much as possible. I am also very interested in artificial intelligence and machine learning, particularly reinforcement learning.

Leave a Reply

Your email address will not be published. Required fields are marked *