There are far fewer exercises in this chapter than in the previous ones and so this will be a much shorter post! Chapter 6 is on frequentist statistics, and it is clear that the author is not a fan!

**6.1 Pessimism of LOOCV**

Suppose we have a completely random labelled dataset (i.e. the features tell us nothing about the class labels) with \(N_1\) examples of class 1 and \(N_2\) examples of class 2, with \(N_1 = N_2\). What is the best misclassification rate that any method can achieve? What does LOOCV predict as the misclassification rate?

Let’s say \(N_1 = N_2 = N\) for notational simplicity. Clearly on average there is no method that can do better than a 50% misclassification rate. If we think about what happens when we apply leave-one-out-cross-validation, every time we will leave out one sample (either of class 1 or class 2, it doesn’t matter, but to be concrete let’s imagine it’s class 1). In the remaining data, which the best method can only predict completely randomly, we will then have \(N-1\) examples of the class 1 and \(N\) of class 2. As such this classifier will, on average, predict class 1 only with probability \(\frac{N-1}{2N-1} < \frac{1}{2}\) . This means that the predicted misclassification rate will be \( 100 \times \frac{N}{2N-1} \% \), which is worse than it should be (hence the “pessimism” of LOOCV).

**6.3 MLE estimate for variance is biased**

Show that \( \hat{\sigma}_{MLE}^2 = \frac{1}{N} \sum_{n=1}^N (x_n – \hat{\mu})^2 \) is a biased estimator of \(\sigma^2\). i.e. show:

\( \mathbb{E}_{X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2)} \left[ \hat{\sigma}^2(X_1, \dots, X_n) \neq \sigma^2\right] \)

To start with we can say:

\( N \mathbb{E} \left[ \hat{\sigma}_{MLE}^2 \right] = \sum_{n=1}^N \left( \mathbb{E}(x_n^2) – 2 \mathbb{E}(x_n \hat{\mu}) + \mathbb{E}(\hat{\mu}) \right) \)

The first term is just: \( \mathbb{E}(x_n^2) = \sigma^2 + \mu^2\), where \(\mu = \mathbb{E}(x_n)\). Next, we need to calculate \(\mathbb{E}(\hat{\mu} x_n) \). We can say:

\( \mathbb{E}(\hat{\mu} x_n) = \frac{1}{N} \mathbb{E} \left[ \sum_{n’=1}^N x_n x_n’ \right] = \frac{1}{N} \left[ \mathbb{E}(x_n^2) + (N-1) \mu^2 \right] = \frac{\sigma^2}{N} + \mu^2\)

We then note that \( \mathbb{E}(\hat{\mu}^2) = \frac{1}{N} \sum_{n=1}^N \mathbb{E}(\hat{\mu} x_n) = \mathbb{E}(\hat{\mu} x_n) = \frac{\sigma^2}{N} + \mu^2\)

Putting this together we find:

\( N \mathbb{E} \left[ \hat{\sigma}_{MLE}^2 \right] = \sum_{n=1}^N \left[ \sigma^2 + \mu^2 – (\frac{\sigma^2}{N} + \mu^2) \right] = \sum_{n=1}^N \left( \frac{N-1}{N} \sigma^2\right) \)

This means that:

\( \mathbb{E} \left[ \hat{\sigma}_{MLE}^2 \right] = \left( \frac{N-1}{N} \right) \sigma^2 \)

and hence this estimate is biased!

**6.4 Estimation of the variance when the mean is known**

Suppose we sample \( x_1, \dots, x_N \sim \mathcal{N}(\mu, \sigma^2)\), where \(\mu\) is a known constant. Derive an expression for the MLE for \(\sigma^2\). Is it unbiased?

We can write the log-likelihood as:

\( \log p(D | \sigma^2) = -\frac{N}{2} \log(\sigma^2) – \sum_i \frac{(x_i-\mu)^2}{2 \sigma^2} \)

Taking the derivative with respect to \(\sigma^2\):

\( – \frac{n}{2 \sigma^2} + \sum_i \frac{(x_i – \mu)^2}{2 \sigma^4} = 0 \)

So it’s clear that the MLE is \( \hat{\sigma}^2 = \frac{1}{N} \sum_i (x_i – \mu)^2 \), which is intuitively obvious. The interesting part is that this is now very clearly unbiased because:

\( \mathbb{E}(\hat{\sigma}^2) = \mathbb{E} \left[ \frac{1}{N} \sum_i (x_i – \mu)^2 \right] = \sigma^2 \)

just by definition. The difference between this result and that in the previous question is that here we are using the exact, known value of \(\mu\) here rather than our MLE estimate.