Machine Learning – A Probabilistic Perspective Exercises – Chapter 6

There are far fewer exercises in this chapter than in the previous ones and so this will be a much shorter post! Chapter 6 is on frequentist statistics, and it is clear that the author is not a fan!

6.1 Pessimism of LOOCV

Suppose we have a completely random labelled dataset (i.e. the features tell us nothing about the class labels) with \(N_1\) examples of class 1 and \(N_2\) examples of class 2, with \(N_1 = N_2\). What is the best misclassification rate that any method can achieve? What does LOOCV predict as the misclassification rate?

Let’s say \(N_1 = N_2 = N\) for notational simplicity. Clearly on average there is no method that can do better than a 50% misclassification rate. If we think about what happens when we apply leave-one-out-cross-validation, every time we will leave out one sample (either of class 1 or class 2, it doesn’t matter, but to be concrete let’s imagine it’s class 1). In the remaining data, which the best method can only predict completely randomly, we will then have \(N-1\) examples of the class 1 and \(N\) of class 2. As such this classifier will, on average, predict class 1 only with probability \(frac{N-1}{2N-1} < \frac{1}{2}\) . This means that the predicted misclassification rate will be \( 100 \times \frac{N}{2N-1} \% \), which is worse than it should be (hence the “pessimism” of LOOCV).

 

6.3 MLE estimate for variance is biased

Show that \( \hat{\sigma}_{MLE}^2 = \frac{1}{N} \sum_{n=1}^N (x_n – \hat{\mu})^2 \) is a biased estimator of \(\sigma^2\). i.e. show:

\( \mathbb{E}_{X_1, \dots, X_n \sim \mathcal{N}(\mu, \sigma^2)} \left[ \hat{\sigma}^2(X_1, \dots, X_n) \neq \sigma^2\right] \)

To start with we can say:

\( N \mathbb{E} \left[ \hat{\sigma}_{MLE}^2 \right] = \sum_{n=1}^N \left( \mathbb{E}(x_n^2) – 2 \mathbb{E}(x_n \hat{\mu}) + \mathbb{E}(\hat{\mu}) \right) \)

The first term is just: \( \mathbb{E}(x_n^2) = \sigma^2 + \mu^2\), where \(\mu = \mathbb{E}(x_n)\). Next, we need to calculate \(\mathbb{E}(\hat{\mu} x_n) \). We can say:

\( \mathbb{E}(\hat{\mu} x_n) = \frac{1}{N} \mathbb{E} \left[ \sum_{n’=1}^N x_n x_n’ \right] = \frac{1}{N} \left[ \mathbb{E}(x_n^2) + (N-1) \mu^2 \right] = \frac{\sigma^2}{N} + \mu^2\)

We then note that \( \mathbb{E}(\hat{\mu}^2) = \frac{1}{N} \sum_{n=1}^N \mathbb{E}(\hat{\mu} x_n) = \mathbb{E}(\hat{\mu} x_n) = \frac{\sigma^2}{N} + \mu^2\)

Putting this together we find:

\( N \mathbb{E} \left[ \hat{\sigma}_{MLE}^2 \right] = \sum_{n=1}^N \left[ \sigma^2 + \mu^2 – (\frac{\sigma^2}{N} + \mu^2) \right] = \sum_{n=1}^N \left( \frac{N-1}{N} \sigma^2\right) \)

This means that:

\( \mathbb{E} \left[ \hat{\sigma}_{MLE}^2 \right] = \left( \frac{N-1}{N} \right) \sigma^2 \)

and hence this estimate is biased!

 

6.4 Estimation of the variance when the mean is known

Suppose we sample \( x_1, \dots, x_N \sim \mathcal{N}(\mu, \sigma^2)\), where \(\mu\) is a known constant. Derive an expression for the MLE for \(\sigma^2\). Is it unbiased?

We can write the log-likelihood as:

\( \log p(D | \sigma^2) = -\frac{N}{2} \log(\sigma^2) – \sum_i \frac{(x_i-\mu)^2}{2 \sigma^2} \)

Taking the derivative with respect to \(\sigma^2\):

\( – \frac{n}{2 \sigma^2} + \sum_i \frac{(x_i – \mu)^2}{2 \sigma^4} = 0 \)

So it’s clear that the MLE is \( \hat{\sigma}^2 = \frac{1}{N} \sum_i (x_i – \mu)^2 \), which is intuitively obvious. The interesting part is that this is now very clearly unbiased because:

\( \mathbb{E}(\hat{\sigma}^2) = \mathbb{E} \left[ \frac{1}{N} \sum_i (x_i – \mu)^2 \right] = \sigma^2 \)

just by definition. The difference between this result and that in the previous question is that here we are using the exact, known value of \(\mu\) here rather than our MLE estimate.

Machine Learning – A Probabilistic Perspective Exercises – Chapter 5

I imagine my rate of doing these exercises is going to drop slightly as I’ve started my machine learning internship at BMLL Technologies and will be there up until Christmas. Still, my plan is to (hopefully) get up to about Chapter 8 finished by then – although this may be optimistic as my working days are pretty long at the moment! Still, I am learning a lot which means the exercises should be easier, right?! Anyway, chapter 5 is about Bayesian statistics and there are significantly less exercises than in the previous two chapters!

 

5.1 Proof that a mixture of conjugate priors is indeed conjugate

We consider the case where we have a prior of the form: \(p(\theta) = \sum_k p(Z=k) p(\theta | Z=k) \) where each \(p(\theta | Z=k)\) is conjugate to the likelihood, i.e. \(P(\theta | D, Z=k) \)  is the same type of probability distribution as \(p(\theta | Z=k)\). We want to derive the general result:

\(p(\theta | D) = \sum_k p(Z=k | D) p(\theta | D, Z=k) \)

as this shows that the posterior is a weighted sum of the posterior of each mixture component – so the whole thing is conjugate to the mixture prior we started with. I’m not sure if I’m missing something and this is too simplistic but it seems we can just say:

\( \sum_k p(Z=k | D) p(\theta | Z=k, D) = \sum_k p(\theta, Z=k | D) = p(\theta | D) \)

An alternative way is to use Baye’s theorem:

\(p(\theta | D) = p( D | \theta) p(\theta)/p(D) = \sum_k \frac{p(D|\theta) p(Z=k) p(\theta | Z=k)}{p(D)}\)

We then note that \(P(D | \theta) = P(D | \theta, Z=k)\) since if we already know \(\theta\) then knowing what Z is tells us nothing. We then use that \(p(D | \theta, Z=k) p(\theta | Z=k) = p(\theta | D, Z=k) p(D | Z=k) \) and then \(\frac{p(D| Z=k) p(Z=k)}{p(D)} = p(Z=k | D) \) which gives us the result we want!

5.2 Optimal threshold on classification probability

Consider a case where we have learned a conditional probability distribution \(p(y | x)\), and suppose there are only two classes. Let \(p_0 = P(Y=0 | x)\) and \(p_1 = P(Y=1 | x)\). Consider the following loss matrix:

Predicted label \(\hat{y}\) True label y
0 1
0 0 \(\lambda_{01}\)
1 \(\lambda_{10}\) 0

(a) Show that the decision \( \hat{y}\) which minimises the expected loss is equivalent to setting a probability threshold \(\theta\)

The expected loss if we choose 0 is simply:

\(\rho( \hat{y}=0 | x) = \lambda_{01} p(y=1 | x) = \lambda_{01} p_1\)

if we choose 1:

\(\rho(\hat{y} =1 | x) = \lambda_{10} p(y=0 | x) = \lambda_{10} p_0 \)

So we will pick 0 if:

\(\lambda_{01} p_1 < \lambda_{10} p_0 = \lambda_{10} (1-p_1) \)

or re-arranging, if:

\(p_1 < \frac{\lambda_{1o}}{\lambda_{01} + \lambda{10}} \equiv \theta \)

Otherwise we pick 1.

(b) Show a loss matrix where the threshold is 0.1

We just want \( 0.1 = \frac{\lambda_{10}}{\lambda_{01} + \lambda_{10}}\), which implies that \(9 \lambda_{10} = \lambda_{01}\). So any loss matrix which satisfies this.

 

5.3 Reject option in classifiers

In many classification problems it is useful to include a reject option – where we are too uncertain to make a classification. Let \(\alpha_i\) be the action you choose with \(i \in \{1, \dots, C+1 \}\), where \(C\) is the number of classes and action \(C+1\) is reject. Let \(Y=j\) be the true (but unknown) state of nature. Consider the following loss:

\( L(\alpha_i | Y=j) = 0 \ \text{if} \ i=j\)

\( \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = \lambda_r \ \text{if} \ i=C+1 \)

\( \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ = \lambda_s \ \text{otherwise} \)

(a) Describe how to minimise the risk

Let’s first define \(P(Y=j | x) = p_j\). Now if we choose an action \( a \in \{1, \dots, C\}\), then we have the following expected loss:

\( \rho(a | x) = \sum_{i=1}^C L(a | Y=i) p_i\)

But of course \(L(i | Y=i) = 0\) and for all other values of Y is a constant, so this gives:

\(\rho(a | x) = (1-p_a) \lambda_s\)

So choosing amongst classes we want to choose the a which  has the minimum value of: \(1-p_a) \lambda_s\). So we need:

\(p_a \ge p_b \ \forall b \in \{1,\dots,C\}\)

This shows that if we are going to choose a class then as we should intuitively expect we choose the most probable one. However, we can also choose to reject. In this case we have an expected loss equal to \(\lambda_r\), and so if we are going to choose a class rather than reject then we need the best class to satisfy:

\((1-p_a) \lambda_s \le \lambda_r \ \ \implies p_a \ge 1 – \frac{\lambda_r}{\lambda_s} \)

(b) Describe qualitatively what happens as \(\lambda_r\) is increased from 0 to 1.

The threshold for rejection changes from 1 down to 0.

 

5.4 More reject options

Consider, for example, a case in which the cost of misclassification is $10, but the cost of having a human manually make the decision is only $3.

(a) Suppose \(P(y=1 | x)\) is predicted to be 0.2. Which decision minimises the expected loss?

This is pretty much the same question as the previous one – clearly we are going to choose \(y=0\) if we are going to classify at all. The threshold for rejection is then: \(1-\frac{\lambda_r}{\lambda_s}) = 1 – \frac{3}{10} = 0.7\). So as \(P(y=0 | x) = 0.8\), we don’t reject and classify as \(y=0\).

(b) Now suppose \(P(y=1 | x) = 0.4\). Now which decision is optimal?

In this case, \(P(y=0 | x)=0.6\) which is below the reject threshold, and so we choose the reject option as we are not sure enough to predict \(y=0\).

(c) Show that in general there are two thresholds \(\theta_0\) and \(\theta_1\) such that the optimal decision is to predict 0 if \(p_1 < \theta_0\), reject if \( \theta_0 \le p_1 \le \theta_1\) and predict 1 if \(p_1 > \theta_1\)

This is clear given the results of the previous question: \(\theta_0 = 0.3\) and \(\theta_1 = 0.7\)

 

5.5 Newsvendor problem

Consider the following classic problem from decision theory/ economics. Suppose you are trying to decide how much quantity of some product (e.g. newspapers) to buy to maximize your profits. The optimal amount will depend on how much demand D you think there is, as well as its cost C and its selling price P. Suppose D is unknown but has pdf f(D) and cdf F(D), We can evaluate the expected profit by considering two cases: if \(D > Q\), then we sell all Q items, and make profit \(\pi = (P-C)Q\), but if \( D < Q\) we only sell D items, at profit \( (P-C)D\), but have wasted \(C(Q-D)\) on the unsold items. So the expected profit if we buy quantity Q is:

\( \mathbb{E}_{\pi}(Q)  = \int_Q^{\infty} (P-C)Q \ f(D) \ dD + \int_0^Q (P-C)D \ f(D) – \int_0^Q C(Q-D) \ f(D) \ dD \)

Show that the optimal \(Q*\) satisfies:

\(F(Q*) = \frac{P-C}{P}\)

Let’s start with saying \(F(a) = P(D \le a) = \int_0^a f(D) dD\), since \(F(0)=0\). Now:

\(\int_0^Q f(D) dD + \int_Q^{\infty} f(D)dD = F(Q) + \int_Q^{\infty} f(D)dD = 1 \ \implies \int_Q^{\infty} f(D) dD = 1-F(Q)\)

So the first term in the expectation is just: \((P-C)Q(1-F(Q))\). So using this and multiplying out a bit we have:

\( \mathbb{E}_{\pi}(Q)  = (P-C)Q – PQ F(Q) + CQ F(Q) + \int_0^Q (P-C)Df(D)dD – \int_0^Q CQf(D) dD + C\int_0^Q D f(D) dD\)

But the second to last term is just \(-CQ F(Q)\) which cancels with the third term. The last term also cancels with a part of the fourth term, so we just have:

\( \mathbb{E}_{\pi}(Q)  = (P-C)Q – PQ F(Q) + \int_0^Q P D f(D) dD\)

Differentiating with respect to Q:

\( (P-C) – P F(Q) – PQ f(Q) + PQ f(Q) = 0\)

And so therefore:

\((P-C) = P F(Q*) \implies F(Q*) = \frac{P-C}{P}\)

 

5.6 Bayes factors and ROC curves

Let \( B= p(D | H_1) / p(D | H_0)\) be the Bayes factor in favour of model 1. Suppose we plot two ROC curves, one computed by thresholding \(B\) and the other computed by thresholding \(p(H_1 | D)\). Will they be the same or different?

I’m not sure if I fully understand what this question is trying to get at, but we can write \(B = \frac{p(D | H_1)}{ p(D | H_0)} = \frac{ p(H_0) p(H_1 | D) }{ p (H_1) p(H_0 | D)}\), and so it seems like in general using B as a threshold will not be the same as using \(p(H_1 | D) \)

 

5.7 Bayes model averaging helps predictive accuracy

Let \(\Delta\) be a quantity we are trying to predict. Then let \(\mathcal{D}\) is the observed data and \(\mathcal{M}\) be a finite set of models. Suppose our “action” is to provide a probabilistic prediction p(), with a loss function \( L(\Delta, p()) = – \log(p(\Delta)) \). We can predict \(\Delta\) in two different ways – firstly by using Bayes model averaging:

\( p^{BMA}(\Delta) = \sum_{m \in \mathcal{M}} p(\Delta | m, D) p(m | D) \)

or using a single model (plugin approximation):

\(p^m(\Delta) = p(\Delta | m, D) \)

Show that for all models the posterior expected loss using BMA is lower, i.e.:

\( \mathbb{E} \left[ L(\Delta, p^{BMA}) \right] \le \mathbb{E} \left[ L(\Delta, p^m) \right] \)

where the expectation over \(\Delta\) is with respect to:

\(p(\Delta | D) = \sum_m p(\Delta | m, D) p(m | D) \)

This actually turns out to be pretty straightforward – we can write:

\( \mathbb{E} \left[ L(\Delta, p^{BMA}) \right] = – \int p^{BMA}(\Delta | D) \log(p^{BMA}(\Delta | D)) d \Delta \)

and:

\( \mathbb{E} \left[ L(\Delta, p^m) \right] = -\int p^{BMA}(\Delta | D) \log(p^m(\Delta | D)) d \Delta \)

Now we consider the KL divergence between \(p^m\) and \(p^{BMA}\), using the fact that we know that it is always non-negative:

\(KL(p^{BMA}(\Delta | D), p^m(\Delta | D)) = \int p^{BMA}(\Delta | D) \log \left( \frac{p^{BMA}(\Delta | D)}{p^m(\Delta | D)} \right) d \Delta \ge 0 \)

\( = – \mathbb{E} \left[ L(\Delta, p^{BMA}) \right] + \mathbb{E} \left[ L(\Delta, p^m) \right] \ \implies \mathbb{E} \left[ L(\Delta, p^m) \right] \ge \mathbb{E} \left[ L(\Delta, p^{BMA}) \right] \)

 

5.8 MLE and model selection for a 2D distribution

Let \( x \in \{0,1\} \) denote a result of a coin toss (\(x=0\) for tails, \(x=1\) for heads). Let’s say that heads occurs with probability \(\theta_1\). Suppose that someone else observes the coin flip and observes the outcome, y. But this person is unreliable and only reports the result correctly with probability \(\theta_2\). Assume that \(\theta_2\) is independent of x and \(\theta_1\).

(a) Write down the probability distribution \(p(x,y | \theta\) in terms of \(\theta_1\) and \(\theta_2\).

\(p(x=0, y=0 | \theta) = (1-\theta_1) \theta_2\)

\(p(x=0, y=1 | \theta) = (1-\theta_1)(1-\theta_2) \)

\(p(x=1, y=0 | \theta) = \theta_1 (1-\theta_2) \)

\(p(x=1, y=1 | \theta) = \theta_1 \theta_2 \)

(b) Suppose we have the following dataset: \(\mathbf{x} = (1,1,0,1,1,0,0), \mathbf{y} = (1,0,0,0,1,0,1)\). What are the MLEs for \(\theta_1\) and \(\theta_2\)? What is \(P(D | \mathbf{\hat{\theta}}, M_2\)) (where \(M_2\) denotes this 2-parameter model)?

There are 4 heads in the data, and 4 values where the values of x and y are in agreement, so:

\(P(D | \mathbf{\theta}) = \theta_1^4 \theta_2^4 (1-\theta_1)^3 (1-\theta_2)^3\)

So the log-likelihood is:

\( \log(P(D|\mathbf{\theta})) = 4 \log(\theta_1) + 4 \log(\theta_2) + 3 \log(1-\theta_1) + 3 \log(1-\theta_2) \)

Taking the derivative with respect to \(\theta_1\) is identical to taking it with respect to \(\theta_2\), so the MLEs will be identical:

\( \frac{4}{\theta_1} – \frac{3}{1-\theta_1} = 0 \implies 4(1-\theta_1) = 3 \theta_1 \implies \theta_1 = \frac{4}{7} \)

Now looking at \(P(D | \mathbf{\hat{\theta}}, M_2)\):

\( \left(\frac{4}{7}\right)^4 \left(\frac{4}{7}\right)^4 \left(\frac{3}{7}\right)^3 \left(\frac{3}{7}\right)^3 \simeq 7 \times 10^{-5} \)

(c) Now consider a model with 4 parameters, \(\mathbf{\theta} = (\theta_{0,0}, \theta_{0,1}, \theta_{1,0}, \theta_{1,1})\), representing \(p(x,y|\mathbf{\theta}) = \theta_{x,y}\). Only three of these parameters are free (as they must sum to zero). What is the MLE of \(\mathbf{\theta}\)? What is \(p(D|\mathbf{\hat{\theta}}, M_4)\), where \(M_4\) denotes this 4-parameter model?

\(P(D | \mathbf{\theta}) = \theta_{0,0}^2 \theta_{1,0}^2 \theta_{1,1}^2 \theta_{0,1} \)

Now let’s say \( \theta_{0,1} = (1-\theta_{0,0} – \theta_{1,0} – \theta_{1,1}) \). Let’s take the log of this:

\( \log(P(D|\mathbf{\theta})) = 2 \log(\theta_{1,1}) + 2 \log(\theta_{1,0}) + 2 \log(\theta_{0,0}) + \log(1-\theta_{0,0} – \theta_{1,0} – \theta_{1,1}) \)

Clearly the MLE estimates of \(\theta_{0,0}\), \(\theta_{1,0}\) and \(\theta_{1,1}\) are equal, and \(\theta_{1,1} = 2 \theta_{0,1}\). It follows immediately that:

\( \hat{\theta}_{1,1} = \hat{\theta}_{1,0} = \hat{\theta}_{0,0} = \frac{2}{7} \)

\( \hat{\theta}_{0,1} = \frac{1}{7} \)

We can then evaluate \(P(D| \mathbf{\hat{\theta}}, M_4) = \frac{2^6}{7^7} \simeq 7.8 \times 10^{-5}\). This doesn’t seem significantly higher than the 2 parameter model, so already this is reason to suspect that the 2 parameter model would be preferable.

(d) Suppose we are not sure which model is correct. We compute the leave-one-out cross-validated log-likelihood of each model:

\(L(m) = \sum_{i=1}^n \log p(x_i, y_i | m, \hat{\theta}(\mathcal{D}_{-i}))\)

where \(\hat{\theta}(\mathcal{D}_{-i}) \) denotes the MLE computed on \(\mathcal{D}\) excluding row i. Which model will CV pick and why?

Initially I wrote some code to generate this for each model, but actually to answer the question this really isn’t necessary. The reason is that for \(M_4\) there is only one data point of \(x=0, y=1\), and so when we estimate the MLE for \(\theta_{0,1}\) this is zero. This means when we use it to evaluate the full log-likehood using our MLE estimate we get a \(\log(0) \to -\infty\) term. This means that clearly \(M_2\) will be favoured by this cross-validation technique.

(e) Recall that an alternative to CV is to use the BIC score, defined as:

\(\text{BIC}(M, \mathcal{D}) = \log p(\mathcal{D} | \mathbf{\hat{\theta}_{MLE}}) – \frac{\text{dof}(M)}{2} \log(N) \)

where \(\text{dof}(M)\) is the number of free parameters in the model. Compute the BIC scores for each model. Which is preferred?

For \(M_2\) we just have: \( \text{BIC} \simeq \log(7 \times 10^{-5}) – \log(7) \simeq -11.51\). For \(M_4\) we have:

\(\text{BIC} \simeq \log(7.8 \times 10^{-5}) – \frac{3}{2} \log(7) \simeq -12.38\)

Therefore we see that the BIC score also favours the simpler 2-parameter model!

 

5.9 Prove that the median is the optimal estimate under L1 loss

L1 loss means \( L(y,a) = |y-a|\), so the expected loss is:

\( \rho(a | x) = \mathbb{E} \left[ |y-a| | x \right] = \int p(y | x) |y-a| dy\)

Differentiating this:

\( \frac{d \rho}{da} = \int_{-\infty}^{\infty} – \text{sgn}(y-a) p(y|x) dy = \int_{-\infty}^a p(y|x)dy – \int_a^{\infty} p(y|x)dy = 0\)

This means:

\(\int_{-\infty}^a p(y|x)dy = \int_a^{\infty} p(y|x)dy\)

Which is essentially the definition of the median, so clearly a is the median.

 

5.10 Decision rule for trading off FPs and FNs

Let \(L_{FN} = c L_{FP}\). I’m pretty sure there is an error in the stated result because it just doesn’t make sense. If we pick \(\hat{y} = 1\), the expected loss is:

\(\rho(\hat{y}=1) = P(y=0 | x) L_{FP}\)

and similarly:

\(\rho(\hat{y}=0)=P(y=1 | x) L_{FN}\)

So we pick \(\hat{y}=1\) iff \(P(y=0|x) L_{FP} < P(y=1|x) L_{FN}\), which we can write as:

\( \frac{P(y=1 | x)}{P(y=0|x)} > \frac{L_{FP}}{L_{FN}} = \frac{1}{c} \)