Machine Learning – A Probabilistic Perspective Exercises – Chapter 2

Recently I’ve been becoming more and more interested in machine learning, and so far my attention has been primarily focused on reinforcement learning. I had a lot of fun working on the Big 2 AI, but I feel like I really need to invest more time to studying the fundamentals of machine learning. I’ve got myself a copy of “Machine Learning – A Probabilistic Perspective”, which seems like a great text book, and so I’m going to work my way through it. I’ve decided to make a decent attempt at doing as many of the exercises as possible, and I feel like actually writing up an explanation for them is quite useful for me in making sure I actually understand what’s going on. Potentially it might also be useful for other people too, so I thought I would post my answers as I go! I may skip some exercises if I think they’re boring (or too “wordy”), or of course if I’m unable to do them (although I will probably mention them in this case)!

2.1 My neighbor has two children. Assuming that the gender of a child is like a coin flip, it is most likely, a priori, that my neighbour has one boy and one girl, with probability 1/2. The other possibilities—two boys or two girls—have probabilities 1/4 and 1/4.
(a) Suppose I ask him whether he has any boys, and he says yes. What is the probability that one child is a girl?

This is a standard probability puzzle. The key here is in the wording of the question we ask him – whether he has any boys? A priori, there are four possible options for child 1 and child 2: BB (boy boy), BG, GB and GG, each with equal probability. If he says yes to our question, this is compatible with three of the initial four options: BB, BG and GB. That is, we can only exclude the possibility GG. In two out of these three remaining possibilities he has one girl, and so the probability is 2/3.

(b) Suppose instead that I happen to see one of his children run by, and it is a boy. What is the probability that the other child is a girl

In this case the question is more specific with its wording – we have seen one specific child, but this tells us nothing about the other child, and so we get the more intuitive answer of 1/2.

 

2.3 Show Var[X+Y] = Var[X] + Var[Y] + 2 Cov[X,Y] for any two random variables X and Y

This one is relatively straightforward and just involves starting from a basic relationship for the variance:

\( Var[X+Y] = \mathbb{E}[(X+Y)^2] – \mathbb{E}[X+Y]^2 = \mathbb{E}[(X+Y)^2] – (\mathbb{E}[X] + \mathbb{E}[Y])^2 \)

(using the linearity of expectation). Expanding this out:

\( \mathbb{E}[X]^2 + \mathbb{E}[Y]^2 + 2 \mathbb{E}[XY] – \mathbb{E}[X]^2 – 2\mathbb{E}[X] \mathbb{E}[Y] – \mathbb{E}[Y]^2 \)

which clearly gives the required result (as \( Cov[X,Y] = \mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y] \) ).

 

2.4 After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). The good news is that this is a rare disease, striking only one in 10,000 people. What are the chances that you actually have the disease?

This is another famous example of the use of Baye’s theorem. If we define two events as follows:

A – you have the disease

B – the test returns a positive result

If you go and take the test, and you get a positive result saying that you have the disease, then what you want to calculate is \(p(A|B)\). That is, the probability that you have the disease given that you tested positive. By Baye’s theorem, this is equivalent to:

\( p(A|B) = \frac{p(B|A) p(A)}{p(B)} \)

Now, since the test is 99% accurate, \( P(B|A) = 0.99 \), i.e. if you have the disease you will test positive 99% of the time. But p(A) = 1/10000, i.e. the probability of an “average” person in the population having the disease (with no other information about them). Finally, we can calculate p(B) as follows:

\(P(B) = 0.99 \frac{1}{10000} + 0.01 \frac{9999}{10000} \)

where the first term is the probability of being positive if you have the disease*prob of actually having the disease and the second term is prob of testing positive if you don’t have the disease (false positive) * probability of not having the disease.

Putting the numbers in you find that \(P(B|A) \approx 0.98 \% \), which makes it extremely unlikely that you have the disease despite testing positive! This is one example of why it would be a good idea if doctors learned some stats!

 

2.5 Monty Hall problem – On a game show, a contestant is told the rules as follows: There are three doors, labelled 1, 2, 3. A single prize has been hidden behind one of them. You get to select one door. Initially your chosen door will not be opened. Instead, the gameshow host will open one of the other two doors, and he will do so in such a way as not to reveal the prize. For example, if you first choose door 1, he will then open one of doors 2 and 3, and it is guaranteed that he will choose which one to open so that the prize will not be revealed. At this point, you will be given a fresh choice of door: you can either stick with your first choice, or you can switch to the other closed door. All the doors will then be opened and you will receive whatever is behind your final choice of door. Imagine that the contestant chooses door 1 first; then the gameshow host opens door 3, revealing nothing behind the door, as promised. Should the contestant (a) stick with door 1, or (b) switch to door 2, or (c) does it make no difference? You may assume that initially, the prize is equally likely to be behind any of the 3 doors.

This is another famous problem which has quite an interesting history, apparently leading to many genuine mathematicians complaining about the correct solution which was given in a newspaper column (although this was a little while ago, I still find this extremely surprising!) It also comes up online every now and then with people arguing/trying to explain why you should switch, so I’ll add my best attempt at explaining it here I guess!

For me, the crucial detail that helps in understanding it is that the host has information about where the prize is, which means that he is not choosing a door at random to open, since he will never open the door with the prize behind. If you take the starting point as the prize having equal probability of being behind each door, then you only need to consider two possibilities – 1) the initial door you chose does not have the prize behind it (which happens 2/3 of the time), and 2) the initial door does have the prize behind it. In case 1), you have selected one of the doors without the prize behind it – which means the host has no choice but to open the only other door that doesn’t have the prize behind it. This means in this case, if you switch, you win the prize – and remember this happens 2/3 of the time. Obviously if you did initially pick the winning door and you switch then you lose, but this only happens 1/3 of the time. So it’s better to switch!

 

2.6 Conditional Independence

(a) Let \(H \in \{1,\dots,K\}\) be a discrete RV, and let e1 and e2 be the observed values of two other RVs E1 and E2. Suppose we wish to calculate the vector:

\( \vec{P}(H|e_1, e_2) = (P(H=1|e_1, e_2), \dots, P(H=K|e_1, e_2)) \)

Which of the following are sufficient for the calculation?

(i) P(e1,e2), P(H), P(e1|H), P(e2|H)

(ii) P(e1,e2), P(H), P(e1,e2|H)

(iii) P(e1|H), P(e2|H), P(H)

We can use Baye’s theorem to write:

\( P(H| E_1, E_2) = \frac{P(E_1, E_2 | H) P(H)}{ P(E_1, E_2)} \)

so clearly (ii) is sufficient. The others are not in general sufficient.

(b) Now suppose we assume \(E_1 \bot E_2 | H\). Now which are sufficient?

Well, clearly (ii) still is. But conditional independence of E1 and E2 given H means that we can write:

\( P(E_1, E_2 | H) = P(E_1 | H) P(E_2 | H) \)

which means that (i) is also sufficient now.

However, we can also write:

\( P(e_1, e_2) = \sum_h p(e_1, e_2, h) = \sum_h p(e_1, e_2 | h) p(h) \)

which means that (iii) is also sufficient too!

 

2.7 Show that pairwise independence does not imply mutual independence

The best way to do this is to show an example where we have pairwise independence, but not mutual independence. Consider rolling a fair, four-sided dice. If we define 3 events, A = {1,2}, B={1,3} and C={1,4}, then clearly P(A) = P(B) = P(C) = 1/2. But also, P(A,B) = P(A,C) = P(B,C) = P({1}) = 1/4 = P(A)P(B) = P(A)P(C) = P(B)P(C). This means we have pairwise independence.

However, P(A,B,C) = P({1}) also, which is not equal to P(A)P(B)P(C) = 1/8, and so we do not have mutual independence.

 

2.9 Conditional Independence. Are the following properties true?

(a) \( (X \bot W|Z,Y) \ AND \ (X \bot Y|Z) \implies (X \bot Y , W|Z) \)

OK initially this looks pretty confusing, but once you break it down it’s not too bad.

The first condition tells us that:

\( P(X,W | Z,Y) = P(X | Z,Y) P(W| Z,Y) \)

and the second tells us:

\( P(X,Y | Z) = P(X | Z) P(Y | Z) \)

The condition we need to show is:

\( P(X,Y,W | Z) = P(X | Z) P(Y, W | Z) \)

Completely generally, we can use the chain rule of probability to say:

\( P(X,Y,W | Z) = P(W| X,Y,Z) P(Y| X,Z) P(X | Z) \)

However, since X and W are conditionally independent given Z,Y, we can say that \(P(W| X,Y,Z) = P(W | Y,Z) \) (this is because generally \(P(X,W | Y,Z) = P(W | X, Y, Z) P(X | Y, Z) \), so if conditional dependence is true, then so must be the previous identity. Similarly, P(Y|X,Z) = P(Y|Z) since X and Y are conditionally independent given Z. This means that:

\(P(X,Y,W | Z) = P(W|Y,Z)P(Y|Z) P(X|Z) \)

Then we can say that \(P(W|Y,Z)P(Y|Z) = P(W,Y|Z) \), which gives us exactly what we need.

I couldn’t be bothered to do part (b), but I expect it’s kind of similar.

 

2.10 Deriving the inverse Gamma density

Let \( X \sim Ga(a,b) \), such that:

\( p(x|a,b) = \frac{b^a}{\Gamma(a)} x^{a-1} e^{-xb} \).

If \( Y = 1/X \), show that \( Y \sim IG(a,b) \), where:

\(IG(x |a,b) = \frac{b^a}{\Gamma(a)} x^{-(a+1)}e^{-b/x} \)

Let’s start from the CDF of Y:

\( P(Y \le y) = P(\frac{1}{X} \le y) = P(X \ge \frac{1}{y}) = 1 – P(X \le \frac{1}{y}) \)

If we let \( u = 1/y \), then:

\( p(y) = \frac{d}{dy} P(Y \le y) = \frac{du}{dy} \frac{d}{du}(1-P(X \le u)) = \frac{1}{y^2} Ga(u | a, b) \)

Then we substitute y back in to find:

\( p(y) = \frac{b^a}{\Gamma(a)} y^{-(a+1)} e^{-b/y} \)

which is the result we require.

 

2.11 Basically asking to evaluate: \( \int_{-\infty}^{\infty} e^{-\frac{x^2}{2 \sigma^2}} dx \)

I remember seeing how to do this before – the trick is to consider the squared value and then convert to circular polar coordinates, so that you’re just left with an integral from \(r=0 \to \infty \) and a constant integral over \(\theta\). Since we change from \( dx dy \) to \(r dr d\theta \) this allows the integral in r to be evaluated. I can’t really be bothered to fill in the details but it should be easy to find elsewhere.

 

2.12 Expressing mutual information in terms of entropies: show that \( I(X;Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) \)

If we start from the definition in the textbook of the MI being equal to the KL divergence between \(P(X,Y)\) and \(P(X)P(Y)\) then we have:

\( I(X;Y) = \sum_{x,y} P(x,y) \log(\frac{P(x,y)}{P(x)P(y)}) = \sum_{x,y} P(x,y) [ \log(P(x,y)) – \log(P(x)) – \log(P(y)) ] \)

If we write \(P(x,y) = P(x | y) p(y) \) then we have:

\(I(X;Y) = \sum_{x,y} P(x|y)P(y) [ \log(P(x|y) – \log(P(x))] \)

We can then use that the conditional entropy \(H(X|Y) = -\sum_y p(y) \sum_x p(x|y) \log(p(x|y)) \), which is the negative of the first term we have. The second term, \(-\sum_{x,y} P(x,y) \log(P(x)) = H(X)\), clearly, and so we have found that \(I(X;Y) = H(X) – H(X|Y)\). It is easy to show the second identity, replacing \(P(x,y) = P(y|x)p(x)\) instead.

 

2.13 Mutual information for correlated normals. Find the mutual information I(X1; X2) where X has a bivariate normal distribution:

\( \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} \sim \mathcal{N} \left( \mathbf{0}, \begin{bmatrix} \sigma^2 & \rho \sigma^2 \\ \rho \sigma^2 & \sigma^2 \end{bmatrix} \right) \)

Evaluate when \(\rho = -1, 0, 1 \)

The question actually gives you the form for the differential entropy for a multivariate normal, but I found a derivation which I thought was pretty nice and so I’m going to include it here (this is for a completely general MV normal of any dimension).

The pdf of a MV normal is:

\( p(\mathbf{x}) = \frac{1}{\sqrt{det(2 \pi \Sigma)}} e^{-\frac{(\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu})}{2}} \)

Now the differential entropy is:

\( H(\mathbf{x}) = \int \int d\mathbf{x} \left[ \log(\sqrt{det(2 \pi \Sigma)}) + \frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right] \mathcal{N}(\mathbf{x}; \mathbf{\mu}, \Sigma) \)

Now the nice trick here is that, by definition, the covariance matrix \( \Sigma = \mathbb{E} \left[ (\mathbf{x}-\mathbf{\mu}) (\mathbf{x}-\mathbf{\mu}) ^T \right] \), and the second term in the differential entropy is 1/2 times \( \mathbb{E} \left[ (\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right] \). Crucially, since this is a scalar we can say it’s equal to it’s trace, i.e.

\( \mathbb{E} \left[ (\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right] = \mathbb{E} \left[ Tr\left( (\mathbf{x}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu}) \right) \right] \)

Then we can use the cyclic properties of traces, i.e. \( Tr(ABC) = Tr(BCA) = Tr(CAB) \), such that this is equal to:

\( \mathbb{E} \left[ Tr\left( (\mathbf{x} -\mathbf{u})(\mathbf{x}-\mathbf{u})^T \Sigma^{-1} \right) \right] = Tr \left[ \Sigma \Sigma^{-1} \right] = d \)

where d is the number of random variables (i.e. the dimension). Now \( d = \log(e^d) \) and \(det(2\pi \Sigma) = (2 \pi)^d det(\Sigma) \), so we are left with:

\(H(\mathbf{x}) = \frac{1}{2} \log((2 \pi)^d det(\Sigma)) \)

Now that we have this the rest of the question is relatively straightforward as we can rewrite the mutual information as \(I(X;Y) = H(X) + H(Y) – H(X,Y)\), and so in this case:

\(I(X_1; X_2) = \log(2 \pi e \sigma^2) – \frac{1}{2} \log((2 \pi e)^2 det(\Sigma)) \)

Now, \(det(\Sigma) = \sigma^4 – \rho^2 \sigma^4\), and so making a few algebraic manipulations we arrive at:

\(I(X_1; X_2) = \frac{1}{2} \log(\frac{1}{1-\rho^2}) \)

As \( \rho \to 0\), \(I \to 0 \), and as \( \rho \to \pm 1 \), \(I \to \infty \). Intuitively this makes sense – if \(\rho = 0\), there is no correlation and so the variables give us no information about each other. If they are perfectly correlated, we learn everything about \(X_1\) from \(X_2\), and an infinite amount of information can be stored in a real number.

 

2.14 Normalised mutual information. Let X and Y be discrete and identically distributed RVs (so H(X) = H(Y)). Let:

\(r=1 – \frac{H(Y|X)}{H(X)}\)

(a) Show \( r = \frac{I(X;Y)}{H(X)} \):

This is quite straightforward: \(r = \frac{H(X) – H(Y|X)}{H(X)} = \frac{I(X;Y)}{H(X)} \)

(b) Show \( 0 \le r \le 1 \)

Entropy is always positive, so clearly \( r \le 1 \) from it’s initial definition. Then we know the mutual information is \( \ge 0 \) too, and so it’s greater than 0. I guess we should prove that \(I(X;Y) \ge 0 \). We do this starting from the initial definition:

\(I(X;Y) = -\sum_{x,y} p(x,y) \log(\frac{p(x)p(y)}{p(x,y)}) \)

Now, since the negative logarithm is convex we can apply Jensen’s inequality \(\sum_i \lambda_i f(x_i) \ge f(\sum_i \lambda_i x_i) \), where \( \sum_i \lambda_i = 1\)):

\( I(X;Y) \ge -\log \left( \sum_{x,y} p(x,y) \frac{p(x) p(y)}{p(x,y)} \right) = 0 \)

(c) When is r=0?

When \(I(X;Y) = 0\), i.e. when the variables give us no information about each other.

(d) When is r=1?

When \(H(Y|X) = 0\), i.e. when the variables tell us everything about each other (i.e. once you know x, you get no information by learning y). I guess this is why r can be thought of as a normalised mutual information!

 

2.15 MLE minimises the KL divergence with empirical distribution

\(P_{emp}(x_i) = N_i / N\) (sticking with the discrete case for now, where \(N_i\) is the number of occurrences of \(x_i\) and N is the total amount of data.

\(KL(p_{emp} || q(x;\theta)) = \sum_i p_{emp}(x_i) \log(\frac{p_{emp}(x_i)}{q(x_i;\theta)}) = \sum p_{emp} \log(p_{emp}) – \sum p_{emp} \log(q(x_i;\theta)) \)

The first term here is fixed by the data and so it is clear that:

\( argmin_{\theta} KL (p_{emp} || q) = argmax_{\theta} \frac{1}{N} \sum_i N_i \log(q(x_i;\theta)) \)

which is the same as maximising the log-likelihood, and hence the likelihood.

 

2.16 Derive the mean, mode and variance of the Beta(a,b) distribution.

\(Beta(x; a,b) = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} x^{a-1} (1-x)^{b-1} \equiv \frac{x^{a-1}(1-x)^{b-1}}{B(a,b)} \)

To get the mode you just differentiate wrt x and set equal to zero – I really couldn’t be bothered.

For the mean:

\( \text{mean} = \mathbb{E}[X] = \frac{1}{B(a,b)} \int_0^1 x^a (1-x)^{b-1} dx \)

Integrating by parts we find:

\( \mathbb{E}[X] = \frac{a}{B(a,b) b} \int_0^1 x^{a-1} (1-x)^b dx = \frac{a}{B(a,b) b} \left[ \int_0^1 x^{a-1} (1-x)^{b-1}dx – \int_0^1 x^a (1-x)^{b-1} dx \right]\)

However, one of these just integrates to 1 because it is a Beta distribution, and the second is exactly what we started with, i.e. \( \mathbb{E}[X]\), and hence:

\( \mathbb{E}[X] = \frac{a}{b}(1-\mathbb{E}[X])\)

rearranging we find: \( \mathbb{E}[X] = \frac{a}{a+b} \). You should be able to get the variance using the same kind of trick but it looked like a lot more algebra and I was too tired to attempt it!

 

2.17 Expected value of the minimum. Let X, Y be iid U(0,1) RVs. What is the expected value of min(X,Y).

If we let the CDF of X (and Y) be \(F(x) = x\) (for x between 0 and 1), then:

\(P(\text{min}(X,Y) \ge x) = 1-P(\text{min}(X,Y) \le x) = P( X \ge x, Y \ge x) = (1-F(x))^2 = (1-x)^2\)

Therefore \(P(\text{min}(X,Y) \le x) = 1 – (1-x)^2 \), and so the can write the probability density of the minimum as:

\(p_{min}(x) = 2(1-x)\)

as such, the expected value of the minimum, which is what we want, is:

\( \mathbb{E}[min] = 2 \int_0^1 x(1-x) dx = 1/3 \)

 

There are a couple of questions I skipped, maybe at some point I’ll get back to them but I don’t think they were especially interesting/informative. If there are any questions, or if you spot any errors, please leave a comment! I have done most of the CH3 exercises now, and so will be writing them up soon as well!

Author: Henry Charlesworth

I am currently a PhD student at the University of Warwick, based in the Centre for Complexity Science. My research has been looking at how the principle of "future state maximisation" can be applied to both explaining behaviour in animal systems as well as generating behaviour in artificial systems. Loosely speaking, this is the idea that taking everything else to be equal it is preferable to keep your options open as much as possible. I am also very interested in artificial intelligence and machine learning, particularly reinforcement learning.

Leave a Reply

Your email address will not be published. Required fields are marked *