This is a continuation of the exercises in “Machine learning – a probabilistic perspective” by Kevin Murphy. Chapter 3 is on “Generative Models for Discrete Data”.

**3.1 MLE for the Bernoulli/ binomial model**

We start off with a nice simple one. If we have data D consisting of N1 heads out of a total number N trials, then the likelihood is \(P(D|\theta) = \theta^{N_1} (1-\theta)^{N-N_1} \). However, it’s a lot easier to work with the log likelihood here:

\( \log(P(D|\theta)) = N_1 \log(\theta) + (N-N_1) \log(1-\theta) \)

Now we take the derivative:

\(\frac{d}{d \theta} = \frac{N_1}{\theta} – \frac{N-N_1}{1-\theta} = 0 \ \ \implies \frac{N_1}{\theta} = \frac{N-N_1}{1-\theta} \)

and rearrange to find that the MLE is \( \hat{\theta} = \frac{N_1}{N} \)

**3.2 Marginal likelihood for the Beta-Bernoulli model.**

This question is looking at deriving the marginal likelihood, \(P(D) = \int P(D|\theta) P(\theta) d\theta \). We are told to use the chain rule of probability: \(P(x_{1:N}) = p(x_1) p(x_2 | x_1) p(x_3: x_{1:2})\dots \)

and reminded that in the chapter we derived the posterior predictive distribution:

\( P(X=k | D_{1:N}) = \frac{N_k + \alpha_k}{\sum_i N_i + \alpha_i} \)

We are the given an example – suppose D = H,T,T,H,H (or D=1,0,0,1,1). It follows that:

\(P(D) = \frac{\alpha_1}{\alpha} \frac{\alpha_0}{\alpha + 1} \frac{\alpha_0+1}{\alpha+2} \frac{\alpha_1 + 1}{\alpha + 3} \frac{\alpha_1+2}{\alpha+4} \)

where we have just applied the chain rule, using the posterior predictive distribution after each data point has been collected. It’s clear that if we do this more generally (for any collection of data), that we will be left with:

\(P(D) = \frac{\left[ \alpha_0 \dots (\alpha_0 + N_0 – 1) \right] \left[ \alpha_1 \dots (\alpha_1 + N_1 – 1) \right]}{\alpha \dots (\alpha + N-1)} \)

We then note that this can be re-written using factorials as follows:

\(P(D) = \frac{(\alpha_0+N_0-1)! (\alpha_1 + N_1 -1)! (\alpha-1)!}{(\alpha_0-1)! (\alpha_1-1)! (\alpha + N -1)!} \)

Now remember that \( \Gamma(N) = (N-1)! \), and \( \alpha = \alpha_0 + \alpha_1 \), so we get the result which is given in the question:

\(P(D) = \frac{ \Gamma(\alpha_0 + N_0) \Gamma(\alpha_1 + N_1) \Gamma(\alpha_1 + \alpha_0) }{\Gamma(\alpha_1 + \alpha_0 + N) \Gamma(\alpha_1) \Gamma(\alpha_0) } \)

**3.3 Posterior predictive for Beta-Binomial model**

In the text the posterior predictive distribution for the Beta-Binomial model was derived for the case of predicting the outcome of multiple future trials, given the data:

\(P(x|n,D) = \frac{B(x+\alpha_1, n-x+\alpha_0)}{B(\alpha_1, \alpha_0)} \binom{n}{x} \)

where \(\alpha_1\) and \(\alpha_0\) involve the prior parameters and the data. The question simply asks to show that when \(n=1\) that we have: \(P(x=1|D) = \frac{\alpha_1}{\alpha_0 + \alpha_1}\).

To do this we need to remember that by definition, \(B(a,b) = \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)} \), hence:

\( \frac{B(1+\alpha_1, \alpha_0)}{B(\alpha_1, \alpha_0)} = \frac{\Gamma(1+\alpha_1) \Gamma(\alpha_0)}{\Gamma(1 + \alpha_0 + \alpha_1)} \frac{\Gamma(\alpha_0 + \alpha_1)}{\Gamma(\alpha_0) \Gamma(\alpha_1)} \)

But then we simply note the following: \(\Gamma(1+\alpha_1) = \alpha_1! = \alpha_1 (\alpha_1-1)! = \alpha_1 \Gamma(\alpha_1)\). Using this and simplifying clearly leaves us with the desired result.

**3.4 Beta updating from censored likelihood**

Suppose we toss a coin \(n=5\) times. Let X be the number of heads. Suppose that we observe there are fewer than 3 heads – but we don’t know precisely how many. The prior we use is \(P(\theta) = \text{Beta}(\theta|1,1)\). Compute the posterior, \(P(\theta | X < 3) \).

Now \( \text{Beta}(\theta|1,1)\) is a constant, so it plays no real role here. As such, the posterior is:

\(P(\theta | X<3) \propto P(X < 3|\theta) P(\theta) \propto P(X<3 | \theta) \)

So we just need to consider the likelihood, \(P(X<3|\theta)\). This is straightforward to calculate as it is the sum of the probability of no heads, plus the probability of one head, plus the probability of two heads, i.e.:

\(P(X<3 | \theta) = (1-\theta)^5 + \binom{5}{4}(1-\theta)^4 \theta + \binom{5}{3} (1-\theta)^3 \theta^2 = Bin(0|\theta, 5) + Bin(1|\theta,5) + Bin(2|\theta,5) \)

It follows that:

\(P(X<3 | \theta) \propto \text{Beta}(1,6) + \text{Beta}(2,5) + \text{Beta}(3,4) \)

which is a mixture distribution.

**3.5 Uninformative prior for log-odds ratio**

Let \( \phi = \text{logit}(\theta) = \log(\frac{\theta}{1-\theta}) \). If \(p(\phi) \propto 1\), show \(p(\theta) \propto \text{Beta}(\theta| 0,0) \).

If we simply apply the change of variables formula from chapter 2:

\(p(\theta) = \bigg| \frac{d\phi}{d\theta} \bigg| p(\phi) \)

but \(p(\phi)\) is a constant, and \(\phi = \log(\theta) – \log(1-\theta)\), so:

\( \frac{d\phi}{d\theta} = \frac{1}{\theta} + \frac{1}{1-\theta} = \frac{1}{\theta(1-\theta)} \)

Remembering the definition for the Beta distribution: \(\text{Beta}(x|a,b) = \frac{1}{B(a,b)} x^{a-1}(1-x)^{b-1}\), and so clearly \(p(\theta) \propto \text{Beta}(\theta|0,0) \).

**3.6 MLE for Poisson distribution**

Definition: \( \text{Poi}(x|\lambda) = \frac{\lambda^x}{x!} e^{-\lambda}\)

So the likelihood of a set of data \( \{x_i\} \) is given by:

\(L(\lambda ; \{x_i\}) =\prod_i \frac{\lambda^{x_i}}{x_i!} e^{-\lambda} \)

Unsurprisingly, it’s easier to work with the log-likelihood:

\( l(\lambda; \{x_i\}) = \sum_i \left[ -\lambda + x_i \log(\lambda) – \log(x_i !) \right]\)

If we ignore the term that doesn’t depend on \(\lambda\) then we are left with \(-\lambda N + \log(\lambda \left( \sum_i x_i \right) \). Differentiating this w.r.t \(\lambda\) and setting it equal to zero: we are left with

\( -N + \frac{1}{\lambda} \sum_i x_i = 0 \ \ \ \implies \hat{\lambda} = \frac{1}{N} \sum_i x_i \)

**3.7 Bayesian analysis of the Poisson distribution**

(a) Derive the posterior assuming a Gamma prior

The prior we are told to use is:

\(P(\lambda) = Ga(\lambda |a,b) \propto \lambda^{a-1} e^{-\lambda b}\).

Then the posterior is proportional to the likelihood times the prior, i.e. \(P(\lambda | D) \propto P(D | \lambda) P(\lambda) \).

We already looked at the likelihood for the Poisson distribution in the previous section, so:

\(P(\lambda | D) \propto \prod_{i=1}^N \frac{e^{-\lambda} \lambda^{x_i}}{x_1!} \lambda^{a-1} e^{-\lambda b} \propto e^{-\lambda(N+b)}\lambda^{a-1+\sum_i x_i} = Ga(\lambda | a + \sum_i x_i, N+b) \)

So we see that the posterior is also a Gamma distribution, making the Gamma distribution a conjugate prior to the Poisson distribution.

**(b) Posterior mean as a->0, b->0**

We use that the fact that we derived the mean of a Gamma distribution in the text, finding that it is equal to a/b. So clearly this is just:

\( \frac{1}{N} \sum_{i=1}^N x_i \).

This is equal to the MLE that we found in the previous section.

**3.8 MLE for the uniform distribution**

Consider a uniform distribution centered on 0 with width 2a. \(p(x) = \frac{1}{2a} I(x \in [-a,a]) \) is the density function.

(a) Given a data set x1, …, xn, what is the MLE estimate of a?

The key point here is that \(P(D|a) = 0\) for any a which is less than the data point with the largest magnitude, and equal to \(\frac{1}{(2a)^n}\) for any a larger than this. This is clearly minimised when a is made as small as possible, i.e. \(\hat{a} = max|x_i|\).

(b)What probability would the model assign to a new data point x_n+1 using the MLE estimate for a?

Clearly \( \frac{1}{2\hat{a}}\) if \(|x_{n+1}| \le \hat{a}\), and 0 otherwise.

**(c) Do you see a problem with this approach?**

Yes, there is an issue here as any value with an absolute value larger than \(max |x_i|\) is assigned zero probability. For relatively small data sets this will be a big issue, but even for larger data sets it seems far from ideal.

**3.9 Bayesian analysis of the uniform distribution**

This is very much a continuation of the previous question, although here we are told to consider a \(\text{Unif}(0,\theta)\) distribution. The MLE is now \(\text{max}(D)\). Overall I’m fairly sure the question is either extremely poorly worded or has some mistakes, so I’m just going to go through it in the way that makes sense to me.

We are told to use a Pareto prior – the Pareto distribution is defined as:

\(p(x | k,m) = k m^k x^{-(k+1)} I(x \ge m) \)

where I is an indicator function. So a \(\text{Pareto}(\theta | b, K)\) prior is:

\( P(\theta) = K b^K \theta^{-(K+1)} I(\theta \ge b) \)

We more or less established the likelihood in the previous section, which is given by:

\(P(D | \theta) = \frac{1}{\theta^N} I(\theta \ge max(D)) \)

This means that the joint distribution, \(P(D,\theta)\), is given by \(P(D,\theta) = P(D | \theta) P(\theta) = \frac{K b^K}{\theta^{N+K+1}} I(\theta \ge m)\) where \(m = \text{max}(b, D)\).

We can use this to write the marginal likelihood:

\(P(D) = \int P(D,\theta) d\theta = \int_m^{\infty} \frac{K b^K}{\theta^{N+K+1}} d\theta = \frac{K b^K}{(N+K) m^{N+K}} \)

Now the posterior is given by:

\(P(\theta | D) = \frac{P(\theta, D)}{P(D)} = \frac{(N+K) m^{N+K}}{\theta^{N+K+1}} I(\theta \ge m) = \text{Pareto}(\theta | N+K, m=\text{max}(D,b))\)

**3.11 Bayesian analysis of the exponential distribution**

\(p(x|\theta) = \theta e^{-\theta x}\) for \(x \ge 0, \theta > 0\) defines the exponential distribution.

(a) Derive the MLE

We can write the likelihood function as:

\(L(\mathbf{x};\theta) = \theta^N e^{-\theta \sum_{i=1}^N x_i} \)

Clearly working with the log-likelihood will be better here:

\(l(\mathbf{x}; \theta) = N \log(\theta) – \theta \sum_i x_i \)

Taking the derivative wrt theta and setting it equal to zero:

\(0 = \frac{N}{\theta} – \sum_i x_i \)

and so the MLE is: \( \hat{\theta} = \frac{1}{\frac{1}{N} \sum_i x_i} = \frac{1}{\bar{x}} \)

(b) Suppose we observe X1=5, X2=6, X3=4. What is the MLE?

The mean is 5, so \(\hat{\theta} = 1/5\).

(c) Assume a prior \(p(\theta) = \text{Expon}(\theta | \lambda) \). What value should \(\lambda\) take to give \( \mathbb{E} [\theta] = 1/3\)?

The exponential distribution is just a special case of the Gamma distribution where \(a=1\) and \(b=\lambda\). We derived that the mean of a Gamma distribution is just given by a/b, and so we want:

\( 1/3 = \frac{1}{\hat{\lambda}}\), hence \(\hat{\lambda}=3\).

(d) What is the posterior?

\(P(\theta | D) \propto \theta^N e^{-N \theta \bar{x}} \lambda e^{-\lambda \theta} \propto \theta^N e^{-\theta(N \bar{x} + \lambda)} = Ga(\theta | N+1, N\bar{x}+\lambda)\)

(e) Is exponential prior conjugate to exponential likelihood?

Kind of, in the sense that both the prior and the posterior are Gamma distributions. But the posterior is not also an Exponential distribution.

(f) What is the posterior mean?

Again, mean of Gamma is a/b so we have \(\frac{N+1}{N \bar{x} + \lambda}\). If \(\lambda = 0\) and as \(N \to \infty\) we recover the MLE.

**(g) Why do they differ?**

This is a bit of a stupid question – because we calculated them in a different manner. We should expect the posterior mean to be less prone to overfitting though, and just generally be a bit better.

**3.12 MAP estimation for Bernoulli with non-conjugate priors**

In the book we discussed Bayesian inference of a Bernoulli rate parameter when we used a \(\text{Beta}(\theta|\alpha, \beta)\) prior. In this case, the MAP estimate was given by:

\( \theta_{MAP} = \frac{N_1 + \alpha – 1}{N + \alpha + \beta – 2} \)

(a) Now consider the following prior:

\(p(\theta) = 0.5\) if \(\theta = 0.5\), \(p(\theta) = 0.5\) if \(\theta = 0.4\) and \(0\) otherwise. Clearly this is a bit of a weird prior to use, but I guess it’s just for an exercise to make a point so let’s go with it. The question is to derive the MAP estimate as a function of \(N_1\) and \(N\).

We can write the posterior as:

\(P(\theta | D) \propto \theta^{N_1} (1-\theta)^{N-N_1} I( \theta \in \{0.4, 0.5 \}) \)

So the MAP is simply: \(\text{max}(0.5^{N_1} 0.5^{N-N_1}, 0.4^{N_1} 0.6^{N-N_1}) = \text{max}(0.5^{N}, 0.4^{N_1} 0.6^{N-N_1})\).

With some algebraic manipulations we can show that the MAP is 0.5 if: \(N_1 > \frac{\log(6/5)}{\log(6/4)} N \approx 0.45 N\). Otherwise it is \(0.4\). I was kind of surprised that it wasn’t *exactly* \(0.45N\), but I guess it has something with it being constrained to be between 0 and 1. I’m actually not sure though.

(b) Suppose the true theta is 0.41. Which prior leads to the better estimate?

When N is small, we would expect this second approach to work better (as it is choosing only between 0.4 and 0.5), however as N becomes larger eventually the other prior, where \(\theta\) can take any value, will give a better estimate.

**3.13 Posterior predictive for a batch of data using the dirichlet-multinomial model**

Derive an expression for \(P(\tilde{D} | D, \alpha)\), i.e. use the posterior to predict the results for a whole batch of data. Now, the definition of the Dirichlet distribution is:

\(\text{Dir}(\mathbf{\theta} | \mathbf{\alpha}) = \frac{1}{B(\mathbf{\alpha})} \prod_{k=1}^K \theta_k^{\alpha_k-1} I(\mathbf{\theta} \in S_k) \) (the identity function is ensuring the components of \(\theta\) sum to one.)

where \(B(\mathbf{\alpha}) = \frac{\prod_k \Gamma(\alpha_k)}{\Gamma[\sum_k \alpha_k]}\).

Using a \(\text{Dir}(\mathbf{\theta} | \mathbf{\alpha})\) prior for \(\theta\) we showed in the book that the posterior was given by:

\( P(\mathbf{\theta} | D) = \text{Dir}(\mathbf{\theta} | \alpha_1 + N_1, \dots, \alpha_K + N_K) \)

This means that we can write:

\(P(\tilde{D} | D, \alpha) = \int P(\tilde{D} | \theta) P(\theta | D) d\theta = \int \prod_{k=1}^K \theta_k^{x_k} \text{Dir}(\theta | \alpha_1 + N_1, \dots, \alpha_K + N_K) d\theta\)

where \( \mathbf{x} = \{x_k\} \) is the numbers of each class in the data we are predicting. This is equal to:

\( \int \prod_k \theta_k^{x_k} \frac{1}{B(\mathbf{\alpha + N})} \prod_k \theta_k^{\alpha_k + N_k} I(\theta \in S_k) d\theta \)

\( = \int \frac{B(\mathbf{\alpha + N + X})}{B(\mathbf{\alpha+N})} \text{Dir}(\theta | \alpha_1 + N_1 + x_1, \dots, \alpha_K + N_K + x_K) d\theta = \frac{B(\mathbf{\alpha + N + X})}{B(\mathbf{\alpha+N})} \)

where we just converted to a Dirichlet distribution and introduced the correct normalisation parameter. Since it is a probability distribution, it then of course integrates to 1 (the final step).

**3.14 Posterior predictive for the Dirichlet-Multinomial**

(a) Suppose we compute the empirical distribution over letters of the Roman alphabet plus the space character (27 values) from 2000 samples. Suppose we see the letter “e” 260 times. What is \(P(x_{2001} = e | D)\), assuming a Dirichlet prior with alpha_k = 10 for all k?

We showed in the text that the posterior predictive is given by:

\(P(X=j | D) = \frac{\alpha_j + N_j}{\sum_k (\alpha_k + N_k)} = \frac{10 + 260}{270 + 2000} \simeq 0.119 \)

(b) Suppose actually we saw “e” 260 times, “a” 100 times and “p” 87 times. What is \(P(x_{2001}=p, x_{2002} = a | D)\) under the same assumptions?

We basically just derived what we need for this in the previous question. We are looking for the probability of the data vector \(\mathbf{X} = (1,0,\dots, 1, 0, \dots, 0)\), where the non-zero components are at indices 1 (“a”) and 16 (“p”). We showed:

\(P(X | D) = \frac{B(\mathbf{\alpha + N + X})}{B(\mathbf{\alpha+N})} \)

This is equal to:

\( \frac{\prod_k \Gamma(\alpha_k + N_k + x_k) \Gamma(\sum_k \alpha_k + N_k)}{\Gamma(\sum_k \alpha_k + N_k + x_k) \prod_k \Gamma(\alpha_k + N_k)} \)

Now in the product terms, everything cancels except the components corresponding to p and a, where we pick up factors of \(\frac{\Gamma(98)}{\Gamma(97)}\) and \(\frac{\Gamma(111)}{\Gamma(110)} \) respectively. Overall, we are left with:

\(P(X|D) = \frac{ \Gamma(111) \Gamma(98) \Gamma(2270)}{\Gamma(110) \Gamma(97) \Gamma(2272)} = \frac{(110!)(97!)(2269!)}{(109!)(96!)(2271!)} = \frac{110*97}{2270*2271} \simeq 0.002 \)

**3.17 Marginal likelihood for beta-binomial under uniform prior**

Suppose we toss a coin N times and observe N1 heads. Let \(N_1 \sim Bin(N,\theta)\) and \(\theta \sim Beta(1,1)\). Show that the marginal likelihood is given by: \(P(N_1 | N) = \frac{1}{N+1}\).

We can write:

\(P(N_1 | N) = \int P(N_1 | N, \theta) P(\theta) d\theta \)

But a \(Beta(1,1)\) distribution is just uniform, so we don’t need to take this into account. This means we have:

\(P(N_1 | N) = \int_0^1 \text{Bin}(N_1 | N, \theta) d\theta = \int_0^1 \frac{N!}{N_1! (N-N_1)!} \theta^{N_1} (1-\theta)^{N-N_1} d\theta \)

It helps to re-write the factorials in terms of Gamma functions:

\( P(N_1 | N) = \int_0^1 \frac{\Gamma(N+1)}{\Gamma(N_1+1) \Gamma(N-N_1 + 1)} \theta^{N_1} (1-\theta)^{N-N_1} d\theta \)

Now, by definition:

\(\text{Beta}(\theta | N_1 + 1, N-N_1+1) = \frac{\Gamma(N+2)}{\Gamma(N_1+1) \Gamma(N-N_1+1)} \theta^{N_1} (1-\theta)^{N-N_1}\)

This means that we can say:

\(P(N_1 | N) = \int_0^1 \frac{\Gamma(N+1)}{\Gamma(N+2)} \text{Beta}(\theta | N_1+1, N-N_1+1) d\theta = \frac{N!}{(N+1)!} = \frac{1}{N+1}\)

since a probability distribution must integrate to 1. This result kind of surprised me to be honest – I guess it kind of makes sense although intuitively I would have expected a uniform prior over \(\theta\) to lead to it being most likely to have \(N_1 = N/2\), rather than being completely uniform!

**3.18 Bayes factor for coin tossing**

Suppose we toss a coin \(N=10\) times and observe \(N_1 = 9\) heads. Let the null hypothesis be that the coin is fair, and the alternative be that the coin can have any bias – \(p(\theta) = \text{Unif}(0,1)\). Derive the Baye’s factor in favour of the biased coin hypothesis. What if \(N_1=90\) and \(N=100\)?

I think this just means we need to look at the ratios of the likelihoods under each assumption. Under the fair assumption:

\(P(N_1 | \theta = 1/2) = \frac{N!}{N_1!(N-N_1)!} \frac{1}{2}^N\)

Under the biased assumption, we just calculated this in the previous exercise:

\(P(N_1 | \theta \sim \text{Unif}(0,1)) = \frac{1}{N+1} \)

So \(BF = \frac{N_1! (N-N_1)! 2^N}{(N+1)!} \). For \(N_1 = 9\) and \(N=10\) I find that this gives \(BF \simeq 9.31 \). For \(N_1 = 90\), \(N=100\) I find: \(BF \simeq 7.25 \times 10^{14} \) – clearly the coin is amazingly unlikely to be fair in the second case.

**3.19 Irrelevant features with Naive Bayes**

Let \(x_{iw}=1\) if word w occurs in document i, and be 0 otherwise. Let \(\theta_{cw}\) be the estimated probability that word w occurs in documents of class c. The log-likelihood that document x belongs to class c is:

\( \log(p(\mathbf{x_i}|c,\theta)) = \log \prod_w \theta_{cw}^{x_{iw}}(1-\theta_{cw})^{1-x_{iw}} \)

\( = \sum_w x_{iw} \log(\frac{\theta_{cw}}{1-\theta_{cw}}) + \sum_w \log(1-\theta_{cw}) \)

This can be written more succinctly as \( \log(p(\mathbf{x_i}) = \mathbf{\phi(x_i)}^T \mathbf{\beta_c} \), where \( \mathbf{\phi(x_i)} = (\mathbf{x_i},1) \) and:

\(\mathbf{\beta_c} = (\log \frac{\theta_{c,1}}{1-\theta_{c,1}}, \dots, \log \frac{\theta_{c,W}}{1-\theta_{c,W}} , \sum_w \log(1-\theta_{cw}))^T \)

i.e. a linear classifier where the class-conditional densities are linear functions of the params \(\mathbf{\beta_c}\).

(a) Assuming P(C=1) = P(C=2) = 0.5, write an expression for the log posterior odds ratio in terms of the features and the parameters.

We just use Bayes’s theorem: \( P(C=1 | \mathbf{x_i}) = P(\mathbf{x_i} | C=1) P(C) / P(\mathbf{x_i}) \), and likewise for C=2. However, as \(P(C=1) = P(C=2)\) we get a cancellation such that:

\( \log \frac{P(C=1 | \mathbf{x_i})}{P(C=2 | \mathbf{x_i})} = \mathbf{\phi(x_i)}^T(\mathbf{\beta_1 – \beta_2}) \)

(b) Consider a particular word w. State the conditions on \(\theta_{1,w}\) and \(\theta_{2,w}\) under which the presence or absence of the word will have no effect on the class posterior.

For this, we want to poster odds ratio to be 1, and hence for the logarithm of this to be zero. This means that \( \beta_{1,w} = \beta_{2,w} \).

(c) The posterior mean estimate of theta, using a Beta(1,1) prior, is given by:

\( \hat{\theta_{cw}} = \frac{1+ \sum_{i \in c} x_{iw}}{2 + n_c} \)

where the sum is over the nc documents in class c. Consider a word w, and suppose it occurs in every document, regardless of class. Let there be n1 documents of class 1, and n2 of class 2, with n1 not equal to n2. Will this word be ignored by our classifier?

Clearly not, as we are told that \(\hat{\theta_{1,w}} = \frac{1+n_1}{2+n_c}\) and \(\hat{\theta_{2,w}} = \frac{1+n_2}{2+n_c}\), which are not equal as \(n_1 \neq n_2\), and hence the necessary condition derived in (b) does not hold.

(d) What other ways can you think of to encourage “irrelevant” words to be ignored?

I guess things like preprocessing and feature selection would be good for this.

**3.21 Mutual information for Naive Baye’s with binary features**

The result was stated in the chapter, here we are asked to derive it. We are looking for the mutual information between feature j and the class label Y, i.e. \(I(X_j, Y)\). By definition, this is equal to:

\(I(X_j;Y) = \sum_{x_j \in \{0,1\}} \sum_{y \in C} P(x_j, y) \log \left( \frac{P(x_j, y)}{p(x_j) p(y)} \right) \)

To get the joint values, we can say \(P(x_j = 1, y=c) = P(x_j=1 | y=c) P(y=c) = \theta_{jc} \pi_c \)

and then:

\( P(x_j=0, y=c) = P(x_j=0 | y=c) P(y=c) = (1-\theta_{jc}) \pi_c \).

By definition, \(P(y=c) = \pi_c\), so we can say:

\(P(x_j=1) = \sum_{c’} P(x_j=1, y=c’) = \sum_{c’} \theta_{jc’} \pi_{c’} \equiv \theta_j\)

\(P(x_j=0) = \sum_{c’} P(x_j=0, y=c’) = \sum_{c’} (1-\theta_{jc’}) \pi_{c’} = 1-\theta_j \)

Putting this together we get the desired result:

\( I(X_j, Y) = \sum_c \left[ \theta_{jc} \pi_c \log \left( \frac{\theta_{jc}}{\theta_j} \right) + (1-\theta_{jc})\pi_c \log \left( \frac{1-\theta_{jc}}{1-\theta_j} \right) \right] \)