Machine Learning – A Probabilistic Perspective Exercises – Chapter 4

This is a continuation of the exercises in “Machine learning – a probabilistic perspective” by Kevin Murphy. Chapter 4 is on “Gaussian Models”. Let’s get started!

4.1 Uncorrelated does not imply independent

Let \( X \sim U(-1,1) \) and \(Y = X^2\). Clearly Y is dependent on X, show \(\rho(X,Y)=0\).

\(\rho(X,Y)\) is just a normalised version of the covariance, so we just need to show the covariance is zero, i.e.:

\(\text{Cov}(X,Y) = \mathbb{E}[XY] – \mathbb{E}[X]\mathbb{E}[Y]\)

Clearly \( \mathbb{E}[X] = 0\) and so we just need to calculate \(\mathbb{E}[XY]\) and show this is zero. We can write:

\( \mathbb{E}[XY] = \int_{-1}^1 dx \int_0^1 dy \ xy p(x,y) \)

Then we say \(p(x,y) = p(y|x) p(x)\), but \(p(y|x) = \delta(y – x^2)\), i.e. a dirac-delta function, and \(p(x)=1/2\), i.e. just a constant. This means we can evaluate the integral over y to get:

\( \mathbb{E}[XY] = 1/2 \int_{-1}^1 x^3\)

This is the integral of an odd function and so is clearly equal to zero.


4.2 Uncorrelated and Gaussian does not imply independent, unless jointly Gaussian

Let \(X \sim \mathcal{N}(0,1)\) and \(Y=WX\), where W takes values \( \pm 1\) with equal probability. Clearly X and Y are not independent, as Y is a function of X.

(a) Show\(Y \sim \mathcal{N}(0,1)\)

This is kind of obvious from symmetry because \(\mathcal{N}(0,1)\) is symmetric, i.e. \( \mathcal{N}(x|0,1) = \mathcal{N}(-x|0,1) \). This means we can write:

\(P(Y=y) = P(W=1)P(X=y) + P(W=-1)P(X=-y) = P(X=y) = \mathcal{N}(0,1) \)

(b) Show covariance between X and Y is zero

We know that \(\mathbb{E}[X] = \mathbb{E}[Y] = 0\), so we just need to evaluate \( \mathbb{E}[XY]\):

\( \mathbb{E}[XY] = \int \int dx \ dy \ xy p(x,y)\)

But again \(p(x,y) = p(y|x)p(x)\), and we can write \(p(y|x) = 0.5 \delta(y-x) + 0.5 \delta(y+x)\). This means we are left with:

\(\mathbb{E}[XY] = \int_{-\infty}^{\infty} x \mathcal{N}(x|0,1)(0.5(x-x)) dx = 0\)

which proves the result.


4.3 Prove \(-1 \le \rho(X,Y) \le 1\)

Let us start with the definitions:

\(\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}}\)

\(\text{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]\)

\(\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2] \)

Let us write \(\mu_X = \mathbb{E}[X]\) and \(\mu_Y = \mathbb{E}[Y]\), for notational convenience. If now for any constants a and b we consider:

\( \mathbb{E}[(a(X-\mu_X) + b(Y-\mu_Y))^2] \)

which is clearly greater than or equal to zero. Multiplying out, this inequality gives:

\(a^2 \mathbb{E}[(X-\mu_X)^2] + b^2 \mathbb{E}[(Y-\mu_Y)^2] + 2ab \mathbb{E}[(X-\mu_X)(Y-\mu_Y)] \ge 0 \)

Which we can re-write as:

\(2ab \text{Cov}(X,Y) \ge -a^2 \text{Var}(X) – b^2 \text{Var}(Y)\)

Now let us substitute in \(a^2 = \text{Var}(Y)\) and \(b^2 = \text{Var}(X)\):

\(2 \sqrt{\text{Var}(X) \text{Var}(Y)} \text{Cov}(X,Y) \ge -2 \text{Var}(X) \text{Var}(Y) \)

\( \implies \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} = \rho(X,Y) \ge -1\)

If we do the same thing, but instead now consider \(\mathbb{E}[(a(X-\mu_X) – b(Y-\mu_Y))^2]\), with the same definitions of a and b, it’s easy to show that \( \rho(X,Y) \le 1\) as well.


4.4 Correlation coefficient for linearly related variables

If \(Y=aX + b\), then if \( a > 0 \) show that \( \rho(X,Y)=1\), and if \(a < 0\) that \( \rho(X,Y) = -1\).

Let’s say \(\mathbb{E}[X] = \mu_X\) and \(\text{Var}(X) = \sigma_X^2\). It follows that:

\( \mathbb{E}[Y] = a \mu_X + b\) and \( \text{Var}(Y) = a^2 \sigma_X^2\).

Now, to evaluate the correlation we need \(\mathbb{E}[XY] = \mathbb{E}[aX^2 + bX] = a \mathbb{E}[X^2] + b \mu_X\)

This means that the covariance is:

\( \text{Cov}(X,Y) = a \mathbb{E}[X^2] + b \mu_X – \mu_X(a \mu_X + b) = a \sigma_X^2\)

This allows us to get the correlation:

\( \rho(X,Y) = \frac{ \text{Cov}(X,Y)}{ \sqrt{\sigma_X^2 \sigma_Y^2}} = \frac{a \sigma_X^2}{\sqrt{a^2 \sigma_X^4}} = \frac{a \sigma_X^2}{|a| \sigma_X^2} = sgn(a)\)

Which is all we were asked to show!


4.5 Normalization constant for MV Gaussian

Prove that: \( (2 \pi)^{d/2} | \mathbf{\Sigma}|^{1/2} = \int \exp(-\frac{1}{2} (\mathbf{x-\mu}^T \mathbf{\Sigma}^{-1} (\mathbf{x-\mu})) d \mathbf{x} \)

We are told to diagonalize the covariance matrix, which can always be done since it is symmetric. That is, we can write:

\(D = P^{-1} \Sigma P\)

Where D is a diagonal matrix where the entries are the eigenvalues of \(\Sigma\) and the columns of P are the eigenvectors. In fact, since \(\Sigma\) is symmetric the eigenvectors can form an orthogonal basis it is possible to make P an orthogonal matrix, such that \(P^{-1} = P^T\). This allows us to say:

\(D^{-1} = P^T \Sigma^{-1} P \implies \Sigma^{-1} = P D^{-1} P^T\)

As such, we can write the integral as:

\( \int \exp(-\frac{1}{2}(x-\mu)^T P D^{-1} P^T(x-\mu)) dx = \int \exp(-\frac{1}{2} (P(x-\mu))^T \begin{bmatrix} \frac{1}{\lambda_1} & & \\ & \ddots & \\ & & \frac{1}{\lambda_d} \end{bmatrix} (P(x-\mu))) dx \)

Now let us define \(y = P(x-\mu)\). Because P is an orthogonal matrix (which has determinant 1), the Jacobian is 1 and we can replace \(dx\) with \(dy\). The term inside the exponential is then:

\( \sum_{ij} y_i \delta_{ij} \frac{1}{\lambda_i} y_j = \sum_i \frac{y_i^2}{\lambda_i}\). Effectively by transforming to the eigenbasis we have decoupled the components of y, so we can write:

\( = \int_{-\infty}^{\infty} dy_1 e^{-\frac{y_1^2}{2 \lambda_1}} \dots \int_{-\infty}^{\infty} dy_d e^{-\frac{y_d^2}{2 \lambda_d}}\)

i.e. just the product of many one-dimensional Gaussians. This is equal to:

\( \sqrt{2 \pi \lambda_1} \sqrt{2 \pi \lambda_2} \dots \sqrt{2 \pi \lambda_d} = (2 \pi)^{d/2} \sqrt{\lambda_1 \dots \lambda_d}\)

We then use that \(det(\Sigma) = \prod_{i=1}^d \lambda_i\), which gives us the final answer we want!


4.7 Conditioning a bivariate Gaussian

Take \(\Sigma = \sigma_1 \sigma_2 \begin{bmatrix} \frac{\sigma_1}{\sigma_2} & \rho \\ \rho & \frac{\sigma_2}{\sigma_1} \end{bmatrix} \)

(a) What is \(P(X_2 | x_1) \)?

Here we just need to apply the general result which was derived in the chapter – which is that the conditional distribution is also Normal with mean and covariance:

\( \mu_{1 | 2} = \mu_1 + \Sigma_{12} \Sigma_{22}^{-1}(x_2-\mu_2) \)

\( \Sigma_{1 | 2} = \Sigma_{11} – \Sigma_{21} \Sigma_{22}^{-1} \Sigma_{21} \)

So in this bivariate case we simply have:

\( \mu_{2 | 1} = \mu_2 + \frac{\rho \sigma_1 \sigma_2}{\sigma_1^2}(x_1-\mu_1) \)

\( \sigma_{2 | 1}^2 = \sigma_2^2(1-\rho^2) \)

(b) What if \(\sigma_1 = \sigma_2 = 1\)?

Then we just have:

\(P(x_2 | x_1) = \mathcal{N}(x_2 | \mu_2 + \rho(x_1 – \mu_1), 1-\rho^2) \)

4.8 Whitening vs. Standardizing

(a) Firstly we load in the data it mentions (height vs weight) and calculate the empirical mean and covariance, and plot:

data = dlmread("heightWeightData.txt");
data = data(:,2:3);

dataMean = mean(data);
centredData = data - dataMean;
N = length(data);
covariance = (1/N)*(centredData')*centredData;

%%part a
hold on
scatter(data(:,1), data(:,2));
gaussPlot2d(dataMean, covariance);

(b) Standardizing – subtract the mean and divide by the standard deviation

%%part b
empStd = sqrt([covariance(1,1), covariance(2,2)]);
standardData = centredData ./ empStd;
sMean = mean(standardData);
sCovariance = (1/N)*(standardData')*standardData;

hold on
scatter(standardData(:,1), standardData(:,2));
gaussPlot2d(sMean, sCovariance);

(c) This section of the question had a typo in it which confused me for a while – it says U is the matrix of eigenvectors of X but this should be the eigenvectors of \( \Sigma\). I decided to go into a bit more detail to understand what the whitening transformation actually does – the idea is it transforms the data in such a way that that the covariance matrix is the identity matrix.

If we take X to be the \(k \times N\) data matrix (in this case k=2), we are looking for a transformation W such that the covariance of \(Y = W X\) is the \(k \times k\) identity matrix. Now, the (empirical) covariance matrix of the data is given by: \( \Sigma = \frac{1}{N} X X^T \). If we perform an eigendecomposition of this we can write:

\( \Sigma = U \Lambda U^T\)

Where U is a matrix with columns made up of the eigenvectors of \(\Sigma\) and \(\Lambda\) is a diagonal matrix where the entries are the eigenvalues of \(\Sigma\). We can do this because \(\Sigma\) is real and symmetric, and we can also rewrite this as: \(\Lambda = U^T \Sigma U\). Now, let’s consider the empirical covariance of Y:

\( \Sigma_Y = \frac{1}{N} Y Y^T = \frac{1}{N} W X X^T W^T = W \Sigma W^T\)

So we see that if we wanted \(\Sigma_Y\) to be diagonal and equal to \(\Lambda\), we could choose \(W = U^T\). If we want it to be the identity matrix, we can achieve this by going one step further and saying \( W = \Lambda^{-1/2} U^T\). This provides the motivation for the formula given in the question – in code:

%%part c
[V, D] = eig(covariance);
W = (D^(-0.5))*(V');
Y = W*(centredData');
yMean = mean(Y,2);
yCovariance = (1/N)*Y*(Y');
hold on;
scatter(Y(1,:), Y(2,:));
gaussPlot2d(yMean, yCovariance);


4.11 Derivation of the NIW posterior

We are asked to show that the posterior for the MV normal parameters are given by:

\(NIW(\mu, \Sigma | m_N, \kappa_N, \nu_n, S_N) \)

When we use an NIW (normal inverse-Wishart) prior:

\(NIW(\mu, \Sigma | m_0, \kappa_0, \nu_0, S_0)\)

and \(m_N = \frac{ \kappa_0 m_0 + N \bar{x}}{\kappa_N}\), \(\kappa_N = \kappa_0 + N\), \(\nu_N = \nu_0 + N\), \(S_N = S_0 + S_{\bar{x}} + \frac{\kappa_0 N }{\kappa_0 + N} (\bar{x}-m_0)(\bar{x}-m_0)^T\)

We are given the following hint: \( N(\bar{x}-\mu)(\bar{x}-\mu)^T + \kappa_0 (\mu-m_0)(\mu-m_0)^T = \kappa_N (\mu – m_N)(\mu-m_N)^T + \frac{\kappa_0 N}{\kappa_N}(\bar{x}-m_0)(\bar{x}-m_0)^T\)

Now the actual form of the prior is:

\( NIW(\mu, \Sigma | m_0, \kappa_0, \nu_0, S_0) = \frac{1}{Z_{NIW}} |\Sigma|^{-1/2} \exp(-\frac{\kappa_0}{2} (\mu-m_0)^T \Sigma^{-1} (\mu-m_0)) |\Sigma|^{-\frac{\nu_0 + D + 1}{2}} \exp(-\frac{1}{2} tr(\Sigma^{-1} S_0)) \)

and we can write the likelihood as:

\( P(D | \mu, \Sigma) = (2\pi)^{-ND/2} |\Sigma|^{-N/2} \exp(-1/2 \sum_{i=1}^N (x_i – \mu)^T \Sigma^{-1} (x_i – \mu)) \)

during the chapter it was stated that this could be rewritten as:

\( P(D | \mu, \Sigma) = (2\pi)^{-ND/2} |\Sigma|^{-N/2} \exp( -\frac{N}{2}(\mu – \bar{x})^T \Sigma^{-1} (\mu – \bar{x})) \exp( -\frac{1}{2} tr(\Sigma^{-1} S_{\bar{x}})) \)

although this was not proven. Since it’s crucial for getting this result, I thought I would fill in the details. If we apply the “trace trick” we can say:

\( \sum_i (x_i – \mu)^T \Sigma^{-1} (x_i-\mu) = tr(\Sigma^{-1} \sum_i (x_i – \mu) (x_i – \mu)^T = tr(\Sigma^{-1} \left[ (\sum_i x_i x_i^T) – N \mu \bar{x}^T – N \bar{x} \mu^T + N \mu \mu^T \right]) \)

Then if we consider:

\(tr(\Sigma^{-1} S_{\bar{x}}) = tr(\Sigma^{-1} \sum_i (x_i-\bar{x})(x_i – \bar{x})^T ) = tr(\Sigma^{-1} ((\sum_i x_i x_i^T) – N \bar{x} \bar{x}^T)) \)

and then:

\(N(\bar{x}-\mu)^T \Sigma^{-1} (\bar{x}-\mu) = tr(\Sigma^{-1} N (\bar{x} \bar{x}^T – \mu \bar{x}^T – \bar{x} \mu^T + \mu \mu^T))

Writing them out explicitly like this it should be easy to see that:

\( \sum_i (x_i-\mu)^T \Sigma^{-1} (x_i – \mu) = tr(\Sigma^{-1} S_{\bar{x}}) + N(\bar{x}-\mu)^T \Sigma^{-1} (\bar{x}-\mu) \)

from which we can easily get the second expression for the likelihood. Now the posterior is proportional to the likelihood multiplied by the prior:

\(P(\mu, \Sigma | D) \propto |\Sigma|^{-1/2} |\Sigma|^{-\frac{\nu_0 + D + N + 1}{2}} \exp(-\frac{1}{2} (\kappa_0 (\mu-m_0)^T \Sigma^{-1} (\mu-m_0)  + N (\mu-\bar{x})^T \Sigma^{-1} (\mu-\bar{x})))  \exp(-\frac{1}{2} tr(\Sigma^{-1}(S_0 + S_{\bar{x}}))) \)

Now if we substitute the result in the hint we can rewrite the first exponential term as:

\(\exp( -\frac{1}{2} (\kappa_N (\mu – m_N)^T \Sigma^{-1} (\mu-m_N) + tr(\Sigma^{-1} \frac{\kappa_0 N}{\kappa_N} (\bar{x}-m_0)(\bar{x}-m_0)^T)) \)

From this point we just have to collect terms and look at the definitions and it should be clear that the stated result is true!


4.12 BIC for Gaussians

The BIC or “Bayesian Information Criterion” is a concept actually introduced in the next chapter for model selection, and represents an approximation to the marginal likelihood given the model. Read ahead to chapter 5 for more info! The definition is:

\( BIC = \log(P(D | \hat{\theta_{ML}})) – \frac{d}{2} \log(N) \)

where here d is the number of parameters in the model. We are asked to derive this in the case of a multivariate Gaussian model. We start by writing the log-likelihood (using the trace trick):

\( \log P(D | \hat{\Sigma}, \hat{\mu}) -\frac{N}{2} tr( \hat{\Sigma} S) – \frac{N}{2} \log( |\Sigma|) \)

where S is the scatter matrix: \( S = \frac{1}{N} \sum_{i=1}^N (x_i – \hat{\mu}) (x_i – \hat{\mu})^T \)

But of course this is exactly equal to the empirical covariance, which is equal to the maximum likeilhood, and so the first term is simply \( -ND/2)\) where D is the number of dimensions.

(a) Derive BIC when we use a full covariance matrix.

In this case, there are D parameters in the mean and D(D+1)/2 in the covariance matrix (which has to be symmetric), so a total of D(D+3)/2. This means the BIC is simply:

\( BIC = -\frac{ND}{2} – \frac{N}{2} \log( |\hat{\Sigma}|) – \frac{D(D+3)}{4} \log(N) \)

(b) How about when we have a diagonal covariance matrix

The hint seems to suggest there is more to this, but as far as I can tell the only difference is the number of parameters (which is now just 2D). I don’t see how having a diagonal covariance simplifies things any further, but I may be missing something! Anyway, this gives the BIC as:

\( BIC = \frac{ND}{2} – \frac{N}{2} \log( |\hat{\Sigma}|) – D \log(N) \)


4.13 Gaussian posterior credible intervals

Let \( X \sim \mathcal{N}(\mu, \sigma^2 = 4)\) where \(\mu\) is unknown but has prior \( \mu \sim \mathcal{N}(\mu_0, \sigma_0^2 = 9) \). The posterior after seeing n samples is \( \mu \sim \mathcal{N}(\mu_n, \sigma_n^2) \). How big does n have to be to ensure \( p(l \le \mu_n \le u | D) \ge 0.95\), where \( (l,u)\) is an interval of width 1 centred on \(\mu_n\). Hint: recall 95% of the probability mass is within \( \pm 1.96 \sigma\) of the mean.

Essentially what we need to do is to calculate n such that \( \sigma_n = 0.5 / 1.96\). To do this, we just need to use the result derived in the chapter for the posterior distribution of a Gaussian mean, given that \( \Sigma\) is given:

\( P(\mu | D, \Sigma) = \mathcal{N}(\mu | m_n, V_n) \)

where \( V_n^{-1} = V_0^{-1} + n \Sigma^{-1} \)

In our case this means:

\( \frac{1}{\sigma_n^2} = \frac{1}{\sigma_0^2} + \frac{n}{\sigma^2} = \frac{\sigma^2 + n \sigma_0^2}{\sigma_0^2 \ \sigma^2} \)

Rearranging this we get:

\( n = \frac{\sigma_2}{\sigma_n^2} – \frac{\sigma_2}{\sigma_0^2} \)

So putting in the numbers I find that we need \( n \simeq 61\).


4.14 MAP estimation for 1D Gaussians

Consider samples \( x_1 \dots x_n \) from a Gaussian RV with known variance \( \sigma^2\) and unknown mean \( \mu\). We assume a prior over the mean \( \mu \sim \mathcal{N}(m, s^2) \), where m and s are fixed.

(a) Calculate the MAP estimate \( \hat{\mu}_{MAP}\)

Rather than actually differentiating and setting equal to zero, we know the most probable value of a Gaussian is equal to its mean, and so we know that \( \hat{\mu}_{MAP}\) is simply the posterior mean. We know from results derived in the main chapter than \( m_N = V_N (\Sigma^{-1}(N \bar{x}) + V_0^{-1} m_0) \). We just calculated \(V_N\) (or really \(\sigma_N\) as we are in 1D) in the previous question, so we find that:

\(m_N = \hat{\mu}_{MAP} = \frac{s^2 \sigma^2}{\sigma^2 + N s^2} ( \frac{N \bar{X}}{\sigma^2} + \frac{m}{s^2}) = \frac{s^2 N \bar{x} + \sigma^2 m}{\sigma^2 + N s^2} \)

(b) Show that as n increases this converges to the MLE.

as \(N \to \infty\), \(\hat{\mu}_{MAP} \to \frac{s^2 N \bar{x}}{N s^2} = \bar{x} \), which we know is the MLE.

(c) increase prior variance s^2

\(\hat{\mu}_{MAP} \to \frac{s^2 N \bar{x}}{N s^2} = \bar{x} \)

so it converges to the MLE again.

(d) decrease s^2 (s -> 0)

clearly \( \hat{\mu}_{MAP} \to m\), i.e. the prior mean.

All of these results make intuitive sense.


4.15 Sequential updating of \( \hat{\Sigma}\)

The unbiased estimate for the covariance of a d-dimensional Gaussian based on n samples is given by:

\( \hat{\Sigma} = C_n = \frac{1}{n-1} \sum_{i=1}^n (x_i – m_n) (x_i – m_m)^T \)

This takes \(O(nd^2)\) to compute.

(a) Show that the covariance can be updated sequentially as follows:

\( C_{n+1} = \frac{n-1}{n} C_n + \frac{1}{n+1} (x_{n+1} – m_n)(x_{n+1}-m_n)^T\)

To start with let’s write:

\(C_{n+1} = \frac{1}{n} \sum_{i=1}^{n+1} (x_i – m_{n+1})(x_i – m_{n+1})^T\)

Next, let us note now that we can update the estimate of the mean sequentially as follows:

\( m_{n+1} = \frac{1}{n+1} \sum_{i=1}^{n+1} (n \ m_n + x_{n+1}) \)

However, for now we’ll leave it as \(m_{n+1}\). It’s easy to show that we can expand out the previous two equations for \(C_n\) and \(C_{n+1}\) in the following way:

\( n C_{n+1} = (\sum_{i=1}^{n+1} x_i x_i^T) – (n+1) m_{n+1} m_{n+1}^T \)

\((n-1) C_n = (\sum_{i=1}^n x_i x_i^T) – n \ m_n m_n^T \)

If we now expand out \(m_{n+1}m_{n+1}^T\):

\( m_{n+1} m_{n+1}^T = \frac{1}{(n+1)^2} ( n^2 m_n m_n^T + n m_n \_{n+1}^T + n x_{n+1} m_n^T + x_{n+1} x_{n+1}^T) \)

then subtract \( n C_{n+1} – (n-1) C_n\), and substitute in the above expression we find:

\( n C_{n+1} – (n-1) C_n = x_{n+1} x_{n+1}^T + n m_n m_n^T – \frac{1}{n+1}(n^2 m_n m_n^T + n m_n \_{n+1}^T + n x_{n+1} m_n^T + x_{n+1} x_{n+1}^T) \)

Simplifying and rearranging this we get the result which is stated.

(b) How much time does it take per sequential update?

The most expensive part here is the outer product, which is \(O(d^2)\).

(c) Show that we can sequentially update the precision matrix

We just showed that:

\(C_{n+1} = \frac{n-1}{n} C_n + \frac{1}{n+1} u u^T \)

where \(u = x_{n+1}-m_n\). We just need to use the matrix inversion lemma, given as a hint in the question:

\( (E + u v^T)^{-1} = E^{-1} – \frac{ E^{-1} u v^T E^{-1}}{1 + v^T E^{-1} u}\)

so using \( E = \frac{n-1}{n} C_n\) and \(u = v = \frac{1}{\sqrt{n+1}}(x_{n+1}-m_n) \) we get:

\( C_{n+1}^{-1} = \frac{n}{n-1} C_n^{-1} – \frac{ \frac{n}{n-1} C_n^{-1} \frac{1}{n+1} u u^T \frac{n}{n-1} C_n^{-1}}{1 + \frac{1}{n+1} u^T \frac{n}{n-1} C_n^{-1}  u} \)

Some simple rearranging gives the result given in the textbook:

\(C_{n+1}^{-1} = \frac{n}{n-1} \left[ C_n^{-1} – \frac{ C_n^{-1}(x_{n+1}-m_n)(x_{n+1}-m_n)^T C_n^{-1}}{\frac{n^2-1}{n} + (x_{n+1} – m_n)^T C_n^{-1} (x_{n+1}-m_n)} \right] \)

(d) What is the time complexity per update?

There are two key things to note here – one is that we don’t need to compute \(C_n^{-1}\) as we already have this in memory (or at least we should do). The second is that we can split the matrix multiplications up:

\( A_1 = C_n^{-1} (x_{n+1}-m_n) \)

this is a \( d \times d\) matrix multiplied by a \(d \times 1\) column vector, so this is \(O(d^2)\). Then we can do:

\( (x_{n+1}-m_n)^T C_n^{-1}\)

which is a \( 1 \times d\) row vector times a \( d \times d\) matrix, which is again \(O(d^2)\). Combining the resulting vectors together as an outer product is again \(O(d^2)\). We can make the same argument about the denominator where all the matrix multiplications are again \(O(d^2)\), and so this is the overall update complexity!


4.16 Likelihood ratio for Gaussians

Consider a binary classifier where the K class conditional densities are MVN \(p(x|y=j) = \mathcal{N}(x | \mu_j, \Sigma_j)\). Derive an expression for the log-likelihood ratio \( \log \frac{p(x | y=1)}{p(x | y=0)} \) for:

(a) arbitrary covariance \( \Sigma_j\), (b) shared covariance \(\Sigma_j = \Sigma\), (c) shared, axis-aligned covariance \( \Sigma_j = \Sigma\) with \(\Sigma_{ij}=0\) for \(i \ne j\) and (d) shared spherical covariance \(\Sigma_j = \sigma^2 I\).

(a) full form:

\( \frac{1}{2}((x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1) – (x-\mu_0)^T \Sigma_0^{-1} (x-\mu_0) \)

(b) we can simplify slightly:

\( -\frac{1}{2} tr(\Sigma^{-1} \left( (x-\mu_1)(x-\mu_1)^T – (x-\mu_0)(x-\mu_0)^T \right) \)

(c) We can multiply the matrices out here to simplify it a bit further:

\( -\frac{1}{2} \sum_{i=1}^K \frac{1}{\sigma_i^2} \left[ (x_i – \mu_{1i})^2 – (x_i – \mu_{0i})^2 \right] \)

(d) The only further simplification I can see if to factor out the variance:

\( -\frac{1}{2 \sigma^2} \sum_{i=1}^K \frac{1}{\sigma_i^2} \left[ (x_i – \mu_{1i})^2 – (x_i – \mu_{0i})^2 \right] \)


4.17 LDA/QDA on height/weight data

If we run the code (which you can get from the book’s github repository) we can see how the classifier works:

The following is my MATLAB code to calculate the misclassification rates for the LDA and the QDA:

rawdata = dlmread("heightWeightData.txt");
data.Y = rawdata(:,1); % 1=male, 2=female
data.X = [rawdata(:,2) rawdata(:,3)]; % height, weight
maleNdx = find(data.Y == 1);
femaleNdx = find(data.Y == 2);
classNdx = {maleNdx, femaleNdx};

%untied - i.e. class-specific covariances
X_male = data.X(classNdx{1},:);
X_female = data.X(classNdx{2},:);

mu{1} = mean(X_male); mu{2} = mean(X_female);
Sigma{1} = cov(X_male); Sigma{2} = cov(X_female);

prob_c1 = gaussProb(data.X, mu{1}, Sigma{1});
prob_c2 = gaussProb(data.X, mu{2}, Sigma{2});

predictedClasses = (prob_c2 > prob_c1) + 1;
misclassificationQDA = 100*(sum(predictedClasses ~= data.Y) / length(data.X));

%shared covariance, i.e. LDA
mu{1} = mean(X_male); mu{2} = mean(X_female);
Sigma{1} = cov(data.X); Sigma{2} = Sigma{1};

prob_c1 = gaussProb(data.X, mu{1}, Sigma{1});
prob_c2 = gaussProb(data.X, mu{2}, Sigma{2});

predictedClasses = (prob_c2 > prob_c1) + 1;
misclassificationLDA = 100*(sum(predictedClasses ~= data.Y) / length(data.X));

I find that for the LDA the misclassification rate is 12.4%, and for the QDA it’s reduced slightly at 11.9%.


4.18 Naive bayes with mixed features

Consider a 3 class naive Bayes classifier with one binary feature and one Gaussian feature.:

\( y \sim Mu(y | \pi, 1)\), \( x_1 | y = c \sim Ber(x_1 | \theta_c)\), \( x_2 | y=c \sim \mathcal{N}(x_2 | \mu_c, \sigma_c^2) \)

Let the parameter vectors be the following:

\(\pi = (0.5, 0.25, 0.25), \theta = (0.5,0.5,0.5), \mu=(-1,0,1), \sigma^2 = (1,1,1) \)

(a) Compute \(p(y|x_1=0, x_2=0)\)

\(p(y | x) \propto p(x | y) p(y) \), so:

\(p(y=c | x) \propto Ber(0, \theta_c) \mathcal{N}(0 | \mu_c, \sigma_c^2) \pi_c \propto (1-\theta_c) \exp( – \frac{\mu_c^2}{2 \sigma_c^2}) \pi_c \)

So we just have to put these numbers in for each class, and then normalize afterwards. Doing this I find: \( [0.4302, 0.3547, 0.2151] \).

(b) Compute \(p(y | x_1 = 0) \)

\(p(y | x_1 = 0) \propto p(x_1 = 0 | y) p(y) = (1-\theta_c) \pi_c \)

So we just put the numbers in again and normalise: \([0.5, 0.25, 0.25]\)

(c) Compute \(p(y | x_2 = 0)\)

\( p(y | x_2 = 0) \propto p(x_2 = 0 | y) p(y) = \mathcal{N}(0 | \mu_c, \sigma_c^2) \pi_c \)

Put numbers in and normalise: \([0.4302, 0.3547, 0.2151]\).

(d) Explain any interesting patterns you see in your results

I guess it’s interesting that the answers to (a) and (c) are identical – we see this arises because all of the \(\theta\) values are equal. This is also the reason why the answer to (b) is equal to the prior.


4.19 Decision boundary for LDA with semi-tied covariance

Consider a generative classifier with class-conditional densities of the form \(\mathcal{N}(x | \mu_c, \Sigma_c ) \). We consider a two class case in which \( \Sigma_1 = k \Sigma_0 \), with \( k > 1\). Derive an expression for \(p(y=1 | x, \theta)\), simplifying as much as possible. Give a geometric interpretation of the result.

\(p(y=c | x) \propto \pi_c \mathcal{N}(x| \mu_c, \sigma_c^2) \)

so we can say:

\( p(y=1 | x) = \frac{ \pi_c |k \Sigma_0 |^{-1/2} \exp( -\frac{k}{2}(x-\mu_1)^T \Sigma_0^{-1} (x-\mu_1))}{ \pi_1 |k \Sigma_0|^{-1/2} \exp( -\frac{k}{2}(x-\mu_1)^T \Sigma_0^{-1} (x-\mu_1)) + \pi_0 |\Sigma_0|^{-1/2} \exp( -\frac{1}{2}(x-\mu_0)^T \Sigma_0^{-1} (x-\mu_0))} \)

Using \( |k \Sigma_0|^{-1/2} = k^{-d/2} |\Sigma_0| \), we can simplify this a bit:

\( = \frac{1}{1 + \frac{\pi_0}{\pi_1} k^{d/2} \exp( -\frac{1}{2} \left[ (x-\mu_0)^T \Sigma_0^{-1} (x-\mu_0) – k(x-\mu_1)^T \Sigma_0^{-1}(x-\mu_1) \right]} \)

We can go a bit further by saying the term in the exponential is:

\( -\frac{1}{2} tr( \Sigma_0^{-1}((x-\mu_0)(x-\mu_0)^T – k(x-\mu_1)(x-\mu_1)^T) \)

and then expanding out the outer products:

\( x x^T – \mu_0 x^T – x \mu_0^T + \mu_0 \mu_0^T – k x x^T + k \mu_1 x^T + k x \mu_1^T – k \mu_1 \mu_1^T \)

we can then complete the square:

\( = (1-k) (x- \frac{(\mu_0 – k \mu_1)}{1-k})(x-\frac{\mu_0-k \mu_1}{1-k})^T + C \)

where C is a constant involving only k, \(\mu_1\) and \(\mu_0\) which I can’t be bothered to calculate. This means the x-dependence in the exponential is:

\( \exp(-\frac{(1-k)}{2} (x-\frac{\mu_0-k\mu_1}{1-k})^T \Sigma_0^{-1} (x-\frac{\mu_0 – k \mu_1}{1-k})) \)

But I don’t really see any way of simplifying this further, and I’m not really sure how to interpret it geometrically. If anyone has an answer to this I’d be interested to hear about it in the comments!


4.21 Gaussian decision boundaries

Let \(p(x|y=j) = \mathcal{N}(x | \mu_j, \sigma_j^2) \) where j=1,2 and \(\mu_1 = 0, \sigma_1^2 = 1, \mu_2 = 1, \sigma_2^2 = 10^6\). Let the class priors be equal.

(a) Find the decision region: \(R_1 = \{ x : p(x|\mu_1, \sigma_1) \ge p(x_2 | \mu_2, \sigma_2) \} \)

To start with let’s just visualize this by plotting the probability of x given each class:

We now solve for the two points of intersection here – I did this in Mathematica:

So we see that the region R1 is for \( -3.72 < x < 3.72\). In the case where \(\sigma_2=1\) as well, we find only one point of intersection (which makes sense) – and for R1 is any \( x < 0.5\)


4.22 QDA with three classes

Consider a three category classification problem. Let the prior probabilities be uniform, and have the class-conditional densities be MVNs with parameters:

\( \mu_1 = [0,0]^T, \mu_2 = [1,1]^T, \mu_3 = [-1,1]^T \)

\( \Sigma_1 = \begin{bmatrix} 0.7 & 0 \\ 0 & 0.7 \end{bmatrix} , \Sigma_2 = \begin{bmatrix} 0.8 & 0.2 \\ 0.2 & 0.8 \end{bmatrix}, \Sigma_3 = \begin{bmatrix} 0.8 & 0.2 \\ 0.2 & 0.8 \end{bmatrix} \)

Classify the following points:

(a) \( x = [-0.5, 0.5]\), (b) \( x = [0.5, 0.5] \)


4.23 Scalar QDA

Consider the following training set of heights and the corresponding labels of gender: \( x = (67, 79, 71, 68, 67, 60), y=(m,m,m,f,f,f) \).

(a) Fit a Bayes classifier to this data using MLE.

Clearly the class priors are uniform, i.e. \( \pi_m = \pi_f = 0.5\). Then the MLE estimates for the means are just the empirical values:

\( \mu_m = \frac{67 + 79 + 71}{3} = 72.33 \)

\(mu_f = \frac{68 + 67 + 60}{3} = 65 \)

And likewise for the estimates of the variances:

\( \sigma_m^2 = \frac{1}{3} ((67 – \mu_m)^2 + (79-\mu_m)^2 + (71-\mu_m)^2) = 24.89 \)

\( \sigma_f^2 = \frac{1}{3}((68-\mu_f)^2 + (67-\mu_f)^2 + (60-\mu_f)^2) = 12.67 \)

(b) Compute \(p(y=m | x, \hat{\theta}) \) where \(x=72\).

\( p(y=m | x=72) = \frac{\frac{1}{2} \mathcal{N}(72 | \mu_m, \sigma_m^2)}{ \frac{1}{2} \mathcal{N}(72 | \mu_m, \sigma_m^2) + \frac{1}{2} \mathcal{N}(72 | \mu_f, \sigma_f^2) } \)

Putting in the numbers I find this gives \( \simeq 0.83 \), which seems about right just looking at the data by eye.

Author: Henry Charlesworth

I am currently a PhD student at the University of Warwick, based in the Centre for Complexity Science. My research has been looking at how the principle of "future state maximisation" can be applied to both explaining behaviour in animal systems as well as generating behaviour in artificial systems. Loosely speaking, this is the idea that taking everything else to be equal it is preferable to keep your options open as much as possible. I am also very interested in artificial intelligence and machine learning, particularly reinforcement learning.

Leave a Reply

Your email address will not be published. Required fields are marked *