Exact predictive density

Next: Gaussian mixture regression (cluster Up: Regression Previous: Gaussian regression Contents

Exact predictive density

For Gaussian regression the predictive density under training data and prior can be found analytically without resorting to a saddle point approximation. The predictive density is defined as the -integral

$\displaystyle p(y\vert x,D,D_0)$	$\textstyle =$	$\displaystyle \int\!d{h} \, p(y\vert x,{h}) p({h}\vert D,D_0)$
	$\textstyle =$	$\displaystyle \frac{\int\!d {h} \, p(y\vert x,{h}) p(y_D\vert x_D,{h}) p({h}\vert D_0)} {\int\!d {h} \, p(y_D\vert x_D,{h}) p({h}\vert D_0)}$
	$\textstyle =$	$\displaystyle \frac{p(y,y_D\vert x,x_D,D_0)} {p(y_D\vert x_D,D_0)} .$	(307)

Denoting training data values

by $t_{i}$ sampled with inverse covariance ${\bf K}_{i}$ concentrated on

and analogously test data values

= $y_{n+1}$ by $t_{n+1}$ sampled with inverse (co-)variance ${\bf K}_{n+1}$ , we have for $1\le i\le n+1$

$\begin{displaymath} p(y_i\vert x_i,{h}) = \det ({\bf K}_i/2\pi )^{\frac{1}{2}} ... ...c{1}{2} \Big( {h}-t_{i},\, {\bf K}_{i} ({h}-t_{i}) \Big) } , \end{displaymath}$

(308)

and

$\begin{displaymath} p({h}\vert D_0) = \det ({\bf K_0}/2\pi)^{\frac{1}{2}} e^{-\frac{1}{2} \Big( {h}-t_0,\, {\bf K}_0 ({h}-t_0) \Big) } , \end{displaymath}$

(309)

hence

$\begin{displaymath} p(y\vert x,D,D_0) = \frac{\int\!d{h} \, e^{-\frac{1}{2} \su... ... +\frac{1}{2} \sum_{i=0}^{n} \ln \det_i ({\bf K}_i/2\pi)}} . \end{displaymath}$

(310)

Here we have this time written explicitly $\det_i({\bf K}_i/2\pi)$ for a determinant calculated in that space where ${\bf K}_i$ is invertible. This is useful because for example in general $\det_i {\bf K}_i \det {\bf K}_0 \ne \det_i {\bf K}_i {\bf K}_0$ . Using the generalized `bias-variance'-decomposition (230) yields

$\begin{displaymath} p(y\vert x,D,D_0) = \frac{\int\!d{h} \, e^{-\frac{1}{2} \... ... +\frac{1}{2} \sum_{i=0}^{n} \ln \det_i ({\bf K}_i/2\pi)}} , \end{displaymath}$

(311)

with

$\displaystyle t$	$\textstyle =$	$\displaystyle {\bf K}^{-1} \sum_{i=0}^{n} {\bf K}_i t_i ,\qquad {\bf K} = \sum_{i=0}^{n} {\bf K}_i,$	(312)
$\displaystyle t_+$	$\textstyle =$	$\displaystyle {\bf K}_+^{-1} \sum_{i=0}^{n+1} {\bf K}_i t_i ,\qquad {\bf K}_+ = \sum_{i=0}^{n+1} {\bf K}_i,$	(313)
$\displaystyle V$	$\textstyle =$	$\displaystyle \frac{1}{n+1} \sum_{i=0}^{n} \Big( t_i, \, {\bf K}_i t_i \Big) -\Big( t, \, \frac{{\bf K}}{n+1} \,t \Big),$	(314)
$\displaystyle V_+$	$\textstyle =$	$\displaystyle \frac{1}{n+2} \sum_{i=0}^{n+1} \Big( t_i,\, {\bf K}_i t_i \Big) -\Big( t_+,\, \frac{{\bf K}_+}{n+2}\, t_+ \Big) .$	(315)

Now the

-integration can be performed

$\begin{displaymath} p(y\vert x,D,D_0) = \frac{e^{-\frac{n+2}{2} V_+ +\frac{1}{2... ...det_i ({\bf K}_i/2\pi) -\frac{1}{2} \ln \det ({\bf K}/2\pi)}} \end{displaymath}$

(316)

Canceling common factors, writing again

for $t_{n+1}$ , ${\bf K}_x$ for ${\bf K}_{n+1}$ , $\det_x$ for $\det_{n+1}$ , and using ${\bf K}_+ t_+$ = ${\bf K}t + {\bf K}_{x}y$ , this becomes

$\begin{displaymath} p(y\vert x,D,D_0) = e^{-\frac{1}{2} (y,{\bf K}_y \,y) + (y,... ...c{1}{2} \ln\det_{x}({\bf K}_{x}{\bf K}_+^{-1} {\bf K}/2\pi)} . \end{displaymath}$

(317)

Here we introduced ${\bf K}_y$ = ${\bf K}_y^T$ = ${\bf K}_{x}-{\bf K}_{x}{\bf K}_+^{-1} {\bf K}_{x}$ and used that

$\begin{displaymath} \det {\bf K}^{-1}{\bf K}_+ =\det ( {\bf I}+{\bf K}^{-1}{\bf K}_{x}) =\det\!{}_{x} {\bf K}^{-1}{\bf K}_+ \end{displaymath}$

(318)

can be calculated in the space of test data

. This follows from ${\bf K}$ = ${\bf K}_+ - {\bf K}_{x}$ and the equality

$\begin{displaymath} \det\left(\begin{array}{cc}1+A&0\\ B&1\end{array}\right) = \det(1+A) \end{displaymath}$

(319)

with

= ${\bf I}_x {\bf K}^{-1} {\bf K}_{x}$ ,

= $({\bf I} - {\bf I}_x) {\bf K}^{-1} {\bf K}_{x}$ , and ${\bf I}_{x}$ denoting the projector into the space of test data

. Finally

$\begin{displaymath} {\bf K}_y ={\bf K}_{x}-{\bf K}_{x}{\bf K}_+^{-1} {\bf K}_{x}... ...f K}_+^{-1} {\bf K} = ({\bf K}-{\bf K}{\bf K}_+^{-1}{\bf K}) , \end{displaymath}$

(320)

yields the correct normalization of the predictive density

$\begin{displaymath} p(y\vert x,D,D_0) = e^{-\frac{1}{2} \Big( y - \bar y,\, {\bf... ...bar y)\Big) +\frac{1}{2} \ln \det\!{}_{x} ({\bf K}_y/2\pi) } , \end{displaymath}$

(321)

with mean and covariance

$\displaystyle \bar y$	$\textstyle =$	$\displaystyle t = {\bf K}^{-1} \sum_{i=0}^n {\bf K}_i t_i,$	(322)
$\displaystyle {\bf K}_y^{-1}$	$\textstyle =$	$\displaystyle \left({\bf K}_{x} -{\bf K}_{x}{\bf K}_+^{-1} {\bf K}_{x}\right)^{-1} = {\bf K}_{x}^{-1} + {\bf I}_{x}{\bf K}^{-1}{\bf I}_{x}.$	(323)

It is useful to express the posterior covariance ${\bf K}^{-1}$ by the prior covariance ${\bf K}_0^{-1}$ . According to

$\begin{displaymath} \left(\begin{array}{cc}1+A&B\\ 0&1\end{array}\right)^{-1} = ... ...gin{array}{cc}(1+A)^{-1}&-(1+A)^{-1}B\\ 0&1\end{array}\right), \end{displaymath}$

(324)

with

= ${\bf K}_D {\bf K}_{0,DD}^{-1}$ ,

= ${\bf K}_D {\bf K}_{0,D\bar D}^{-1}$ , and ${\bf K}_{0,DD}^{-1}$ = ${\bf I}_{D}{\bf K}_0^{-1}{\bf I}_{D}$ , ${\bf K}_{0,D\bar D}^{-1}$ = ${\bf I}_{D}{\bf K}_0^{-1}{\bf I}_{\bar D}$ , ${\bf I}_{\bar D}$ = ${\bf I}-{\bf I}_{D}$ we find

$\displaystyle {\bf K}^{-1}$	$\textstyle =$	$\displaystyle {\bf K}_0^{-1} \left( {\bf I}+{\bf K}_D {\bf K}_0^{-1}\right)^{-1}$	(325)
	$\textstyle =$	$\displaystyle {\bf K}_0^{-1} \! \left( \left( {\bf I}_D+{\bf K}_D {\bf K}_{0,DD... ...^{-1}\right)^{-1} {\bf K}_D {\bf K}_{0,D\bar D}^{-1} + {\bf I}_{\bar D}\right).$

Notice that while ${\bf K}_{D}^{-1}$ = $({\bf I}_{D}{\bf K}_D {\bf I}_{D})^{-1}$ in general ${\bf K}_{0,DD}^{-1}$ = ${\bf I}_{D}{\bf K}_0^{-1}{\bf I}_{D} \ne ({\bf I}_{D}{\bf K}_0{\bf I}_{D})^{-1}$ . This means for example that ${\bf K}_{0}^{-1}$ has to be known to find ${\bf K}_{0,DD}^{-1}$ and it is not enough to invert ${\bf I}_{D}{\bf K}_0{\bf I}_{D}$ = ${\bf K}_{0,DD}\ne ({\bf K}_{0,DD}^{-1})^{-1}$ . In data space $\left( {\bf I}_D+{\bf K}_D {\bf K}_{0,DD}^{-1}\right)^{-1}$ = $\left( {\bf K}_D^{-1}+{\bf K}_{0,DD}^{-1}\right)^{-1}{\bf K}_D^{-1}$ , so Eq. (325) can be manipulated to give

$\begin{displaymath} {\bf K}^{-1} = {\bf K}_0^{-1} \left( {\bf I}- {\bf I}_D \le... ... K}_{0,DD}^{-1}\right)^{-1} {\bf I}_D{\bf K}_0^{-1} \right) . \end{displaymath}$

(326)

This allows now to express the predictive mean (322) and covariance (323) by the prior covariance

$\displaystyle \bar y$	$\textstyle =$	$\displaystyle t_0 + {{\bf K}}_0^{-1} \left({\bf K}_D^{-1} + {{\bf K}}_{0,DD}^{-1} \right)^{-1} (t_D - t_0) ,$	(327)
$\displaystyle {\bf K}_y^{-1}$	$\textstyle =$	$\displaystyle {\bf K}_{x}^{-1}+ {\bf K}_{0,xx}^{-1} -{\bf K}_{0,xD}^{-1} \left({\bf K}_D^{-1}+{\bf K}_{0,DD}^{-1}\right)^{-1} {\bf K}_{0,Dx}^{-1} .$	(328)

Thus, for given prior covariance ${\bf K}_0^{-1}$ both, $\bar y$ and ${\bf K}_y^{-1}$ , can be calculated by inverting the $\tilde n\times\tilde n$ matrix $\widetilde {\bf K}$ = $\left({\bf K}_{0,DD}^{-1} + {\bf K}_D^{-1}\right)^{-1}$ .

Comparison of Eqs.(327,328) with the maximum posterior solution ${h}^*$ of Eq. (277) now shows that for Gaussian regression the exact predictive density $p(y\vert x,D,D_0)$ and its maximum posterior approximation $p(y\vert x,{h}^*)$ have the same mean

$\begin{displaymath} t = \int \!dy\, y\, p(y\vert x,D,D_0) = \int \!dy\, y\, p(y\vert x,{h}^*) . \end{displaymath}$

(329)

The variances, however, differ by the term ${\bf I}_x {\bf K}^{-1}{\bf I}_x$ .

According to the results of Section 2.2.2 the mean of the predictive density is the optimal choice under squared-error loss (51). For Gaussian regression, therefore the optimal regression function is the same for squared-error loss in exact and in maximum posterior treatment and thus also for log-loss (for Gaussian $p(y\vert x,a)$ with fixed variance)

$\begin{displaymath} a^*_{\rm MAP,log} =a^*_{\rm exact,log} = a^*_{\rm MAP,sq.} = a^*_{\rm exact,sq.} ={h}^* = t. \end{displaymath}$

(330)

In case the space of possible $p(y\vert x,a)$ is not restricted to Gaussian densities with fixed variance, the variance of the optimal density under log-loss $p(y\vert x,a^*_{\rm exact,log})$ = $p(y\vert x,D,D_0)$ differs by ${\bf I}_x {\bf K}^{-1}{\bf I}_x$ from its maximum posterior approximation $p(y\vert x,a^*_{\rm MAP,log})$ = $p(y\vert x,{h}^*)$ .

Next: Gaussian mixture regression (cluster Up: Regression Previous: Gaussian regression Contents

Joerg_Lemm 2001-01-21