next up previous contents
Next: Gaussian mixture regression (cluster Up: Regression Previous: Gaussian regression   Contents


Exact predictive density

For Gaussian regression the predictive density under training data $D$ and prior $D_0$ can be found analytically without resorting to a saddle point approximation. The predictive density is defined as the $h$-integral

$\displaystyle p(y\vert x,D,D_0)$ $\textstyle =$ $\displaystyle \int\!d{h} \, p(y\vert x,{h}) p({h}\vert D,D_0)$  
  $\textstyle =$ $\displaystyle \frac{\int\!d {h} \, p(y\vert x,{h}) p(y_D\vert x_D,{h}) p({h}\vert D_0)}
{\int\!d {h} \, p(y_D\vert x_D,{h}) p({h}\vert D_0)}$  
  $\textstyle =$ $\displaystyle \frac{p(y,y_D\vert x,x_D,D_0)}
{p(y_D\vert x_D,D_0)}
.$ (307)

Denoting training data values $y_i$ by $t_{i}$ sampled with inverse covariance ${\bf K}_{i}$ concentrated on $x_i$ and analogously test data values $y$ = $y_{n+1}$ by $t_{n+1}$ sampled with inverse (co-)variance ${\bf K}_{n+1}$, we have for $1\le i\le n+1$
\begin{displaymath}
p(y_i\vert x_i,{h}) =
\det ({\bf K}_i/2\pi )^{\frac{1}{2}}
...
...c{1}{2}
\Big( {h}-t_{i},\, {\bf K}_{i} ({h}-t_{i}) \Big) }
,
\end{displaymath} (308)

and
\begin{displaymath}
p({h}\vert D_0) =
\det ({\bf K_0}/2\pi)^{\frac{1}{2}}
e^{-\frac{1}{2}
\Big( {h}-t_0,\, {\bf K}_0 ({h}-t_0) \Big) }
,
\end{displaymath} (309)

hence
\begin{displaymath}
p(y\vert x,D,D_0) =
\frac{\int\!d{h} \, e^{-\frac{1}{2} \su...
...
+\frac{1}{2} \sum_{i=0}^{n} \ln \det_i ({\bf K}_i/2\pi)}}
.
\end{displaymath} (310)

Here we have this time written explicitly $\det_i({\bf K}_i/2\pi)$ for a determinant calculated in that space where ${\bf K}_i$ is invertible. This is useful because for example in general $\det_i {\bf K}_i \det {\bf K}_0 \ne \det_i {\bf K}_i {\bf K}_0$. Using the generalized `bias-variance'-decomposition (230) yields
\begin{displaymath}
p(y\vert x,D,D_0) =
\frac{\int\!d{h} \, e^{-\frac{1}{2}
\...
...
+\frac{1}{2} \sum_{i=0}^{n} \ln \det_i ({\bf K}_i/2\pi)}}
,
\end{displaymath} (311)

with
$\displaystyle t$ $\textstyle =$ $\displaystyle {\bf K}^{-1} \sum_{i=0}^{n} {\bf K}_i t_i
,\qquad {\bf K} = \sum_{i=0}^{n} {\bf K}_i,$ (312)
$\displaystyle t_+$ $\textstyle =$ $\displaystyle {\bf K}_+^{-1} \sum_{i=0}^{n+1} {\bf K}_i t_i
,\qquad {\bf K}_+ = \sum_{i=0}^{n+1} {\bf K}_i,$ (313)
$\displaystyle V$ $\textstyle =$ $\displaystyle \frac{1}{n+1} \sum_{i=0}^{n}
\Big( t_i, \, {\bf K}_i t_i \Big)
-\Big( t, \, \frac{{\bf K}}{n+1} \,t \Big),$ (314)
$\displaystyle V_+$ $\textstyle =$ $\displaystyle \frac{1}{n+2} \sum_{i=0}^{n+1}
\Big( t_i,\, {\bf K}_i t_i \Big)
-\Big( t_+,\, \frac{{\bf K}_+}{n+2}\, t_+ \Big)
.$ (315)

Now the ${h}$-integration can be performed
\begin{displaymath}
p(y\vert x,D,D_0) =
\frac{e^{-\frac{n+2}{2} V_+
+\frac{1}{2...
...det_i ({\bf K}_i/2\pi)
-\frac{1}{2} \ln \det ({\bf K}/2\pi)}}
\end{displaymath} (316)

Canceling common factors, writing again $y$ for $t_{n+1}$, ${\bf K}_x$ for ${\bf K}_{n+1}$, $\det_x$ for $\det_{n+1}$, and using ${\bf K}_+ t_+$ = ${\bf K}t + {\bf K}_{x}y$, this becomes
\begin{displaymath}
p(y\vert x,D,D_0) =
e^{-\frac{1}{2}
(y,{\bf K}_y \,y) + (y,...
...c{1}{2} \ln\det_{x}({\bf K}_{x}{\bf K}_+^{-1} {\bf K}/2\pi)}
.
\end{displaymath} (317)

Here we introduced ${\bf K}_y$ = ${\bf K}_y^T$ = ${\bf K}_{x}-{\bf K}_{x}{\bf K}_+^{-1} {\bf K}_{x}$ and used that
\begin{displaymath}
\det {\bf K}^{-1}{\bf K}_+
=\det ( {\bf I}+{\bf K}^{-1}{\bf K}_{x})
=\det\!{}_{x} {\bf K}^{-1}{\bf K}_+
\end{displaymath} (318)

can be calculated in the space of test data $x$. This follows from ${\bf K}$ = ${\bf K}_+ - {\bf K}_{x}$ and the equality
\begin{displaymath}
\det\left(\begin{array}{cc}1+A&0\\ B&1\end{array}\right) = \det(1+A)
\end{displaymath} (319)

with $A$ = ${\bf I}_x {\bf K}^{-1} {\bf K}_{x}$, $B$ = $({\bf I} - {\bf I}_x) {\bf K}^{-1} {\bf K}_{x}$, and ${\bf I}_{x}$ denoting the projector into the space of test data $x$. Finally
\begin{displaymath}
{\bf K}_y
={\bf K}_{x}-{\bf K}_{x}{\bf K}_+^{-1} {\bf K}_{x}...
...f K}_+^{-1} {\bf K}
= ({\bf K}-{\bf K}{\bf K}_+^{-1}{\bf K})
,
\end{displaymath} (320)

yields the correct normalization of the predictive density
\begin{displaymath}
p(y\vert x,D,D_0) =
e^{-\frac{1}{2} \Big( y - \bar y,\, {\bf...
...bar y)\Big)
+\frac{1}{2} \ln \det\!{}_{x} ({\bf K}_y/2\pi)
}
,
\end{displaymath} (321)

with mean and covariance
$\displaystyle \bar y$ $\textstyle =$ $\displaystyle t
= {\bf K}^{-1} \sum_{i=0}^n {\bf K}_i t_i,$ (322)
$\displaystyle {\bf K}_y^{-1}$ $\textstyle =$ $\displaystyle \left({\bf K}_{x}
-{\bf K}_{x}{\bf K}_+^{-1} {\bf K}_{x}\right)^{-1}
= {\bf K}_{x}^{-1} + {\bf I}_{x}{\bf K}^{-1}{\bf I}_{x}.$ (323)

It is useful to express the posterior covariance ${\bf K}^{-1}$ by the prior covariance ${\bf K}_0^{-1}$. According to
\begin{displaymath}
\left(\begin{array}{cc}1+A&B\\ 0&1\end{array}\right)^{-1}
=
...
...gin{array}{cc}(1+A)^{-1}&-(1+A)^{-1}B\\ 0&1\end{array}\right),
\end{displaymath} (324)

with $A$ = ${\bf K}_D {\bf K}_{0,DD}^{-1}$, $B$ = ${\bf K}_D {\bf K}_{0,D\bar D}^{-1}$, and ${\bf K}_{0,DD}^{-1}$ = ${\bf I}_{D}{\bf K}_0^{-1}{\bf I}_{D}$, ${\bf K}_{0,D\bar D}^{-1}$ = ${\bf I}_{D}{\bf K}_0^{-1}{\bf I}_{\bar D}$, ${\bf I}_{\bar D}$ = ${\bf I}-{\bf I}_{D}$ we find
$\displaystyle {\bf K}^{-1}$ $\textstyle =$ $\displaystyle {\bf K}_0^{-1} \left( {\bf I}+{\bf K}_D {\bf K}_0^{-1}\right)^{-1}$ (325)
  $\textstyle =$ $\displaystyle {\bf K}_0^{-1} \! \left(
\left( {\bf I}_D+{\bf K}_D {\bf K}_{0,DD...
...^{-1}\right)^{-1}
{\bf K}_D {\bf K}_{0,D\bar D}^{-1} + {\bf I}_{\bar D}\right).$  

Notice that while ${\bf K}_{D}^{-1}$ = $({\bf I}_{D}{\bf K}_D {\bf I}_{D})^{-1}$ in general ${\bf K}_{0,DD}^{-1}$ = ${\bf I}_{D}{\bf K}_0^{-1}{\bf I}_{D}
\ne ({\bf I}_{D}{\bf K}_0{\bf I}_{D})^{-1}$. This means for example that ${\bf K}_{0}^{-1}$ has to be known to find ${\bf K}_{0,DD}^{-1}$ and it is not enough to invert ${\bf I}_{D}{\bf K}_0{\bf I}_{D}$ = ${\bf K}_{0,DD}\ne ({\bf K}_{0,DD}^{-1})^{-1}$. In data space $\left( {\bf I}_D+{\bf K}_D {\bf K}_{0,DD}^{-1}\right)^{-1}$ = $\left( {\bf K}_D^{-1}+{\bf K}_{0,DD}^{-1}\right)^{-1}{\bf K}_D^{-1}$ , so Eq. (325) can be manipulated to give
\begin{displaymath}
{\bf K}^{-1}
=
{\bf K}_0^{-1} \left( {\bf I}- {\bf I}_D
\le...
... K}_{0,DD}^{-1}\right)^{-1}
{\bf I}_D{\bf K}_0^{-1}
\right)
.
\end{displaymath} (326)

This allows now to express the predictive mean (322) and covariance (323) by the prior covariance
$\displaystyle \bar y$ $\textstyle =$ $\displaystyle t_0 + {{\bf K}}_0^{-1}
\left({\bf K}_D^{-1} + {{\bf K}}_{0,DD}^{-1} \right)^{-1} (t_D - t_0)
,$ (327)
$\displaystyle {\bf K}_y^{-1}$ $\textstyle =$ $\displaystyle {\bf K}_{x}^{-1}+
{\bf K}_{0,xx}^{-1}
-{\bf K}_{0,xD}^{-1}
\left({\bf K}_D^{-1}+{\bf K}_{0,DD}^{-1}\right)^{-1}
{\bf K}_{0,Dx}^{-1}
.$ (328)

Thus, for given prior covariance ${\bf K}_0^{-1}$ both, $\bar y$ and ${\bf K}_y^{-1}$, can be calculated by inverting the $\tilde n\times\tilde n$ matrix $\widetilde {\bf K}$ = $\left({\bf K}_{0,DD}^{-1} + {\bf K}_D^{-1}\right)^{-1}$.

Comparison of Eqs.(327,328) with the maximum posterior solution ${h}^*$ of Eq. (277) now shows that for Gaussian regression the exact predictive density $p(y\vert x,D,D_0)$ and its maximum posterior approximation $p(y\vert x,{h}^*)$ have the same mean

\begin{displaymath}
t = \int \!dy\, y\, p(y\vert x,D,D_0) = \int \!dy\, y\, p(y\vert x,{h}^*)
.
\end{displaymath} (329)

The variances, however, differ by the term ${\bf I}_x {\bf K}^{-1}{\bf I}_x$.

According to the results of Section 2.2.2 the mean of the predictive density is the optimal choice under squared-error loss (51). For Gaussian regression, therefore the optimal regression function $a^*(x)$ is the same for squared-error loss in exact and in maximum posterior treatment and thus also for log-loss (for Gaussian $p(y\vert x,a)$ with fixed variance)

\begin{displaymath}
a^*_{\rm MAP,log}
=a^*_{\rm exact,log}
= a^*_{\rm MAP,sq.}
= a^*_{\rm exact,sq.}
={h}^* = t.
\end{displaymath} (330)

In case the space of possible $p(y\vert x,a)$ is not restricted to Gaussian densities with fixed variance, the variance of the optimal density under log-loss $p(y\vert x,a^*_{\rm exact,log})$ = $p(y\vert x,D,D_0)$ differs by ${\bf I}_x {\bf K}^{-1}{\bf I}_x$ from its maximum posterior approximation $p(y\vert x,a^*_{\rm MAP,log})$ = $p(y\vert x,{h}^*)$.


next up previous contents
Next: Gaussian mixture regression (cluster Up: Regression Previous: Gaussian regression   Contents
Joerg_Lemm 2001-01-21