Exact posterior for hyperparameters

Next: Integer hyperparameters Up: Parameterizing priors: Hyperparameters Previous: Regularization parameters Contents

Exact posterior for hyperparameters

In the previous sections we have studied saddle point approximations which lead us to maximize the joint posterior $p(h,\theta\vert D,D_0)$ simultaneously with respect to the hidden variables and $\theta$

$\begin{displaymath} p(y\vert x,D,D_0) = p(y_D\vert x_D,D_0)^{-1} \!\!\int \!\!d... ...h,\theta\vert D,D_0),\; \rm max\, w.r.t.\, \theta\; and\; h} , \end{displaymath}$

(486)

assuming for the maximization with respect to

a slowly varying $p(y\vert x,h)$ at the stationary point.

This simultaneous maximization with respect to both variables is consistent with the usual asymptotic justification of a saddle point approximation. For example, for a function $f(h,\theta)$ of two (for example, one-dimensional) variables , $\theta$

$\begin{displaymath} \int \!dh\, d\theta\, e^{-\beta f(h,\theta)} \approx e^{-\beta f(h^*,\theta^*) -\frac{1}{2}\ln \det (\beta {\bf H}/2\pi) } \end{displaymath}$

(487)

for large enough $\beta$ (and a unique maximum). Here $f(h^*,\theta^*)$ denotes the joint minimum and ${\bf H}$ the Hessian of

with respect to

and $\theta$ . For $\theta$ -dependent determinant of the covariance and the usual definition of $\beta$ , results in a function

of the form $f(h,\theta)$ = $E(h,\theta) + (1/2\beta)\ln \det (\beta {\bf K}(\theta)/2 \pi)$ , where both terms are relevant for the minimization of

with respect to $\theta$ . For large $\beta$ , however, the second term becomes small compared to the first one. (Of course, there is the possibility that a saddle point approximation is not adequate for the $\theta$ integration. Also, we have seen that the condition of a positive definite covariance may lead to a solution for $\theta$ on the boundary where the (unrestricted) stationarity equation is not fulfilled.)

Alternatively, one might think of performing the two integrals stepwise. This seems especially useful if one integral can be calculated analytically. Consider, for example

$\begin{displaymath} \int \!dh\, d\theta\, e^{-\beta f(h,\theta)} \approx \int \!... ...\beta}{2\pi} \frac{\partial^2 f(h^*(\theta))}{\partial h^2}) } \end{displaymath}$

(488)

which would be exact for a Gaussian

-integral. One sees now that minimizing the complete negative exponent $\beta f(\theta,h^*)$ + $\frac{1}{2}\ln \det (\beta (\partial^2 f/\partial h^2)/2\pi)$ with respect to $\theta$ is different from minimizing only

in (487), if the second derivative of

with respect to

depends on $\theta$ (which is not the case for a Gaussian $\theta$ integral). Again this additional term becomes negligible for large enough $\beta$ . Thus, at least asymptotically, this term may be altered or even be skipped, and differences in the results of the variants of saddle point approximation will be expected to be small.

Stepwise approaches like (488) can be used, for example to perform Gaussian integrations analytically, and lead to somewhat simpler stationarity equations for $\theta$ -dependent covariances [236].

In particular, let us look at the case of Gaussian regression in a bit more detail. The following discussion, however, also applies to density estimation if, as in (488), the Gaussian first step integration is replaced by a saddle point approximation including the normalization factor. (This requires the calculation of the determinant of the Hessian.) Consider the two step procedure for Gaussian regression

$\displaystyle p(y\vert x,D,D_0)\!$	$\textstyle =$	$\displaystyle p(y_D\vert x_D,D_0)^{-1} \!\!\int \!\!d\theta \underbrace{ p(\the... ...D,D_0,\theta) \propto p(y,\theta\vert x,D,D_0) {\rm\, max\, w.r.t.\,} \theta} ,$
	$\textstyle =$	$\displaystyle \int \! d\theta\, \underbrace{ \underbrace{ p(\theta\vert D,D_0) ... ...\theta) }_{\rm exact} }_{p(y,\theta\vert x,D,D_0),\; \rm max\, w.r.t.\, \theta}$	(489)

where in a first step $p(y,y_D\vert x,x_D,D_0,\theta)$ can be calculated analytically and in a second step the $\theta$ integral is performed by Gaussian approximation around a stationary point. Instead of maximizing the joint posterior $p(h,\theta\vert D,D_0)$ with respect to

and $\theta$ this approach performs the

-integration analytically and maximizes $p(y,\theta\vert x,D,D_0)$ with respect to $\theta$ . The disadvantage of this approach is the

-, and

-dependency of the resulting solution.

Thus, assuming a slowly varying $p(y\vert x,D,D_0,\theta)$ at the stationary point it appears simpler to maximize the -marginalized posterior, $p(\theta\vert D,D_0)$ = $\int dh\, p(h,\theta\vert D,D_0)$ , if the -integration can be performed exactly,

$\begin{displaymath} p(y\vert x,D,D_0) = \int \! d\theta\, \underbrace{ \underbr... ...\theta } \underbrace{ p(y\vert x,D,D_0,\theta) }_{\rm exact} . \end{displaymath}$

(490)

Having found a maximum posterior solution $\theta^*$ the corresponding analytical solution for $p(y\vert x,D,D_0,\theta^*)$ is then given by Eq. (321). The posterior density $p(\theta\vert D,D_0)$ can be obtained from the likelihood of $\theta$ and a specified prior $p(\theta)$

$\begin{displaymath} p(\theta\vert D,D_0) = \frac{p(y_D\vert x_D,D_0,\theta) p(\theta)}{p(y_D\vert x_D,D_0)} . \end{displaymath}$

(491)

Thus, in case the $\theta$ -likelihood can be calculated analytically, the $\theta$ -integral is calculated in saddle point approximation by maximizing the posterior for $\theta$ with respect to $\theta$ . In the case of a uniform $p(\theta)$ the optimal $\theta^*$ is obtained by maximizing the $\theta$ -likelihood. This corresponds technically to an empirical Bayes approach [35]. As

is integrated out in $p(y_D\vert x_D,D_0,\theta)$ the $\theta$ -likelihood is also called marginalized likelihood.

Indeed, for Gaussian regression, the $\theta$ -likelihood can be integrated analytically. Analogously to Section 3.7.2 one finds [228,237,236],

$\displaystyle p(y_D\vert x_D,D_0,\theta)$	$\textstyle =$	$\displaystyle \int\!dh\, p(y_D\vert x_D,h) \, p(h\vert D_0,\theta)$
	$\textstyle =$	$\displaystyle \int\!dh\, e^{-\frac{1}{2} \sum_{i=0}^{n} \big( {h}-t_i,\, {\bf K}_i ({h}-t_i) \big) +\frac{1}{2} \sum_{i=0}^{n} \ln \det_i ({\bf K}_i/2\pi) }$
	$\textstyle =$	$\displaystyle e^{ -\frac{1}{2}\sum_{i=0}^n \big( t_i,\, {\bf K}_i t_i \big) +\f... ...2}\big( t,\, {\bf K} t \big) +\frac{1}{2}\ln \det_D(\widetilde {\bf K} /2\pi) }$
	$\textstyle =$	$\displaystyle e^{-\frac{1}{2} \Big( t_D - t_0,\, \widetilde {\bf K} ( t_D - t_0) \Big) +\frac{1}{2}\ln\det_D \widetilde {\bf K} -\frac{\tilde n}{2}\ln(2\pi) }$
	$\textstyle =$	$\displaystyle e^{-\widetilde E +\frac{1}{2}\ln\det_D \widetilde {\bf K} -\frac{\tilde n}{2}\ln(2\pi) } ,$	(492)

where $\widetilde E$ = $\frac{1}{2} \Big( t_D - t_0,\, \widetilde {\bf K} ( t_D - t_0) \Big)$ , $\widetilde {\bf K}$ = $({\bf K}_D^{-1}+{\bf K}_{0,DD}^{-1}(\theta))^{-1}$ = ${\bf K}_D + {\bf K}_D{\bf K}^{-1}{\bf K}_D$ , $\det_D$ the determinant in data space, and we used that from ${\bf K}_i^{-1}{\bf K}_j$ = $\delta_{ij}$ for

follows $\sum_{i=0}^n \big( t_i,\, {\bf K}_i t_i \big)$ = $\big( t_D,\, {\bf K}_D t_D \big)$ + $\big( t_0,\, {\bf K}_0 t_0 \big)$ = $\big( t_D,\, {\bf K} t\big)$ , with ${\bf K}$ = $\sum_{i=0}^{n} {\bf K}_i$ . In cases where the marginalization over

, necessary to obtain the evidence, cannot be performed analytically and all

-integrals are calculated in saddle point approximation, we get the same result as for a direct simultaneous MAP for

and $\theta$ for the predictive density as indicated in (486).

Now we are able to compare the three resulting stationary equations for $\theta$ -dependent mean $t_0(\theta)$ , inverse covariance ${\bf K}_0(\theta)$ and prior $p(\theta)$ . Setting the derivative of the joint posterior $p(h,\theta\vert D,D_0)$ with respect to $\theta$ to zero yields

$\displaystyle 0$	$\textstyle =$	$\displaystyle \left( \frac{\partial t_0}{\partial \theta},\; {\bf K}_0(t_0-{h})... ...h}-t_0 , \,\frac{\partial {{\bf K}_0}(\theta)}{\partial \theta}\,({h}-t_0)\Big)$
		$\displaystyle - \frac{1}{2} {\rm Tr} \left({\bf K}_0^{-1} \frac{\partial {{\bf ... ...theta}\right) -\frac{1}{p(\theta)} \frac{\partial p(\theta)}{\partial \theta} .$	(493)

This equation which we have already discussed has to be solved simultaneously with the stationarity equation for

. While this approach is easily adapted to general density estimation problems, its difficulty for $\theta$ -dependent covariance determinants lies in calculation of the derivative of the determinant of ${\bf K}_0$ . Maximizing the

-marginalized posterior $p(\theta\vert D,D_0)$ , on the other hand, only requires the calculation of the derivative of the determinant of the $\tilde n\times\tilde n$ matrix $\widetilde {\bf K}$

$\displaystyle 0$	$\textstyle =$	$\displaystyle \left( \frac{\partial t_0}{\partial \theta},\; \widetilde {\bf K}... ...t_D-t_0),\, \frac{\partial \widetilde {\bf K}}{\partial \theta}(t_D-t_0)\right)$
		$\displaystyle -\frac{1}{2} {\rm Tr}\left(\widetilde {\bf K}^{-1} \frac{\partial... ...theta}\right) -\frac{1}{p(\theta)} \frac{\partial p(\theta)}{\partial \theta} .$	(494)

Evaluated at the stationary

= $t_0+{\bf K}_0^{-1} \widetilde {\bf K} (t_D-t_0)$ , the first term of Eq. (493), which does not contain derivatives of the inverse covariances, becomes equal to the first term of Eq. (494). The last terms of Eqs. (493) and (494) are always identical. Typically, the data-independent ${\bf K}_0$ has a more regular structure than the data-dependent $\widetilde {\bf K}$ . Thus, at least for one or two dimensional

, a straightforward numerical solution of Eq. (493) by discretizing

can also be a good choice for Gaussian regression problems.

Analogously, from Eq. (321) follows for maximizing $p(y,\theta\vert x,D,D_0)$ with respect to $\theta$

$\displaystyle 0$	$\textstyle =$	$\displaystyle \left( \frac{\partial t}{\partial\theta},\;{\bf K}_{y}(t - y)\rig... ...c{1}{2} \left((y-t),\, \frac{\partial {\bf K}_{y}}{\partial \theta}(y-t)\right)$
		$\displaystyle -\frac{1}{2} {\rm Tr}\left({\bf K}_{y}^{-1} \frac{\partial {\bf K... ...}{p(\theta\vert D,D_0)} \frac{\partial p(\theta\vert D,D_0)}{\partial \theta} ,$	(495)

which is

-, and

-dependent. Such an approach may be considered if interested only in specific test data

We may remark that also in Gaussian regression the $\theta$ -integral may be quite different from a Gaussian integral, so a saddle point approximation does not necessarily have to give satisfactory results. In cases one encounters problems one can, for example, try variable transformations $\int f(\theta)d\theta$ = $\int \det (\partial \theta/\partial \theta^\prime) f(\theta(\theta^\prime))d\theta^\prime$ to obtain a more Gaussian shape of the integrand. Due to the presence of the Jacobian determinant, however, the asymptotic interpretation of the corresponding saddle point approximation is different for the two integrals. The variability of saddle point approximations results from the freedom to add terms which vanish asymptotically but remains finite in the nonasymptotic region. Similar effects are known in quantum many body theory (see for example [172], chapter 7.) Alternatively, the $\theta$ -integral can be solved numerically by Monte Carlo methods[237,236].

Next: Integer hyperparameters Up: Parameterizing priors: Hyperparameters Previous: Regularization parameters Contents

Joerg_Lemm 2001-01-21