Prior normalization

In Chapter 4. parameterization of $\phi$ have been studied. This section now discusses parameterizations of the prior density $p(\phi\vert D_0)$ . For Gaussian prior densities that means parameterization of mean and/or covariance. The parameters of the prior functional, which we will denote by $\theta$ , are in a Bayesian context also known as hyperparameters. Hyperparameters $\theta$ can be considered as part of the hidden variables.

In a full Bayesian approach the

-integral therefore has to be completed by an integral over the additional hidden variables $\theta$ . Analogously, the prior densities can be supplemented by priors for $\theta$ , also be called hyperpriors, with corresponding energies $E_\theta$ .

In saddle point approximation thus an additional stationarity equation will appear, resulting from the derivative with respect to $\theta$ . The saddle point approximation of the $\theta$ -integration (in the case of uniform hyperprior $p(\theta)$ and with the

-integral being calculated exactly or by approximation) is also known as ML-II prior [16] or evidence framework [85,86,213,147,148,149,24].

There are some cases where it is convenient to let the likelihood $p(y\vert x,h)$ depend, besides on a function $\phi$ , on a few additional parameters. In regression such a parameter can be the variance of the likelihood. Another example is the inverse temperature $\beta$ introduced in Section 6.3, which, like $\phi$ also appears in the prior. Such parameters may formally be added to the ``direct'' hidden variables $\phi$ yielding an enlarged $\tilde \phi$ . As those ``additional likelihood parameters'' are like other hyperparameters typically just real numbers, and not functions like $\phi$ , they can often be treated analogously to hyperparameters. For example, they may also be determined by cross-validation (see below) or by a low dimensional integration. In contrast to pure prior parameters, however, the functional derivatives with respect to such ``additional likelihood parameters'' contain terms arising from the derivative of the likelihood.

Within the Frequentist interpretation of error minimization as empirical risk minimization hyperparameters $\theta$ can be determined by minimizing the empirical generalization error on a new set of test or validation data

being independent from the training data

. Here the empirical generalization error is meant to be the pure data term $E_D(\theta)$ = $E_D(\phi^*(\theta))$ of the error functional for $\phi^*$ being the optimal $\phi$ for the full regularized $E_\phi(\theta)$ at $\theta$ and for given training data

. Elaborated techniques include cross-validation and bootstrap methods which have been mentioned in Sections 2.5 and 4.9.

Within the Bayesian interpretation of error minimization as posterior maximization the introduction of hyperparameters leads to a new difficulty. The problem arises from the fact that it is usually desirable to interpret the error term $E_\theta$ as prior energy for $\theta$ , meaning that

$\begin{displaymath} p(\theta) = \frac{e^{-E_\theta}}{Z_\theta} , \end{displaymath}$

(419)

$\begin{displaymath} Z_\theta = {\int\!d\theta\, e^{-E_\theta}} , \end{displaymath}$

(420)

$\begin{displaymath} p(\phi,\theta) = p(\phi\vert\theta) p(\theta) , \end{displaymath}$

(421)

$\begin{displaymath} p(\phi\vert\theta) = \frac{e^{-E (\phi\vert\theta)}}{Z_\phi(\theta)} . \end{displaymath}$

(422)

$\begin{displaymath} Z_\phi(\theta) = \int\!d\phi\, e^{-E(\phi\vert\theta)} , \end{displaymath}$

(423)

$\begin{displaymath} E_N (\theta) = \ln Z_\phi(\theta) \end{displaymath}$

(424)

It is interesting to look what happens if $p(\phi,\theta)$ of Eq. (419) is expressed in terms of joint energy $E(\phi,\theta)$ as follows

$\begin{displaymath} p(\phi,\theta) = \frac{e^{-E (\phi,\theta)}}{Z_{\phi,\theta}} . \end{displaymath}$

(425)

$\begin{displaymath} Z_{\phi,\theta} = \int\!d\phi\, d\theta\, e^{-E(\phi,\theta)} , \end{displaymath}$

(426)

Notice especially, that this discussion also applies to the case where $E_\theta$ is assumed to be uniform so it does not have to appear explicitly in the error functional. The two ways of expressing $p(\phi,\theta)$ by a joint or conditional energy, respectively, are equivalent if the joint density factorizes. In that case, however, $\theta$ and $\phi$ are independent, so $\theta$ cannot be used to parameterize the density of $\phi$ .

Numerically the need to calculate $Z_\phi(\theta)$ can be disastrous because normalization factors $Z_\phi(\theta)$ represent often an extremely high dimensional (functional) integral and are, in contrast to the normalization of

over

, very difficult to calculate.

There are, however, situations for which $Z_\phi(\theta)$ remains $\theta$ -independent. Let $p(\phi,\theta)$ stand for example for a Gaussian specific prior $p(\phi,\theta\vert\tilde D_0)$ (with the normalization condition factored out as in Eq. (90)). Then, because the normalization of a Gaussian is independent of its mean, parameterizing the mean

= $t(\theta)$ results in a $\theta$ -independent $Z_\phi(\theta)$ .

Besides their mean, Gaussian processes are characterized by their covariance operators ${{\bf K}}^{-1}$ . Because the normalization only depends on $\det {{\bf K}}$ a second possibility yielding $\theta$ -dependent $Z_\phi(\theta)$ are parameterized transformations of the form ${{\bf K}} \rightarrow {\bf O}{{\bf K}}{\bf O}^{-1}$ with orthogonal ${\bf O}$ = ${\bf O}(\theta)$ . Indeed, such transformations do not change the determinant $\det {{\bf K}}$ . They are only non-trivial for multi-dimensional Gaussians.

For general parameterizations of density estimation problems, however, the normalization term $\ln Z_\phi(\theta)$ must be included. The only way to get rid of that normalization term would be to assume a compensating hyperprior

$\begin{displaymath} p(\theta)\propto Z_\phi (\theta) , \end{displaymath}$

(427)

$\begin{displaymath} E_{\theta,\phi} = -(\ln P(\phi ),\,N) + (P(\phi),\, \Lambda_X ) + E_\phi (\theta) + E_\theta +\ln Z_\phi (\theta) . \end{displaymath}$

(428)

$\displaystyle \frac{\delta E_\phi}{\delta \phi}$	$\textstyle =$	$\displaystyle {\bf P}^\prime(\phi) {\bf P}^{-1}(\phi) N - {\bf P}^\prime (\phi) \Lambda_X ,$	(429)
$\displaystyle \frac{\partial E_\phi}{\partial \theta}$	$\textstyle =$	$\displaystyle -{\bf Z}^\prime Z_\phi^{-1}(\theta) -E_{\theta}^\prime ,$	(430)

$\begin{displaymath} {\bf Z}^\prime (l,k) = \delta (l-k) \frac{\partial Z_\phi(\t... ...ta}^\prime (l) = \frac{\partial E_\theta}{\partial \theta_l} . \end{displaymath}$

(431)

Finally, we want to remark that in case function evaluation of $p(\phi,\theta)$ is much cheaper than calculating the gradient (430), minimization methods not using the gradient should be considered, like for example the downhill simplex method [196].