next up previous contents
Next: Adapting prior means Up: Parameterizing priors: Hyperparameters Previous: Parameterizing priors: Hyperparameters   Contents


Prior normalization

In Chapter 4. parameterization of $\phi $ have been studied. This section now discusses parameterizations of the prior density $p(\phi\vert D_0)$. For Gaussian prior densities that means parameterization of mean and/or covariance. The parameters of the prior functional, which we will denote by $\theta$, are in a Bayesian context also known as hyperparameters. Hyperparameters $\theta$ can be considered as part of the hidden variables.

In a full Bayesian approach the ${h}$-integral therefore has to be completed by an integral over the additional hidden variables $\theta$. Analogously, the prior densities can be supplemented by priors for $\theta$, also be called hyperpriors, with corresponding energies $E_\theta$.

In saddle point approximation thus an additional stationarity equation will appear, resulting from the derivative with respect to $\theta$. The saddle point approximation of the $\theta$-integration (in the case of uniform hyperprior $p(\theta)$ and with the ${h}$-integral being calculated exactly or by approximation) is also known as ML-II prior [16] or evidence framework [85,86,213,147,148,149,24].

There are some cases where it is convenient to let the likelihood $p(y\vert x,h)$ depend, besides on a function $\phi $, on a few additional parameters. In regression such a parameter can be the variance of the likelihood. Another example is the inverse temperature $\beta $ introduced in Section 6.3, which, like $\phi $ also appears in the prior. Such parameters may formally be added to the ``direct'' hidden variables $\phi $ yielding an enlarged $\tilde \phi$. As those ``additional likelihood parameters'' are like other hyperparameters typically just real numbers, and not functions like $\phi $, they can often be treated analogously to hyperparameters. For example, they may also be determined by cross-validation (see below) or by a low dimensional integration. In contrast to pure prior parameters, however, the functional derivatives with respect to such ``additional likelihood parameters'' contain terms arising from the derivative of the likelihood.

Within the Frequentist interpretation of error minimization as empirical risk minimization hyperparameters $\theta$ can be determined by minimizing the empirical generalization error on a new set of test or validation data $D_T$ being independent from the training data $D$. Here the empirical generalization error is meant to be the pure data term $E_D(\theta)$ = $E_D(\phi^*(\theta))$ of the error functional for $\phi^*$ being the optimal $\phi $ for the full regularized $E_\phi(\theta)$ at $\theta$ and for given training data $D$. Elaborated techniques include cross-validation and bootstrap methods which have been mentioned in Sections 2.5 and 4.9.

Within the Bayesian interpretation of error minimization as posterior maximization the introduction of hyperparameters leads to a new difficulty. The problem arises from the fact that it is usually desirable to interpret the error term $E_\theta$ as prior energy for $\theta$, meaning that

\begin{displaymath}
p(\theta) = \frac{e^{-E_\theta}}{Z_\theta}
,
\end{displaymath} (419)

with normalization
\begin{displaymath}
Z_\theta
=
{\int\!d\theta\, e^{-E_\theta}}
,
\end{displaymath} (420)

represents the prior density for $\theta$. Because the joint prior factor for $\phi $ and $\theta$ is given by the product
\begin{displaymath}
p(\phi,\theta) = p(\phi\vert\theta) p(\theta)
,
\end{displaymath} (421)

one finds
\begin{displaymath}
p(\phi\vert\theta) = \frac{e^{-E (\phi\vert\theta)}}{Z_\phi(\theta)}
.
\end{displaymath} (422)

Hence, the $\phi $-dependent part of the energy represents a conditional prior energy denoted here $E(\phi\vert\theta)$. As this conditional normalization
\begin{displaymath}
Z_\phi(\theta)
=
\int\!d\phi\, e^{-E(\phi\vert\theta)}
,
\end{displaymath} (423)

is in general $\theta$-dependent a normalization term
\begin{displaymath}
E_N (\theta) = \ln Z_\phi(\theta)
\end{displaymath} (424)

must therefore be included in the error functional when minimizing with respect to $\theta$.

It is interesting to look what happens if $p(\phi,\theta)$ of Eq. (419) is expressed in terms of joint energy $E(\phi,\theta)$ as follows

\begin{displaymath}
p(\phi,\theta) = \frac{e^{-E (\phi,\theta)}}{Z_{\phi,\theta}}
.
\end{displaymath} (425)

Then the joint normalization
\begin{displaymath}
Z_{\phi,\theta}
=
\int\!d\phi\, d\theta\, e^{-E(\phi,\theta)}
,
\end{displaymath} (426)

is independent of $\phi $ and $\theta$ and could be skipped from the functional. However, in that case the term $E_\theta$ cannot easily be related to the prior $p(\theta)$.

Notice especially, that this discussion also applies to the case where $E_\theta$ is assumed to be uniform so it does not have to appear explicitly in the error functional. The two ways of expressing $p(\phi,\theta)$ by a joint or conditional energy, respectively, are equivalent if the joint density factorizes. In that case, however, $\theta$ and $\phi $ are independent, so $\theta$ cannot be used to parameterize the density of $\phi $.

Numerically the need to calculate $Z_\phi(\theta)$ can be disastrous because normalization factors $Z_\phi(\theta)$ represent often an extremely high dimensional (functional) integral and are, in contrast to the normalization of $P$ over $y$, very difficult to calculate.

There are, however, situations for which $Z_\phi(\theta)$ remains $\theta$-independent. Let $p(\phi,\theta)$ stand for example for a Gaussian specific prior $p(\phi,\theta\vert\tilde D_0)$ (with the normalization condition factored out as in Eq. (90)). Then, because the normalization of a Gaussian is independent of its mean, parameterizing the mean $t$ = $t(\theta)$ results in a $\theta$-independent $Z_\phi(\theta)$.

Besides their mean, Gaussian processes are characterized by their covariance operators ${{\bf K}}^{-1}$. Because the normalization only depends on $\det {{\bf K}}$ a second possibility yielding $\theta$-dependent $Z_\phi(\theta)$ are parameterized transformations of the form ${{\bf K}} \rightarrow {\bf O}{{\bf K}}{\bf O}^{-1}$ with orthogonal ${\bf O}$ = ${\bf O}(\theta)$. Indeed, such transformations do not change the determinant $\det {{\bf K}}$. They are only non-trivial for multi-dimensional Gaussians.

For general parameterizations of density estimation problems, however, the normalization term $\ln Z_\phi(\theta)$ must be included. The only way to get rid of that normalization term would be to assume a compensating hyperprior

\begin{displaymath}
p(\theta)\propto Z_\phi (\theta)
,
\end{displaymath} (427)

resulting in an error term $E(\theta)$ = $-\ln Z_\phi (\theta)$ compensating $E_N (\theta)$.

Thus, in the general case we have to consider the functional

\begin{displaymath}
E_{\theta,\phi} =
-(\ln P(\phi ),\,N)
+ (P(\phi),\, \Lambda_X )
+ E_\phi (\theta)
+ E_\theta
+\ln Z_\phi (\theta)
.
\end{displaymath} (428)

writing $E(\phi\vert\theta)$ = $E_\phi$ and $E(\theta)$ = $E_\theta$. The stationarity conditions have the form
$\displaystyle \frac{\delta E_\phi}{\delta \phi}$ $\textstyle =$ $\displaystyle {\bf P}^\prime(\phi) {\bf P}^{-1}(\phi) N
- {\bf P}^\prime (\phi) \Lambda_X
,$ (429)
$\displaystyle \frac{\partial E_\phi}{\partial \theta}$ $\textstyle =$ $\displaystyle -{\bf Z}^\prime Z_\phi^{-1}(\theta)
-E_{\theta}^\prime
,$ (430)

with
\begin{displaymath}
{\bf Z}^\prime (l,k)
= \delta (l-k) \frac{\partial Z_\phi(\t...
...ta}^\prime (l)
=
\frac{\partial E_\theta}{\partial \theta_l}
.
\end{displaymath} (431)

For compensating hyperprior $ E_\theta = -\ln Z_\phi (\theta)$ the right hand side of Eq. (430) vanishes.

Finally, we want to remark that in case function evaluation of $p(\phi,\theta)$ is much cheaper than calculating the gradient (430), minimization methods not using the gradient should be considered, like for example the downhill simplex method [196].


next up previous contents
Next: Adapting prior means Up: Parameterizing priors: Hyperparameters Previous: Parameterizing priors: Hyperparameters   Contents
Joerg_Lemm 2001-01-21