next up previous contents
Next: Non-Gaussian prior factors Up: Parameterizing priors: Hyperparameters Previous: Integer hyperparameters   Contents


Local hyperfields

Most, but not all hyperparameters $\theta$ considered so far have been real or integer numbers, or vectors with real or integer components $\theta_i$. With the unrestricted template functions of Sect. 5.2.3 or the functions parameterizing the inverse covariance in Section 5.3.3, we have, however, also already encountered function hyperparameters or hyperfields. In this section we will now discuss function hyperparameters in more detail.

Functions can be seen as continuous vectors, the function values $\theta(u)$ being the (continuous) analogue of vector components $\theta_i$. In numerical calculations, in particular, functions usually have to be discretized, so, numerically, functions stand for high dimensional vectors.

Typical arguments of function hyperparameters are the independent variables $x$ and, for general density estimation, also the dependent variables $y$. Such functions $\theta(x)$ or $\theta(x,y)$ will be called local hyperparameters or local hyperfields. Local hyperfields $\theta(x)$ can be used, for example, to adapt templates or inverse covariances locally. (For general density estimation problems replace here and in the following $x$ by $(x,y)$.)

The price to be paid for the additional flexibility of function hyperparameters is a large number of additional degrees of freedom. This can considerably complicate calculations and, requires a sufficient number of training data and/or a sufficiently restrictive hyperprior to be able to determine the hyperfield and not to make the prior useless.

To introduce local hyperparameters $\theta(x)$ we express real symmetric, positive (semi-)definite inverse covariances by square roots or ``filter operators'' ${\bf W}$, ${\bf K}$ = ${\bf W}^T {\bf W}$ = $\int \!dx\; {W}_x{W}^T_x$ where ${W}_x$ represents the vector ${\bf W}(x,\cdot)$. Thus, in components

\begin{displaymath}
{\bf K}(x,x^\prime)
= \int \!dx^{\prime\prime}\;
{\bf W}^T(x,x^{\prime\prime}){\bf W}(x^{\prime\prime},x^{\prime})
,
\end{displaymath} (498)

and therefore
$\displaystyle \Big( \phi-t\, , \,{\bf K} (\phi-t) \Big)$ $\textstyle =$ $\displaystyle \int\! dx\,dx^\prime\,
[\phi(x)-t(x)]
\, {\bf K}^T(x,x^{\prime}) \,
[\phi(x^{\prime})-t(x^{\prime})]$  
  $\textstyle =$ $\displaystyle \int\! dx\,dx^\prime\, dx^{\prime\prime}\,
[\phi(x)-t(x)]
{\bf W}^T(x,x^{\prime})$  
    $\displaystyle \qquad\times\;
{\bf W}(x^{\prime},x^{\prime\prime})
[\phi(x^{\prime\prime})-t(x^{\prime\prime})]$  
  $\textstyle =$ $\displaystyle \int \! dx\, \vert\omega (x)\vert^2
,$ (499)

where we defined the ``filtered differences''
\begin{displaymath}
\omega (x)
=
\big( \, W_x\, ,\, \phi-t \, \big)
=
\int \!dx^\prime \, {\bf W}(x,x^\prime)
[\phi(x^\prime)-t(x^\prime )]
.
\end{displaymath} (500)

Thus, for a Gaussian prior for $\phi $ we have
\begin{displaymath}
p(\phi) \propto
e^{- \frac{1}{2} \big( \phi-t\, , \,{\bf K} ...
...\big)}
=
e^{- \frac{1}{2}\int \!dx \, \vert\omega(x)\vert^2}
.
\end{displaymath} (501)

A real local hyperfield $\theta(x)$ mixing, for instance, locally two alternative filtered differences may now be introduced as follows
\begin{displaymath}
p(\phi\vert\theta) =
e^{-\frac{1}{2}
\int \!dx \vert\omega (...
...
+\theta(x) \omega_2(x)
\right\vert^2
-\ln Z_\phi(\theta)}
,
\end{displaymath} (502)

where
\begin{displaymath}
\omega (x;\theta)
= [1-\theta(x)] \, \omega_1(x) + \theta(x) \,\omega_2(x)
,
\end{displaymath} (503)

and, say, $\theta(x)\in [0,1]$. For unrestricted real $\theta(x)$ an arbitrary real $\omega (x;\theta)$ can be obtained. For a binary local hyperfield with $\theta(x)\in \{0,1\}$ we have $\theta^2$ = $\theta$, $(1-\theta)^2$ = $(1-\theta)$, and $\theta(1-\theta)$ = $0$, so Eq. (502) becomes
\begin{displaymath}
p(\phi\vert\theta) =
e^{-\frac{1}{2}
\int \!dx \vert\omega (...
...eta(x) \vert\omega_2(x)\vert^2
\right)
-\ln Z_\phi(\theta)}
.
\end{displaymath} (504)

For real $\theta(x)$ in Eq. (503) terms with $\theta^2(x)$, $[1-\theta(x)]^2$, and $[1-\theta(x)]\theta(x)$ would appear in Eq. (504). A binary $\theta$ variable can be obtained from a real $\theta$ with the help of a step function $\Theta(x)$ and a threshold $\vartheta$ by replacing
\begin{displaymath}
B_\theta(x) = \Theta(\theta(x)-\vartheta)
\rightarrow \theta(x)
.
\end{displaymath} (505)

Clearly, if both prior and hyperprior are formulated in terms of such $B_\theta(x)$ this is equivalent to using directly a binary hyperfield.

For a local hyperfield $\theta(x)$ a local adaption of the functions $\omega (x;\theta)$ as in Eq. (503) can be achieved by switching locally between alternative templates or alternative filter operators ${\bf W}$

$\displaystyle t_x(x^\prime;\theta)$ $\textstyle =$ $\displaystyle [1-\theta(x)] \, t_{1,x}(x^\prime)
+ \theta(x)\, t_{2,x}(x^\prime),$ (506)
$\displaystyle {\bf W}(x,x^\prime; \theta)$ $\textstyle =$ $\displaystyle [1-\theta(x)] \, {\bf W}_{1}(x,x^\prime)
+ \theta(x) \,{\bf W}_{2}(x,x^\prime).$ (507)

In Eq. (506) it is important to notice that ``local'' templates $t_x(x^\prime; \theta)$ for fixed $x$ are still functions of an $x^\prime$ variable. Indeed, to obtain $\omega (x;\theta)$, the function $t_x$ is needed for all $x^\prime$ for which ${\bf W}$ has nonzero entries,
\begin{displaymath}
\omega(x;\theta )
=
\int\!dx^\prime\,
{\bf W}(x,x^\prime)
[\phi(x^\prime)-t_x(x^\prime;\theta)]
.
\end{displaymath} (508)

That means that the template is adapted individually for every local filtered difference. Thus, Eq. (506) has to be distinguished from the choice
\begin{displaymath}
t(x^\prime;\theta)
=
[1-\theta(x^\prime )] \, t_{1}(x^\prime)
+ \theta(x^\prime )\, t_{2}(x^\prime)
.
\end{displaymath} (509)

The unrestricted adaption of templates discussed in Sect. 5.2.3, for instance, can be seen as an approach of the form of Eq. (509) with an unbounded real hyperfield $\theta(x)$.

Eq. (507) corresponds for binary $\theta$ to an inverse covariance

\begin{displaymath}
{\bf K}(\theta)
= \int \!dx\; {\bf K}_x(\theta)
= \int \!d...
...W}_{1,x}{W}^T_{1,x}
+ \theta(x) {W}_{2,x}{W}^T_{2,x}
\right)
,
\end{displaymath} (510)

where ${\bf K}_x(\theta)$ = ${W}_{x}(\theta){W}^T_{x}(\theta)$ and $W_{i,x}$ = ${\bf W}_i(x,\cdot)$, $W_{x}(\theta )$ = ${\bf W}(x,\cdot\,;\theta)$. We remark that $\theta$-dependent inverse covariances require to include the normalization factors when integrating over $\theta$ or solving for the optimal $\theta$ in MAP. If we consider two binary hyperfields $\theta$, $\theta^\prime$, one for $t$ and one for ${\bf W}$, we get a prior
\begin{displaymath}
p(\phi\vert\theta,\theta^\prime)
\propto
e^{-\frac{1}{2}
\in...
...theta)]\big)
}
\propto
e^{-E(\phi\vert\theta,\theta^\prime)}
.
\end{displaymath} (511)

Up to a $\phi $-independent constant (which still depends on $\theta$, $\theta^\prime$) the corresponding prior energy can again be written in the form
\begin{displaymath}
E(\phi\vert\theta,\theta^\prime)
=
\frac{1}{2}
\Big( \phi-t(...
...{\bf K}(\theta^\prime ) [\phi-t(\theta,\theta^\prime)] \Big)
.
\end{displaymath} (512)

Indeed, the corresponding effective template $t(\theta,\theta^\prime)$ and effective inverse covariance ${\bf K}(\theta^\prime)$ are according to Eqs. (247,252) given by
$\displaystyle t (\theta ,\theta^\prime)$ $\textstyle =$ $\displaystyle {\bf K}(\theta^\prime)^{-1}
\int\!dx\, {\bf K}_x(\theta^\prime ) \, t_x(\theta)
,$ (513)
$\displaystyle {\bf K}(\theta^\prime)$ $\textstyle =$ $\displaystyle \int \! dx\, {\bf K}_x(\theta^\prime)
.$ (514)

Hence, one may rewrite
$\displaystyle \int\! dx \, \vert\omega(x;\theta,\theta^\prime)\vert^2$ $\textstyle =$ $\displaystyle \Big( \phi-t(\theta,\theta^\prime),\;
{\bf K}(\theta^\prime) \, [\phi-t(\theta,\theta^\prime)] \Big)$ (515)
    $\displaystyle +
\sum_x \Big( t_x(\theta),\;
{\bf K}_x(\theta^\prime) \, t_x(\th...
...heta,\theta^\prime),\;
{\bf K}(\theta^\prime) \,t(\theta,\theta^\prime) \Big)
.$  

The MAP solution of Gaussian regression for a prior corresponding to (515) at optimal $\theta^*$, ${\theta^\prime}^*$ is according to Section 3.7 therefore given by
\begin{displaymath}
\phi^* (\theta^*,{\theta^\prime}^*)
=
[{\bf K}_D+{\bf K}({\...
...}({\theta^\prime}^*)\,
t(\theta^*,{\theta^\prime}^*)\right)
.
\end{displaymath} (516)

One may avoid dealing with ``local'' templates $t_x(\theta)$ by adapting templates in prior terms where ${\bf K}$ is equal to the identity ${\bf I}$. In that case $(t_{0})_{x}(x^\prime;\theta)$ is only needed for $x$ = $x^\prime$ and we may thus directly write $(t_{0})_{x}(x^\prime;\theta)$ = $t_{0}(x^\prime;\theta)$. As example, consider the following prior energy, where the $\theta$-dependent template is located in a term with ${\bf K}$ = ${\bf I}$ and another, say smoothness, prior is added with zero template

\begin{displaymath}
E(\phi\vert\theta) =
\frac{1}{2}
\Big( \phi-t_0(\theta),\, (...
...a) )\Big)
+\frac{1}{2}
\Big( \phi, \, {\bf K}_0\, \phi \Big)
.
\end{displaymath} (517)

Combining both terms yields
\begin{displaymath}
E(\phi\vert\theta)
=
\frac{1}{2}
\bigg(
\Big(\phi-t(\thet...
...
\left({\bf I}-{\bf K}^{-1}\right) t_0(\theta) \Big)
\bigg)
,
\end{displaymath} (518)

with effective template and effective inverse covariance
\begin{displaymath}
t(\theta) = {\bf K}^{-1} t_0(\theta)
,\quad
{\bf K} = {\bf I}+ {\bf K}_0
.
\end{displaymath} (519)

For differential operators ${\bf W}$ the effective $t(\theta)$ is thus a smoothed version of $t_0(\theta)$.

The extreme case would be to treat $t$ and ${\bf W}$ itself as unrestricted hyperparameters. Notice, however, that increasing flexibility tends to lower the influence of the corresponding prior term. That means, using completely free templates and covariances without introducing additional restricting hyperpriors, just eliminates the corresponding prior term (see Section 5.2.3).

Hence, to restrict the flexibility, typically a smoothness hyperprior may be imposed to prevent highly oscillating functions $\theta(x)$. For real $\theta(x)$, for example, a smoothness prior like $(\theta, -\Delta \theta)$ can be used in regions where it is defined. (The space of $\phi $-functions for which a smoothness prior $(\phi-t,\,{\bf K}(\phi-t))$ with discontinuous $t(\theta)$ is defined depends on the locations of the discontinuities.) An example of a non-Gaussian hyperprior is,

\begin{displaymath}
p(\theta) \propto
e^{-\frac{\kappa}{2} \int\!dx \, C_\theta(x)}
,
\end{displaymath} (520)

where $\kappa$ is some constant and
\begin{displaymath}
C_\theta(x) =
\Theta \left( \left(\frac{\partial \theta}{\partial x}\right)^2
- \vartheta_\theta\right)
.
\end{displaymath} (521)

is zero at locations where the square of the first derivative is smaller than a certain threshold $0\le \vartheta_\theta < \infty$, and one otherwise. (The step function $\Theta$ is defined as $\Theta(x)$ = 0 for $x\le 0$ and $\Theta(x)$ = 1 for $0<x\le\infty$.) To enable differentiation the step function $\Theta$ could be replaced by a sigmoidal function. For discrete $x$ one can analogously count the number of jumps larger than a given threshold. Similarly, one may penalize the number $N_d(\theta)$ of discontinuities where $\left(\frac{\partial \theta}{\partial x}\right)^2$ = $\infty $ and use
\begin{displaymath}
p(\theta) \propto e^{-\frac{\kappa}{2} N_d(\theta)}
.
\end{displaymath} (522)

In the case of a binary field this corresponds to counting the number of times the field changes its value.

The expression $C_\theta$ of Eq. (521) can be generalized to

\begin{displaymath}
C_\theta(x)
= \Theta\left( \vert\omega_\theta(x)\vert^2-\vartheta_\theta\right)
,
\end{displaymath} (523)

where, analogously to Eq. (500),
\begin{displaymath}
\omega_\theta(x)
=
\int \!dx^\prime \,
{\bf W}_\theta(x,x^\prime)
[\theta(x^\prime)-t_\theta(x^\prime)]
,
\end{displaymath} (524)

and ${\bf W}_\theta$ is some filter operator acting on the hyperfield and $t_\theta(x^\prime)$ is a template for the hyperfield.

Discontinuous functions $\phi $ can either be approximated by using discontinuous templates $t(x;\theta)$ or by eliminating matrix elements of the inverse covariance which connect the two sides of the discontinuity. For example, consider the discrete version of a negative Laplacian with periodic boundary conditions,

\begin{displaymath}
{\bf K} = {\bf W}^T {\bf W} =
\left(
\begin{tabular}{ c c c ...
...$-1$\ \\
$-1$& 0 & 0 & 0 & $-1$& 2 \\
\end{tabular}\right),
\end{displaymath} (525)

and possible square root,
\begin{displaymath}
{\bf W} =
\left(
\begin{tabular}{ c c c c c c }
$-1$\ & $1$...
...1$\ \\
$1$\ & 0 & 0 & 0 & 0 & $-1$\\
\end{tabular}\right)
.
\end{displaymath} (526)

The first three points can be disconnected from the last three points by setting ${\bf W}(3,\cdot)$ and ${\bf W}(6,\cdot)$ to zero, namely,
\begin{displaymath}
\tilde {\bf W} =
\left(
\begin{tabular}{ c c c \vert c c c ...
...1$& $1$\ \\
0 & 0 & 0 & 0 & 0 & 0 \\
\end{tabular}\right)
.
\end{displaymath} (527)

so that the smoothness prior with inverse covariance
\begin{displaymath}
\tilde {\bf K} = \tilde {\bf W}^T \tilde {\bf W} =
\left(
\b...
... $-1$\ \\
0 & 0 & 0 & 0 & $-1$& 1 \\
\end{tabular}\right)
,
\end{displaymath} (528)

is ineffective between points from different regions, In contrast to using discontinuous templates, the height of the jump at the discontinuity has not to be given in advance when using such disconnected Laplacians (or other inverse covariances). On the other hand training data are then required for all separated regions to determine the free constants which correspond to the zero modes of the Laplacian.

Non-Gaussian priors, which will be discussed in more detail in the next Section, often provide an alternative to the use of function hyperparameters. Similarly to Eq. (521) one may for example define a binary function $B(x)$ in terms of $\phi $,

\begin{displaymath}
B(x) =
\Theta \left(\vert\omega_1(x)\vert^2-\vert\omega_2(x)\vert^2 - \vartheta \right)
,
\end{displaymath} (529)

like, for a negative Laplacian prior,
\begin{displaymath}
B(x) =
\Theta \left(
\left\vert\frac{\partial (\phi-t_1)}{...
...al (\phi-t_2)}{\partial x}\right\vert^2
- \vartheta
\right)
.
\end{displaymath} (530)

Here $B(x)$ is directly determined by $\phi $ and is not considered as an independent hyperfield. Notice also that the functions $\omega_i(x)$ and $B(x)$ may be nonlocal with respect to $\phi(x)$, meaning they may depend on more than one $\phi(x)$ value. The threshold $\vartheta$ has to be related to the prior expectations on $\omega_i$. A possible non-Gaussian prior for $\phi $ formulated in terms of $B$ can be,
\begin{displaymath}
p(\phi) \propto
e^{-\frac{1}{2} \int\!dx\,
\left(
\vert\ome...
...rt\omega_2(x)\vert^2 B(x)
-\frac{\kappa}{2} N_d(B)
\right)
}
,
\end{displaymath} (531)

with $N_d(B)$ counting the number of discontinuities of $B(x)$. Alternatively to $N_d$ one may for a real $B$ define, similarly to (523),
\begin{displaymath}
C(x)
= \Theta\left( \vert\omega_B(x)\vert^2-\vartheta_B\right)
,
\end{displaymath} (532)

with
\begin{displaymath}
\omega_B(x)
=
\int \!dx^\prime \,
{\bf W}_B(x,x^\prime)
[B(x^\prime)-t_B(x^\prime)]
,
\end{displaymath} (533)

and some filter operator ${\bf W}_B$ and template $t_B$. Similarly to the introduction of hyperparameters, one can treat $B(x)$ formally as an independent function by including a term $
\lambda
\left(
B(x)-\Theta \left(\vert\omega_1(x)\vert^2-\vert\omega_2(x)\vert^2 - \vartheta
\right)
\right)$ in the prior energy and taking the limit $\lambda\rightarrow \infty$.

Eq. (531) looks similar to the combination of the prior (504) with the hyperprior (522),

\begin{displaymath}
p(\phi,\theta) \propto
e^{-\frac{1}{2} \int\!dx\,
\left(
\v...
...\frac{\kappa}{2} N_d(B_\theta)
-\ln Z_\phi(\theta)
\right)
}
.
\end{displaymath} (534)

Notice, however, that the definition (505) of the hyperfield $B_\theta$ (and $N_d(B_\theta)$ or $C_\theta$, respectively), is different from that of $B$ (and $N_d(B)$ or $C$), which are direct functions of $\phi $. If the $\omega_i$ differ only in their templates, the normalization term can be skipped. Then, identifying $B_\theta$ in (534) with a binary $\theta$ and assuming $\vartheta$ = $0$, $\vartheta_\theta$ = $\vartheta_B$, ${\bf W}_\theta$ = ${\bf W}_B$, the two equations are equivalent for $\theta$ = $\Theta\left(\vert\omega_1(x)\vert^2-\vert\omega_2(x)\vert^2\right)$. In the absence of hyperpriors, it is indeed easily seen that this is a selfconsistent solution for $\theta$, given $\phi $. In general, however, when hyperpriors are included, another solution for $\theta$ may have a larger posterior. Non-Gaussian priors will be discussed in Section 6.5.

Hyperpriors or non-Gaussian prior terms are useful to enforce specific global constraints for $\theta(x)$ or $B(x)$. In images, for example, discontinuities are expected to form closed curves. Hyperpriors, organizing discontinuities along lines or closed curves, are thus important for image segmentation [70,153,66,67,238,247].


next up previous contents
Next: Non-Gaussian prior factors Up: Parameterizing priors: Hyperparameters Previous: Integer hyperparameters   Contents
Joerg_Lemm 2001-01-21