next up previous contents
Next: Quadratic density estimation and Up: Gaussian prior factors Previous: Example: Approximate periodicity   Contents


Non-zero means

A prior energy term $(1/2)(\phi,\, {{\bf K}}\,\phi)$ measures the squared ${{\bf K}}$-distance of $\phi $ to the zero function $t\equiv 0$. Choosing a zero mean function for the prior process is calculationally convenient for Gaussian priors, but by no means mandatory. In particular, a function $\phi $ is in practice often measured relative to some non-trivial base line. Without further a priori information that base line can in principle be an arbitrary function. Choosing a zero mean function that base line does not enter the formulae and remains hidden in the realization of the measurement process. On the the other hand, including explicitly a non-zero mean function $t$, playing the role of a function ${\it template}$ (or reference, target, prototype, base line) and being technically relatively straightforward, can be a very powerful tool. It allows, for example, to parameterize $t(\theta$) by introducing hyperparameters (see Section 5) and to specify explicitly different maxima of multimodal functional priors (see Section 6. [132,133,134,135,136]). All this cannot be done by referring to a single baseline.

Hence, in this section we consider error terms of the form

\begin{displaymath}
\frac{1}{2} \Big(\phi - t,\, {{\bf K}}\,(\phi - t)\,\Big)
.
\end{displaymath} (225)

Mean or template functions $t$ allow an easy and straightforward implementation of prior information in form of examples for $\phi $. They are the continuous analogue of standard training data. The fact that template functions $t$ are most times chosen equal to zero, and thus do not appear explicitly in the error functional, should not obscure the fact that they are of key importance for any generalization. There are many situations where it can be very valuable to include non-zero prior means explicitly. Template functions for $\phi $ can for example result from learning done in the past for the same or for similar tasks. In particular, consider for example $\tilde \phi(x)$ to be the output of an empirical learning system (neural net, decision tree, nearest neighbor methods, $\ldots$) being the result of learning the same or a similar task. Such a $\tilde \phi(x)$ would be a natural candidate for a template function $t(x)$. Thus, we see that template functions could be used for example to allow transfer of knowledge between similar tasks or to include the results of earlier learning on the same task in case the original data are lost but the output of another learning system is still available.

Including non-zero template functions generalizes functional $E_\phi$ of Eq. (187) to

$\displaystyle E_\phi$ $\textstyle =$ $\displaystyle -(\ln P(\phi),\,N)
+\frac{1}{2} \Big(\phi-t,\,{{\bf K}}\,(\phi-t)\Big)
+ (P(\phi),\, \Lambda_X )$ (226)
  $\textstyle =$ $\displaystyle -(\ln P(\phi),\,N)
+\frac{1}{2} (\phi,\,{{\bf K}}\,\phi)
-(J,\,\phi)
\!+\!(P(\phi),\, \Lambda_X )
\!+\!{\rm const}
.$ (227)

In the language of physics $J$ = ${{\bf K}} t$ represents an external field coupling to $\phi(x,y)$, similar, for example, to a magnetic field. A non-zero field leads to a non-zero expectation of $\phi $ in the no-data case. The $\phi $-independent constant stands for the term $\frac{1}{2} (t,\,{{\bf K}}\,t)$, or $\frac{1}{2} (J,\,{{\bf K}}^{-1}\,J)$ for invertible ${{\bf K}}$, and can be skipped from the error/energy functional $E_\phi$.

The stationarity equation for an $E_\phi$ with non-zero template $t$ contains an inhomogeneous term ${{\bf K}} t$ = $J$

\begin{displaymath}
0 =
{\bf P}^\prime (\phi) {\bf P}^{-1}(\phi) N
- {\bf P}^\prime (\phi) \Lambda_X
- {{\bf K}}\left( \phi - t\right)
,
\end{displaymath} (228)

with, for invertible ${\bf P} {{\bf P}^\prime}^{-1}$ and $\Lambda_X\ne 0$,
\begin{displaymath}
\Lambda_X =
{\bf I}_X \left( N - {\bf P} {{\bf P}^\prime}^{-1} {{\bf K}}\, (\phi-t) \right)
.
\end{displaymath} (229)

Notice that functional (226) can be rewritten as a functional with zero template $t\equiv 0$ in terms of $\widetilde \phi$ = $\phi - t$. That is the reason why we have not included non-zero templates in the previous sections. For general non-additive combinations of squared distances of the form (225) non-zero templates cannot be removed from the functional as we will see in Section 6. Additive combinations of squared error terms, on the other hand, can again be written as one squared error term, using a generalized `bias-variance'-decomposition
\begin{displaymath}
\frac{1}{2}
\sum_{j=1}^N \Big( \phi - t_j,\, {{\bf K}}_j\,(\...
...}{2}\Big(\phi - t,\, {{\bf K}}\, (\phi - t)\Big)
+ E_{\rm min}
\end{displaymath} (230)

with template average
\begin{displaymath}
t = {{\bf K}}^{-1} \sum_{j=1}^N {{\bf K}}_j t_j,
\end{displaymath} (231)

assuming the existence of the inverse of the operator
\begin{displaymath}
{{\bf K}} = \sum_{j=1}^N {{\bf K}}_j.
\end{displaymath} (232)

and minimal energy/error
\begin{displaymath}
E_{\rm min} = \frac{N}{2} V(t_1,\cdots t_N)
= \frac{1}{2}
\sum_{j=1}^N (t_j,\, {{\bf K}}_j\,t_j)
- (t,\, {{\bf K}}\,t),
\end{displaymath} (233)

which up to a factor $N/2$ represents a generalized template variance $V$. We end with the remark that adding error terms corresponds in its probabilistic Bayesian interpretation to ANDing independent events. For example, if we wish to implement that $\phi $ is likely to be smooth AND mirror symmetric, we may add two squared error terms, one related to smoothness and another to mirror symmetry. According to (230) the result will be a single squared error term of form (225).

Summarizing, we have seen that there are many potentially useful applications of non-zero template functions. Technically, however, non-zero template functions can be removed from the formalism by a simple substitution $\phi^\prime = \phi-t$ if the error functional consists of an additive combination of quadratic prior terms. As most regularized error functionals used in practice have additive prior terms this is probably the reason that they are formulated for $t\equiv 0$, meaning that non-zero templates functions (base lines) have to be treated by including a preprocessing step switching from $\phi $ to $\phi^\prime$. We will see in Section 6 that for general error functionals templates cannot be removed by a simple substitution and do enter the error functionals explicitly.


next up previous contents
Next: Quadratic density estimation and Up: Gaussian prior factors Previous: Example: Approximate periodicity   Contents
Joerg_Lemm 2001-01-21