next up previous contents
Next: Inverse quantum statistics Up: The Bayesian approach Previous: Basic notations   Contents

An Example: Regression

Before applying the Bayesian framework to quantum theory, we shortly present one of its standard applications: the case of (Gaussian) regression. (For more details see for example [34].) This also provides an example for the relation between the Bayesian maximum posterior approximation and the minimization of regularized error functionals.

A regression model is a model with Gaussian likelihoods,

\begin{displaymath}
p(x\vert c,h) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-h(c))^2}{2\sigma^2}}
,
\end{displaymath} (6)

with fixed variance $\sigma^2$. The function $h(c)$ is known as regression function. (In regression, one often writes $y$ for the dependent variable, which is $x$ in our notation, and $x$ for the ``condition'' $c$.) Our aim is to determine an approximation for $h(c)$ using observational data $D$ = $\{(x_i,c_i)\vert 1\le i\le n\}$. Within a parametric approach one searches for an optimal approximation in a space of parameterized regression functions $h(c;\xi)$. For example, in the simple cases of a constant or a linear regression such a parameterization would be $h(c;\xi)$ = $\xi $ or $h(c;\xi_1,\xi_2)$ = $\xi _2 c +\xi _1$, respectively. If the parameterization is restrictive enough then a prior term $p_0(h)$ is not needed and maximizing the likelihood over all data $D$ is thus equivalent to minimizing the squared error,
\begin{displaymath}
E_{\rm sq} (\xi) = \sum_i^n \Big( x_i-h(c_i;\xi) \Big)^2
.
\end{displaymath} (7)

There are, however, also very flexible parametric approaches, which usually do require additional a priori information. An example of such a nonlinear one-parameter family has been given by Vapnik and is shown in Fig. 1. Without additional a priori information, which may for example restrict the number of oscillations, such functions can in most cases not be expected to lead to useful predictions. Nonparametric approaches, which treat the numbers $h(c)$ as single degrees of freedom, are even more flexible and do always require a prior $p_0(h)$. For nonparametric approaches such a prior can be formulated in terms of the function values $h(c)$. A technically very convenient choice is a Gaussian process prior [48,49],
\begin{displaymath}
p_0(h)
=
\left(\det \frac{\lambda}{2\pi}{\bf K}_0 \right)^\...
...ac{\lambda}{2}
<\!h-h_0\,\vert\,{\bf K}_0\,\vert\,h-h_0\!>}
,
\end{displaymath} (8)

with mean $h_0$, representing a reference or template for the regression function $h$, and inverse covariance $\lambda {\bf K}_0$ given by a real symmetric, positive (semi-)definite operator scaled by $\lambda $ and acting on functions $h$. The operator ${\bf K}_0$ defines the scalar product,
\begin{displaymath}
<\!h-h_0\,\vert\,{\bf K}_0\,\vert\,h-h_0\!>
=
\int \!dc \, d...
...c)]
\, {\bf K}_0(c,c^\prime) \,
[h(c^\prime )-h_0(c^\prime)]
.
\end{displaymath} (9)

Typical priors enforce the regression function $h$ to be smooth. Such smoothness priors are implemented by choosing differential operators for ${\bf K}_0$. For example, taking for ${\bf K}_0$ the negative Laplacian and choosing a zero mean $h_0$ = $0$, yields
\begin{displaymath}
<\!h-h_0\,\vert\,{\bf K}_0\,\vert\,h-h_0\!>
=
- \int \!dc \,...
...nt \!dc \,
\left(\frac{\partial h(c)}{\partial c}
\right)^2
,
\end{displaymath} (10)

where we integrated by parts assuming vanishing boundary terms. In statistics one often prefers inverse covariance operators with higher derivatives to obtain smoother regression functions [16, 50-55]. An example of such a prior with higher derivatives is a ``Radial Basis Functions'' prior with the pseudo-differential operator ${\bf K}_0$ = $\exp{(-{\sigma_{\rm RBF}^2}{\Laplace}/2)}$ as inverse covariance.

Maximizing the posterior $p(x\vert c,D)$ for a Gaussian prior (8) is equivalent to minimizing the regularized error functional

\begin{displaymath}
E_{\rm reg} (h)
=
\frac{1}{2}\sum_i^n \Big( x_i-h(c_i) \Big)...
...lambda^\prime}{2}<\!h-h_0\,\vert\,{\bf K}_0\,\vert\,h-h_0\!>
.
\end{displaymath} (11)

The ``regularization'' parameter $\lambda^\prime$ = $\lambda \sigma^2$, representing a so called hyperparameter, controls the balance between empirical data and a priori information. In a Bayesian framework one can include a hyperprior $p(\lambda)$ and either integrate over $\lambda $ or determine an optimal $\lambda $ in maximum posterior approximation [22,26]. Alternative ways to determine $\lambda $ are crossvalidation techniques[16], the discrepancy and the self-consistent method [56]. For example in the case of a smoothness prior, a larger $\lambda^\prime$ will result in a smoother regression function $h^*$. It is typical for the case of regression that the regularized error functional $E_{\rm reg} (h)$ is quadratic in $h$. It is therefore easily minimized by setting the functional derivative with respect to $h$ to zero, i.e., $\delta E_{\rm reg}/\delta h$ = $\delta_h E_{\rm reg}$ = 0. This stationarity equation is then linear in $h$ and thus has a unique solution $h^*$. (This is equivalent to so called kernel methods with kernel ${\bf K}_0^{-1}$. It is specific for regression with Gaussian prior that, given ${\bf K}_0^{-1}$, only an $n$-dimensional equation has to be solved to obtain $h^*$. ) As the resulting maximum posterior solution $p(x\vert c,h^*)$ is Gaussian by definition, we find for its mean
\begin{displaymath}
\int \!dx \,x\, p(x\vert c,h^*)
= h^*(c)
.
\end{displaymath} (12)

It is not difficult to check that, for regression with a Gaussian prior, $h^*(c)$ is also equal to the mean $\int \!dx\, x\, p(x\vert c,D)$ of the exact predictive density (3). Furthermore it can be shown that, in order to minimize the squared error $[x-h(c)]^2$ for (future) test data, it is optimal to predict outcome $x$ = $h(c)$ for situation $c$.

Figure 1: Examples of parametric regression functions with increasing flexibility (with 3 data points). L.h.s: A fitted constant $h(c)$ = $\xi $. Middle: A linear $h(c)$ = $\xi _2 c +\xi _1$. R.h.s: The function $h(c)$ = $\sin (\xi c)$ can fit an arbitrary number of data points (with different $c_i\ne 0$ and $\vert x_i\vert\le 1$) well [17]. Additional a priori information becomes especially important for flexible approaches.
\begin{figure}\begin{center}
\epsfig{file=const.eps, width= 40mm}\epsfig{file=lin.eps, width= 40mm}\epsfig{file=sin1.eps, width= 40mm}\end{center}\end{figure}

In the following sections we will apply the Bayesian formalism to quantum theory. Hence, training data $x_i$ will represent the results of measurements on quantum systems and conditions $c_i$ will describe the kind of measurements performed. Being interested in the determination of quantum potentials our hypotheses $h$ will in the following represent potentials $v$.


next up previous contents
Next: Inverse quantum statistics Up: The Bayesian approach Previous: Basic notations   Contents
Joerg_Lemm 2000-06-06