next up previous contents
Next: Exact predictive density Up: Regression Previous: Regression   Contents

Gaussian regression

An important special case of density estimation leading to quadratic data terms is regression for independent training data with Gaussian likelihoods

\begin{displaymath}
p(y_i\vert x_i,{h})
= \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_i-{h}(x_i))^2}{2 \sigma^2}}
,
\end{displaymath} (245)

with fixed, but possibly $x_i$-dependent, variance $\sigma^2$. In that case $P(x,y)$ = $p(y_i\vert x_i,{h})$ is specified by $\phi $ = $h$ and the logarithmic term $\sum_i \ln P_i$ becomes quadratic in the regression function ${h}(x_i)$, i.e., of the form (225). In an interpretation as empirical risk minimization quadratic error terms corresponds to the choice of a squared error loss function $l(x,y,a)$ = $(y-a(x))^2$ for action $a(x)$. Similarly, the technical analogon of Bayesian priors are additional (regularizing) cost terms.

We have remarked in Section 2.3 that for continuous $x$ measurement of ${h}(x)$ has to be understood as measurement of a ${h} (\tilde x)$ = $\int \!dx\, \vartheta (x) {h} (x)$ for sharply peaked $\vartheta (x)$. We assume here that the discretization of ${h}$ used in numerical calculations takes care of that averaging. Divergent quantities like $\delta$-functionals, used here for convenience, will then not be present.

We now combine Gaussian data terms and a Gaussian (specific) prior with prior operator ${{\bf K}}_0 (x,x^\prime)$ and define for training data $x_i$, $y_i$ the operator

\begin{displaymath}
{{\bf K}}_i(x,x^\prime ) =
\delta(x-x_i)\delta(x-x^\prime)
,
\end{displaymath} (246)

and training data templates $t=y_i$. We also allow a general prior template $t_0$ but remark that it is often chosen identically zero. According to (230) the resulting functional can be written in the following forms, useful for different purposes,
$\displaystyle E_{h}$ $\textstyle =$ $\displaystyle \frac{1}{2}\sum_{i=1}^n ( {h} (x_i) -y_i)^2
+\frac{1}{2}(\,{h}-t_0,\, {{\bf K}}_0 \, ({h}-t_0) \,)_X$ (247)
  $\textstyle =$ $\displaystyle \frac{1}{2}\sum_{i=1}^n (\,{h} -t_i,\, {{\bf K}}_i ({h}-t_i)\,)_X
+\frac{1}{2}(\,{h}-t_0,\, {{\bf K}}_0 \, ({h}-t_0) \,)_X$ (248)
  $\textstyle =$ $\displaystyle \frac{1}{2}\sum_{i=0}^n (\,{h} -t_i,\, {{\bf K}}_i ({h}-t_i)\,)_X$ (249)
  $\textstyle =$ $\displaystyle \frac{1}{2}({h} -t_D,\, {{\bf K}}_D ({h}-t_D))_X
\!+\!\frac{1}{2}({h}-t_0,\, {{\bf K}}_0 ({h}-t_0))_X
\!+\!E_{{\rm min}}^D$ (250)
  $\textstyle =$ $\displaystyle \frac{1}{2}(\,{h},\, {{\bf K}} \, {h}\,)_X
- (\,{h},\, {{\bf K}} \, {t}\,)_X
+\frac{1}{2}\sum_{i=0}^n (\,t_i,\, {{\bf K}}_i t_i\,)_X$ (251)
  $\textstyle =$ $\displaystyle \frac{1}{2}(\,{h} - t,\, {{\bf K}} \, ({h}-t)\,)_X + E_{\rm min},$ (252)

with
\begin{displaymath}
{{\bf K}}_D = \sum_{i=1}^n {{\bf K}}_i
,\quad
t_D = {{\bf K}}_D^{-1} \sum_{i=1}^n {{\bf K}}_i t_i
,
\end{displaymath} (253)


\begin{displaymath}
{{\bf K}} = \sum_{i=0}^n {{\bf K}}_i
,\quad
t = {{\bf K}}^{-1} \sum_{i=0}^n {{\bf K}}_i t_i,
\end{displaymath} (254)

and ${h}$-independent minimal errors,
$\displaystyle E_{{\rm min}}^D$ $\textstyle =$ $\displaystyle \frac{1}{2}
\left(
\sum_{i=1}^n \left( t_i,\; {\bf K}_i t_i\right)_X
-\left( t_D,\; {\bf K}_D t_D\right)_X
\right)
,$ (255)
$\displaystyle E_{\rm min}$ $\textstyle =$ $\displaystyle \frac{1}{2}
\left(
\sum_{i=0}^n \left( t_i,\; {\bf K}_i t_i\right)_X
-\left( t,\; {\bf K} t\right)_X
\right)
,$ (256)

being proportional to the ``generalized variances'' $V_D$ = $ 2 E_{{\rm min}}^D/n$ and $V$ = $2E_{{\rm min}}/(n+1)$. The scalar product $(\cdot,\cdot)_X$ stands for $x$-integration only, for the sake of simplicity however, we will skip the subscript $X$ in the following. The data operator ${{\bf K}}_D$
\begin{displaymath}
{{\bf K}}_D (x,x^\prime )
= \sum_{i=1}^n \delta (x-x_i)
\delta (x-x^\prime)
= n_x\, \delta (x-x^\prime)
,
\end{displaymath} (257)

contains for discrete $x$ on its diagonal the number of measurements at $x$,
\begin{displaymath}
n_x
=N_X(x)
=\sum_{i=1}^n \delta (x-x_i)
,
\end{displaymath} (258)

which is zero for $x$ not in the training data. As already mentioned for continuous $x$ a integration around a neighborhood of $x_i$ is required. ${{\bf K}}_D^{-1}$ is a short hand notation for the inverse within the space of training data
\begin{displaymath}
{{\bf K}}_D^{-1} =
\left({\bf I}_D {{\bf K}}_D{\bf I}_D\right)^{-1} =
\delta (x-x^\prime)/n_x
,
\end{displaymath} (259)

${\bf I}_D$ denoting the projector into the space of training data
\begin{displaymath}
{\bf I}_D = \delta ( x-x^\prime )
\sum_{i=1}^{\tilde n} \delta (x-x_i)
.
\end{displaymath} (260)

Notice that the sum is not over all $n$ training points $x_i$ but only over the $\tilde n\le n$ different $x_i$. (Again for continuous $x$ an integration around $x_i$ is required to ensure ${\bf I}_D^2$ = ${\bf I}_D$). Hence, the data template $t_D$ becomes the mean of $y$-values measured at $x$
\begin{displaymath}
t_D (x) = \frac{1}{n_x}\sum_{j=1 \atop x_j=x}^{n_x} y (x_j)
,
\end{displaymath} (261)

and $t_D (x)$ = $0$ for $n_x$ = $0$. Normalization of $P(x,y)$ is not influenced by a change in ${h}(x)$ so the Lagrange multiplier terms have been skipped.

The stationarity equation is most easily obtained from (252),

\begin{displaymath}
0={{\bf K}} ({h}-t)
.
\end{displaymath} (262)

It is linear and has on a space where ${{\bf K}}^{-1}$ exists the unique solution
\begin{displaymath}
{h}=t
.
\end{displaymath} (263)

We remark that ${{\bf K}}$ can be invertible (and usually is so the learning problem is well defined) even if ${{\bf K}}_0$ is not invertible. The inverse ${{\bf K}}^{-1}$, necessary to calculate $t$, is training data dependent and represents the covariance operator/matrix of a Gaussian posterior process. In many practical cases, however, the prior covariance ${{\bf K}}_0^{-1}$ (or in case of a null space a pseudo inverse of ${{\bf K}}_0$) is directly given or can be calculated. Then an inversion of a finite dimensional matrix in data space is sufficient to find the minimum of the energy $E_{h}$ [228,76].

Invertible ${\bf K}_0$: Let us first study the case of an invertible ${\bf K}_0$ and consider the stationarity equation as obtained from (248) or (250)

$\displaystyle 0$ $\textstyle =$ $\displaystyle \sum_{i=1}^n {{\bf K}}_i ({h}-t_i) + {{\bf K}}_0 ({h} - t_0)$ (264)
  $\textstyle =$ $\displaystyle {{\bf K}}_D ({h}-t_D) + {{\bf K}}_0 ({h} - t_0)
.$ (265)

For existing ${{\bf K}}_0^{-1}$
\begin{displaymath}
{h} = t_0 + {{\bf K}}_0^{-1} {{\bf K}}_D (t_D-{h} )
,
\end{displaymath} (266)

one can introduce
\begin{displaymath}
a = {{\bf K}}_D (t_D -{h})
,
\end{displaymath} (267)

to obtain
\begin{displaymath}
{h} = t_0 + {{\bf K}}_0^{-1} a
.
\end{displaymath} (268)

Inserting Eq. (268) into Eq. (267) one finds an equation for $a$
\begin{displaymath}
\left(I + {{\bf K}}_D {{\bf K}}_0^{-1}\right) a
=
{{\bf K}}_D (t_D - t_0)
.
\end{displaymath} (269)

Multiplying Eq. (269) from the left by the projector ${\bf I}_D$ defined in (260) and using
\begin{displaymath}
{{\bf K}}_D {\bf I}_D = {\bf I}_D {{\bf K}}_D
,\quad
a = {\bf I}_D a
,\quad
t_D = {\bf I}_D t_D
,
\end{displaymath} (270)

one obtains an equation in data space
\begin{displaymath}
\left(I_D + {{\bf K}}_D {{\bf K}}_{0,DD}^{-1}\right) a
=
{{\bf K}}_D (t_D - t_{0,D})
,
\end{displaymath} (271)

where
\begin{displaymath}
{{\bf K}}_{0,DD}^{-1}
= ({{\bf K}}_0^{-1})_{DD}
= {\bf I}_D...
...
\ne ({{\bf K}}_{0,DD})^{-1}
,\quad
t_{0,D} = {\bf I}_D t_0
.
\end{displaymath} (272)

Thus,
\begin{displaymath}
a = {{\bf C}_{DD}}\; b
,
\end{displaymath} (273)

where
\begin{displaymath}
{\bf C}_{DD} =
\left({\bf I}_D + {{\bf K}}_D {{\bf K}}_{0,DD}^{-1}\right)^{-1}
,
\end{displaymath} (274)

and
\begin{displaymath}
b ={{\bf K}}_D (t_D - t_0)
.
\end{displaymath} (275)

In components Eq. (273) reads,
\begin{displaymath}
\sum_l \left( \delta_{kl} + n_{x_k} {{\bf K}}_{0}^{-1}(x_k,x_l)\right) a(x_l)
=
n_{x_k} \left( t_D(x_k) -t_0(x_k) \right)
.
\end{displaymath} (276)

Having calculated $a$ the solution ${h}$ is given by Eq. (268)
\begin{displaymath}
{h}
= t_0 + {{\bf K}}_0^{-1} {{\bf C}_{DD}} b
= t_0 + {{\bf...
...f K}_D^{-1} + {{\bf K}}_{0,DD}^{-1} \right)^{-1} (t_D - t_0)
.
\end{displaymath} (277)

Eq. (277) can also be obtained directly from Eq. (263) and the definitions (254), without introducing the auxiliary variable $a$, using the decomposition ${\bf K}_0 t_0$ = $-{\bf K}_D t_0$ + $({\bf K}_0 + {\bf K}_D )t_0$ and
\begin{displaymath}
{\bf K}^{-1}{\bf K}_D
={\bf K}_0^{-1} \left({\bf I} + {\bf ...
...=0}^\infty
\left(-{\bf K}_D {\bf K}_0^{-1}\right)^m {\bf K}_D
\end{displaymath} (278)


\begin{displaymath}
={\bf K}_0^{-1} \sum_{m=0}^\infty
\left(-{\bf K}_D {\bf I}_D...
...I}_D
+ {\bf K}_D {\bf K}_{0,DD}^{-1}\right)^{-1}{\bf K}_D
.
\end{displaymath} (279)

${{\bf K}}_0^{-1} {{\bf C}_{DD}}$ is also known as equivalent kernel due to its relation to kernel smoothing techniques [210,94,90,76].

Interestingly, Eq. (268) still holds for non-quadratic data terms of the form $g_D(h)$ with any differentiable function fulfilling $g(h)$ = $g(h_D)$, where $h_D$ = ${\bf I}_D h$ is the restriction of $h$ to data space. Hence, also the function of functional derivatives with respect to $h(x)$ is restricted to data space, i.e., $g^\prime (h_D)$ = $g^\prime_D(h_D)$ with $g^\prime_D$ = ${\bf I}_D g^\prime$ and $g^\prime (h,x)=\delta g(h)/\delta h(x)$. For example, $g(h)$ = $\sum_{i=1}^nV({h} (x_i) -y_i)$ with $V$ a differentiable function. The finite dimensional vector $a$ is then found by solving a nonlinear equation instead of a linear one [73,75].

Furthermore, one can study vector fields, i.e., the case where, besides possibly $x$, also $y$, and thus ${h}(x)$, is a vector for given $x$. (Considering the variable indicating the vector components of $y$ as part of the $x$-variable, this is a situation where a fixed number of one-dimensional $y$, corresponding to a subspace of $X$ with fixed dimension, is always measured simultaneously.) In that case the diagonal ${{\bf K}}_i$ of Eq. (246) can be replaced by a version with non-zero off-diagonal elements ${{\bf K}}_{\alpha,\alpha^\prime}$ between the vector components $\alpha$ of $y$. This corresponds to a multi-dimensional Gaussian data generating probability

\begin{displaymath}
p(y_i\vert x_i,{h})
= \frac{\det{{{\bf K}}_i}^\frac{1}{2}}{(...
...prime}(x_i)
(y_{i,\alpha^\prime}-{h}_{\alpha^\prime}(x_i))}
,
\end{displaymath} (280)

for $k$-dimensional vector $y_i$ with components $y_{i,\alpha}$.

Non-invertible ${\bf K}_0$: For non-invertible ${{\bf K}}_0$ one can solve for ${h}$ using the Moore-Penrose inverse ${{\bf K}}_0^\char93 $. Let us first recall some basic facts [58,164,186,187,126,15,120]. A pseudo inverse ${\bf B}$ of (a possibly non-square) ${\bf A}$ is defined by the conditions

\begin{displaymath}
{\bf B} {\bf A} {\bf B} = {\bf B}
,\quad
{\bf A} {\bf B} {\bf A} = {\bf A}
,
\end{displaymath} (281)

and becomes for real ${\bf A}$ the unique Moore-Penrose inverse ${\bf A}^\char93 $ if
\begin{displaymath}
({\bf A} {\bf A}^\char93 )^T = {\bf A} {\bf A}^\char93
,\quad
({\bf A}^\char93  {\bf A})^T = {\bf A}^\char93  {\bf A}
.
\end{displaymath} (282)

A linear equation
\begin{displaymath}
{\bf A} x = b
\end{displaymath} (283)

is solvable if
\begin{displaymath}
{\bf A}{\bf A}^\char93  b = b
.
\end{displaymath} (284)

In that case the solution is
\begin{displaymath}
x
= {\bf A}^\char93 b + x^0
= {\bf A}^\char93 b + y-{\bf A}^\char93 {\bf A}y
,
\end{displaymath} (285)

where $x^0 = y-{\bf A}^\char93 {\bf A}y$ is solution of the homogeneous equation ${\bf A}x^0 = 0$ and vector $y$ is arbitrary. Hence, $x^0$ can be expanded in an orthonormalized basis $\psi_l$ of the null space of ${\bf A}$
\begin{displaymath}
x^0 = \sum_l c_l \psi_l
.
\end{displaymath} (286)

For a diagonal

\begin{displaymath}
{\bf D} = \left(\begin{array}{cc}\widetilde {\bf D}&0\\ 0&0\end{array}\right)
,
\end{displaymath} (287)

where $\widetilde {\bf D}$ is invertible, the Moore-Penrose inverse is
\begin{displaymath}
{\bf D}^\char93  = \left(\begin{array}{cc}{\widetilde {\bf D}}^{-1}&0\\ 0&0\end{array}\right)
.
\end{displaymath} (288)

For a square matrix ${\bf D}$ the product
\begin{displaymath}
{\bf D}{\bf D}^\char93
=
{\bf D}^\char93 {\bf D}
= \left(\begin{array}{cc}1&0\\ 0&0\end{array}\right)
= {\bf I}^D_1
\end{displaymath} (289)

becomes the projector on the space where ${\bf D}$ is invertible. Similarly, ${\bf I}^D_0$ = ${\bf I} - {\bf I}^D_1$ is the projector on the null space of ${\bf D}$.

For a general (possible non-square) ${\bf A}$ there always exist orthogonal ${\bf O_1}$ and ${\bf O_2}$ so it can be written in orthogonal normal form ${\bf A}$ = ${\bf O_1 D O_2}$, where ${\bf D}$ is of the form of Eq. (287) [120]. The corresponding Moore-Penrose inverse is ${\bf A}$ = ${\bf O_2^{-1} D^\char93  O_1^{-1}}$.

Similarly, for an ${\bf A}$ which can be diagonalized, i.e., for a square matrix ${\bf A}$ = ${\bf M}^{-1} {\bf D} {\bf M}$ with diagonal ${\bf D}$, the Moore-Penrose inverse is ${\bf A}^\char93 $ = ${\bf M}^{-1} {\bf D}^\char93  {\bf M}$. From Eq. (289) it follows then that

\begin{displaymath}
{\bf A}{\bf A}^\char93
= {\bf A}^\char93 {\bf A}
= {\bf I}_1 = {\bf I} - {\bf I}_0
.
\end{displaymath} (290)

where ${\bf I}_1$ = ${\bf M}^{-1} {\bf D} {\bf D}^\char93  {\bf M}$ = ${\bf M}^{-1} {\bf I}_1^D {\bf M}$ projects on the space where ${\bf A}$ is invertible and ${\bf I}_0$ = ${\bf I} - {\bf I}_1$ = $\sum_l \psi_l \psi_l^T$ is the projector into the null space of ${\bf A}$. Then from the definition (281) it follows
\begin{displaymath}
{\bf I}_1 {\bf A} {\bf I}_1 = {\bf A},
\qquad
{\bf I}_1 {\bf A}^\char93  {\bf I}_1 = {\bf A}^\char93
.\end{displaymath} (291)

Sometimes it is convenient to write in that case
\begin{displaymath}
{\bf A}^\char93
=
({\bf I}_1 {\bf A} {\bf I}_1)^{-1}
\end{displaymath} (292)

where the inversion is understood to be performed in the subspace where ${\bf A}$ is invertible. Expressed by projectors the solvability condition Eq. (284) becomes
\begin{displaymath}
{\bf I}_1 b = b, {\quad \rm i.e., \quad}
{\bf I}_0 b = 0
,
\end{displaymath} (293)

or in terms of a basis $\psi_l$ of the null space
\begin{displaymath}
(\, \psi_l , \, b) = 0,\; \forall l
.
\end{displaymath} (294)

This means that the inhomogeneity $b$ must have no components within the null space of ${\bf A}$. Furthermore we can then write
\begin{displaymath}
x^0 = {\bf I}_0 \, y
,
\end{displaymath} (295)

showing that an $x^0$ can be obtained by projecting an arbitrary $y$ on the null space of ${\bf A}$. Collecting the results for an ${\bf A}$ which can be diagonalized, Eq. (285) can be expressed in terms of projection operators as follows:
\begin{displaymath}
x
= {\bf I}_1 {\bf A}^\char93  b + {\bf I}_0 y
= ({\bf I}_1 {\bf A} {\bf I}_1)^{-1}{\bf I}_1 b + {\bf I}_0\, y
.
\end{displaymath} (296)

Thus a solution $x$ is obtained by inverting ${\bf A}$ in the subspace where it is invertible and adding an arbitrary component of the null space.

Now we apply this to Eq. (265) where ${{\bf K}}_0$ is diagonalizable because being positive semi-definite. (In this case ${\bf M}$ is an orthogonal matrix and the entries of $D$ are real and larger or equal to zero.) Hence, one obtains under the condition

\begin{displaymath}
{\bf I}_0 \left( {{\bf K}}_0 t_0 + {{\bf K}}_D (t_D -{h}) \right) = 0
,
\end{displaymath} (297)

for Eq. (285)
\begin{displaymath}
{h} =
{{\bf K}}_0^\char93  \left(
{{\bf K}}_0 t_0 + {{\bf K}}_D (t_D -{h})
\right)
+{h}^0
,
\end{displaymath} (298)

where ${{\bf K}}_0 {h}^0$ = $0$ so that ${h}^0 = \sum_l c_l \psi_l$ can be expanded in an orthonormalized basis $\psi_l$ of the null space of ${{\bf K}}_0$, assumed here to be of finite dimension. To find an equation in data space define the vector
\begin{displaymath}
a = {{\bf K}}_D (t_D -{h})
,
\end{displaymath} (299)

to get from Eqs.(297) and (298)
$\displaystyle 0$ $\textstyle =$ $\displaystyle (\, \psi_l,\, {{\bf K}}_0 t_0) + (\, \psi_l,\, a),\; \forall l$ (300)
$\displaystyle {h}$ $\textstyle =$ $\displaystyle {{\bf K}}_0^\char93  \left(
{{\bf K}}_0 t_0 + a
\right)
+\sum_l c_l \psi_l
.$ (301)

These equations have to be solved for $a$ and the coefficients $c_l$. Inserting Eq. (301) into the definition (299) gives
\begin{displaymath}
({\bf I} + {{\bf K}}_D{{\bf K}}_0^\char93 ) a =
{{\bf K}}_D t_D -
{{\bf K}}_D {\bf I}_1 t_0 - {{\bf K}}_D \sum_l c_l \psi_l
,
\end{displaymath} (302)

using ${{\bf K}}_0^\char93 {{\bf K}}_0 $ = ${\bf I}_1$ according to Eq. (290). Using $a$ = ${\bf I}_D\,a $ [where ${\bf I}_D$, the projector on the space of training data, is defined in (260)] the solvability condition (297) becomes
\begin{displaymath}
\sum_{i=1}^{\tilde n} \psi_l(x_i) a
= -(\, \psi_l,\, {{\bf K}}_0 t_0\,), \forall l
,
\end{displaymath} (303)

the sum going over different $x_i$ only. Eq. (302) for $a$ and $c_l$ reads in data space, similar to Eq. (271),
\begin{displaymath}
a = \tilde {{\bf C}} \tilde b
,
\end{displaymath} (304)

where $\tilde {{\bf C}}^{-1}$ = ${\bf I} + {{\bf K}}_D{{\bf K}}_0^\char93 $ has been assumed invertible and $\tilde b$ is given by the right hand side of Eq. (302). Inserting into Eq. (301) the solution finally can be written
\begin{displaymath}
{h} =
{\bf I}_1 t_0
+ {{\bf K}}_0^\char93  \tilde {{\bf C}} \tilde b
+\sum_l c_l \psi_l
.
\end{displaymath} (305)

Again, general non-quadratic data terms $g(h_D)$ can be allowed. In that case $\delta g(h_D)/\delta h(x)$ = $g^\prime(h_D,x)$ = $({\bf I}_D g^\prime)(h_D,x)$ and Eq. (299) becomes the nonlinear equation

\begin{displaymath}
a = g^\prime (h_D) =
g^\prime \!\left(
{\bf I}_D\left({{\bf...
..._0 t_0 + {{\bf K}}_D (t_D -{h})
\right)+{h}^0\right)
\right)
.
\end{displaymath} (306)

The solution(s) $a$ of that equation have then to be inserted in Eq. (301).


next up previous contents
Next: Exact predictive density Up: Regression Previous: Regression   Contents
Joerg_Lemm 2001-01-21