Gaussian regression

Next: Exact predictive density Up: Regression Previous: Regression Contents

Gaussian regression

An important special case of density estimation leading to quadratic data terms is regression for independent training data with Gaussian likelihoods

$\begin{displaymath} p(y_i\vert x_i,{h}) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_i-{h}(x_i))^2}{2 \sigma^2}} , \end{displaymath}$

(245)

with fixed, but possibly

-dependent, variance $\sigma^2$ . In that case

= $p(y_i\vert x_i,{h})$ is specified by $\phi$ =

and the logarithmic term $\sum_i \ln P_i$ becomes quadratic in the regression function ${h}(x_i)$ , i.e., of the form (225). In an interpretation as empirical risk minimization quadratic error terms corresponds to the choice of a squared error loss function

for action

. Similarly, the technical analogon of Bayesian priors are additional (regularizing) cost terms.

We have remarked in Section 2.3 that for continuous measurement of has to be understood as measurement of a ${h} (\tilde x)$ = $\int \!dx\, \vartheta (x) {h} (x)$ for sharply peaked $\vartheta (x)$ . We assume here that the discretization of used in numerical calculations takes care of that averaging. Divergent quantities like $\delta$ -functionals, used here for convenience, will then not be present.

We now combine Gaussian data terms and a Gaussian (specific) prior with prior operator ${{\bf K}}_0 (x,x^\prime)$ and define for training data , the operator

$\begin{displaymath} {{\bf K}}_i(x,x^\prime ) = \delta(x-x_i)\delta(x-x^\prime) , \end{displaymath}$

(246)

and training data templates

. We also allow a general prior template

but remark that it is often chosen identically zero. According to (230) the resulting functional can be written in the following forms, useful for different purposes,

$\displaystyle E_{h}$	$\textstyle =$	$\displaystyle \frac{1}{2}\sum_{i=1}^n ( {h} (x_i) -y_i)^2 +\frac{1}{2}(\,{h}-t_0,\, {{\bf K}}_0 \, ({h}-t_0) \,)_X$	(247)
	$\textstyle =$	$\displaystyle \frac{1}{2}\sum_{i=1}^n (\,{h} -t_i,\, {{\bf K}}_i ({h}-t_i)\,)_X +\frac{1}{2}(\,{h}-t_0,\, {{\bf K}}_0 \, ({h}-t_0) \,)_X$	(248)
	$\textstyle =$	$\displaystyle \frac{1}{2}\sum_{i=0}^n (\,{h} -t_i,\, {{\bf K}}_i ({h}-t_i)\,)_X$	(249)
	$\textstyle =$	$\displaystyle \frac{1}{2}({h} -t_D,\, {{\bf K}}_D ({h}-t_D))_X \!+\!\frac{1}{2}({h}-t_0,\, {{\bf K}}_0 ({h}-t_0))_X \!+\!E_{{\rm min}}^D$	(250)
	$\textstyle =$	$\displaystyle \frac{1}{2}(\,{h},\, {{\bf K}} \, {h}\,)_X - (\,{h},\, {{\bf K}} \, {t}\,)_X +\frac{1}{2}\sum_{i=0}^n (\,t_i,\, {{\bf K}}_i t_i\,)_X$	(251)
	$\textstyle =$	$\displaystyle \frac{1}{2}(\,{h} - t,\, {{\bf K}} \, ({h}-t)\,)_X + E_{\rm min},$	(252)

with

$\begin{displaymath} {{\bf K}}_D = \sum_{i=1}^n {{\bf K}}_i ,\quad t_D = {{\bf K}}_D^{-1} \sum_{i=1}^n {{\bf K}}_i t_i , \end{displaymath}$

(253)

$\begin{displaymath} {{\bf K}} = \sum_{i=0}^n {{\bf K}}_i ,\quad t = {{\bf K}}^{-1} \sum_{i=0}^n {{\bf K}}_i t_i, \end{displaymath}$

(254)

and

-independent minimal errors,

$\displaystyle E_{{\rm min}}^D$	$\textstyle =$	$\displaystyle \frac{1}{2} \left( \sum_{i=1}^n \left( t_i,\; {\bf K}_i t_i\right)_X -\left( t_D,\; {\bf K}_D t_D\right)_X \right) ,$	(255)
$\displaystyle E_{\rm min}$	$\textstyle =$	$\displaystyle \frac{1}{2} \left( \sum_{i=0}^n \left( t_i,\; {\bf K}_i t_i\right)_X -\left( t,\; {\bf K} t\right)_X \right) ,$	(256)

being proportional to the ``generalized variances''

= $2 E_{{\rm min}}^D/n$ and

= $2E_{{\rm min}}/(n+1)$ . The scalar product $(\cdot,\cdot)_X$ stands for

-integration only, for the sake of simplicity however, we will skip the subscript

in the following. The data operator ${{\bf K}}_D$

$\begin{displaymath} {{\bf K}}_D (x,x^\prime ) = \sum_{i=1}^n \delta (x-x_i) \delta (x-x^\prime) = n_x\, \delta (x-x^\prime) , \end{displaymath}$

(257)

contains for discrete

on its diagonal the number of measurements at

$\begin{displaymath} n_x =N_X(x) =\sum_{i=1}^n \delta (x-x_i) , \end{displaymath}$

(258)

which is zero for

not in the training data. As already mentioned for continuous

a integration around a neighborhood of

is required. ${{\bf K}}_D^{-1}$ is a short hand notation for the inverse within the space of training data

$\begin{displaymath} {{\bf K}}_D^{-1} = \left({\bf I}_D {{\bf K}}_D{\bf I}_D\right)^{-1} = \delta (x-x^\prime)/n_x , \end{displaymath}$

(259)

${\bf I}_D$ denoting the projector into the space of training data

$\begin{displaymath} {\bf I}_D = \delta ( x-x^\prime ) \sum_{i=1}^{\tilde n} \delta (x-x_i) . \end{displaymath}$

(260)

Notice that the sum is not over all

training points

but only over the $\tilde n\le n$ different

. (Again for continuous

an integration around

is required to ensure ${\bf I}_D^2$ = ${\bf I}_D$ ). Hence, the data template

becomes the mean of

-values measured at

$\begin{displaymath} t_D (x) = \frac{1}{n_x}\sum_{j=1 \atop x_j=x}^{n_x} y (x_j) , \end{displaymath}$

(261)

and

for

. Normalization of

is not influenced by a change in

so the Lagrange multiplier terms have been skipped.

The stationarity equation is most easily obtained from (252),

$\begin{displaymath} 0={{\bf K}} ({h}-t) . \end{displaymath}$

(262)

It is linear and has on a space where ${{\bf K}}^{-1}$ exists the unique solution

$\begin{displaymath} {h}=t . \end{displaymath}$

(263)

We remark that ${{\bf K}}$ can be invertible (and usually is so the learning problem is well defined) even if ${{\bf K}}_0$ is not invertible. The inverse ${{\bf K}}^{-1}$ , necessary to calculate

, is training data dependent and represents the covariance operator/matrix of a Gaussian posterior process. In many practical cases, however, the prior covariance ${{\bf K}}_0^{-1}$ (or in case of a null space a pseudo inverse of ${{\bf K}}_0$ ) is directly given or can be calculated. Then an inversion of a finite dimensional matrix in data space is sufficient to find the minimum of the energy $E_{h}$ [228,76].

Invertible ${\bf K}_0$ : Let us first study the case of an invertible ${\bf K}_0$ and consider the stationarity equation as obtained from (248) or (250)

$\displaystyle 0$	$\textstyle =$	$\displaystyle \sum_{i=1}^n {{\bf K}}_i ({h}-t_i) + {{\bf K}}_0 ({h} - t_0)$	(264)
	$\textstyle =$	$\displaystyle {{\bf K}}_D ({h}-t_D) + {{\bf K}}_0 ({h} - t_0) .$	(265)

For existing ${{\bf K}}_0^{-1}$

$\begin{displaymath} {h} = t_0 + {{\bf K}}_0^{-1} {{\bf K}}_D (t_D-{h} ) , \end{displaymath}$

(266)

one can introduce

$\begin{displaymath} a = {{\bf K}}_D (t_D -{h}) , \end{displaymath}$

(267)

to obtain

$\begin{displaymath} {h} = t_0 + {{\bf K}}_0^{-1} a . \end{displaymath}$

(268)

Inserting Eq. (268) into Eq. (267) one finds an equation for

$\begin{displaymath} \left(I + {{\bf K}}_D {{\bf K}}_0^{-1}\right) a = {{\bf K}}_D (t_D - t_0) . \end{displaymath}$

(269)

Multiplying Eq. (269) from the left by the projector ${\bf I}_D$ defined in (260) and using

$\begin{displaymath} {{\bf K}}_D {\bf I}_D = {\bf I}_D {{\bf K}}_D ,\quad a = {\bf I}_D a ,\quad t_D = {\bf I}_D t_D , \end{displaymath}$

(270)

one obtains an equation in data space

$\begin{displaymath} \left(I_D + {{\bf K}}_D {{\bf K}}_{0,DD}^{-1}\right) a = {{\bf K}}_D (t_D - t_{0,D}) , \end{displaymath}$

(271)

where

$\begin{displaymath} {{\bf K}}_{0,DD}^{-1} = ({{\bf K}}_0^{-1})_{DD} = {\bf I}_D... ... \ne ({{\bf K}}_{0,DD})^{-1} ,\quad t_{0,D} = {\bf I}_D t_0 . \end{displaymath}$

(272)

Thus,

$\begin{displaymath} a = {{\bf C}_{DD}}\; b , \end{displaymath}$

(273)

where

$\begin{displaymath} {\bf C}_{DD} = \left({\bf I}_D + {{\bf K}}_D {{\bf K}}_{0,DD}^{-1}\right)^{-1} , \end{displaymath}$

(274)

and

$\begin{displaymath} b ={{\bf K}}_D (t_D - t_0) . \end{displaymath}$

(275)

In components Eq. (273) reads,

$\begin{displaymath} \sum_l \left( \delta_{kl} + n_{x_k} {{\bf K}}_{0}^{-1}(x_k,x_l)\right) a(x_l) = n_{x_k} \left( t_D(x_k) -t_0(x_k) \right) . \end{displaymath}$

(276)

Having calculated

the solution

is given by Eq. (268)

$\begin{displaymath} {h} = t_0 + {{\bf K}}_0^{-1} {{\bf C}_{DD}} b = t_0 + {{\bf... ...f K}_D^{-1} + {{\bf K}}_{0,DD}^{-1} \right)^{-1} (t_D - t_0) . \end{displaymath}$

(277)

Eq. (277) can also be obtained directly from Eq. (263) and the definitions (254), without introducing the auxiliary variable

, using the decomposition ${\bf K}_0 t_0$ = $-{\bf K}_D t_0$ + $({\bf K}_0 + {\bf K}_D )t_0$ and

$\begin{displaymath} {\bf K}^{-1}{\bf K}_D ={\bf K}_0^{-1} \left({\bf I} + {\bf ... ...=0}^\infty \left(-{\bf K}_D {\bf K}_0^{-1}\right)^m {\bf K}_D \end{displaymath}$

(278)

$\begin{displaymath} ={\bf K}_0^{-1} \sum_{m=0}^\infty \left(-{\bf K}_D {\bf I}_D... ...I}_D + {\bf K}_D {\bf K}_{0,DD}^{-1}\right)^{-1}{\bf K}_D . \end{displaymath}$

(279)

${{\bf K}}_0^{-1} {{\bf C}_{DD}}$ is also known as equivalent kernel due to its relation to kernel smoothing techniques [210,94,90,76].

Interestingly, Eq. (268) still holds for non-quadratic data terms of the form with any differentiable function fulfilling = , where = ${\bf I}_D h$ is the restriction of to data space. Hence, also the function of functional derivatives with respect to is restricted to data space, i.e., $g^\prime (h_D)$ = $g^\prime_D(h_D)$ with $g^\prime_D$ = ${\bf I}_D g^\prime$ and $g^\prime (h,x)=\delta g(h)/\delta h(x)$ . For example, = $\sum_{i=1}^nV({h} (x_i) -y_i)$ with a differentiable function. The finite dimensional vector is then found by solving a nonlinear equation instead of a linear one [73,75].

Furthermore, one can study vector fields, i.e., the case where, besides possibly , also , and thus , is a vector for given . (Considering the variable indicating the vector components of as part of the -variable, this is a situation where a fixed number of one-dimensional , corresponding to a subspace of with fixed dimension, is always measured simultaneously.) In that case the diagonal ${{\bf K}}_i$ of Eq. (246) can be replaced by a version with non-zero off-diagonal elements ${{\bf K}}_{\alpha,\alpha^\prime}$ between the vector components $\alpha$ of . This corresponds to a multi-dimensional Gaussian data generating probability

$\begin{displaymath} p(y_i\vert x_i,{h}) = \frac{\det{{{\bf K}}_i}^\frac{1}{2}}{(... ...prime}(x_i) (y_{i,\alpha^\prime}-{h}_{\alpha^\prime}(x_i))} , \end{displaymath}$

(280)

for

-dimensional vector

with components $y_{i,\alpha}$ .

Non-invertible ${\bf K}_0$ : For non-invertible ${{\bf K}}_0$ one can solve for using the Moore-Penrose inverse ${{\bf K}}_0^\char93$ . Let us first recall some basic facts [58,164,186,187,126,15,120]. A pseudo inverse ${\bf B}$ of (a possibly non-square) ${\bf A}$ is defined by the conditions

$\begin{displaymath} {\bf B} {\bf A} {\bf B} = {\bf B} ,\quad {\bf A} {\bf B} {\bf A} = {\bf A} , \end{displaymath}$

(281)

and becomes for real ${\bf A}$ the unique Moore-Penrose inverse ${\bf A}^\char93$ if

$\begin{displaymath} ({\bf A} {\bf A}^\char93 )^T = {\bf A} {\bf A}^\char93 ,\quad ({\bf A}^\char93 {\bf A})^T = {\bf A}^\char93 {\bf A} . \end{displaymath}$

(282)

A linear equation

$\begin{displaymath} {\bf A} x = b \end{displaymath}$

(283)

is solvable if

$\begin{displaymath} {\bf A}{\bf A}^\char93 b = b . \end{displaymath}$

(284)

In that case the solution is

$\begin{displaymath} x = {\bf A}^\char93 b + x^0 = {\bf A}^\char93 b + y-{\bf A}^\char93 {\bf A}y , \end{displaymath}$

(285)

where $x^0 = y-{\bf A}^\char93 {\bf A}y$ is solution of the homogeneous equation ${\bf A}x^0 = 0$ and vector

is arbitrary. Hence,

can be expanded in an orthonormalized basis $\psi_l$ of the null space of ${\bf A}$

$\begin{displaymath} x^0 = \sum_l c_l \psi_l . \end{displaymath}$

(286)

For a diagonal

$\begin{displaymath} {\bf D} = \left(\begin{array}{cc}\widetilde {\bf D}&0\\ 0&0\end{array}\right) , \end{displaymath}$

(287)

where $\widetilde {\bf D}$ is invertible, the Moore-Penrose inverse is

$\begin{displaymath} {\bf D}^\char93 = \left(\begin{array}{cc}{\widetilde {\bf D}}^{-1}&0\\ 0&0\end{array}\right) . \end{displaymath}$

(288)

For a square matrix ${\bf D}$ the product

$\begin{displaymath} {\bf D}{\bf D}^\char93 = {\bf D}^\char93 {\bf D} = \left(\begin{array}{cc}1&0\\ 0&0\end{array}\right) = {\bf I}^D_1 \end{displaymath}$

(289)

becomes the projector on the space where ${\bf D}$ is invertible. Similarly, ${\bf I}^D_0$ = ${\bf I} - {\bf I}^D_1$ is the projector on the null space of ${\bf D}$ .

For a general (possible non-square) ${\bf A}$ there always exist orthogonal ${\bf O_1}$ and ${\bf O_2}$ so it can be written in orthogonal normal form ${\bf A}$ = ${\bf O_1 D O_2}$ , where ${\bf D}$ is of the form of Eq. (287) [120]. The corresponding Moore-Penrose inverse is ${\bf A}$ = ${\bf O_2^{-1} D^\char93 O_1^{-1}}$ .

Similarly, for an ${\bf A}$ which can be diagonalized, i.e., for a square matrix ${\bf A}$ = ${\bf M}^{-1} {\bf D} {\bf M}$ with diagonal ${\bf D}$ , the Moore-Penrose inverse is ${\bf A}^\char93$ = ${\bf M}^{-1} {\bf D}^\char93 {\bf M}$ . From Eq. (289) it follows then that

$\begin{displaymath} {\bf A}{\bf A}^\char93 = {\bf A}^\char93 {\bf A} = {\bf I}_1 = {\bf I} - {\bf I}_0 . \end{displaymath}$

(290)

where ${\bf I}_1$ = ${\bf M}^{-1} {\bf D} {\bf D}^\char93 {\bf M}$ = ${\bf M}^{-1} {\bf I}_1^D {\bf M}$ projects on the space where ${\bf A}$ is invertible and ${\bf I}_0$ = ${\bf I} - {\bf I}_1$ = $\sum_l \psi_l \psi_l^T$ is the projector into the null space of ${\bf A}$ . Then from the definition (281) it follows

$\begin{displaymath} {\bf I}_1 {\bf A} {\bf I}_1 = {\bf A}, \qquad {\bf I}_1 {\bf A}^\char93 {\bf I}_1 = {\bf A}^\char93 .\end{displaymath}$

(291)

Sometimes it is convenient to write in that case

$\begin{displaymath} {\bf A}^\char93 = ({\bf I}_1 {\bf A} {\bf I}_1)^{-1} \end{displaymath}$

(292)

where the inversion is understood to be performed in the subspace where ${\bf A}$ is invertible. Expressed by projectors the solvability condition Eq. (284) becomes

$\begin{displaymath} {\bf I}_1 b = b, {\quad \rm i.e., \quad} {\bf I}_0 b = 0 , \end{displaymath}$

(293)

or in terms of a basis $\psi_l$ of the null space

$\begin{displaymath} (\, \psi_l , \, b) = 0,\; \forall l . \end{displaymath}$

(294)

This means that the inhomogeneity

must have no components within the null space of ${\bf A}$ . Furthermore we can then write

$\begin{displaymath} x^0 = {\bf I}_0 \, y , \end{displaymath}$

(295)

showing that an

can be obtained by projecting an arbitrary

on the null space of ${\bf A}$ . Collecting the results for an ${\bf A}$ which can be diagonalized, Eq. (285) can be expressed in terms of projection operators as follows:

$\begin{displaymath} x = {\bf I}_1 {\bf A}^\char93 b + {\bf I}_0 y = ({\bf I}_1 {\bf A} {\bf I}_1)^{-1}{\bf I}_1 b + {\bf I}_0\, y . \end{displaymath}$

(296)

Thus a solution

is obtained by inverting ${\bf A}$ in the subspace where it is invertible and adding an arbitrary component of the null space.

Now we apply this to Eq. (265) where ${{\bf K}}_0$ is diagonalizable because being positive semi-definite. (In this case ${\bf M}$ is an orthogonal matrix and the entries of are real and larger or equal to zero.) Hence, one obtains under the condition

$\begin{displaymath} {\bf I}_0 \left( {{\bf K}}_0 t_0 + {{\bf K}}_D (t_D -{h}) \right) = 0 , \end{displaymath}$

(297)

for Eq. (285)

$\begin{displaymath} {h} = {{\bf K}}_0^\char93 \left( {{\bf K}}_0 t_0 + {{\bf K}}_D (t_D -{h}) \right) +{h}^0 , \end{displaymath}$

(298)

where ${{\bf K}}_0 {h}^0$ =

so that ${h}^0 = \sum_l c_l \psi_l$ can be expanded in an orthonormalized basis $\psi_l$ of the null space of ${{\bf K}}_0$ , assumed here to be of finite dimension. To find an equation in data space define the vector

$\begin{displaymath} a = {{\bf K}}_D (t_D -{h}) , \end{displaymath}$

(299)

to get from Eqs.(297) and (298)

$\displaystyle 0$	$\textstyle =$	$\displaystyle (\, \psi_l,\, {{\bf K}}_0 t_0) + (\, \psi_l,\, a),\; \forall l$	(300)
$\displaystyle {h}$	$\textstyle =$	$\displaystyle {{\bf K}}_0^\char93 \left( {{\bf K}}_0 t_0 + a \right) +\sum_l c_l \psi_l .$	(301)

These equations have to be solved for

and the coefficients

. Inserting Eq. (301) into the definition (299) gives

$\begin{displaymath} ({\bf I} + {{\bf K}}_D{{\bf K}}_0^\char93 ) a = {{\bf K}}_D t_D - {{\bf K}}_D {\bf I}_1 t_0 - {{\bf K}}_D \sum_l c_l \psi_l , \end{displaymath}$

(302)

using ${{\bf K}}_0^\char93 {{\bf K}}_0$ = ${\bf I}_1$ according to Eq. (290). Using

= ${\bf I}_D\,a$ [where ${\bf I}_D$ , the projector on the space of training data, is defined in (260)] the solvability condition (297) becomes

$\begin{displaymath} \sum_{i=1}^{\tilde n} \psi_l(x_i) a = -(\, \psi_l,\, {{\bf K}}_0 t_0\,), \forall l , \end{displaymath}$

(303)

the sum going over different

only. Eq. (302) for

and

reads in data space, similar to Eq. (271),

$\begin{displaymath} a = \tilde {{\bf C}} \tilde b , \end{displaymath}$

(304)

where $\tilde {{\bf C}}^{-1}$ = ${\bf I} + {{\bf K}}_D{{\bf K}}_0^\char93$ has been assumed invertible and $\tilde b$ is given by the right hand side of Eq. (302). Inserting into Eq. (301) the solution finally can be written

$\begin{displaymath} {h} = {\bf I}_1 t_0 + {{\bf K}}_0^\char93 \tilde {{\bf C}} \tilde b +\sum_l c_l \psi_l . \end{displaymath}$

(305)

Again, general non-quadratic data terms can be allowed. In that case $\delta g(h_D)/\delta h(x)$ = $g^\prime(h_D,x)$ = $({\bf I}_D g^\prime)(h_D,x)$ and Eq. (299) becomes the nonlinear equation

$\begin{displaymath} a = g^\prime (h_D) = g^\prime \!\left( {\bf I}_D\left({{\bf... ..._0 t_0 + {{\bf K}}_D (t_D -{h}) \right)+{h}^0\right) \right) . \end{displaymath}$

(306)

The solution(s)

of that equation have then to be inserted in Eq. (301).

Next: Exact predictive density Up: Regression Previous: Regression Contents

Joerg_Lemm 2001-01-21