Loss functions for approximation

Next: General loss functions and Up: Bayesian decision theory Previous: Loss and risk Contents

Loss functions for approximation

Log-loss: A typical loss function for density estimation problems is the log-loss

$\begin{displaymath} l(x,y,a) = -b_1(x)\ln p(y\vert x,a)+b_2(x,y) \end{displaymath}$

(46)

with some

-independent

and actions

describing probability densities

$\begin{displaymath} \int \!dy\, p(y\vert x,a) = 1, \,\,\forall x\in X, \forall a\in A . \end{displaymath}$

(47)

Choosing

= $\ln p(y\vert x,f)$ and

gives

$\displaystyle r(a,f)$	$\textstyle =$	$\displaystyle \int\! dx\, dy\, p(x)p(y\vert x,f) \ln \frac{p(y\vert x,f)}{p(y\vert x,a)}$	(48)
	$\textstyle =$	$\displaystyle < \ln \frac{p(y\vert x,f)}{p(y\vert x,a)} >_{X,Y\vert f}$	(49)
	$\textstyle =$	$\displaystyle <{\rm KL}( \,{p(y\vert x,f)},\, {p(y\vert x,a)}\, )>_{X} ,$	(50)

which shows that minimizing log-loss is equivalent to minimizing the (

-averaged) Kullback-Leibler entropy ${\rm KL}( \, {p(y\vert x,f)},\, {p(y\vert x,a)}\, )$ [122,123,13,46,53].

While the paper will concentrate on log-loss we will also give a short summary of loss functions for regression problems. (See for example [16,201] for details.) Regression problems are special density estimation problems where the considered possible actions are restricted to -independent functions .

Squared-error loss: The most common loss function for regression problems (see Sections 3.7, 3.7.2) is the squared-error loss. It reads for one-dimensional

$\begin{displaymath} l(x,y,a) = b_1(x) \left( y-a(x) \right)^2 +b_2(x,y) , \end{displaymath}$

(51)

with arbitrary

and

. In that case the optimal function

is the regression function of the posterior which is the mean of the predictive density

$\begin{displaymath} a^*(x) = \int \!dy \, y \, p(y\vert x,f) = \,\, <y>_{Y\vert x,f} . \end{displaymath}$

(52)

This can be easily seen by writing

$\displaystyle \left( y-a(x) \right)^2$	$\textstyle =$	$\displaystyle \left( y\;-\!<y>_{Y\vert x,f}+<y>_{Y\vert x,f}-\;a(x) \right)^2$	(53)
	$\textstyle =$	$\displaystyle \left( y\;-\!<y>_{Y\vert x,f} \right)^2 +\left( a(x)\;- <y>_{Y\vert x,f} \right)^2$
		$\displaystyle - 2 \left( y\,-<y>_{Y\vert x,f} \right) \left( a(x)\;- <y>_{Y\vert x,f} \right),$	(54)

where the first term in (54) is independent of

and the last term vanishes after integration over

according to the definition of $<y>_{Y\vert x,f}$ . Hence,

$\begin{displaymath} r(a,f)=\int\!dx\,b_1(x) p(x)\left( a(x)\,-\!<y>_{Y\vert x,f} \right)^2+{\rm const.} \end{displaymath}$

(55)

This is minimized by $a(x) = <y>_{Y\vert x,f}$ . Notice that for Gaussian $p(y\vert x,a)$ with fixed variance log-loss and squared-error loss are equivalent. For multi-dimensional

one-dimensional loss functions like Eq. (51) can be used when the component index of

is considered part of the

-variables. Alternatively, loss functions depending explicitly on multidimensional

can be defined. For instance, a general quadratic loss function would be

$\begin{displaymath} l(x,y,a)= \sum_{k,k^\prime }(y_k-a_k){\bf K}(k,k^\prime)(y_{k^\prime } -a_{k^\prime }(x)). \end{displaymath}$

(56)

with symmetric, positive definite kernel ${\bf K}(k,k^\prime )$ .

Absolute loss: For absolute loss

$\begin{displaymath} l(x,y,a) = b_1(x) \vert y-a(x)\vert +b_2(x,y) , \end{displaymath}$

(57)

with arbitrary

and

. The risk becomes

$\displaystyle r(a,f)$	$\textstyle =$	$\displaystyle \int \!dx\, b_1(x) p(x) \int_{-\infty}^{a(x)}\!dy \left (a(x)-y\right) p(y\vert x,f)$
		$\displaystyle + \int \!dx\, b_1(x)p(x)\int_{a(x)}^\infty\!dy \left (y-a(x)\right) p(y\vert x,f) +{\rm const.}$	(58)
	$\textstyle =$	$\displaystyle 2\int \!dx\, b_1(x) p(x) \int_{m(x)}^{a(x)}\!dy \left (a(x)-y\right) p(y\vert x,f) +{\rm const.}^\prime ,$	(59)

where the integrals have been rewritten as $\int_{-\infty}^{a(x)}$ = $\int_{-\infty}^{m(x)}$ + $\int_{m(x)}^{a(x)}$ and $\int_{a(x)}^\infty$ = $\int_{a(x)}^{m(x)}$ + $\int_{m(x)}^{\infty}$ introducing a median function

which satisfies

$\begin{displaymath} \int_{-\infty}^{m(x)} \!dy\, p(y\vert x,f) = \frac{1}{2},\, \forall x\in X , \end{displaymath}$

(60)

so that

$\begin{displaymath} a(x) \left( \int_{-\infty}^{m(x)} \!dy\, p(y\vert x,f) -\in... ...}^\infty \!dy\, p(y\vert x,f) \right) = 0,\, \forall x\in X . \end{displaymath}$

(61)

Thus the risk is minimized by any median function

$\delta$ -loss and - loss : Another possible loss function, typical for classification tasks (see Section 3.8), like for example image segmentation [153], is the $\delta$ -loss for continuous or --loss for discrete

$\begin{displaymath} l(x,y,a) = - b_1(x) \delta \left( y-a(x) \right) +b_2(x,y), \end{displaymath}$

(62)

with arbitrary

and

. Here $\delta$ denotes the Dirac $\delta$ -functional for continuous

and the Kronecker $\delta$ for discrete

. Then,

$\begin{displaymath} r(a,f) = -\int\!dx\, b_1(x) p(x) \, p(\,y\!=\!a(x)\,\vert x,f) +{\rm const.} , \end{displaymath}$

(63)

so the optimal

corresponds to any mode function of the predictive density. For Gaussians mode and median are unique, and coincide with the mean.

Next: General loss functions and Up: Bayesian decision theory Previous: Loss and risk Contents

Joerg_Lemm 2001-01-21