next up previous contents
Next: General loss functions and Up: Bayesian decision theory Previous: Loss and risk   Contents


Loss functions for approximation



Log-loss: A typical loss function for density estimation problems is the log-loss

\begin{displaymath}
l(x,y,a) = -b_1(x)\ln p(y\vert x,a)+b_2(x,y)
\end{displaymath} (46)

with some $a$-independent $b_1(x)>0$, $b_2(x,y)$ and actions $a$ describing probability densities
\begin{displaymath}
\int \!dy\, p(y\vert x,a) = 1, \,\,\forall x\in X, \forall a\in A
.
\end{displaymath} (47)

Choosing $b_2(x,y)$ = $\ln p(y\vert x,f)$ and $b_1(x)$ = $1$ gives
$\displaystyle r(a,f)$ $\textstyle =$ $\displaystyle \int\! dx\, dy\, p(x)p(y\vert x,f) \ln \frac{p(y\vert x,f)}{p(y\vert x,a)}$ (48)
  $\textstyle =$ $\displaystyle < \ln \frac{p(y\vert x,f)}{p(y\vert x,a)} >_{X,Y\vert f}$ (49)
  $\textstyle =$ $\displaystyle <{\rm KL}( \,{p(y\vert x,f)},\, {p(y\vert x,a)}\, )>_{X}
,$ (50)

which shows that minimizing log-loss is equivalent to minimizing the ($x$-averaged) Kullback-Leibler entropy ${\rm KL}( \, {p(y\vert x,f)},\, {p(y\vert x,a)}\, )$[122,123,13,46,53].

While the paper will concentrate on log-loss we will also give a short summary of loss functions for regression problems. (See for example [16,201] for details.) Regression problems are special density estimation problems where the considered possible actions are restricted to $y$-independent functions $a(x)$.



Squared-error loss: The most common loss function for regression problems (see Sections 3.7, 3.7.2) is the squared-error loss. It reads for one-dimensional $y$

\begin{displaymath}
l(x,y,a) = b_1(x) \left( y-a(x) \right)^2 +b_2(x,y)
,
\end{displaymath} (51)

with arbitrary $b_1(x)>0$ and $b_2(x,y)$. In that case the optimal function $a(x)$ is the regression function of the posterior which is the mean of the predictive density
\begin{displaymath}
a^*(x)
= \int \!dy \, y \, p(y\vert x,f)
= \,\, <y>_{Y\vert x,f}
.
\end{displaymath} (52)

This can be easily seen by writing
$\displaystyle \left( y-a(x) \right)^2$ $\textstyle =$ $\displaystyle \left( y\;-\!<y>_{Y\vert x,f}+<y>_{Y\vert x,f}-\;a(x) \right)^2$ (53)
  $\textstyle =$ $\displaystyle \left( y\;-\!<y>_{Y\vert x,f} \right)^2
+\left( a(x)\;- <y>_{Y\vert x,f} \right)^2$  
    $\displaystyle - 2 \left( y\,-<y>_{Y\vert x,f} \right)
\left( a(x)\;- <y>_{Y\vert x,f} \right),$ (54)

where the first term in (54) is independent of $a$ and the last term vanishes after integration over $y$ according to the definition of $<y>_{Y\vert x,f}$. Hence,
\begin{displaymath}
r(a,f)=\int\!dx\,b_1(x) p(x)\left( a(x)\,-\!<y>_{Y\vert x,f} \right)^2+{\rm const.}
\end{displaymath} (55)

This is minimized by $a(x) = <y>_{Y\vert x,f}$. Notice that for Gaussian $p(y\vert x,a)$ with fixed variance log-loss and squared-error loss are equivalent. For multi-dimensional $y$ one-dimensional loss functions like Eq. (51) can be used when the component index of $y$ is considered part of the $x$-variables. Alternatively, loss functions depending explicitly on multidimensional $y$ can be defined. For instance, a general quadratic loss function would be
\begin{displaymath}
l(x,y,a)=
\sum_{k,k^\prime }(y_k-a_k){\bf K}(k,k^\prime)(y_{k^\prime }
-a_{k^\prime }(x)).
\end{displaymath} (56)

with symmetric, positive definite kernel ${\bf K}(k,k^\prime )$.



Absolute loss: For absolute loss

\begin{displaymath}
l(x,y,a) = b_1(x) \vert y-a(x)\vert +b_2(x,y)
,
\end{displaymath} (57)

with arbitrary $b_1(x)>0$ and $b_2(x,y)$. The risk becomes
$\displaystyle r(a,f)$ $\textstyle =$ $\displaystyle \int \!dx\, b_1(x) p(x) \int_{-\infty}^{a(x)}\!dy \left (a(x)-y\right) p(y\vert x,f)$  
    $\displaystyle +
\int \!dx\, b_1(x)p(x)\int_{a(x)}^\infty\!dy \left (y-a(x)\right) p(y\vert x,f)
+{\rm const.}$ (58)
  $\textstyle =$ $\displaystyle 2\int \!dx\, b_1(x) p(x) \int_{m(x)}^{a(x)}\!dy \left (a(x)-y\right) p(y\vert x,f)
+{\rm const.}^\prime
,$ (59)

where the integrals have been rewritten as $\int_{-\infty}^{a(x)}$ = $\int_{-\infty}^{m(x)}$ + $\int_{m(x)}^{a(x)}$ and $\int_{a(x)}^\infty$ = $\int_{a(x)}^{m(x)}$ + $\int_{m(x)}^{\infty}$ introducing a median function $m(x)$ which satisfies
\begin{displaymath}
\int_{-\infty}^{m(x)} \!dy\, p(y\vert x,f) = \frac{1}{2},\, \forall x\in X
,
\end{displaymath} (60)

so that
\begin{displaymath}
a(x) \left( \int_{-\infty}^{m(x)} \!dy\, p(y\vert x,f)
-\in...
...}^\infty \!dy\, p(y\vert x,f)
\right)
= 0,\, \forall x\in X
.
\end{displaymath} (61)

Thus the risk is minimized by any median function $m(x)$.



$\delta$-loss and $0$-$1$ loss : Another possible loss function, typical for classification tasks (see Section 3.8), like for example image segmentation [153], is the $\delta$-loss for continuous $y$ or $0$-$1$-loss for discrete $y$

\begin{displaymath}
l(x,y,a) = - b_1(x) \delta \left( y-a(x) \right) +b_2(x,y),
\end{displaymath} (62)

with arbitrary $b_1(x)>0$ and $b_2(x,y)$. Here $\delta$ denotes the Dirac $\delta$-functional for continuous $y$ and the Kronecker $\delta$ for discrete $y$. Then,
\begin{displaymath}
r(a,f) = -\int\!dx\, b_1(x) p(x) \, p(\,y\!=\!a(x)\,\vert x,f)
+{\rm const.}
,
\end{displaymath} (63)

so the optimal $a$ corresponds to any mode function of the predictive density. For Gaussians mode and median are unique, and coincide with the mean.


next up previous contents
Next: General loss functions and Up: Bayesian decision theory Previous: Loss and risk   Contents
Joerg_Lemm 2001-01-21