next up previous contents
Next: Inverse quantum mechanics Up: Gaussian prior factors Previous: Support vector machines and   Contents


Classification

In classification (or pattern recognition) tasks the independent visible variable $y$ takes discrete values (group, cluster or pattern labels) [16,61,24,47]. We write $y$ = $k$ and $p(y\vert x,h)$ = $P_k(x,h)$, i.e., $\sum_k P_k(x,h)$ = $1$. Having received classification data $D$ = $\{(x_i,k_i)\vert 1\le i\le n\}$ the density estimation error functional for a prior on function $\phi $ (with components $\phi_k$ and $P$ = $P(\phi)$) reads

\begin{displaymath}
E_{\rm cl.}
=
\sum_i^n \ln P_{k_i}(x_i;\phi)
+\frac{1}{2}\Big(\phi-t,\, {\bf K}\,(\phi-t) \Big)
+(P(\phi), \Lambda_X)
.
\end{displaymath} (335)

In classification the scalar product corresponds to an integral over $x$ and a summation over $k$, e.g.,
\begin{displaymath}
\Big(\phi-t,\, {\bf K}\,(\phi-t) \Big)
=
\sum_{k,k^\prime} \...
...x^\prime)
(\phi_{k^\prime}(x^\prime)-t_{k^\prime}(x^\prime))
,
\end{displaymath} (336)

and $(P,\Lambda_X)$ = $\int\!dx\,\Lambda_X(x)\sum_k P_k(x)$.

For zero-one loss $l(x,k,a)$ = $\delta_{k,a(x)}$ -- a typical loss function for classification problems -- the optimal decision (or Bayes classifier) is given by the mode of the predictive density (see Section 2.2.2), i.e.,

\begin{displaymath}
a(x) = {\rm argmax}_k \, p(k\vert x,D,D_0)
.
\end{displaymath} (337)

In saddle point approximation $p(k\vert x,D,D_0)\approx p(k\vert x,\phi^*)$ where $\phi^*$ minimizing $E_{\rm cl.}(\phi)$ can be found by solving the stationarity equation (228).

For the choice $\phi_k=P_k$ non-negativity and normalization must be ensured. For $\phi=L$ with $P=e^L$ non-negativity is automatically fulfilled but the Lagrange multiplier must be included to ensure normalization.

Normalization is guaranteed by using unnormalized probabilities $\phi_k=z_k$, $P=z_k/\sum_l z_l$ (for which non-negativity has to be checked) or shifted log-likelihoods $\phi_k=g_k$ with $g_k = L_k +\ln \sum_l e^{L_l}$, i.e., $P_k$ = $e^{g_k }/\sum_l e^{g_l}$. In that case the nonlocal normalization terms are part of the likelihood and no Lagrange multiplier has to be used [236]. The resulting equation can be solved in the space defined by the $X$-data (see Eq. (153)). The restriction of $\phi_k$ = $g_k$ to linear functions $\phi_k(x) = w_k x +b_k$ yields log-linear models [154]. Recently a mean field theory for Gaussian Process classification has been developed [177,179].

Table 3 lists some special cases of density estimation. The last line of the table, referring to inverse quantum mechanics, will be discussed in the next section.


Table 3: Special cases of density estimation
likelihood $p(y\vert x,h)$ problem type
of general form density estimation
discrete $y$ classification
Gaussian with fixed variance regression
mixture of Gaussians clustering
quantum mechanical likelihood inverse quantum mechanics



next up previous contents
Next: Inverse quantum mechanics Up: Gaussian prior factors Previous: Support vector machines and   Contents
Joerg_Lemm 2001-01-21