next up previous contents
Next: Density estimation with Gaussian Up: Numerical examples Previous: Numerical examples   Contents

Density estimation with Gaussian specific prior

In this section we look at some numerical examples and discuss implementations of the nonparametric learning algorithms for density estimation we have discussed in this paper.

As example, consider a problem with a one-dimensional $X$-space and a one-dimensional $Y$-space, and a smoothness prior with inverse covariance

\begin{displaymath}
{{\bf K}} = \lambda_x \left( {{\bf K}}_X\otimes 1_Y \right)
+ \lambda_y \left( 1_X\otimes {{\bf K}}_Y \right) ,
\end{displaymath} (705)

where
$\displaystyle {{\bf K}}_X$ $\textstyle =$ $\displaystyle \lambda_0 {\bf I}_X- \lambda_2 {\Delta}_x
+ \lambda_4 {\Delta}_x^2 - \lambda_6 {\Delta}_x^3$ (706)
$\displaystyle {{\bf K}}_Y$ $\textstyle =$ $\displaystyle \lambda_0 {\bf I}_Y- \lambda_2 {\Delta}_y
+ \lambda_4 {\Delta}_y^2 - \lambda_6 {\Delta}_y^3,$ (707)

and Laplacian
\begin{displaymath}
{\Delta}_x (x,x^\prime )
= \delta^{\prime\prime} (x-x^\prime)
= \delta (x-x^\prime)\frac{d^2}{dx^2},
\end{displaymath} (708)

and analogously for ${\Delta}_y$. For $\lambda_2\ne0$ = $\lambda _0$ = $\lambda _4$ = $\lambda _6$ this corresponds to the two Laplacian prior factors $\Delta_x$ for $x$ and $\Delta_y$ for $y$. (Notice that also for $\lambda _x$ = $\lambda _y$ the $\lambda _4$- and $\lambda _6$-terms do not include all terms of an iterated 2-dimensional Laplacian, like $\Delta^2$ = $(\Delta_x+\Delta_y)^2$ or $\Delta^4$, as the mixed derivatives $\Delta_x\Delta_y$ are missing.)

We will now study nonparametric density estimation with prior factors being Gaussian with respect to $L$ as well as being Gaussian with respect to $P$.

The error or energy functional for a Gaussian prior factor in $L$ is given by Eq. (108). The corresponding iteration procedure is

\begin{displaymath}
L^{(i+1)} =
L^{(i)} + \eta {\bf A}^{-1}
\left(N - {{\bf K}}...
...)}} \left[
N_X-{\bf I}_X {{\bf K}} L^{(i)}
\right]\right)
.
\end{displaymath} (709)

Written explicitly for $\lambda_2=1$, $\lambda _0$ = $\lambda _4$ = $\lambda _6$ = $0$ Eq. (709) reads,
\begin{displaymath}
L^{(i+1)}(x,y) = L^{(i)}(x,y) +
\eta \sum_j {\bf A}^{-1} (x,y;x_j ,y_j )
\end{displaymath} (710)


\begin{displaymath}
+\eta \int \! dx^\prime dy^\prime {\bf A}^{-1} (x,y;x^\prime...
...rac{d^2}{d(y^\prime )^2} L^{(i)} (x^\prime ,y^\prime )
\right.
\end{displaymath}


\begin{displaymath}
\left.
\!-\!
\left(
\sum_j \! \delta(x^\prime\! -\!x_j)
\!+...
...e\prime} )
\right)
\!e^{L^{(i)}(x^\prime ,y^\prime )}
\right].
\end{displaymath}

Here

\begin{displaymath}
\int_{y_A}^{y_B} \!dy^{\prime\prime}\,
\frac{d^2}{d(y^{\prim...
...
L^{(i)} (x^\prime ,y^{\prime\prime} )\Bigg\vert _{y_A}^{y_B}
\end{displaymath}

vanishes if the first derivative $\frac{d}{dy} L^{(i)} (x,y)$ vanishes at the boundary or if periodic.

Analogously, for error functional $E_P$ (164) the iteration procedure

\begin{displaymath}
P^{(i+1)} =
P^{(i)} + \eta {\bf A}^{-1}
\left[
({\bf P}^{(i...
... {\bf P}^{(i)} {{\bf K}} P^{(i)}
- {{\bf K}} P^{(i)}
\right].
\end{displaymath} (711)

becomes for $\lambda_2=1$, $\lambda _0$ = $\lambda _4$ = $\lambda _6$ = $0$
\begin{displaymath}
P^{(i+1)}(x,y) = P^{(i)}(x,y) +
\eta \sum_j \frac{{\bf A}^{-1} (x,y;x_j ,y_j )}
{P^{(i)} (x_j,y_j)}
\end{displaymath} (712)


\begin{displaymath}
+\eta \int \! dx^\prime dy^\prime {\bf A}^{-1} (x,y;x^\prime...
...rac{d^2}{d(y^\prime )^2} P^{(i)} (x^\prime ,y^\prime )
\right.
\end{displaymath}


\begin{displaymath}
\left.
-\left(
\sum_j \delta(x^\prime\!\! -\!x_j)
+\!\!\int...
...(x^\prime ,y^{\prime\prime} )}{d(x^\prime )^2}
\right.\right.
\end{displaymath}


\begin{displaymath}
\left.\left.
+\int \!dy^{\prime\prime} P^{(i)} (x^\prime ,y^...
...,y^{\prime\prime} )}{d(y^{\prime\prime} )^2}
\right)
\right].
\end{displaymath}

Here

\begin{displaymath}
\int_{y_A}^{y_B} \!\!\!dy^{\prime\prime} P^{(i)} (x^\prime ,...
...(i)} (x^\prime ,y^{\prime\prime} )}{d(y^{\prime\prime} )^2}
=
\end{displaymath}


\begin{displaymath}
P^{(i)} (x^\prime ,y^{\prime\prime} )
\frac{d P^{(i)} (x^\pr...
...prime ,y^{\prime\prime} )}{dy^{\prime\prime}}
\! \right)^2\!,
\end{displaymath} (713)

where the first term vanishes for $P^{(i)}$ periodic or vanishing at the boundaries. (This has to be the case for $i\, d/dy$ to be hermitian.)

We now study density estimation problems numerically. In particular, we want to check the influence of the nonlinear normalization constraint. Furthermore, we want to compare models with Gaussian prior factors for $L$ with models with Gaussian prior factors for $P$.

The following numerical calculations have been performed on a mesh of dimension 10$\times$ 15, i.e., $x\in [1,10]$ and $y\in [1,15]$, with periodic boundary conditions on $y$ and sometimes also in $x$. A variety of different iteration and initialization methods have been used.

Figs. 14 - 17 summarize results for density estimation problems with only two data points, where differences in the effects of varying smoothness priors are particularly easy to see. A density estimation with more data points can be found in Fig. 21.

For Fig. 14 a Laplacian smoothness prior on $L$ has been implemented. The solution has been obtained by iterating with the negative Hessian, as long as positive definite. Otherwise the gradient algorithm has been used. One iteration step means one iteration according to Eq. (617) with the optimal $\eta $. Thus, each iteration step includes the optimization of $\eta $ by a line search algorithm. (For the figures the Mathematica function FindMinimum has been used to optimize $\eta $.)

As initial guess in Fig. 14 the kernel estimate $L^{(0)}$ = $\ln (\tilde {\bf C} \tilde P_{\rm emp})$ has been employed, with $\tilde {\bf C}$ defined in Eq. (698) and ${\bf C}$ = $({\bf K}+m_C^2{\bf I})$ with squared mass $m_C^2$ = $0.1$. The fast drop-off of the energy $E_L$ within the first two iterations shows the quality of this initial guess. Indeed, this fast convergence seems to indicate that the problem is nearly linear, meaning that the influence of the only nonlinear term in the stationarity equation, the normalization constraint, is not too strong. Notice also, that the reconstructed regression shows the typical piecewise linear approximations well known from one-dimensional (normalization constraint free) regression problems with Laplacian prior.

Fig. 15 shows a density estimation similar to Fig. 14, but for a Gaussian prior factor in $P$ and thus also with different $\lambda _2$, different initialization, and slightly different iteration procedure. For Fig. 15 also a kernel estimate $P^{(0)}$ = $(\tilde {\bf C} \tilde P_{\rm emp})$ has been used as initial guess, again with $\tilde {\bf C}$ as defined in Eq. (698) and ${\bf C}$ = $({\bf K}+m_C^2{\bf I})$ but with squared mass $m_C^2$ = $1.0$. The solution has been obtained by prior relaxation ${\bf A}$ = ${\bf K}+m^2{\bf I}$ including a mass term with $m^2$ = 1.0 to get for a Laplacian ${\bf K}$ = $-\Delta$ and periodic boundary conditions an invertible ${\bf A}$. This iteration scheme does not require to calculate the Hessian ${\bf H}_P$ at each iteration step. Again the quality of the initial guess (and the iteration scheme) is indicated by the fast drop-off of the energy $E_P$ during the first iteration.

Because the range of $P$-values, being between zero and one, is smaller than that of $L$-values, being between minus infinity and zero, a larger Laplacian smoothness factor $\lambda _2$ is needed for Fig. 15 to get similar results than for Fig. 14. In particular, such $\lambda _2$ values have been chosen for the two figures that the maximal values of the the two reconstructed probability densities $P$ turns out to be nearly equal.

Because the logarithm particularly expands the distances between small probabilities one would expect a Gaussian prior for $L$ to be especially effective for small probabilities. Comparing Fig. 14 and Fig. 15 this effect can indeed be seen. The deep valleys appearing in the $L$-landscape of Fig. 15 show that small values of $L$ are not smoothed out as effectively as in Fig. 14. Notice, that therefore also the variance of the solution $p(y\vert x,h)$ is much smaller for a Gaussian prior in $P$ at those $x$ which are in the training set.

Fig. 16 resumes results for a model similar to that presented in Fig. 14, but with a $(-\Delta ^3)$-prior replacing the Laplacian $(-\Delta)$-prior. As all quadratic functions have zero third derivative such a prior favors, applied to $L$, quadratic log-likelihoods, corresponding to Gaussian probabilities $P$. Indeed, this is indicated by the striking difference between the regression functions in Fig. 16 and in Fig. 14: The $(-\Delta ^3)$-prior produces a much rounder regression function, especially at the $x$ values which appear in the data. Note however, that in contrast to a pure Gaussian regression problem, in density estimation an additional non-quadratic normalization constraint is present.

In Fig. 17 a similar prior has been applied, but this time being Gaussian in $P$ instead of $L$. In contrast to a $(-\Delta ^3)$-prior for $L$, a $(-\Delta ^3)$-prior for $P$ implements a tendency to quadratic $P$. Similarly to the difference between Fig. 14 and Fig. 16, the regression function in Fig. 17 is also rounder than that in Fig. 15. Furthermore, smoothing in Fig. 17 is also less effective for smaller probabilities than it is in Fig. 16. That is the same result we have found comparing the two priors for $L$ shown in Fig. 15 and Fig. 14. This leads to deeper valleys in the $L$-landscape and to a smaller variance especially at $x$ which appear in the training data.

Fig. 21 depicts the results of a density estimation based on more than two data points. In particular, fifty training data have been obtained by sampling with uniform $p(x)$ from the ``true'' density

\begin{displaymath}
P_{\rm true}(x,y)
= p(y\vert x,h_{\rm true})
= \frac{1}{2\...
...gma_0^2}}
+
e^{-\frac{(y-h_b(x))^2}{2\sigma_0^2}}
\right)
,
\end{displaymath} (714)

with $\sigma_0$ = $1.5$, $h_a(x)$ = $125/18+(5/9)x$, $h_b(x)$ = $145/18-(5/9)x$, shown in the top row of Fig. 18. The sampling process has been implemented using the transformation method (see for example [196]). The corresponding empirical density $N/n$ (235) and conditional empirical density $P_{\rm emp}$ of Eq. (236), in this case equal to the extended $\tilde P_{\rm emp}$ defined in Eq. (238), can be found in Fig. 20.

Fig. 21 shows the maximum posterior solution $p(y\vert x,h^*)$ and its logarithm, the energy $E_L$ during iteration, the regression function

\begin{displaymath}
h(x)
= \int \!dy \, y \, p(y\vert x,h_{\rm true})
= \int \!dy \, y \, P_{\rm true}(x,y)
,
\end{displaymath} (715)

(as reference, the regression function for the true likelihood $p(y\vert x,h_{\rm true})$ is given in Fig. 19), the average training error (or empirical (conditional) log-loss)
\begin{displaymath}
<-\ln p(y\vert x,h) >_D
= -\frac{1}{n}\sum_{i=1}^n\ln p(y_i\vert x_i,h)
,
\end{displaymath} (716)

and the average test error (or true expectation of (conditional) log-loss) for uniform $p(x)$
\begin{displaymath}
<-\ln p(y\vert x,h) >_{P_{\rm true}}
=
-\int \!dy\,dx\, p(x)p(y\vert x,h_{\rm true}) \ln p(y\vert x,h)
,
\end{displaymath} (717)

which is, up to a constant, equal to the expected Kullback-Leibler distance between the actual solution and the true likelihood,
\begin{displaymath}
{\rm KL}\Big(p(x,y\vert h_{\rm true}),p(y\vert x,h)\Big)
=
-...
... true})
\ln \frac{p(y\vert x,h)}{p(y\vert x,h_{\rm true}) }
.
\end{displaymath} (718)

The test error measures the quality of the achieved solution. It has, in contrast to the energy and training error, of course not been available to the learning algorithm.

The maximum posterior solution of Fig. 21 has been calculated by minimizing $E_L$ using massive prior iteration with ${\bf A}$ = ${\bf K}+m^2{\bf I}$, a squared mass $m^2$ = $0.01$, and a (conditionally) normalized, constant $L^{(0)}$ as initial guess. Convergence has been fast, the regression function is similar to the true one (see Fig. 19).

Fig. 22 compares some iteration procedures and initialization methods Clearly, all methods do what they should do, they decrease the energy functional. Iterating with the negative Hessian yields the fastest convergence. Massive prior iteration is nearly as fast, even for uniform initialization, and does not require calculation of the Hessian at each iteration. Finally, the slowest iteration method, but the easiest to implement, is the gradient algorithm.

Looking at Fig. 22 one can distinguish data-oriented from prior-oriented initializations. We understand data-oriented initial guesses to be those for which the training error is smaller at the beginning of the iteration than for the final solution and prior-oriented initial guesses to be those for which the opposite is true. For good initial guesses the difference is small. Clearly, the uniform initializations is prior-oriented, while an empirical log-density $\ln(N/n+\epsilon)$ and the shown kernel initializations are data-oriented.

The case where the test error grows while the energy is decreasing indicates a misspecified prior and is typical for overfitting. For example, in the fifth row of Fig. 22 the test error (and in this case also the average training error) grows again after having reached a minimum while the energy is steadily decreasing.


next up previous contents
Next: Density estimation with Gaussian Up: Numerical examples Previous: Numerical examples   Contents
Joerg_Lemm 2001-01-21