Density estimation with Gaussian specific prior

In this section we look at some numerical examples and discuss implementations of the nonparametric learning algorithms for density estimation we have discussed in this paper.

As example, consider a problem with a one-dimensional

-space and a one-dimensional

-space, and a smoothness prior with inverse covariance

$\begin{displaymath} {{\bf K}} = \lambda_x \left( {{\bf K}}_X\otimes 1_Y \right) + \lambda_y \left( 1_X\otimes {{\bf K}}_Y \right) , \end{displaymath}$

(705)

$\displaystyle {{\bf K}}_X$	$\textstyle =$	$\displaystyle \lambda_0 {\bf I}_X- \lambda_2 {\Delta}_x + \lambda_4 {\Delta}_x^2 - \lambda_6 {\Delta}_x^3$	(706)
$\displaystyle {{\bf K}}_Y$	$\textstyle =$	$\displaystyle \lambda_0 {\bf I}_Y- \lambda_2 {\Delta}_y + \lambda_4 {\Delta}_y^2 - \lambda_6 {\Delta}_y^3,$	(707)

$\begin{displaymath} {\Delta}_x (x,x^\prime ) = \delta^{\prime\prime} (x-x^\prime) = \delta (x-x^\prime)\frac{d^2}{dx^2}, \end{displaymath}$

(708)

We will now study nonparametric density estimation with prior factors being Gaussian with respect to

as well as being Gaussian with respect to

The error or energy functional for a Gaussian prior factor in

is given by Eq. (108). The corresponding iteration procedure is

$\begin{displaymath} L^{(i+1)} = L^{(i)} + \eta {\bf A}^{-1} \left(N - {{\bf K}}... ...)}} \left[ N_X-{\bf I}_X {{\bf K}} L^{(i)} \right]\right) . \end{displaymath}$

(709)

$\begin{displaymath} L^{(i+1)}(x,y) = L^{(i)}(x,y) + \eta \sum_j {\bf A}^{-1} (x,y;x_j ,y_j ) \end{displaymath}$

(710)

$\begin{displaymath} +\eta \int \! dx^\prime dy^\prime {\bf A}^{-1} (x,y;x^\prime... ...rac{d^2}{d(y^\prime )^2} L^{(i)} (x^\prime ,y^\prime ) \right. \end{displaymath}$

$\begin{displaymath} \left. \!-\! \left( \sum_j \! \delta(x^\prime\! -\!x_j) \!+... ...e\prime} ) \right) \!e^{L^{(i)}(x^\prime ,y^\prime )} \right]. \end{displaymath}$

$\begin{displaymath} \int_{y_A}^{y_B} \!dy^{\prime\prime}\, \frac{d^2}{d(y^{\prim... ... L^{(i)} (x^\prime ,y^{\prime\prime} )\Bigg\vert _{y_A}^{y_B} \end{displaymath}$

$\begin{displaymath} P^{(i+1)} = P^{(i)} + \eta {\bf A}^{-1} \left[ ({\bf P}^{(i... ... {\bf P}^{(i)} {{\bf K}} P^{(i)} - {{\bf K}} P^{(i)} \right]. \end{displaymath}$

(711)

$\begin{displaymath} P^{(i+1)}(x,y) = P^{(i)}(x,y) + \eta \sum_j \frac{{\bf A}^{-1} (x,y;x_j ,y_j )} {P^{(i)} (x_j,y_j)} \end{displaymath}$

(712)

$\begin{displaymath} +\eta \int \! dx^\prime dy^\prime {\bf A}^{-1} (x,y;x^\prime... ...rac{d^2}{d(y^\prime )^2} P^{(i)} (x^\prime ,y^\prime ) \right. \end{displaymath}$

$\begin{displaymath} \left. -\left( \sum_j \delta(x^\prime\!\! -\!x_j) +\!\!\int... ...(x^\prime ,y^{\prime\prime} )}{d(x^\prime )^2} \right.\right. \end{displaymath}$

$\begin{displaymath} \left.\left. +\int \!dy^{\prime\prime} P^{(i)} (x^\prime ,y^... ...,y^{\prime\prime} )}{d(y^{\prime\prime} )^2} \right) \right]. \end{displaymath}$

$\begin{displaymath} \int_{y_A}^{y_B} \!\!\!dy^{\prime\prime} P^{(i)} (x^\prime ,... ...(i)} (x^\prime ,y^{\prime\prime} )}{d(y^{\prime\prime} )^2} = \end{displaymath}$

$\begin{displaymath} P^{(i)} (x^\prime ,y^{\prime\prime} ) \frac{d P^{(i)} (x^\pr... ...prime ,y^{\prime\prime} )}{dy^{\prime\prime}} \! \right)^2\!, \end{displaymath}$

(713)

We now study density estimation problems numerically. In particular, we want to check the influence of the nonlinear normalization constraint. Furthermore, we want to compare models with Gaussian prior factors for

with models with Gaussian prior factors for

The following numerical calculations have been performed on a mesh of dimension 10 $\times$ 15, i.e., $x\in [1,10]$ and $y\in [1,15]$ , with periodic boundary conditions on

and sometimes also in

. A variety of different iteration and initialization methods have been used.

Figs. 14 - 17 summarize results for density estimation problems with only two data points, where differences in the effects of varying smoothness priors are particularly easy to see. A density estimation with more data points can be found in Fig. 21.

For Fig. 14 a Laplacian smoothness prior on

has been implemented. The solution has been obtained by iterating with the negative Hessian, as long as positive definite. Otherwise the gradient algorithm has been used. One iteration step means one iteration according to Eq. (617) with the optimal $\eta$ . Thus, each iteration step includes the optimization of $\eta$ by a line search algorithm. (For the figures the Mathematica function FindMinimum has been used to optimize $\eta$ .)

As initial guess in Fig. 14 the kernel estimate $L^{(0)}$ = $\ln (\tilde {\bf C} \tilde P_{\rm emp})$ has been employed, with $\tilde {\bf C}$ defined in Eq. (698) and ${\bf C}$ = $({\bf K}+m_C^2{\bf I})$ with squared mass

. The fast drop-off of the energy

within the first two iterations shows the quality of this initial guess. Indeed, this fast convergence seems to indicate that the problem is nearly linear, meaning that the influence of the only nonlinear term in the stationarity equation, the normalization constraint, is not too strong. Notice also, that the reconstructed regression shows the typical piecewise linear approximations well known from one-dimensional (normalization constraint free) regression problems with Laplacian prior.

Fig. 15 shows a density estimation similar to Fig. 14, but for a Gaussian prior factor in

and thus also with different $\lambda _2$ , different initialization, and slightly different iteration procedure. For Fig. 15 also a kernel estimate $P^{(0)}$ = $(\tilde {\bf C} \tilde P_{\rm emp})$ has been used as initial guess, again with $\tilde {\bf C}$ as defined in Eq. (698) and ${\bf C}$ = $({\bf K}+m_C^2{\bf I})$ but with squared mass

. The solution has been obtained by prior relaxation ${\bf A}$ = ${\bf K}+m^2{\bf I}$ including a mass term with

= 1.0 to get for a Laplacian ${\bf K}$ = $-\Delta$ and periodic boundary conditions an invertible ${\bf A}$ . This iteration scheme does not require to calculate the Hessian ${\bf H}_P$ at each iteration step. Again the quality of the initial guess (and the iteration scheme) is indicated by the fast drop-off of the energy

during the first iteration.

Because the range of

-values, being between zero and one, is smaller than that of

-values, being between minus infinity and zero, a larger Laplacian smoothness factor $\lambda _2$ is needed for Fig. 15 to get similar results than for Fig. 14. In particular, such $\lambda _2$ values have been chosen for the two figures that the maximal values of the the two reconstructed probability densities

turns out to be nearly equal.

Because the logarithm particularly expands the distances between small probabilities one would expect a Gaussian prior for

to be especially effective for small probabilities. Comparing Fig. 14 and Fig. 15 this effect can indeed be seen. The deep valleys appearing in the

-landscape of Fig. 15 show that small values of

are not smoothed out as effectively as in Fig. 14. Notice, that therefore also the variance of the solution $p(y\vert x,h)$ is much smaller for a Gaussian prior in

at those

which are in the training set.

Fig. 16 resumes results for a model similar to that presented in Fig. 14, but with a $(-\Delta ^3)$ -prior replacing the Laplacian $(-\Delta)$ -prior. As all quadratic functions have zero third derivative such a prior favors, applied to

, quadratic log-likelihoods, corresponding to Gaussian probabilities

. Indeed, this is indicated by the striking difference between the regression functions in Fig. 16 and in Fig. 14: The $(-\Delta ^3)$ -prior produces a much rounder regression function, especially at the

values which appear in the data. Note however, that in contrast to a pure Gaussian regression problem, in density estimation an additional non-quadratic normalization constraint is present.

In Fig. 17 a similar prior has been applied, but this time being Gaussian in

instead of

. In contrast to a $(-\Delta ^3)$ -prior for

, a $(-\Delta ^3)$ -prior for

implements a tendency to quadratic

. Similarly to the difference between Fig. 14 and Fig. 16, the regression function in Fig. 17 is also rounder than that in Fig. 15. Furthermore, smoothing in Fig. 17 is also less effective for smaller probabilities than it is in Fig. 16. That is the same result we have found comparing the two priors for

shown in Fig. 15 and Fig. 14. This leads to deeper valleys in the

-landscape and to a smaller variance especially at

which appear in the training data.

Fig. 21 depicts the results of a density estimation based on more than two data points. In particular, fifty training data have been obtained by sampling with uniform

from the ``true'' density

$\begin{displaymath} P_{\rm true}(x,y) = p(y\vert x,h_{\rm true}) = \frac{1}{2\... ...gma_0^2}} + e^{-\frac{(y-h_b(x))^2}{2\sigma_0^2}} \right) , \end{displaymath}$

(714)

Fig. 21 shows the maximum posterior solution $p(y\vert x,h^*)$ and its logarithm, the energy

during iteration, the regression function

$\begin{displaymath} h(x) = \int \!dy \, y \, p(y\vert x,h_{\rm true}) = \int \!dy \, y \, P_{\rm true}(x,y) , \end{displaymath}$

(715)

$\begin{displaymath} <-\ln p(y\vert x,h) >_D = -\frac{1}{n}\sum_{i=1}^n\ln p(y_i\vert x_i,h) , \end{displaymath}$

(716)

$\begin{displaymath} <-\ln p(y\vert x,h) >_{P_{\rm true}} = -\int \!dy\,dx\, p(x)p(y\vert x,h_{\rm true}) \ln p(y\vert x,h) , \end{displaymath}$

(717)

$\begin{displaymath} {\rm KL}\Big(p(x,y\vert h_{\rm true}),p(y\vert x,h)\Big) = -... ... true}) \ln \frac{p(y\vert x,h)}{p(y\vert x,h_{\rm true}) } . \end{displaymath}$

(718)

The maximum posterior solution of Fig. 21 has been calculated by minimizing

using massive prior iteration with ${\bf A}$ = ${\bf K}+m^2{\bf I}$ , a squared mass

, and a (conditionally) normalized, constant $L^{(0)}$ as initial guess. Convergence has been fast, the regression function is similar to the true one (see Fig. 19).

Fig. 22 compares some iteration procedures and initialization methods Clearly, all methods do what they should do, they decrease the energy functional. Iterating with the negative Hessian yields the fastest convergence. Massive prior iteration is nearly as fast, even for uniform initialization, and does not require calculation of the Hessian at each iteration. Finally, the slowest iteration method, but the easiest to implement, is the gradient algorithm.

Looking at Fig. 22 one can distinguish data-oriented from prior-oriented initializations. We understand data-oriented initial guesses to be those for which the training error is smaller at the beginning of the iteration than for the final solution and prior-oriented initial guesses to be those for which the opposite is true. For good initial guesses the difference is small. Clearly, the uniform initializations is prior-oriented, while an empirical log-density $\ln(N/n+\epsilon)$ and the shown kernel initializations are data-oriented.

The case where the test error grows while the energy is decreasing indicates a misspecified prior and is typical for overfitting. For example, in the fifth row of Fig. 22 the test error (and in this case also the average training error) grows again after having reached a minimum while the energy is steadily decreasing.