Linearization and Newton algorithm

Next: Massive relaxation Up: Learning matrices Previous: Learning algorithms for density Contents

Linearization and Newton algorithm

For linear equations ${{\bf K}} \phi$ = where and ${{\bf K}}$ are no functions of $\phi$ a spectral radius $\rho ({\bf M})<1$ (the largest modulus of the eigenvalues) of the iteration matrix

$\begin{displaymath} {\bf M} = - \eta {\bf A}^{-1} {\bf B}_\eta =(1-\eta) {\bf I}... ...a {\bf A}^{-1} {\bf B} = {\bf I} - \eta {\bf A}^{-1} {{\bf K}} \end{displaymath}$

(631)

would guarantee convergence of the iteration scheme. This is easily seen by solving the linear equation by iteration according to (615)

$\displaystyle \phi^{(i+1)}$	$\textstyle =$	$\displaystyle \eta {\bf A}^{-1}T + {\bf M}\phi^{(i)}$	(632)
	$\textstyle =$	$\displaystyle \eta {\bf A}^{-1}T + \eta {\bf M} {\bf A}^{-1}T + {\bf M}^2 \phi^{(i-1)}$	(633)
	$\textstyle =$	$\displaystyle \eta \sum_{n=0}^\infty {\bf M}^n {\bf A}^{-1} T .$	(634)

A zero mode of ${{\bf K}}$ , for example a constant function for differential operators without boundary conditions, corresponds to an eigenvalue

of ${\bf M}$ and would lead to divergence of the sequence $\phi^{(i)}$ . However, a nonlinear $T(\phi)$ or ${{\bf K}}(\phi)$ , like the nonlinear normalization constraint contained in $T(\phi)$ , can then still lead to a unique solution.

A convergence analysis for nonlinear equations can be done in a linear approximation around a fixed point. Expanding the gradient at $\phi^*$

$\begin{displaymath} G(\phi) = \frac{\delta (-E_\phi) }{\delta \phi}\Bigg\vert _{\phi^*} + (\phi-\phi^*) \,\, {\bf H} (\phi^*) + \cdots \end{displaymath}$

(635)

shows that the factor of the linear term is the Hessian. Thus in the vicinity of $\phi^*$ the spectral radius of the iteration matrix

$\begin{displaymath} {\bf M} = {\bf I} + \eta {\bf A}^{-1} {\bf H}, \end{displaymath}$

(636)

determines the rate of convergence. The Newton algorithm uses the negative Hessian $-{\bf H}$ as learning matrix provided it exists and is positive definite. Otherwise it must resort to other methods. In the linear approximation (i.e., for quadratic energy) the Newton algorithm

$\displaystyle {\bf A} = -{\bf H}$

$\textstyle :$

$\displaystyle {\rm Newton}$

(637)

is optimal. We have already seen in Sections 3.1.3 and 3.2.3 that the inhomogeneities generate in the Hessian in addition to ${{\bf K}}$ a diagonal part which can remove zero modes of ${{\bf K}}$ .

Next: Massive relaxation Up: Learning matrices Previous: Learning algorithms for density Contents

Joerg_Lemm 2001-01-21