Empirical risk minimization

Next: Interpretations of Occam's razor Up: Bayesian framework Previous: Normalization, non-negativity, and specific Contents

Empirical risk minimization

In the previous sections the error functionals we will try to minimize in the following have been given a Bayesian interpretation in terms of the log-posterior density. There is, however, an alternative justification of error functionals using the Frequentist approach of empirical risk minimization [224,225,226].

Common to both approaches is the aim to minimize the expected risk for action

$\begin{displaymath} r(a,f) = \int\!dx\,dy\,p(x,y\vert f(D,D_0)) \, l(x,y,a) . \end{displaymath}$

(104)

The expected risk, however, cannot be calculated without knowledge of the true $p(x,y\vert f)$ . In contrast to the Bayesian approach of modeling $p(x,y\vert f)$ the Frequentist approach approximates the expected risk by the empirical risk

$\begin{displaymath} E(a) = \hat r(a,f) = \sum_i l(x_i,y_i,a) , \end{displaymath}$

(105)

i.e., by replacing the unknown true probability by an observable empirical probability. Here it is essential for obtaining asymptotic convergence results to assume that training data are sampled according to the true $p(x,y\vert f)$ [224,52,194,128,226]. Notice that in contrast in a Bayesian approach the density

for training data

does according to Eq. (16) not enter the formalism because

enters as conditional variable. For a detailed discussion of the relation between quadratic error functionals and Gaussian processes see for example [181,183,184,112,113,153,228,144].

From that Frequentist point of view one is not restricted to logarithmic data terms as they arise from the posterior-related Bayesian interpretation. However, like in the Bayesian approach, training data terms are not enough to make the minimization problem well defined. Indeed this is a typical inverse problem [224,115,226] which can, according to the classical regularization approach [220,221,165], be treated by including additional regularization (stabilizer) terms in the loss function . Those regularization terms, which correspond to the prior terms in a Bayesian approach, are thus from the point of view of empirical risk minimization a technical tool to make the minimization problem well defined.

The empirical generalization error for a test or validation data set independent from the training data , on the other hand, is measured using only the data terms of the error functional without regularization terms. In empirical risk minimization this empirical generalization error is used, for example, to determine adaptive (hyper-)parameters of regularization terms. A typical example is a factor multiplying the regularization terms controlling the trade-off between data and regularization terms. Common techniques using the empirical generalization error to determine such parameters are cross-validation or bootstrap like techniques [166,6,230,216,217,81,39,228,54]. From a strict Bayesian point of view those parameters would have to be integrated out after defining an appropriate prior [16,147,149,24].

Next: Interpretations of Occam's razor Up: Bayesian framework Previous: Normalization, non-negativity, and specific Contents

Joerg_Lemm 2001-01-21