For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.

Formal statement edit

The following Representer Theorem and its proof are due to Schölkopf, Herbrich, and Smola: [1]

Theorem: Consider a positive-definite real-valued kernel   on a non-empty set   with a corresponding reproducing kernel Hilbert space  . Let there be given

  • a training sample  ,
  • a strictly increasing real-valued function  , and
  • an arbitrary error function  ,

which together define the following regularized empirical risk functional on  :

 

Then, any minimizer of the empirical risk

 

admits a representation of the form:

 

where   for all  .

Proof: Define a mapping

 

(so that   is itself a map  ). Since   is a reproducing kernel, then

 

where   is the inner product on  .

Given any  , one can use orthogonal projection to decompose any   into a sum of two functions, one lying in  , and the other lying in the orthogonal complement:

 

where   for all  .

The above orthogonal decomposition and the reproducing property together show that applying   to any training point   produces

 

which we observe is independent of  . Consequently, the value of the error function   in (*) is likewise independent of  . For the second term (the regularization term), since   is orthogonal to   and   is strictly monotonic, we have

 

Therefore setting   does not affect the first term of (*), while it strictly decreases the second term. Consequently, any minimizer   in (*) must have  , i.e., it must be of the form

 

which is the desired result.

Generalizations edit

The Theorem stated above is a particular example of a family of results that are collectively referred to as "representer theorems"; here we describe several such.

The first statement of a representer theorem was due to Kimeldorf and Wahba for the special case in which

 

for  . Schölkopf, Herbrich, and Smola generalized this result by relaxing the assumption of the squared-loss cost and allowing the regularizer to be any strictly monotonically increasing function   of the Hilbert space norm.

It is possible to generalize further by augmenting the regularized empirical risk functional through the addition of unpenalized offset terms. For example, Schölkopf, Herbrich, and Smola also consider the minimization

 

i.e., we consider functions of the form  , where   and   is an unpenalized function lying in the span of a finite set of real-valued functions  . Under the assumption that the   matrix   has rank  , they show that the minimizer   in   admits a representation of the form

 

where   and the   are all uniquely determined.

The conditions under which a representer theorem exists were investigated by Argyriou, Micchelli, and Pontil, who proved the following:

Theorem: Let   be a nonempty set,   a positive-definite real-valued kernel on   with corresponding reproducing kernel Hilbert space  , and let   be a differentiable regularization function. Then given a training sample   and an arbitrary error function  , a minimizer

 

of the regularized empirical risk admits a representation of the form

 

where   for all  , if and only if there exists a nondecreasing function   for which

 

Effectively, this result provides a necessary and sufficient condition on a differentiable regularizer   under which the corresponding regularized empirical risk minimization   will have a representer theorem. In particular, this shows that a broad class of regularized risk minimizations (much broader than those originally considered by Kimeldorf and Wahba) have representer theorems.

Applications edit

Representer theorems are useful from a practical standpoint because they dramatically simplify the regularized empirical risk minimization problem  . In most interesting applications, the search domain   for the minimization will be an infinite-dimensional subspace of  , and therefore the search (as written) does not admit implementation on finite-memory and finite-precision computers. In contrast, the representation of   afforded by a representer theorem reduces the original (infinite-dimensional) minimization problem to a search for the optimal  -dimensional vector of coefficients  ;   can then be obtained by applying any standard function minimization algorithm. Consequently, representer theorems provide the theoretical basis for the reduction of the general machine learning problem to algorithms that can actually be implemented on computers in practice.

The following provides an example of how to solve for the minimizer whose existence is guaranteed by the representer theorem. This method works for any positive definite kernel  , and allows us to transform a complicated (possibly infinite dimensional) optimization problem into a simple linear system that can be solved numerically.

Assume that we are using a least squares error function

 

and a regularization function   for some  . By the representer theorem, the minimizer

 

has the form

 

for some  . Noting that

 

we see that   has the form

 


where   and  . This can be factored out and simplified to

 


Since   is positive definite, there is indeed a single global minimum for this expression. Let   and note that   is convex. Then  , the global minimum, can be solved by setting  . Recalling that all positive definite matrices are invertible, we see that

 

so the minimizer may be found via a linear solve.

See also edit

References edit

  1. ^ Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alex J. (2001). "A Generalized Representer Theorem". In Helmbold, David; Williamson, Bob (eds.). Computational Learning Theory. Lecture Notes in Computer Science. Vol. 2111. Berlin, Heidelberg: Springer. pp. 416–426. doi:10.1007/3-540-44581-1_27. ISBN 978-3-540-44581-4.