In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.
Introduction edit
In the setting of supervised learning, a function of is to be learned, where is thought of as a space of inputs and as a space of outputs, that predicts well on instances that are drawn from a joint probability distribution on . In reality, the learner never knows the true distribution over instances. Instead, the learner usually has access to a training set of examples . In this setting, the loss function is given as , such that measures the difference between the predicted value and the true value . The ideal goal is to select a function , where is a space of functions called a hypothesis space, so that some notion of total loss is minimized. Depending on the type of model (statistical or adversarial), one can devise different notions of loss, which lead to different learning algorithms.
Statistical view of online learning edit
In statistical learning models, the training sample are assumed to have been drawn from the true distribution and the objective is to minimize the expected "risk"
A common strategy to overcome the above issues is to learn using mini-batches, which process a small batch of data points at a time, this can be considered as pseudo-online learning for much smaller than the total number of training points. Mini-batch techniques are used with repeated passing over the training data to obtain optimized out-of-core versions of machine learning algorithms, for example, stochastic gradient descent. When combined with backpropagation, this is currently the de facto training method for training artificial neural networks.
Example: linear least squares edit
The simple example of linear least squares is used to explain a variety of ideas in online learning. The ideas are general enough to be applied to other settings, for example, with other convex loss functions.
Batch learning edit
Consider the setting of supervised learning with being a linear function to be learned:
Let be the data matrix and is the column vector of target values after the arrival of the first data points. Assuming that the covariance matrix is invertible (otherwise it is preferential to proceed in a similar fashion with Tikhonov regularization), the best solution to the linear least squares problem is given by
Now, calculating the covariance matrix takes time , inverting the matrix takes time , while the rest of the multiplication takes time , giving a total time of . When there are total points in the dataset, to recompute the solution after the arrival of every datapoint , the naive approach will have a total complexity . Note that when storing the matrix , then updating it at each step needs only adding , which takes time, reducing the total time to , but with an additional storage space of to store .[1]
Online learning: recursive least squares edit
The recursive least squares (RLS) algorithm considers an online approach to the least squares problem. It can be shown that by initialising and , the solution of the linear least squares problem given in the previous section can be computed by the following iteration:
The complexity for steps of this algorithm is , which is an order of magnitude faster than the corresponding batch learning complexity. The storage requirements at every step here are to store the matrix , which is constant at . For the case when is not invertible, consider the regularised version of the problem loss function . Then, it's easy to show that the same algorithm works with , and the iterations proceed to give .[1]
Stochastic gradient descent edit
When this
However, the stepsize needs to be chosen carefully to solve the expected risk minimization problem, as detailed above. By choosing a decaying step size one can prove the convergence of the average iterate . This setting is a special case of stochastic optimization, a well known problem in optimization.[1]
Incremental stochastic gradient descent edit
In practice, one can perform multiple stochastic gradient passes (also called cycles or epochs) over the data. The algorithm thus obtained is called incremental gradient method and corresponds to an iteration
Kernel methods edit
Kernels can be used to extend the above algorithms to non-parametric models (or models where the parameters form an infinite dimensional space). The corresponding procedure will no longer be truly online and instead involve storing all the data points, but is still faster than the brute force method. This discussion is restricted to the case of the square loss, though it can be extended to any convex loss. It can be shown by an easy induction [1] that if is the data matrix and is the output after steps of the SGD algorithm, then,
Now, if a general kernel is introduced instead and let the predictor be
Online convex optimization edit
Online convex optimization (OCO) [4] is a general framework for decision making which leverages convex optimization to allow for efficient algorithms. The framework is that of repeated game playing as follows:
For
- Learner receives input
- Learner outputs from a fixed convex set
- Nature sends back a convex loss function .
- Learner suffers loss and updates its model
The goal is to minimize regret, or the difference between cumulative loss and the loss of the best fixed point in hindsight. As an example, consider the case of online least squares linear regression. Here, the weight vectors come from the convex set , and nature sends back the convex loss function . Note here that is implicitly sent with .
Some online prediction problems however cannot fit in the framework of OCO. For example, in online classification, the prediction domain and the loss functions are not convex. In such scenarios, two simple techniques for convexification are used: randomisation and surrogate loss functions.[citation needed]
Some simple online convex optimisation algorithms are:
Follow the leader (FTL) edit
The simplest learning rule to try is to select (at the current step) the hypothesis that has the least loss over all past rounds. This algorithm is called Follow the leader, and round is simply given by:
Follow the regularised leader (FTRL) edit
This is a natural modification of FTL that is used to stabilise the FTL solutions and obtain better regret bounds. A regularisation function is chosen and learning performed in round t as follows:
If S is instead some convex subspace of , S would need to be projected onto, leading to the modified update rule
Online subgradient descent (OSD) edit
The above proved a regret bound for linear loss functions . To generalise the algorithm to any convex loss function, the subgradient of is used as a linear approximation to near , leading to the online subgradient descent algorithm:
Initialise parameter
For
- Predict using , receive from nature.
- Choose
- If , update as
- If , project cumulative gradients onto i.e.
One can use the OSD algorithm to derive regret bounds for the online version of SVM's for classification, which use the hinge loss
Other algorithms edit
Quadratically regularised FTRL algorithms lead to lazily projected gradient algorithms as described above. To use the above for arbitrary convex functions and regularisers, one uses online mirror descent. The optimal regularization in hindsight can be derived for linear loss functions, this leads to the AdaGrad algorithm. For the Euclidean regularisation, one can show a regret bound of , which can be improved further to a for strongly convex and exp-concave loss functions.
Continual learning edit
Continual learning means constantly improving the learned model by processing continuous streams of information.[5] Continual learning capabilities are essential for software systems and autonomous agents interacting in an ever changing real world. However, continual learning is a challenge for machine learning and neural network models since the continual acquisition of incrementally available information from non-stationary data distributions generally leads to catastrophic forgetting.
Interpretations of online learning edit
The paradigm of online learning has different interpretations depending on the choice of the learning model, each of which has distinct implications about the predictive quality of the sequence of functions . The prototypical stochastic gradient descent algorithm is used for this discussion. As noted above, its recursion is given by
The first interpretation consider the stochastic gradient descent method as applied to the problem of minimizing the expected risk defined above.[6] Indeed, in the case of an infinite stream of data, since the examples are assumed to be drawn i.i.d. from the distribution , the sequence of gradients of in the above iteration are an i.i.d. sample of stochastic estimates of the gradient of the expected risk and therefore one can apply complexity results for the stochastic gradient descent method to bound the deviation , where is the minimizer of .[7] This interpretation is also valid in the case of a finite training set; although with multiple passes through the data the gradients are no longer independent, still complexity results can be obtained in special cases.
The second interpretation applies to the case of a finite training set and considers the SGD algorithm as an instance of incremental gradient descent method.[3] In this case, one instead looks at the empirical risk:
Implementations edit
- Vowpal Wabbit: Open-source fast out-of-core online learning system which is notable for supporting a number of machine learning reductions, importance weighting and a selection of different loss functions and optimisation algorithms. It uses the hashing trick for bounding the size of the set of features independent of the amount of training data.
- scikit-learn: Provides out-of-core implementations of algorithms for
- Classification: Perceptron, SGD classifier, Naive bayes classifier.
- Regression: SGD Regressor, Passive Aggressive regressor.
- Clustering: Mini-batch k-means.
- Feature extraction: Mini-batch dictionary learning, Incremental PCA.
See also edit
Learning paradigms
- Incremental learning
- Lazy learning
- Offline learning, the opposite model
- Reinforcement learning
- Multi-armed bandit
- Supervised learning
General algorithms
Learning models
References edit
- ^ a b c d e f g L. Rosasco, T. Poggio, Machine Learning: a Regularization Approach, MIT-9.520 Lectures Notes, Manuscript, Dec. 2015. Chapter 7 - Online Learning
- ^ Kushner, Harold J.; Yin, G. George (2003). Stochastic Approximation and Recursive Algorithms with Applications (Second ed.). New York: Springer. pp. 8–12. ISBN 978-0-387-21769-7.
- ^ a b Bertsekas, D. P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optimization for Machine Learning, 85.
- ^ Hazan, Elad (2015). Introduction to Online Convex Optimization (PDF). Foundations and Trends in Optimization.
- ^ Parisi, German I.; Kemker, Ronald; Part, Jose L.; Kanan, Christopher; Wermter, Stefan (2019). "Continual lifelong learning with neural networks: A review". Neural Networks. 113: 54–71. arXiv:1802.07569. doi:10.1016/j.neunet.2019.01.012. ISSN 0893-6080.
- ^ Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks. Cambridge University Press. ISBN 978-0-521-65263-6.
- ^ Stochastic Approximation Algorithms and Applications, Harold J. Kushner and G. George Yin, New York: Springer-Verlag, 1997. ISBN 0-387-94916-X; 2nd ed., titled Stochastic Approximation and Recursive Algorithms and Applications, 2003, ISBN 0-387-00894-2.