The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data. The objective function takes a tuple of hyperparameters and returns the associated loss. Cross-validation is often used to estimate this generalization performance.
The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.
Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search.
For example, a typical soft-margin SVM classifier equipped with an RBF kernel has at least two hyperparameters that need to be tuned for good performance on unseen data: a regularization constant C and a kernel hyperparameter γ. Both parameters are continuous, so to perform grid search, one selects a finite set of "reasonable" values for each, say
Grid search then trains an SVM with each pair (C, γ) in the Cartesian product of these two sets and evaluates their performance on a held-out validation set (or by internal cross-validation on the training set, in which case multiple SVMs are trained per pair). Finally, the grid search algorithm outputs the settings that achieved the highest score in the validation procedure.
Since grid searching is an exhaustive and therefore potentially expensive method, several alternatives have been proposed. In particular, a randomized search that simply samples parameter settings a fixed number of times has been found to be more effective in high-dimensional spaces than exhaustive search. This is because oftentimes, it turns out some hyperparameters do not significantly affect the loss. Therefore, having randomly dispersed data gives more "textured" data than an exhaustive search over parameters that ultimately do not affect the loss.
Bayesian optimization is a methodology for the global optimization of noisy black-box functions. Applied to hyperparameter optimization, Bayesian optimization consists of developing a statistical model of the function from hyperparameter values to the objective evaluated on a validation set. Intuitively, the methodology assumes that there is some smooth but noisy function that acts as a mapping from hyperparameters to the objective. In Bayesian optimization, one aims to gather observations in such a manner as to evaluate the machine learning model the least number of times while revealing as much information as possible about this function and, in particular, the location of the optimum. Bayesian optimization relies on assuming a very general prior over functions which when combined with observed hyperparameter values and corresponding outputs yields a distribution over functions. The methodology proceeds by iteratively picking hyperparameters to observe (experiments to run) in a manner that trades off exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters which are expected to have a good outcome). In practice, Bayesian optimization has been shown to obtain better results in fewer experiments than grid search and random search, due to the ability to reason about the quality of experiments before they are run.
For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using gradient descent. The first usage of these techniques was focused on neural networks. Since then, these methods have been extended to other models such as support vector machines or logistic regression.
A different approach in order to obtain a gradient with respect to hyperparameters consists in differentiating the steps of an iterative optimization algorithm using automatic differentiation.
Evolutionary optimization is a methodology for the global optimization of noisy black-box functions. In hyperparameter optimization, evolutionary optimization uses evolutionary algorithms to search the space of hyperparameters for a given algorithm. Evolutionary hyperparameter optimization follows a process inspired by the biological concept of evolution:
- Create an initial population of random solutions (i.e., randomly generate tuples of hyperparameters, typically 100+)
- Evaluate the hyperparameters tuples and acquire their fitness function (e.g., 10-fold cross-validation accuracy of the machine learning algorithm with those hyperparameters)
- Rank the hyperparameter tuples by their relative fitness
- Replace the worst-performing hyperparameter tuples with new hyperparameter tuples generated through crossover and mutation
- Repeat steps 2-4 until satisfactory algorithm performance is reached or algorithm performance is no longer improving
Evolutionary optimization has been used in hyperparameter optimization for statistical machine learning algorithms, automated machine learning, deep neural network architecture search, as well as training of the weights in deep neural networks.
- LIBSVM comes with scripts for performing grid search.
- scikit-learn is a Python package which includes grid search.
- hyperopt and hyperopt-sklearn are Python packages which include random search.
- scikit-learn is a Python package which includes random search.
- H2O AutoML provides automated data preparation, hyperparameter tuning via random search, and stacked ensembles in a distributed machine learning platform.
- spearmint Spearmint is a package to perform Bayesian optimization of machine learning algorithms.
- Bayesopt, an efficient implementation of Bayesian optimization in C/C++ with support for Python, Matlab and Octave.
- MOE MOE is a Python/C++/CUDA library implementing Bayesian Global Optimization using Gaussian Processes.
- Auto-WEKA is a Bayesian hyperparameter optimization layer on top of WEKA.
- Auto-sklearn is a Bayesian hyperparameter optimization layer on top of scikit-learn.
- TPOT is a Python package that automatically creates and optimizes full machine learning pipelines using genetic programming.
- devol is a Python package that performs Deep Neural Network architecture search using genetic programming.
- hyperopt and hyperopt-sklearn are Python packages which include Tree of Parzen Estimators based distributed hyperparameter optimization.
- pycma is a Python implementation of Covariance Matrix Adaptation Evolution Strategy.
- SUMO-Toolbox is a MATLAB toolbox for surrogate modeling supporting a wide collection of hyperparameter optimization algorithm for many model types.
- rbfopt is a Python package that uses a radial basis function model
- Harmonica is a Python package for spectral hyperparameter optimization.
- Claesen, Marc; Bart De Moor (2015). "Hyperparameter Search in Machine Learning". arXiv: [cs.LG].
- Bergstra, James; Bengio, Yoshua (2012). "Random Search for Hyper-Parameter Optimization" (PDF). J. Machine Learning Research. 13: 281–305.
- Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). A practical guide to support vector classification. Technical Report, National Taiwan University.
- Chicco D (December 2017). "Ten quick tips for machine learning in computational biology". BioData Mining. 10 (35): 1–17. doi:10.1186/s13040-017-0155-3. PMC . PMID 29234465.
- Hutter, Frank; Hoos, Holger; Leyton-Brown, Kevin (2011), "Sequential model-based optimization for general algorithm configuration" (PDF), Learning and Intelligent Optimization
- Bergstra, James; Bardenet, Remi; Bengio, Yoshua; Kegl, Balazs (2011), "Algorithms for hyper-parameter optimization" (PDF), Advances in Neural Information Processing Systems
- Snoek, Jasper; Larochelle, Hugo; Adams, Ryan (2012). "Practical Bayesian Optimization of Machine Learning Algorithms" (PDF). Advances in Neural Information Processing Systems. arXiv: [stat.ML]. Bibcode:2012arXiv1206.2944S.
- Thornton, Chris; Hutter, Frank; Hoos, Holger; Leyton-Brown, Kevin (2013). "Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms" (PDF). Knowledge Discovery and Data Mining. arXiv: [cs.LG]. Bibcode:2012arXiv1208.3719T.
- Larsen, Jan; Hansen, Lars Kai; Svarer, Claus; Ohlsson, M (1996). "Design and regularization of neural networks: the optimal use of a validation set". Proceedings of the 1996 IEEE Signal Processing Society Workshop.
- Olivier Chapelle; Vladimir Vapnik; Olivier Bousquet; Sayan Mukherjee (2002). "Choosing multiple parameters for support vector machines" (PDF). Machine Learning. 46: 131–159. doi:10.1023/a:1012450327387.
- Chuong B; Chuan-Sheng Foo; Andrew Y Ng (2008). "Efficient multiple hyperparameter learning for log-linear models". Advances in Neural Information Processing Systems 20.
- Domke, Justin (2012). "Generic Methods for Optimization-Based Modeling" (PDF). Aistats. 22.
- Maclaurin, Douglas; Duvenaud, David; Adams, Ryan P. (2015). "Gradient-based Hyperparameter Optimization through Reversible Learning". arXiv: [stat.ML].
- Olson RS, Urbanowicz RJ, Andrews PC, Lavender NA, Kidd L, Moore JH (2016). "Automating biomedical data science through tree-based pipeline optimization". Proceedings of EvoStar 2016. Lecture Notes in Computer Science. 9597: 123–137. doi:10.1007/978-3-319-31204-0_9. ISBN 978-3-319-31203-3.
- Olson RS, Bartley N, Urbanowicz RJ, Moore JH (2016). "Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science". Proceedings of EvoBIO 2016: 485–492. doi:10.1145/2908812.2908918. ISBN 9781450342063.
- Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N, Hodjat B (2017). "Evolving Deep Neural Networks". arXiv: [cs.NE].
- Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K (2017). "Population Based Training of Neural Networks". arXiv: [cs.LG].
- Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J (2017). "Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning". arXiv: [cs.NE].
- Diaz, Gonzalo; Fokoue, Achille; Nannicini, Giacomo; Samulowitz, Horst (2017). "An effective algorithm for hyperparameter optimization of neural networks". arXiv: [cs.AI].
- Hazan, Elad; Klivans, Adam; Yuan, Yang (2017). "Hyperparameter Optimization: A Spectral Approach". arXiv: [cs.LG].
- Martinez-Cantin, Ruben (2014). "BayesOpt: A Bayesian Optimization Library for Nonlinear Optimization, Experimental Design and Bandits" (PDF). Journal of Machine Learning Research. 15: 3915−3919. arXiv: [cs.LG]. Bibcode:2014arXiv1405.7430M.
- Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K (2017). "Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA". Journal of Machine Learning Research: 1–5.
- Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F (2015). "Efficient and Robust Automated Machine Learning". Advances in Neural Information Processing Systems 28 (NIPS 2015): 2962–2970.
- Gorissen, Dirk; Crombecq, Karel; Couckuyt, Ivo; Demeester, Piet; Dhaene, Tom (2010). "A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design" (PDF). J. Machine Learning Research. 11: 2051–2055.