# Q-learning

Q-learning is a model-free reinforcement learning technique. It is able to compare the expected utility of the available actions (for a given state) without requiring a model of the environment. Additionally, Q-learning can handle problems with stochastic transitions and rewards, without requiring adaptations.

It has been proven that for any finite Markov decision process (MDP), Q-learning eventually finds an optimal policy, in the sense that the expected value of the total reward return over all successive steps, starting from the current state, is the maximum achievable.[1]

Specifically, Q-learning can identify an optimal action-selection policy for any given (finite) MDP.[citation needed] It works by learning an action-value function ${\displaystyle Q(s,a)}$, which ultimately gives the expected utility of a given action ${\displaystyle a}$ while in a given state ${\displaystyle s}$ and following an optimal policy thereafter. A policy ${\displaystyle \pi }$, is a rule that the agent follows in selecting actions, given the state it is in. When such an action-value function is learned, the optimal policy can be constructed by selecting the action with the highest value in each state.

## AlgorithmEdit

The problem space consists of an agent, a set of states ${\displaystyle S}$ , and a set of actions per state ${\displaystyle A}$ . By performing an action ${\displaystyle a\in A}$ , the agent can move from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score). The goal of the agent is to maximize its total (future) reward. It does this by learning which action is optimal for each state. The action that is optimal for each state is the action that has the highest long-term reward. This reward is a weighted sum of the expected values of the rewards of all future steps starting from the current state, where the weight for a step from a state ${\displaystyle \Delta t}$  steps into the future is calculated as ${\displaystyle \gamma ^{\Delta t}}$ . Here, ${\displaystyle \gamma }$  is a number between 0 and 1 (${\displaystyle 0\leq \gamma \leq 1}$ ) called the discount factor and trades off the importance of earlier versus later rewards. ${\displaystyle \gamma }$  may also be interpreted as the probability to succeed (or survive) at every step ${\displaystyle \Delta t}$ .

The algorithm, therefore, has a function that calculates the quality of a state-action combination:

${\displaystyle Q:S\times A\to \mathbb {R} }$  .

Before learning begins, ${\displaystyle Q}$  is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time ${\displaystyle t}$  the agent selects an action ${\displaystyle a_{t}}$ , observes a reward ${\displaystyle r_{t}}$ , enters a new state ${\displaystyle s_{t+1}}$  (that may depend on both the previous state ${\displaystyle s_{t}}$  and the selected action), and ${\displaystyle Q}$  is updated. The core of the algorithm is a simple value iteration update, using the weighted average of the old value and the new information:

${\displaystyle Q(s_{t},a_{t})\leftarrow (1-\alpha )\cdot \underbrace {Q(s_{t},a_{t})} _{\rm {old~value}}+\underbrace {\alpha } _{\rm {learning~rate}}\cdot \overbrace {{\bigg (}\underbrace {r_{t}} _{\rm {reward}}+\underbrace {\gamma } _{\rm {discount~factor}}\cdot \underbrace {\max _{a}Q(s_{t+1},a)} _{\rm {estimate~of~optimal~future~value}}{\bigg )}} ^{\rm {learned~value}}}$

where ${\displaystyle r_{t}}$  is the reward observed for the current state ${\displaystyle s_{t}}$ , and ${\displaystyle \alpha }$  is the learning rate (${\displaystyle 0<\alpha \leq 1}$ ).

An episode of the algorithm ends when state ${\displaystyle s_{t+1}}$  is a final or terminal state. However, Q-learning can also learn in non-episodic tasks.[citation needed] If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops.

For all final states ${\displaystyle s_{f}}$ , ${\displaystyle Q(s_{f},a)}$  is never updated, but is set to the reward value ${\displaystyle r}$  observed for state ${\displaystyle s_{f}}$ . In most cases, ${\displaystyle Q(s_{f},a)}$  can be taken to be equal to zero.

## Influence of variables on the algorithmEdit

### Learning rateEdit

The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing, while a factor of 1 makes the agent consider only the most recent information. In fully deterministic environments, a learning rate of ${\displaystyle \alpha _{t}=1}$  is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that required it to decrease to zero. In practice, often a constant learning rate is used, such as ${\displaystyle \alpha _{t}=0.1}$  for all ${\displaystyle t}$ .[2]

### Discount factorEdit

The discount factor ${\displaystyle \gamma }$  determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For ${\displaystyle \gamma =1}$ , without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite.[3] Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network.[4] In that case, it is known that starting with a lower discount factor and increasing it towards its final value accelerates learning.[5]

### Initial conditions (Q0)Edit

Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions",[6] can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward ${\displaystyle r}$  can be used to reset the initial conditions.[citation needed] According to this idea, the first time an action is taken the reward is used to set the value of ${\displaystyle Q}$ . This allows immediate learning in case of fixed deterministic rewards. RIC seems to be consistent with human behaviour in repeated binary choice experiments.[7]

## ImplementationEdit

Q-learning at its simplest uses tables to store data. This approach falters with increasing sizes of state/action space of the system.

### Function approximationEdit

One solution is to use an (adapted) artificial neural network as a function approximator.[8]

More generally, Q-learning can be combined with function approximation.[9] This makes it possible to apply the algorithm to larger problems, even when the state space is continuous. Additionally, it may speed up learning in finite problems, due to the fact that the algorithm can generalize earlier experiences to previously unseen states.

### QuantizationEdit

Another technique to decrease the state/action space quantizes possible values. Consider the example of learning to balance a stick on a finger. To describe a state at a certain point in time involves the position of the finger in space, its velocity and the angle of the stick. This yields a three-element vector that describes one state, i.e. a snapshot of one state encoded into three values. The problem is that infinitely many possible states are present. To shrink the possible space of valid actions multiple values are assigned to a bucket. The exact distance of the finger from its starting position (-Infinity to Infinity) but rather that it is far away or not (Near, Far).

## HistoryEdit

Q-learning was introduced by Watkins[10] in 1989. The convergence proof was presented by Watkins and Dayan[11] in 1992.

Watkins was addressing “Learning from delayed rewards”, the title of his PhD Thesis. Eight years earlier in 1981 the same problem under the name of “Delayed reinforcement learning” was solved by a system named Crossbar Adaptive Array (CAA).[12] It was initially published in 1982.[13] The memory matrix W(a,s) of the presented CAA architecture is the same as the Q-table of Q-learning. The architecture shown in the Figure introduced the term “state evaluation” in reinforcement learning. The crossbar learning algorithm, written in mathematical pseudocode in the paper, in each iteration performs the following computation:

• In state s perform action a;
• Compute state evaluation v(s’);
• Update crossbar value W’(a,s) = W(a,s) + v(s’).

The term “secondary reinforcement” is borrowed from animal learning theory, to model state values via backpropagation: the state value v(s’) of the consequence situation is backpropagated to the previously encountered situation s. In a crossbar fashion, CAA computes state values vertically and actions horizontally. Demonstration graphs showing delayed reinforcement learning contained states (desirable, undesirable, and neutral states), which were computed by state evaluation function. This learning system in 1997 was recognized as a forerunner of the Q-learning algorithm.[14]

## VariantsEdit

### Deep Q-learningEdit

An application of Q-learning to deep learning, by Google DeepMind, titled "deep reinforcement learning" or "deep Q-learning", was successful at playing Atari 2600 games at expert human levels. Preliminary results were presented in 2014. The system used a deep convolutional neural network, which used hierarchical layers of tiled convolutional filters to mimic the effects of receptive fields. Reinforcement learning is unstable or divergent when a nonlinear function approximator such as a neural network is used to represent Q. This instability comes from the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy and the data distribution, and the correlations between Q and the target values. This variant of Q-learning, used a biologically inspired mechanism termed experience replay that randomizes over the data, thereby removing correlations in the observation sequence and smoothing over changes in the data distribution. It also used an iterative update that adjusts Q towards target values that are only periodically updated, further reducing correlations with the target.[15]

### Double Q-learningEdit

Because the maximum approximated action value is used in Q-learning, in noisy environments Q-learning can sometimes overestimate the action values, slowing the learning. A variant called Double Q-learning was proposed to correct this.[16] This algorithm was later combined with deep learning, as in the DQN algorithm, resulting in Double DQN, which outperforms the original DQN algorithm.[17]

#### OthersEdit

Delayed Q-learning is an alternative implementation of the online Q-learning algorithm, with probably approximately correct (PAC) learning.[18]

Greedy GQ is a variant of Q-learning to use in combination with (linear) function approximation.[19] The advantage of Greedy GQ is that convergence guarantees can be given even when function approximation is used to estimate the action values.

## ReferencesEdit

1. ^ Francisco S. Melo, "Convergence of Q-learning: a simple proof"
2. ^ Reinforcement Learning: An Introduction. Richard Sutton and Andrew Barto. MIT Press, 1998.
3. ^ Stuart J. Russell; Peter Norvig (2010). Artificial Intelligence: A Modern Approach (Third ed.). Prentice Hall. p. 649. ISBN 978-0136042594.
4. ^ Baird, Leemon (1995). "Residual algorithms: Reinforcement learning with function approximation" (PDF). ICML: 30–37.
5. ^ François-Lavet, Vincent; Fonteneau, Raphael; Ernst, Damien (2015-12-07). "How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies". arXiv: [cs.LG].
6. ^ http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html
7. ^ Shteingart, H; Neiman, T; Loewenstein, Y (May 2013). "The Role of First Impression in Operant Learning". J Exp Psychol Gen. 142 (2): 476–88. doi:10.1037/a0029550. PMID 22924882.
8. ^ Tesauro, Gerald (March 1995). "Temporal Difference Learning and TD-Gammon". Communications of the ACM. 38 (3): 58. doi:10.1145/203330.203343. Retrieved 2010-02-08.
9. ^ Hasselt, Hado van (5 March 2012). "Reinforcement Learning in Continuous State and Action Spaces". In Wiering, Marco; Otterlo, Martijn van. Reinforcement Learning: State-of-the-Art. Springer Science & Business Media. pp. 207–251. ISBN 978-3-642-27645-3.
10. ^ Watkins, C.J.C.H. (1989), Learning from Delayed Rewards (PDF) (Ph.D. thesis), Cambridge University
11. ^ Watkins and Dayan, C.J.C.H., (1992), 'Q-learning.Machine Learning'
12. ^ Bozinovski, S. (15 July 1999). "Crossbar Adaptive Array: The first connectionist network that solved the delayed reinforcement learning problem". In editor-Dobnikar, Andrej; Steele, Nigel C.; Pearson, David W.; Albrecht, Rudolf F. Artificial Neural Nets and Genetic Algorithms: Proceedings of the International Conference in Portorož, Slovenia, 1999. Springer Science & Business Media. pp. 320–325. ISBN 978-3-211-83364-3.
13. ^ Bozinovski, S. (1982). "A self learning system using secondary reinforcement". In Trappl, Robert. Cybernetics and Systems Research: Proceedings of the Sixth European Meeting on Cybernetics and Systems Research. North Holland. pp. 397–402. ISBN 978-0-444-86488-8.
14. ^ Barto, A. (24 February 1997). "Reinforcement learning". In Omidvar, Omid; Elliott, David L. Neural Systems for Control. Elsevier. ISBN 978-0-08-053739-9.
15. ^ Mnih, Volodymyr; et al. (2015). "Human-level control through deep reinforcement learning" (PDF). 518: 529–533.
16. ^ van Hasselt, Hado (2011). "Double Q-learning" (PDF). Advances in Neural Information Processing Systems. 23: 2613–2622.
17. ^ van Hasselt, Hado; Guez, Arthur; Silver, David (2015). "Deep reinforcement learning with double Q-learning" (PDF). AAAI Conference on Artificial Intelligence: 2094–2100.
18. ^ Strehl, Alexander L.; Li, Lihong; Wiewiora, Eric; Langford, John; Littman, Michael L. (2006). "=Pac model-free reinforcement learning" (PDF). Proc. 22nd ICML: 881–888.
19. ^ Maei, Hamid; Szepesvári, Csaba; Bhatnagar, Shalabh; Sutton, Richard (2010). "Toward off-policy learning control with function approximation in Proceedings of the 27th International Conference on Machine Learning" (PDF). pp. 719–726.