Q-learning

Animals · Animal ethology · Comparative psychology · Animal models · Outline · Index

Q-learning is a reinforcement learning technique that works by learning an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. A strength with Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. A recent variation called delayed-Q learning has shown substantial improvements, bringing PAC bounds to Markov decision processes.

Algorithm[]

The problem model consists of an agent, states S and a number of actions per state A. By performing an action a, the agent can move from state to state. Each state provides the agent a reward (a real or natural number) or punishment (a negative reward). The goal of the agent is to maximize its total reward. It does this by learning which action is optimal for each state.

The algorithm therefore has a function which calculates the Quality of a state-action combination:

Q: S \times A \to \mathbb{R}

Before learning has started, Q returns a fixed value, chosen by the designer. Then, each time the agent is given a reward (the state has changed) new values are calculated for each combination of a state s from S, and action a from A. The core of the algorithm is a simple value iteration update. It assumes the old value and makes a correction based on the new information.

Q(s_t,a_t) \leftarrow \underbrace{Q(s_t,a_t)}_{old~value} + \underbrace{\alpha_t(s_t,a_t)}_{learning~rate} \times [\overbrace{\underbrace{r_{t+1}}_{reward} + \underbrace{\gamma}_{discount~factor} \underbrace{\max_{a}Q(s_{t+1}, a)}_{max~future~value}}^{expected~discounted~reward} - \overbrace{Q(s_t,a_t)}^{old~value}]

Where $r_t$ is the reward given at time $t$ , $\alpha_t(s, a)$ ( $0<\alpha\le 1$ ) the learning rates, may be the same value for all pairs. The discount factor $\gamma$ is such that $0 \le \gamma < 1$

The above formula is equivalent to:

Q(s_t,a_t) \leftarrow Q(s_t,a_t)(1-\alpha_t(s_t,a_t)) + \alpha_t(s_t,a_t) [r_{t+1} + \gamma \max_{a}Q(s_{t+1}, a)]

Influence of variables on the algorithm[]

Learning rate[]

The learning rate determines to what extent the newly acquired information will override the old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information.

Discount factor[]

The discount factor determines the importance of future rewards. A factor of 0 will make the agent "opportunistic" by only considering current rewards, while a factor of 1 will make it strive for a long-term high reward.

Implementation[]

Q-learning at its simplest uses tables to store data. This very quickly loses viability with increasing levels of complexity of the system it is monitoring/controlling. One answer to this problem is to use an (adapted) artificial neural network as a function approximator, as demonstrated by Tesauro in his Backgammon playing temporal difference learning research. An adaptation of the standard neural network is required because the required result (from which the error signal is generated) is itself generated at run-time.

External links[]

Q-Learning topic on Knol
Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England.
Strehl, Li, Wiewiora, Langford, Littman (2006). PAC model-free reinforcement learning
Q-Learning by Examples
Reinforcement Learning: An Introduction by Richard Sutton and Andrew S. Barto, an online textbook. See "6.5 Q-Learning: Off-Policy TD Control".
Connectionist Q-learning Java Framework
Piqle: a Generic Java Platform for Reinforcement Learning
Reinforcement Learning Maze, a demonstration of guiding an ant through a maze using Q-learning.
Q-learning work by Gerald Tesauro
Q-learning work by Tesauro Citeseer Link

Learning
Types of learning
Avoidance conditioning \| Classical conditioning \| Confidence-based learning \| Discrimination learning \| Emulation \| Experiential learning \| Escape conditioning \| Incidental learning \|Intentional learning \| Latent learning \| Maze learning \| Mastery learning \| Mnemonic learning \| Nonassociative learning \| Nonreversal shift learning \| Nonsense syllable learning \| Nonverbal learning \| Observational learning \| Omission training \| Operant conditioning \| Paired associate learning \| Perceptual motor learning \| Place conditioning \| Probability learning \| Rote learning \| Reversal shift learning \| Second-order conditioning \| Sequential learning \| Serial anticipation learning \| Serial learning \| Skill learning \| Sidman avoidance conditioning \| Social learning \| Spatial learning \| State dependent learning \| Social learning theory \| State-dependent learning \| Trial and error learning \| Verbal learning
Concepts in learning theory
Chaining \| Cognitive hypothesis testing \| Conditioning \| Conditioned responses \| Conditioned stimulus \| Conditioned suppression \| Constant time delay \| Counterconditioning \| Covert conditioning \| Counterconditioning \| Delayed alternation \| Delay reduction hypothesis \| Discriminative response \| Distributed practice \|Extinction \| Fast mapping \| Gagné's hierarchy \| Generalization (learning) \| Generation effect (learning) \| Habits \| Habituation \| Imitation (learning) \| Implicit repetition \| Interference (learning) \| Interstimulus interval \| Intermittent reinforcement \| Latent inhibition \| Learning schedules \| Learning rate \| Learning strategies \| Massed practice \| Modelling \| Negative transfer \| Overlearning \| Practice \| Premack principle \| Preconditioning \| Primacy effect \| Primary reinforcement \| Principles of learning \| Prompting \| Punishment \| Recall (learning) \| Recency effect \| Recognition (learning) \| Reconstruction (learning) \| Reinforcement \| Relearning \| Rescorla-Wagner model \| Response \| Reinforcement \| Secondary reinforcement \| Sensitization \| Serial position effect \| Serial recall \| Shaping \| Stimulus \| Reinforcement schedule \| Spontaneous recovery \| State dependent learning \| Stimulus control \| Stimulus generalization \| Transfer of learning \| Unconditioned responses \| Unconditioned stimulus
Animal learning
Cat learning \| Dog learning Rat learning
Neuroanatomy of learning

Neurochemistry of learning
Adenylyl cyclase
Learning in clinical settings
Applied Behavior Analysis \| Behaviour therapy \| Behaviour modification \| Delay of gratification \| CBT \| Desensitization \| Exposure Therapy \| Exposure and response prevention \| Flooding \| Graded practice \| Habituation \| Learning disabilities \| Reciprocal inhibition therapy \| Systematic desensitization \| Task analysis \| Time out
Learning in education
Adult learning \| Cooperative learning \| Constructionist learning \| Experiential learning \| Foreign language learning \| Individualised instruction \| Learning ability \| Learning disabilities \| Learning disorders \| Learning Management \| Learning styles \| Learning theory (education) \| Learning through play \| School learning \| Study habits
Machine learning
Temporal difference learning \| Q-learning
Philosophical context of learning theory
Behaviourism \| Connectionism \| Constructivism \| Functionalism \| Logical positivism \| Radical behaviourism
Prominant workers in Learning Theory\|-
Pavlov \| Hull \| Tolman \| Skinner \| Bandura \| Thorndike \| Skinner \| Watson
Miscellaneous\|-
Category:Learning journals \| Melioration theory
edit

This page uses Creative Commons Licensed content from Wikipedia (view authors).