Markov Decision Processes (MDPs) for Reinforcement Learning (RL)
The Markov Decision Process (MDP) is one of the most important models in all of reinforcement learning. It allows researchers to model environments with a finite set of states, a finite set of controls, a reward function (tells an agent what are good controls from bad controls in any given state), a transition matrix that stores all of the possible transition probabilities from one state to another state, and a discount factor “gamma” that tells the agent how much to consider the influence of future rewards in the present control decision.
Finite Discounted MDP Model
The MDP model presented above provides a mathematically clear picture of all of the parts required to model an RL problem. The task of the agent is to maximize its expected long-term reward over time by learning an optimal policy corresponding to the definition of the MDP model. The expected return can consist of a finite sum if the horizon is finite, but in general the expected return is an infinite summation of rewards.