Markov Decision Processes (MDPs) for Reinforcement Learning (RL)

Caleb M. Bowyer, Ph.D. Candidate
3 min readAug 18, 2022
Photo by Dan Asaki on Unsplash

The Markov Decision Process (MDP) is one of the most important models in all of reinforcement learning. It allows researchers to model environments with a finite set of states, a finite set of controls, a reward function (tells an agent what are good controls from bad controls in any given state), a transition matrix that stores all of the possible transition probabilities from one state to another state, and a discount factor “gamma” that tells the agent how much to consider the influence of future rewards in the present control decision.

Finite Discounted MDP Model

The MDP model presented above provides a mathematically clear picture of all of the parts required to model an RL problem. The task of the agent is to maximize its expected long-term reward over time by learning an optimal policy corresponding to the definition of the MDP model. The expected return can consist of a finite sum if the horizon is finite, but in general the expected return is an infinite summation of rewards.

Expected Discounted Return

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Caleb M. Bowyer, Ph.D. Candidate
Caleb M. Bowyer, Ph.D. Candidate

Written by Caleb M. Bowyer, Ph.D. Candidate

AI | Reinforcement Learning | Python | Finance | Value. Support my writing by joining Medium (unlimited access): https://medium.com/@CalebMBowyer/membership

No responses yet

What are your thoughts?