What is dynamic programming?
Behind this strange and mysterious name hides pretty straightforward concept. Dynamic programming or DP, in short, is a collection of methods used calculate the optimal policies – solve the Bellman equations. Before you get any more hyped up there are severe limitations to it which makes DP use very limited. Here are main ones:
- It needs perfect environment model in form of the Markov Decision Process – that’s a hard one to comply. It’s fine for the simpler problems but try to model game of chess with a description of all states. Have fun 😉
- Computationally expensive – as I mentioned before solving Bellman equations is a brute force job. Try to do it for a chess game. You will need few orders of magnitudes more computing power then all humanity already has.
So why even bothering checking out the dynamic programming? 🤔 Well, it’s an important step to understand methods which comes later in a book. And the dynamic programming provides us with the optimal solutions. Other Reinforcement Learning methods try to do pretty much the same. Only with fewer resources and the imperfect environment model.
Before we jump into the theory and code let’s see what “game” we will try to beat this time. Here is the board:
The game I coded to be exactly the same as the one in the book. The code to print the board and all other accompanying functions you can find in the notebook I prepared.
- X is the terminal state, where our game ends. It’s the place we want to be as fast as possible.
- A is the agent position.
The agent starts in a random state which is not a terminal state. Every step it needs to take has a reward of
-1 to optimize the number of moves needed to reach the finish line. The agent can move in any direction (north, south, east, west). If the move would take the agent out of the board it stays on the same field (
s' == s).
Let’s try it out! 🏁
To debug the board, agent code and to benchmark it, later on, I tested agent out with random policy. Which means that on every move it has a 25% of going in any direction. An agent with such policy it’s pretty much clueless. Here is the code for it:
policyis a dictionary containing the states as the keys and move chances as a value in form of another dictionary.
Wmoves directions and the float number is a chance of performing it by the agent.
state_to_state_prime_mapis a map similar to the
policybut as a value, we have a dictionary with move directions and resulting states (
s') after performing that moves.
The dictionaries look like that:
What the agent function does is until the terminal state is reached (
15) it creates random float between 0 and 1. Then compares it against current state policy to decide on move and checks which is being’` for that action.
Let’s see how an agent performs with the random policy:
An average number of steps an agent with random policy needs to take to complete the task in
19.843. Pretty bad, right?
Back to the Finite Markov Decision Process
We need to get back for a while to the finite-MDP. From this moment it will be always with us when solving the Reinforcement Learning problems. Finite-MDP means we can describe it with a probabilities
p(s', r | s, a). Quick reminder:
s'– resulting state
s– current state
In plain English
p(s', r | s, a) means: probability of being in resulting state with the reward given current state and action. The set is exhaustive that means it contains all possibilities even those not allowed by our game. For our simple problem, it contains
1024 values and our reward is always
-1! Tell me about the brute force algorithms. 😎 I decided to include this section as this term will appear often in Reinforcement Learning.
Iterative policy evaluation
This is the first method I am going to describe. Both of theme will use the iterative approach. We will solve Bellman equations by iterating over and over.
Two hyperparameters here are
discount_rate. Theta is a parameter controlling a degree of approximation (smaller is more precise). Discount rate I described [last time](before and it diminishes a reward received in future.
Let me go with you step by step here.
- Initialization of
V_sdictionary. It contains the value of each state and we will use it to create better policy which will choose next state based on maximum value. Remember to set terminal states values to
0, rest is up to you.
- Creation of probability map described in the previous section. You can use a global variable or anything. It doesn’t change so you don’t have to create fresh each time.
- Initialization of
deltaparameter to some high value (higher than
0for later use.
whileloop. We will repeat everything until the
deltawill be smaller than
forloop iterating through all states.
- Value assignment of the current state to local variable
- Start of summation. The heart of the algorithm is here. There are 2 sums here hence 2 additional
action_totalis a sum of probabilities for each action while
totalis a sum of all action sums. You can find math in a Reinforcement Learning: An Introduction (Book site | Amazon).
- Assignment of
totalwhich is a new estimated value for current
V_sdictionary containing all values.
deltawhich can stop the
- Return of
V_sdictionary to create new policy later on.
The algorithm managed to create optimal solution after 2 iterations. More is just a value tuning. It averages around
3 steps per solution. That’s quite an improvement from the random policy!
Value iteration is quite similar to the policy evaluation one. But the approach is different. First of all, we don’t judge the policy instead we create perfect values. Let’s tackle the code:
#1 - #6 and
#9 - #10 are the same as
#2 - #7 and
#10 - #11 in previous section. The only difference is that we don’t have to create the
V_s from scratch as it’s passed as a parameter to the function. The
for loop iterates through all states except the terminal states. The reason is that we don’t want to mess with terminal states having a value of
0. We don’t have any other way (like a positive reward) to make this states distinguished.
- Start of summation. Here we calculate values for each
actionwithout summing it together like in the previous point. We have to store all values.
- Assignment of value for
state. Here the value of the
stateis a value of maximum action.
I won’s show you the test runs of the algorithm as it’s the same as the policy evaluation one. Dynamic Programming methods are guaranteed to find an optimal solution if we managed to have the power and the model.
The Dynamic Programming is a cool area with an even cooler name. It shows how Reinforcement Learning would look if we had superpowers like unlimited computing power and full understanding of each problem as Markov Decision Process. I found it a nice way to boost my understanding of various parts of MDP as the last post was mainly theoretical one. I hope you enjoyed. Coming up next is a Monte Carlo method. 👋
Notebook with complete code
K-armed bandit series: