## What is dynamic programming?

Behind this strange and mysterious name hides pretty straightforward concept. Dynamic programming or DP, in short, is a collection of methods used calculate the optimal policies – solve the Bellman equations. Before you get any more hyped up there are severe limitations to it which makes DP use very limited. Here are main ones:

**It needs perfect environment model**in form of the Markov Decision Process – that’s a hard one to comply. It’s fine for the simpler problems but try to model game of chess with a description of all states. Have fun π**Computationally expensive**– as I mentioned before solving Bellman equations is a brute force job. Try to do it for a chess game. You will need few orders of magnitudes more computing power then all humanity already has.

So why even bothering checking out the dynamic programming? π€ Well, it’s an important step to understand methods which comes later in a book. And the dynamic programming provides us with the optimal solutions. Other Reinforcement Learning methods try to do pretty much the same. Only with fewer resources and the imperfect environment model.

## Testbed

Before we jump into the theory and code let’s see what “game” we will try to beat this time. Here is the board:

The game I coded to be exactly the same as the one in the book. The code to print the board and all other accompanying functions you can find in the notebook I prepared.

**X**is the terminal state, where our game ends. It’s the place we want to be as fast as possible.**A**is the agent position.

The agent starts in a random state which is not a terminal state. Every step it needs to take has a reward of `-1`

to optimize the number of moves needed to reach the finish line. The agent can move in any direction (north, south, east, west). If the move would take the agent out of the board it stays on the same field (`s' == s`

).

Let’s try it out! π

## Random policy

To debug the board, agent code and to benchmark it, later on, I tested agent out with random policy. Which means that on every move it has a 25% of going in any direction. An agent with such policy it’s pretty much clueless. Here is the code for it:

`policy`

is a dictionary containing the states as the keys and move chances as a value in form of another dictionary.`N`

,`S`

,`E`

,`W`

moves directions and the float number is a chance of performing it by the agent.`state_to_state_prime_map`

is a map similar to the`policy`

but as a value, we have a dictionary with move directions and resulting states (`s'`

) after performing that moves.

The dictionaries look like that:

What the agent function does is until the terminal state is reached (`0`

or `15`

) it creates random float between 0 and 1. Then compares it against current state policy to decide on move and checks which is being’` for that action.

Let’s see how an agent performs with the random policy:

An average number of steps an agent with random policy needs to take to complete the task in `19.843`

. Pretty bad, right?

## Back to the Finite Markov Decision Process

We need to get back for a while to the finite-MDP. From this moment it will be always with us when solving the Reinforcement Learning problems. Finite-MDP means we can describe it with a probabilities `p(s', r | s, a)`

. Quick reminder:

`p`

– probability`s'`

– resulting state`r`

– reward`s`

– current state`a`

– action

In plain English `p(s', r | s, a)`

means: *probability of being in resulting state with the reward given current state and action*. The set is exhaustive that means it contains all possibilities even those not allowed by our game. For our simple problem, it contains `1024`

values and our reward is always `-1`

! Tell me about the brute force algorithms. π I decided to include this section as this term will appear often in Reinforcement Learning.

## Iterative policy evaluation

This is the first method I am going to describe. Both of theme will use the iterative approach. We will solve Bellman equations by iterating over and over.

Two hyperparameters here are `theta`

and `discount_rate`

. Theta is a parameter controlling a degree of approximation (smaller is more precise). Discount rate I described [last time](before and it diminishes a reward received in future.

Let me go with you step by step here.

- Initialization of
`V_s`

dictionary. It contains the value of each state and we will use it to create better policy which will choose next state based on maximum value. Remember to set terminal states values to`0`

, rest is up to you. - Creation of probability map described in the previous section. You can use a global variable or anything. It doesn’t change so you don’t have to create fresh each time.
- Initialization of
`delta`

parameter to some high value (higher than`theta`

). `delta`

goes to`0`

for later use.`while`

loop. We will repeat everything until the`delta`

will be smaller than`theta`

.`for`

loop iterating through all states.- Value assignment of the current state to local variable
`v`

for later. - Start of summation. The heart of the algorithm is here. There are 2 sums here hence 2 additional
`for`

loops.`action_total`

is a sum of probabilities for each action while`total`

is a sum of all action sums. You can find math in a Reinforcement Learning: An Introduction (Book site | Amazon). - Assignment of
`total`

which is a new estimated value for current`state`

to`V_s`

dictionary containing all values. - New
`delta`

which can stop the`while`

loop. - Return of
`V_s`

dictionary to create new policy later on.

The algorithm managed to create optimal solution after 2 iterations. More is just a value tuning. It averages around `3`

steps per solution. That’s quite an improvement from the random policy!

## Value iteration

Value iteration is quite similar to the policy evaluation one. But the approach is different. First of all, we don’t judge the policy instead we create perfect values. Let’s tackle the code:

Points `#1 - #6`

and `#9 - #10`

are the same as `#2 - #7`

and `#10 - #11`

in previous section. The only difference is that we don’t have to create the `V_s`

from scratch as it’s passed as a parameter to the function. The `for`

loop iterates through all states except the terminal states. The reason is that we don’t want to mess with terminal states having a value of `0`

. We don’t have any other way (like a positive reward) to make this states distinguished.

- Start of summation. Here we calculate values for each
`action`

without summing it together like in the previous point. We have to store all values. - Assignment of value for
`state`

. Here the value of the`state`

is a value of maximum action.

I won’s show you the test runs of the algorithm as it’s the same as the policy evaluation one. Dynamic Programming methods are guaranteed to find an optimal solution if we managed to have the power and the model.

## Conclusion

The Dynamic Programming is a cool area with an even cooler name. It shows how Reinforcement Learning would look if we had superpowers like unlimited computing power and full understanding of each problem as Markov Decision Process. I found it a nice way to boost my understanding of various parts of MDP as the last post was mainly theoretical one. I hope you enjoyed. Coming up next is a Monte Carlo method. π

Notebook with complete code

Reinforcement Learning: An Introduction (Book site | Amazon)

Finite Markov Decision Process

**K-armed bandit series**: