So here I am after a quite long delay with another post. A lot happened during that time as I came back from our half-year South East Asia journey. I resumed my work at Tooploox but it takes a while to readjust to living back home. Anyway here comes some more Reinforcement Learning for your enjoyment.

Temporal-Difference Learning (or TD Learning) is quite important and novel thing around. It’s the first time where you can really see some patterns emerging and everything is building upon a previous chapter.

## DP + MC = TD

The pseudo equation above is not scientific at all but gives an idea about the Temporal-Difference learning. It’s basically some ideas from Monte Carlo and some from Dynamic Programming stuff.

**Monte Carlo** – because TD learns from experience, without a model of any kind

**Dynamic Programming** – as TD doesn’t wait for episode completion. TD *bootstraps*

It seems that those three terms are the foundation of Reinforcement Learning. It will appear in different combinations over and over from now on.

We will evaluate everything on the Frozen Lake game from gym package. If you don’t know about the gym, see my previous post.

## Sarsa

Our first TD Learning algorithm will be the `Sarsa`

. As usual, it will follow generalized policy iteration and as TD Learning in general, it bootstraps. So it will update policy or rather value dictionary of state-action pairs on which we will base our policy. The name SARSA comes from the principle of the algorithm where we use on each step:

**S**tate -> **A**ction -> **R**eward -> **S**tate’ -> **A**ction’

…and so on.

Here is the code. Again all utility functions you will find in the repo.

As always I created the algorithm as a function. The parameters will be the same for all TD Learning algorithms in the post. `env`

is the gym environment which contains data about the game, rules and playing functionality. `episodes`

is just number of times which we allow the algorithm to play through and learn. `step_size`

defines how fast it will absorb new data (and forget the old). Finally `exploration_rate`

is an incentive to explore other than greedy options.

- First we have to create value (
`Q`

) dictionary on which we will work. It’s actually dictionary of dictionaries where the keys are the states. The internal maps contains actual values with action as a keys. So it’s formatted in following way:`{ S1 : { A1: 0.0, A2: 0.0, A3: 0.0, A4: 0.0 }, S2: { ... }, ... }`

. All values are initialised to`0.0`

to save the hustle as terminal states have to be all zeros and the rest are some arbitral values. Plus we might not know which states are terminal. `for`

loop through the desired number of episodes. Nothing special.- Getting first State of the environment for current episode.
- We choose an action based on greedy policy created with current Value dictionary. It’s just a maximum valued action for given state. If this value is identical for many actions then random maximum action is taken.
`while`

loop till episode is finished – terminal state is reached.- The step is taken inside a loop. The gym function returns tuple –
`(S_prime, reward, finished, info)`

. We would need only first 3 of those. - Based on the
`S_prime`

we take another greedy action. - Here is the clue of the
`Sarsa`

approach. We update the value dictionary and learn in the same time unlike Monte Carlo where actual learning is at the end of the episode. The Value dictionary for current state-action pair is updated using the reward and additional value of**resulting**state-action pair. That’s a major difference between`Sarsa`

and other TD-Learning algorithms. Here we also use the`step_size`

and`exploration_rate`

hyperparameters. - Update initial state and action for the next step (which won’t be needed when terminal state is reached).

## Q-Learning

It was one of the earlier breakthroughs in Reinforcement Learning. It will be **very** similar to the `Sarsa`

. Only one major difference is the update rule. I like that when `Q-Learning`

finds a good way through the Frozen Lake it sticks to it. I know it might be a problem in many problems where there is some randomness but here it went directly from `0.0`

accuracy to `1.0`

. Very fun to watch.

- Same as
`Sarsa`

1. - Same as
`Sarsa`

2. - Same as
`Sarsa`

3. - Same as
`Sarsa`

5. - Same as
`Sarsa`

4. or 7. As we have to take the one action per step I decided to put it in the`while`

loop rather than outside. - Same as
`Sarsa`

6. - Update rule. Principles are identical but
**instead**on looking into the**actual**next action we use**maximum valued**action for`S_prime`

. - Same as
`Sarsa`

9. Only State.

## Double Q-Learning

Double learning is actually a little modification to either Q-Learning or any other Reinforcement Learning algorithm. It is used to avoid the **maximization bias**. In Temporal Difference Learning we don’t learn after the episode is finished so we don’t know resulting reward initially. And the math is set up that way so the algorithms will prefer the longer road to failure that to fail quickly and move on. That’s maximization bias in a nutshell. The double part is just using 2 value dictionaries with a little dance between them. Having to totally independent dictionaries lets us avoid being biased in some way. And resulting value is just sum of `Q_1`

and `Q_2`

.

Here is the code:

- Same as
`Q-Learning`

1. But we need to initialize 2 dictionaries`Q_1`

and`Q_2`

. - Same as
`Q-Learning`

2. - Same as
`Q-Learning`

3. - Same as
`Q-Learning`

4. - Adding
`Q_1`

and`Q_2`

to create greedy policy. - Same as
`Q-Learning`

5. - Same as
`Q-Learning`

6. - 50% chance to follow one update rule or the other.
- Same as
`Q-Learning`

7. But instead of maximum action in`S_prime`

for currently estimated dictionary we use the other one. So`Q_1`

is updated with`Q_2`

and same the other way around.

## Results

All algorithms managed to learn both Frozen Lake 4 x 4 and 8 x 8 in not slippery version. `Q-Learning`

was faster then `Sarsa`

and `Double Q-Learning`

as it learned 4 x 4 versions in just `40`

episodes. That way faster than Monte Carlo. I failed to obtain any predictable results on the slippery version of Frozen Lake. It’s totally not scientific conclusion but as the slippery version is very random the algorithm quickly forgets what it learned before. It might be worth to try on very small step size during a large number of episodes. With `0.01`

step size and around `100 000`

episodes it managed in a small percentage of tests reach around `0.5`

accuracy. `0.2`

–`0.1`

accuracy was more common but most of the time it was lower than `0.1`

.

The chapter itself was way more interesting than the previous one. The book is doing a great job building on foundations. I find it really nice how every chapter adds something on top of another. With Temporal Difference Learning we are past the basics of Reinforcement Learning. From now on we will learn less new methods but more ways to improve it significantly and combine different ideas. Monte Carlo TD-Learning anyone?

### Btw the 2nd edition of RL book is available to pre-order now on Amazon!

Links:

– Notebook with complete code

– Reinforcement Learning: An Introduction (Book site | Amazon)

– OpenAI Gym

– Frozen Lake

– Hub for all my reinforcement learning posts