Temporal Difference Learning in Python

Temporal-Difference Learning (or TD Learning) is quite important and novel thing around. It’s the first time where you can really see some patterns emerging and everything is building upon a previous knowledge. Hop in for some theory and Python code.

So here I am after a quite long delay with another post. A lot happened during that time as I came back from our half-year South East Asia journey. I resumed my work at Tooploox but it takes a while to readjust to living back home. Anyway here comes some more Reinforcement Learning for your enjoyment.

Temporal-Difference Learning (or TD Learning) is quite important and novel thing around. It’s the first time where you can really see some patterns emerging and everything is building upon a previous chapter.

DP + MC = TD

The pseudo equation above is not scientific at all but gives an idea about the Temporal-Difference learning. It’s basically some ideas from Monte Carlo and some from Dynamic Programming stuff.

Monte Carlo – because TD learns from experience, without a model of any kind

Dynamic Programming – as TD doesn’t wait for episode completion. TD bootstraps

It seems that those three terms are the foundation of Reinforcement Learning. It will appear in different combinations over and over from now on.

We will evaluate everything on the Frozen Lake game from gym package. If you don’t know about the gym, see my previous post.


Our first TD Learning algorithm will be the Sarsa. As usual, it will follow generalized policy iteration and as TD Learning in general, it bootstraps. So it will update policy or rather value dictionary of state-action pairs on which we will base our policy. The name SARSA comes from the principle of the algorithm where we use on each step:

State -> Action -> Reward -> State’ -> Action’

…and so on.

Here is the code. Again all utility functions you will find in the repo.

As always I created the algorithm as a function. The parameters will be the same for all TD Learning algorithms in the post. env is the gym environment which contains data about the game, rules and playing functionality. episodes is just number of times which we allow the algorithm to play through and learn. step_size defines how fast it will absorb new data (and forget the old). Finally exploration_rate is an incentive to explore other than greedy options.

  1. First we have to create value (Q) dictionary on which we will work. It’s actually dictionary of dictionaries where the keys are the states. The internal maps contains actual values with action as a keys. So it’s formatted in following way: { S1 : { A1: 0.0, A2: 0.0, A3: 0.0, A4: 0.0 }, S2: { ... }, ... }. All values are initialised to 0.0 to save the hustle as terminal states have to be all zeros and the rest are some arbitral values. Plus we might not know which states are terminal.
  2. for loop through the desired number of episodes. Nothing special.
  3. Getting first State of the environment for current episode.
  4. We choose an action based on greedy policy created with current Value dictionary. It’s just a maximum valued action for given state. If this value is identical for many actions then random maximum action is taken.
  5. while loop till episode is finished – terminal state is reached.
  6. The step is taken inside a loop. The gym function returns tuple – (S_prime, reward, finished, info). We would need only first 3 of those.
  7. Based on the S_prime we take another greedy action.
  8. Here is the clue of the Sarsa approach. We update the value dictionary and learn in the same time unlike Monte Carlo where actual learning is at the end of the episode. The Value dictionary for current state-action pair is updated using the reward and additional value of resulting state-action pair. That’s a major difference between Sarsa and other TD-Learning algorithms. Here we also use the step_size and exploration_rate hyperparameters.
  9. Update initial state and action for the next step (which won’t be needed when terminal state is reached).


It was one of the earlier breakthroughs in Reinforcement Learning. It will be very similar to the Sarsa. Only one major difference is the update rule. I like that when Q-Learning finds a good way through the Frozen Lake it sticks to it. I know it might be a problem in many problems where there is some randomness but here it went directly from 0.0 accuracy to 1.0. Very fun to watch.

  1. Same as Sarsa 1.
  2. Same as Sarsa 2.
  3. Same as Sarsa 3.
  4. Same as Sarsa 5.
  5. Same as Sarsa 4. or 7. As we have to take the one action per step I decided to put it in the while loop rather than outside.
  6. Same as Sarsa 6.
  7. Update rule. Principles are identical but instead on looking into the actual next action we use maximum valued action for S_prime.
  8. Same as Sarsa 9. Only State.

Double Q-Learning

Double learning is actually a little modification to either Q-Learning or any other Reinforcement Learning algorithm. It is used to avoid the maximization bias. In Temporal Difference Learning we don’t learn after the episode is finished so we don’t know resulting reward initially. And the math is set up that way so the algorithms will prefer the longer road to failure that to fail quickly and move on. That’s maximization bias in a nutshell. The double part is just using 2 value dictionaries with a little dance between them. Having to totally independent dictionaries lets us avoid being biased in some way. And resulting value is just sum of Q_1 and Q_2.

Here is the code:

  1. Same as Q-Learning 1. But we need to initialize 2 dictionaries Q_1 and Q_2.
  2. Same as Q-Learning2.
  3. Same as Q-Learning 3.
  4. Same as Q-Learning 4.
  5. Adding Q_1 and Q_2 to create greedy policy.
  6. Same as Q-Learning 5.
  7. Same as Q-Learning 6.
  8. 50% chance to follow one update rule or the other.
  9. Same as Q-Learning 7. But instead of maximum action in S_prime for currently estimated dictionary we use the other one. So Q_1 is updated with Q_2 and same the other way around.


Guy Image GIF - Find & Share on GIPHY

All algorithms managed to learn both Frozen Lake 4 x 4 and 8 x 8 in not slippery version. Q-Learning was faster then Sarsa and Double Q-Learning as it learned 4 x 4 versions in just 40episodes. That way faster than Monte Carlo. I failed to obtain any predictable results on the slippery version of Frozen Lake. It’s totally not scientific conclusion but as the slippery version is very random the algorithm quickly forgets what it learned before. It might be worth to try on very small step size during a large number of episodes. With 0.01 step size and around 100 000 episodes it managed in a small percentage of tests reach around 0.5 accuracy. 0.20.1 accuracy was more common but most of the time it was lower than 0.1.

The chapter itself was way more interesting than the previous one. The book is doing a great job building on foundations. I find it really nice how every chapter adds something on top of another. With Temporal Difference Learning we are past the basics of Reinforcement Learning. From now on we will learn less new methods but more ways to improve it significantly and combine different ideas. Monte Carlo TD-Learning anyone?

Btw the 2nd edition of RL book is available to pre-order now on Amazon!

Notebook with complete code
– Reinforcement Learning: An Introduction (Book site | Amazon)
OpenAI Gym
Frozen Lake
Hub for all my reinforcement learning posts

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: