In this post, we will explore our first reinforcement learning methods for estimating value. It’s the first taste of real RL in this series. I bet you’ve heard the term *Monte Carlo method* before.

Monte Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results.

–Wikipedia

In other words, it’s all about using randomness to solve our problem. The approach proved to be very useful in mathematics and physics. Especially when other approaches are impossible. The modern version of Monte Carlo method was invented in the late 1940s by Stanislaw Ulam while working on nukes at Los Alamos. Soon after that famous John von Neumann thought it was good and programmed their ENIAC to do the calculations. As every super secret project have its codename they decided on *Monte Carlo*. Ulam’s uncle was losing tons of money in Monte Carlo Casino in Monaco hence the name.

Although crude the Monte Carlo method was crucial to Manhattan Project. In the 1950s it was used in the early developments of the hydrogen bomb project. It became an established method of solving real-world problems in mathematics, physics, engineering, finance and of course the **artificial intelligence**.

# Testbed

I will describe and run everything on a Frozen Lake game. The first version of this post was written on blackjack where I decided to write my own game. For convenience and focus on reinforcement learning part, I decided to throw it away. It allowed me to finally move to OpenAI Gym environment. What’s the game about? Here is the quote from OpenAI website.

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

The grid consist of following tiles:

**S**starting point, safe**F**frozen tile, safe**H**hole, not safe at all**G**goal

I will check my algorithm on 4×4 and 8×8 version. I will start with not slippery **F** tiles and then try out slippery ones.

# OpenAI Gym

It’s not a tutorial on OpenAI Gym but I will include some basics so it would be easier to follow along. So let’s create gym environment. The environment is everything we need to run and have fun with our reinforcement learning algorithms.

```
>>> import gym
>>> env = gym.make('FrozenLake-v0')
```

Let’s see some parameters of our environment:

```
>>> env.observation_space
Discrete(16)
>>> env.action_space
Discrete(4)
```

The `observation_space`

parameter is in the other words range of possible states. `Discrete`

is a data object from the Gym. It’s 0 indexed so it contains values from 0 to 15. Same for the `action_space`

. So our 4×4 frozen lake game has 16 possible states and 4 possible actions. To get the upper bound to use in our code there is property `n`

.

```
>>> env.observation_space.n
16
```

To create environments with custom parameters we can use `register`

function. Gym documentation is not too broad so you have to look at the code to find those. That way we can create a simpler version to quickly test our algorithm with less complexity.

```
>>> register(
id='FrozenLakeNotSlippery-v0',
entry_point='gym.envs.toy_text:FrozenLakeEnv',
kwargs={'map_name' : '4x4', 'is_slippery': False},
)
>>> env = gym.make('FrozenLakeNotSlippery-v0')
```

Finally you can run everything like:

```
done = False
while not done:
env.render()
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
```

It will look like for (8×8 version):

# Monte Carlo time

Of course (as always) all utilities needed you can find in the notebook. I tried to make all the function do exactly what they look like should do. I hope I succeed to some degree. I decided also not to use maximum performance approach with `numpy`

. I am learning it as I write and at this point clarity and understanding is more valuable than performance. Besides performance wise, there are some cutting-edge implementations.

One of the algorithms from the book I chose is the **On-policy first-visit MC control** let me explain what all those terms mean.

**On-policy**algorithm is improving the policy it’s working on. The opposite is the*off-policy*.**first-visit**algorithm averages returns for visits following the first visit to given state. Opposite in*every-visit*.**MC control**or Monte Carlo control. This type of algorithm first estimates values for states, then builds greedy (or near greedy) policy on them just to use the new policy to create another estimate. It loops forever (or rather until reaching a certain amount of episodes).

Here is the algorithm tested on:

- Frozen Lake 4×4 not slippery
- Frozen Lake 8×8 not slippery
- Frozen Lake 4×4 slippery
- Frozen Lake 8×8 slippery

Slippery ones are definitely harder to learn (or estimate would be a better one).

Function parameters:

`env`

– gym environment, I decided on convention to pass it as a parameter to every function needing one.`episodes`

– the number of episodes we will base our estimates on, the episode is generated play through data we can learn on.`policy`

– we can pass policy to improve existing one instead of generating new random one.`epsilon`

– exploration rate. A chance to take random action instead of the greedy one.- If we are not passing policy create random one. I use policy in form of the dictionary where keys are states and values are another map with action – action probability pair.

```
>>> policy = {
1: {0: 0.25, 1: 0.25, 2: 0.25, 3: 0.25},
2: {0: 0.25, 1: 0.25, 2: 0.25, 3: 0.25},
...
15: {0: 0.25, 1: 0.25, 2: 0.25, 3: 0.25}
}
```

- Create an empty value estimates dictionary
`Q`

. Same format as the policy but instead of probabilities we have value estimates inside. Starting all`0.0`

- Create an empty dictionary containing all encountered rewards for each state-action pair. So the keys are the tuples
`(state, action)`

and the values are arrays with all rewards following the given key. - Good old
`for`

loop on episodes number. - Initialize
`G`

to store cumulative reward for the current iteration. - Create episode. I represent episode as a t-length array of 3-values array.
`episode[t][0]`

contains**state**,`episode[t][1]`

is the**action**and`episode[t][2]`

the**reward**. `for`

loop through reversed indices of`episode`

array. The logic behind it being reversed is that the eventual reward would be at the end. So we have to go back from the last timestep to the first one propagating result from the future.- Extracting needed variables for verbosity.
`s_t`

is the state,`a_t`

action and the`r_t`

reward, everything on given timestep. The line below the`state_action`

variable is only for convenient indexing of dictionaries. - We increment total reward
`G`

by reward on current timestep. In practice, only reward different than`0`

is possible on the last timestep. - We don’t want to update the same pair twice, so we update it only when those are not present in previous timesteps.
- Update
`returns`

dictionary with cumulative reward. If there is nothing under a key we create a new array. Otherwise just append`G`

. - The count value of current state action pair. It’s just an average of the rewards gained on multiple episodes.
- My implementation of
`argmax(Q[s_t])`

which randoms maximum action if there are many with the same value estimate. - Save
`A*`

or`A_star`

variable to set up proper probabilities in policy dictionary. - Update action probability for
`s_t`

in the policy. Here we create near greedy policy.`epsilon`

parameter decides how much probability we give to other actions for exploration purposes.

So here it is. What makes this algorithm a *Monte Carlo* algorithm? Mainly is the generation of episodes. We can do it endlessly and by doing so we can create better and better models of the environment.

Let’s see how it works and performs.

# Results

#### 4×4 not slippery

```
>>> env = create_environment(slippery=False, big=False)
>>> policy = monte_carlo_e_soft(env, episodes=200)
>>> test_policy(policy, env)
0.96
```

Pretty decent performance here. We can blame 0.04 loss on the epsilon parameter. What we should do is probably create greedy policy out of near greedy.

#### 8×8 not slippery

```
>>> env = create_environment(slippery=False, big=True)
>>> policy = monte_carlo_e_soft(env, episodes=10000)
>>> test_policy(policy, env)
0.92
```

Same as 4×4. It does not always find an optimal solution after a small number of episodes.

#### 4×4 slippery

```
>>> env = create_environment(slippery=True, big=False)
>>> policy = monte_carlo_e_soft(env, episodes=50000)
>>> test_policy(policy, env)
0.46
```

Here gets more interesting. As intended the more episodes the more or less better performance. But after 50000 on my machine, it stops getting fun, and the problem is so simple I feel I shouldn’t let it run for 10 minutes. As we see the low performance is not always due to going into `H`

fields but also by exceeding the time limit. I included a successful play through. It runs for a loooong time before finishing.

#### 8×8 slippery

```
>>> env = create_environment(slippery=True, big=True)
>>> policy = monte_carlo_e_soft(env, episodes=50000)
>>> test_policy(policy, env)
0.55
```

Same as the 4×4 slippery version. Although completion rate is higher it couldn’t learn to solve it in a reasonable time. The problem is that Monte Carlo is a random algorithm and the slippery moves are random so it’s hard to generate enough episodes to cover all possibilities in given episodes amount.

# Improvements

Can we do better? Well, although it’s not proven mathematically the Monte Carlo method guarantees to find the optimal solution. But as the episodes are random you can get 100 episodes which will perfectly describe the problem. You can also get 1000 useless episodes. I wrote a little helper function to iterate Monte Carlo on a smaller amount of episodes and to replace policy if the new one is an improvement. It’s the way to cut the dead ends and speed up the learning. Here is the code:

I run it on both 4×4 and 8×8 slippery FrozenLake game:

#### 4×4 slippery

```
>>> env = create_environment(slippery=True, big=False)
>>> policy, score = policy_iterator(env, 50, 10000, epsilon=0.01)
>>> score
0.7
```

#### 8×8 slippery

```
>>> env = create_environment(slippery=True, big=True)
>>> policy, score = policy_iterator(env, 50, 10000, epsilon=0.01)
>>> score
0.37
```

# Conclusions

Clearly, the score is better but still far from the conventional *success* score on this game which is 0.78. I was limited by the MacBook Air I am working on and not efficient algorithm used for verbose learning implementation. But as I increased episodes the score got better and I am sure Monte Carlo method will eventually find the optimal policy.

I hope you enjoyed my post. It took me a while to create working and compelling implementation as I had many struggles along the way. Next one in the series will be about the Temporal-Difference Learning.

Links:

– Notebook with complete code

– Reinforcement Learning: An Introduction (Book site | Amazon)

– OpenAI Gym

– Frozen Lake

– Hub for all my reinforcement learning posts

[…] Monte Carlo – because TD learns from experience, without a model of any kind […]

LikeLike

Hi Jeremy,

Why do you use (epsilon / abs(sum(policy[s_t].values())) ?

To my mind it should be (epsilon / len(policy[s_t])). That’s what sutton means, and it has higher performance when testing with your test_policy().

Thanks and take care.

Gilles

LikeLike

Hi Gilles,

I did this post a while ago and really don’t remember what I was thinking then đŸ™‚ I have to check it out but what you propose seem right.

Thanks!

LikeLike

Ok Thanks. If I am not wrong, |A(St)| means cardinality of A(St) and in frozenlake case, it’s always 4. 4 actions

LikeLike