Our topic for today will be using Random Policy and enhance it with genetic/ evolutionary algorithms to score in different versions of FrozenLake.

About FrozenLake, OpenAI gym:

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

Let’s get started!

I guess I should start by letting you know that I will try to use Julia for everything I develop. In the first few posts I will also include the full code so we can get started and most probably will link here in the future.

Let’s start by simply loading the packages and loading the environment:

using OpenAIGym import Reinforce.action env = GymEnv("FrozenLake8x8-v0") # FrozenLake-v0 for 4x4 env_actions = env.pyenv[:action_space][:n] env_space = env.pyenv[:observation_space][:n] env_space_actions = 0:(env_space - 1)

It is important to note that we are trying to generalize our code. That’s the reason behind finding `env_action`, `env_space` and creating `env_space_actions` – a set of actions in the game.

Now let’s get to defining our Policy.

struct StateActionPolicy <: AbstractPolicy states::Vector{Int} end function action(policy::StateActionPolicy, r, s′, A′) index = (typeof(s′) != Int64 ? 0 : s′) + 1 policy.states[index] end

So what did I do above? I have defined a new policy, that will keep state to action mapping and accept it as a constructor. Next, I have defined actions function, that select mapping value for the current state.

So now we have an environment and policy. Let’s generate a random number of state mappings and evaluate them.

play_episode_number = 10^3 rewards = zeros(play_episode_number) random_policy_values = rand(env_space_actions, env_space, play_episode_number) function global_episode(env, policy) reward = run_episode(env, policy) do # Nothing here end return reward end function play_global_policy(env, policy_id = 1, episode_count = 100) return sum(map(x -> global_episode(env, StateActionPolicy(random_policy_values[:, policy_id])), zeros(episode_count))) end policy_scores = map(x -> play_global_policy(env, x[1]), enumerate(rewards));

A few things we did above:

- We generated 10^3 random policies
- We evaluated each policy 100 times and collected the results in `policy_scores`

Now we proceed with genetic algorithms. In short, we do the following:

- Crossover: take 2 random policies and select actions from one or another
- Mutation: take random policy and update its steps randomly
- Evaluate the results and keep best policies
- Repeat

EPOCHS = 20 IMPUTATIONS = 50 KEEP_RECORDS = 100 for i = 1:EPOCHS CURRENT_SIZE = length(policy_scores) for j = 1:IMPUTATIONS policy_selector = rand(1:CURRENT_SIZE, 3, 1) random_policy_values = hcat(random_policy_values, crossover(policy_selector[1], policy_selector[2])) policy_scores = vcat(policy_scores, play_global_policy(env, CURRENT_SIZE + j*2 - 1)) random_policy_values = hcat(random_policy_values, mutate(policy_selector[3])) policy_scores = vcat(policy_scores, play_global_policy(env, CURRENT_SIZE + j*2)) end indices = sortperm(policy_scores) random_policy_values = random_policy_values[:, indices][:, end-KEEP_RECORDS:end] policy_scores = policy_scores[indices][end-(KEEP_RECORDS - 1):end] end

I have trained 2 different models – one for Frozenlake-v0, which has 4×4 map and one for Frozenlake8x8-v0 which has 8×8 map.

My models achieve the following results which are considered good according to the playground:

- FrozenLake4x4: 0.86 in 10+ epochs
- FrozenLake8x8: 0.97 in 30+ epochs