Hello hello! I am currently working on implementing Q-learning to solve Taxi-v2 in Julia.
In it’s simplest implementation, Q-Learning is a table of values for every state (row) and action (column) possible in the environment.
Taxi-v2 rules: there are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.
Unfortunately, my code did not reach 9.7 in 100 consecutive rounds and stopped on around 6.9~7.1. I have compared it with other examples and it seems that my epsilon implementation is different from those, that converged with a higher score.
I have a fixed epsilon over all rounds which could be an issue. It could be that I need to have a higher value when the game starts and lower it over time.
Today I have been trying to re-run my code and confirm I am getting 195 points while playing CartPole. According to OpenAI website CartPole-v0 defines “solving” as getting average reward of 195.0 over 100 consecutive trials.
It was actually a very challenging task to figure out why my model is not training at all and I am getting 50% probabilities over multiple iterations.
Initially I suspected having issues in my solver (and I actually had an issue), but in the end, I realized that the learning rate is too small.
I was using Adam optimiser with a default learning rate of 0.001. It is totally OK to use it for most of the tasks. I suspect I had to increase the learning rate to 0.01 because of a relatively small batch.
Anyways, the problem is solved and cart pole is running great!
Recently I’ve been playing FrozenLake using Cross-entropy method implemented in Julia, but this time I have made my task more complicated and implemented deep cross-entropy method using MXNet in order to play CartPole.
In parallel to taking Practical RL course, I am also reading a great book on reinforcement learning. I have found this quote to be good to note it down on the blog. It is explaining the difference between evolutionary methods and methods that learn value functions.
For example, if the player wins, then all of its behavior in the game is given credit, independently of how specific moves might have been critical to the win. Credit is even given to moves that never occurred! Value function methods, in contrast, allow individual states to be evaluated. In the end, evolutionary and value function methods both search the space of policies, but learning a value function takes advantage of information available during the course of play.
Our topic for today will be using Random Policy and enhance it with genetic/ evolutionary algorithms to score in different versions of FrozenLake.
About FrozenLake, OpenAI gym:
The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.