Ohhh! It has been so long since I wrote my last post on Q-learning and I wouldn’t say I have progressed much since then. I have managed to learn SARSA, but our topic for today will stay Q-learning and epsilon discount factor which I missed last time.
Yes, so first of all, lets clear a few things. Last time I’ve been trying to reach 9.5 in Taxi-v2, which is wrong. The game is considered succeeded if you receive an average of 8.5 during 100 games. Unfortunately, OpenAI did not document it on their website.
Let’s get back to the topic and assume we have an `epsilon` value of 0.25 – it means we are exploring other opportunities 25% of our time, in other words taking random action 25%. It is good when you start training and want to try different scenarios, but bad when you have a trained model.
Last time when I’ve been solving Taxi-v2 I had to set it extremely low, something like 0.01, otherwise I would not be able to reach the target score of 9.5
This time I have introduced `epsilon_discount`, a float variable that is part of my solver and which reduces the epsilon after every batch. Depending on a game it can take different values, but usually, we fix it to ~ 0.99X. As the number of iterations, I am running is relatively high, I have set the `epsilon discount factor` to 0.999 and increased the learning rate to 0.15.
It helped me to reach 8.5 in apprx. 4000 epochs of 100 games.
UPDATE: I have found an issue in my Q-learning implementation. I’ve been updating weights at the end of the episode, which is wrong. I have changed the code to make the update after each frame and the game reached a positive score in less than 2000.
As always, everything is on Github: https://github.com/dmitrijsc/practical-rl