I have finally implemented three different algortihms which are based on a Q-value. They all use state-action-value table and e-greedy policy when choosing the next action. The only thing that differs is the update process of the Q-value.
I have prepared a simple mapping, for those who are experiencing problems understanding the differences:
- Q-learning: use the maximum value of next state actions as a `next_value`.
- SARSA: generate another action using existing policy and use its value from a mapping table
- Expected Value SARSA: use sum of all next state actions divided by a number of actions as a `next_value`.
Ohhh! It has been so long since I wrote my last post on Q-learning and I wouldn’t say I have progressed much since then. I have managed to learn SARSA, but our topic for today will stay Q-learning and epsilon discount factor which I missed last time.
Yes, so first of all, lets clear a few things. Last time I’ve been trying to reach 9.5 in Taxi-v2, which is wrong. The game is considered succeeded if you receive an average of 8.5 during 100 games. Unfortunately, OpenAI did not document it on their website.
Continue reading “Julia: Q-learning and epsilon discount factor”
Hello hello! I am currently working on implementing Q-learning to solve Taxi-v2 in Julia.
In it’s simplest implementation, Q-Learning is a table of values for every state (row) and action (column) possible in the environment.
Taxi-v2 rules: there are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.
Unfortunately, my code did not reach 9.7 in 100 consecutive rounds and stopped on around 6.9~7.1. I have compared it with other examples and it seems that my epsilon implementation is different from those, that converged with a higher score.
I have a fixed epsilon over all rounds which could be an issue. It could be that I need to have a higher value when the game starts and lower it over time.
Will test it out later. As always, you can find my code here: https://github.com/dmitrijsc/practical-rl
P. S. When I reduced epsilon to 0.01 I could reach 8.9! I guess I need to play more with the other parameters.