Difference in Q-learning, Sarsa, Expected Value Sarsa for dummies

I have finally implemented three different algortihms which are based on a Q-value. They all use state-action-value table and e-greedy policy when choosing the next action. The only thing that differs is the update process of the Q-value.

I have prepared a simple mapping, for those who are experiencing problems understanding the differences:

  • Q-learning: use the maximum value of next state actions as a `next_value`.
  • SARSA: generate another action using existing policy and use its value from a mapping table
  • Expected Value SARSA: use sum of all next state actions divided by a number of actions as a `next_value`.
Difference in Q-learning, Sarsa, Expected Value Sarsa for dummies

Julia: Q-learning and epsilon discount factor

Ohhh! It has been so long since I wrote my last post on Q-learning and I wouldn’t say I have progressed much since then. I have managed to learn SARSA, but our topic for today will stay Q-learning and epsilon discount factor which I missed last time.

Yes, so first of all, lets clear a few things. Last time I’ve been trying to reach 9.5 in Taxi-v2, which is wrong. The game is considered succeeded if you receive an average of 8.5 during 100 games. Unfortunately, OpenAI did not document it on their website.

Continue reading “Julia: Q-learning and epsilon discount factor”

Julia: Q-learning and epsilon discount factor

Julia: Q-learning – using value table to solve Taxi-v2

Hello hello! I am currently working on implementing Q-learning to solve Taxi-v2 in Julia.

In it’s simplest implementation, Q-Learning is a table of values for every state (row) and action (column) possible in the environment. 

Taxi-v2 rules: there are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.

Unfortunately, my code did not reach 9.7 in 100 consecutive rounds and stopped on around 6.9~7.1. I have compared it with other examples and it seems that my epsilon implementation is different from those, that converged with a higher score.

I have a fixed epsilon over all rounds which could be an issue. It could be that I need to have a higher value when the game starts and lower it over time.

Will test it out later. As always, you can find my code here: https://github.com/dmitrijsc/practical-rl

P. S. When I reduced epsilon to 0.01 I could reach 8.9! I guess I need to play more with the other parameters.

Julia: Q-learning – using value table to solve Taxi-v2