Hello hello! I am currently working on implementing Q-learning to solve Taxi-v2 in Julia.
In it’s simplest implementation, Q-Learning is a table of values for every state (row) and action (column) possible in the environment.
Taxi-v2 rules: there are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.
Unfortunately, my code did not reach 9.7 in 100 consecutive rounds and stopped on around 6.9~7.1. I have compared it with other examples and it seems that my epsilon implementation is different from those, that converged with a higher score.
I have a fixed epsilon over all rounds which could be an issue. It could be that I need to have a higher value when the game starts and lower it over time.
Will test it out later. As always, you can find my code here: https://github.com/dmitrijsc/practical-rl
P. S. When I reduced epsilon to 0.01 I could reach 8.9! I guess I need to play more with the other parameters.