Few updates before I move on to Space Invaders. I have updated my GitHub repo and updated DQN to support multiple layers and also managed to fix some bugs.
I am a very active Oslo Data Science meetup Hacking group member where we actively discuss Reinforcement Learning. Lately, we have been thinking about solving Space Invaders, and some of the members were struggling with implementing the solution. I decided to join their army and see if I can handle it in Julia.
I have started with reviewing my existing code and created an additional solver based on a standard DQN and made it dependent on a previous frame. So now to predict our next action, we will first find a difference between the last two frames.
My current score varies very much with an average of around 210 and maximum 630 points. Most probably I need to train the model a little bit longer so it can explore more of the field.
I am also planning to introduce a Replay Buffer, which should “help” the model to remember “old” frames and keep gradients alive.
I haven’t shared any code yet, and it is to come as soon as working solution is available.
Finally! I did it! I’ve been struggling for some time trying to make DQN work and could not succeed.
Today I have managed to make it work and solve CartPole from OpenAI gym using DQN. You know what the problem was? Size of my neural network!
Continue reading “Playing CartPole using Julia and MXNet implementation of DQN”
I have finally implemented three different algortihms which are based on a Q-value. They all use state-action-value table and e-greedy policy when choosing the next action. The only thing that differs is the update process of the Q-value.
I have prepared a simple mapping, for those who are experiencing problems understanding the differences:
- Q-learning: use the maximum value of next state actions as a `next_value`.
- SARSA: generate another action using existing policy and use its value from a mapping table
- Expected Value SARSA: use sum of all next state actions divided by a number of actions as a `next_value`.
So, I’ve been very active this two days and managed to implement SARSA algorithm for solving Taxi-v2.
They look mostly the same except that in Q-learning, we update our Q-function by assuming we are taking action `a` that maximises our post-state Q function.
In SARSA, we use the same policy that generated the previous action a to generate the next action, `a-prim`, which we run through our Q-function for updates.
It all might sound very complicated but it results in very small change to Q-learning algorithm. You can compare my implementation of SARSA and Q-learning to see the difference.
I have managed to reach the same score as with Q-learning in about the same time. I guess Taxi-v2 problem is solved once an forever.
You should also be able to launch Taxi-v2 using the code from my GitHub repository.
Ohhh! It has been so long since I wrote my last post on Q-learning and I wouldn’t say I have progressed much since then. I have managed to learn SARSA, but our topic for today will stay Q-learning and epsilon discount factor which I missed last time.
Yes, so first of all, lets clear a few things. Last time I’ve been trying to reach 9.5 in Taxi-v2, which is wrong. The game is considered succeeded if you receive an average of 8.5 during 100 games. Unfortunately, OpenAI did not document it on their website.
Continue reading “Julia: Q-learning and epsilon discount factor”
Hello hello! I am currently working on implementing Q-learning to solve Taxi-v2 in Julia.
In it’s simplest implementation, Q-Learning is a table of values for every state (row) and action (column) possible in the environment.
Taxi-v2 rules: there are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.
Unfortunately, my code did not reach 9.7 in 100 consecutive rounds and stopped on around 6.9~7.1. I have compared it with other examples and it seems that my epsilon implementation is different from those, that converged with a higher score.
I have a fixed epsilon over all rounds which could be an issue. It could be that I need to have a higher value when the game starts and lower it over time.
Will test it out later. As always, you can find my code here: https://github.com/dmitrijsc/practical-rl
P. S. When I reduced epsilon to 0.01 I could reach 8.9! I guess I need to play more with the other parameters.
Today I have been trying to re-run my code and confirm I am getting 195 points while playing CartPole. According to OpenAI website CartPole-v0 defines “solving” as getting average reward of 195.0 over 100 consecutive trials.
It was actually a very challenging task to figure out why my model is not training at all and I am getting 50% probabilities over multiple iterations.
Initially I suspected having issues in my solver (and I actually had an issue), but in the end, I realized that the learning rate is too small.
I was using Adam optimiser with a default learning rate of 0.001. It is totally OK to use it for most of the tasks. I suspect I had to increase the learning rate to 0.01 because of a relatively small batch.
Anyways, the problem is solved and cart pole is running great!