It has been a long time since I actually developed anything useful, but today I have finally managed to complete cross-entropy method development for playing OpenAI text games using Julia.
Cross-entropy method (CEM) in nutshell
So how do we solve the policy optimisation problem of maximising the total reward given some parametrised policy? Here at any point in time, you maintain a distribution over parameter vectors and move the distribution towards parameters with a higher reward. This works surprisingly well, even if its not that effective when theta is a high dimensional vector.
The idea is to initialise the mean and sigma of a Gaussian and then for
epochs times we:
- collect a batch of
thetafrom a Gaussian with the current mean and sigma
- perform a noisy evaluation to get the total rewards with these thetas
- select best performing combinations of states and actions
- calculate a number of times an action was taken in a specific state and convert it to a probability
So in order to complete the task I had implement or reuse the following functionality:
- ToyTextMDP – it is an implementation of the Text Toy problem from OpenAI
- Cross-entropy Policy – it is a mapping from every state that an agent might take to an action
- Cross-entropy Policy Solver – it is a function that responsible for minimising the error/ maximising the Q value
The model and implementation are extremely simple. You can find an even more detailed explanation on Wikipedia.
You can find my source code on GitHub. I execute
week_1.jl to run the CEM.