Hello, hello!

It has been a long time since I actually developed anything useful, but today I have finally managed to complete cross-entropy method development for playing OpenAI text games using Julia.

**Cross-entropy method (CEM) in nutshell**

So how do we solve the policy optimisation problem of maximising the total reward given some parametrised policy? Here at any point in time, you maintain a distribution over parameter vectors and move the distribution towards parameters with a higher reward. This works surprisingly well, even if its not that effective when theta is a high dimensional vector.

**Algorithm**

The idea is to initialise the mean and sigma of a Gaussian and then for `epochs`

times we:

- collect a batch of
`n_samples`

of`theta`

from a Gaussian with the current mean and sigma - perform a noisy evaluation to get the total rewards with these thetas
- select best performing combinations of states and actions
- calculate a number of times an action was taken in a specific state and convert it to a probability

So in order to complete the task I had implement or reuse the following functionality:

**ToyTextMDP**– it is an implementation of the Text Toy problem from OpenAI**Cross-entropy Policy**– it is a mapping from every state that an agent might take to an action**Cross-entropy Policy Solver**– it is a function that responsible for minimising the error/ maximising the Q value

The model and implementation are extremely simple. You can find an even more detailed explanation on Wikipedia.

You can find my source code on GitHub. I execute `week_1.jl`

to run the CEM.