Training Reinforcement Learning agent using derivative of Generative Recurrent Neural Network which models jointly environment and reward. Run "example.py" to see it working. This code requires Chainer and Numpy to be installed.
On the high level the code works as follows:
-
Agent RNN is initialized, probability of agent just outputting random action is set to 1.0.
-
Agent acts in an environment, generating data about the environment.
-
Collected data about environment is split evenly into training and validation parts.
-
Two separete generative RNNs are trained on training and validation parts of data. Any of such generative RNNs can be viewed as a differentiable model of environemnt.
-
Agent is trained to optimize average reward on training environment using gradient descent over outputs of training environment GAN. Agent training stops when performance on valiadation GAN starts to decrease.
-
Probability of agent outputing random action is decreased. Repeat from step 2 until terminating criterion (fixed number of iterations).
Important: this is a work in progress, thus expect bugs and things changing.