The DQNSolver
allows to take actions in an environment and adapt it's decision making over time through environmental feedback.
The following UML-Class diagram shows the composition of the Solver:
The construction of a solver requires instances of the Classes DQNEnv
and DQNOpt
. These instances provide all necessary information about the environment and the decision-making and learning of the Agent.
- DQNEnv holds information about the environment.
numberOfStates
describes the length of the Vector (s) of observationsnumberOfActions
describes the number of potential actions to choose from
- DQNOpt holds all the hyperparamters for the learning and decision making.
numberOfHiddenUnits
: width and depth of the neural network. Array length describes the number of hidden layers (depth) and the numerical values describe the length of each layer (width).trainingMode
: mode of training, defines the explorative behaviour of the solver (decaying or stable)epsilon
: static exploration rate (during non-training mode)epsilonMax
,epsilonMin
,epsilonDecayPeriod
: linear decay of exploration during training modealpha
: discount factor for the network updategamma
: discount factor for the bellmann functiondoLossClipping
: controll flag for the clipping of the loss function during learninglossClamp
: the size of the loss clippingdoRewardClipping
: controll flag for the clipping of the injected rewardrewardClamp
: the size of the reward clippingexperienceSize
: size of the replay memoryreplayInterval
: interval of memory replayreplaySteps
: size of the replay sample
The solver provides primarily two methods. The decide
-method allows for the decision making and the learn
-method enables the adaptive learning of the solver. To decide upon an action the solver needs a vector (s: Array<number>
) of current observations (states) and it returns the index of the action (a: number
) to be executed. After the action was executed the solver takes an integer reward (r: number
) to learn
from the effectivity of the action.
The appended scheme gives a brief overview of the concept of the solver. As seen, the DQNSolver
is constructed with a set of hyperparameters defined via DQNOpt
and the description of the environment defined by DQNEnv
. The solver is then enabled to make decisions based on a set of observations (s). The decision making can then be influenced by providing a reward value.
Assuming the solver has decided and learned at least once, the state and action of the current iteration plus the reward, state and action of the last iteration resolve into an experience (called sarsa-tuple) for the replay memory. When aiming for a stable learning, the clipping of the reward and loss is recommended.
For further information please consult the research paper of Mnih et al. (2015) "Human-level control through deep reinforcement learning"