4.8 Miscellaneous model-free algorithm

4.8.1 Stochastic Value Gradient (SVG)

Heess et al. (2015)

4.8.2 Q-Prop

Gu et al. (2016a)

4.8.3 Normalized Advantage Function (NAF)

Gu et al. (2016b)

4.8.4 Fictitious Self-Play (FSP)

Heinrich et al. (2015) Heinrich and Silver (2016)

4.9 Comparison between value-based and policy gradient methods

Having now reviewed both value-based methods (DQN and its variants) and policy gradient methods (A3C, DDPG, PPO), the question is which method to choose? While not much happens right now for value-based methods, policy gradient methods are attracting a lot of attention, as they are able to learn policies in continuous action spaces, what is very important in robotics. https://flyyufelix.github.io/2017/10/12/dqn-vs-pg.html summarizes the advantages and inconvenients of policy gradient methods.

Advantages of PG:

Disadvantages of PG:

The policy gradient methods presented above rely on backpropagation and gradient descent/ascent to update the parameters of the policy and maximize the objective function. Gradient descent is generally slow, sample inefficient and subject to local minima, but is nevertheless the go-to method in neural networks. However, it is not the only optimization that can be used in deep RL. This section presents quickly some of the alternatives.

4.10.1 Cross-entropy Method (CEM)

Szita and Lörincz (2006)

4.10.2 Evolutionary Search (ES)

Salimans et al. (2017)

Explanations from OpenAI: https://blog.openai.com/evolution-strategies/

Deep neuroevolution at Uber: https://eng.uber.com/deep-neuroevolution/