11 Miscellaneous model-free algorithm

11.0.1 Stochastic Value Gradient (SVG)

Heess et al. (2015)

11.0.2 Q-Prop

Gu et al. (2016b)

11.0.3 Normalized Advantage Function (NAF)

Gu et al. (2016a)

11.0.4 Fictitious Self-Play (FSP)

Heinrich et al. (2015) Heinrich and Silver (2016)

11.1 Comparison between value-based and policy gradient methods

Having now reviewed both value-based methods (DQN and its variants) and policy gradient methods (A3C, DDPG, PPO), the question is which method to choose? While not much happens right now for value-based methods, policy gradient methods are attracting a lot of attention, as they are able to learn policies in continuous action spaces, what is very important in robotics. https://flyyufelix.github.io/2017/10/12/dqn-vs-pg.html summarizes the advantages and inconvenients of policy gradient methods.

Advantages of PG:

Better convergence properties, more stable (Duan et al., 2016).
Effective in high-dimensional or continuous action spaces.
Can learn stochastic policies.

Disadvantages of PG:

Typically converge to a local rather than global optimum.
Evaluating a policy is often inefficient and having a high variance.
Worse sample efficiency (but it is getting better).

11.2 Gradient-free policy search

The policy gradient methods presented above rely on backpropagation and gradient descent/ascent to update the parameters of the policy and maximize the objective function. Gradient descent is generally slow, sample inefficient and subject to local minima, but is nevertheless the go-to method in neural networks. However, it is not the only optimization that can be used in deep RL. This section presents quickly some of the alternatives.

11.2.1 Cross-entropy Method (CEM)

Szita and Lörincz (2006)

11.2.2 Evolutionary Search (ES)

Salimans et al. (2017)

Explanations from OpenAI: https://blog.openai.com/evolution-strategies/

Deep neuroevolution at Uber: https://eng.uber.com/deep-neuroevolution/