4.8 Miscellaneous model-free algorithm
4.8.1 Stochastic Value Gradient (SVG)
Heess et al. (2015)
Gu et al. (2016a)
4.8.3 Normalized Advantage Function (NAF)
Gu et al. (2016b)
4.8.4 Fictitious Self-Play (FSP)
Heinrich et al. (2015) Heinrich and Silver (2016)
4.9 Comparison between value-based and policy gradient methods
Having now reviewed both value-based methods (DQN and its variants) and policy gradient methods (A3C, DDPG, PPO), the question is which method to choose? While not much happens right now for value-based methods, policy gradient methods are attracting a lot of attention, as they are able to learn policies in continuous action spaces, what is very important in robotics. https://flyyufelix.github.io/2017/10/12/dqn-vs-pg.html summarizes the advantages and inconvenients of policy gradient methods.
Advantages of PG:
- Better convergence properties, more stable (Duan et al., 2016).
- Effective in high-dimensional or continuous action spaces.
- Can learn stochastic policies.
Disadvantages of PG:
- Typically converge to a local rather than global optimum.
- Evaluating a policy is often inefficient and having a high variance.
- Worse sample efficiency (but it is getting better).
4.10 Gradient-free policy search
The policy gradient methods presented above rely on backpropagation and gradient descent/ascent to update the parameters of the policy and maximize the objective function. Gradient descent is generally slow, sample inefficient and subject to local minima, but is nevertheless the go-to method in neural networks. However, it is not the only optimization that can be used in deep RL. This section presents quickly some of the alternatives.
4.10.1 Cross-entropy Method (CEM)
Szita and Lörincz (2006)
4.10.2 Evolutionary Search (ES)
Salimans et al. (2017)
Explanations from OpenAI: https://blog.openai.com/evolution-strategies/
Deep neuroevolution at Uber: https://eng.uber.com/deep-neuroevolution/