11  Miscellaneous model-free algorithm

11.0.1 Stochastic Value Gradient (SVG)

Heess et al. (2015)

11.0.2 Q-Prop

Gu et al. (2016b)

11.0.3 Normalized Advantage Function (NAF)

Gu et al. (2016a)

11.0.4 Fictitious Self-Play (FSP)

Heinrich et al. (2015) Heinrich and Silver (2016)

11.1 Comparison between value-based and policy gradient methods

Having now reviewed both value-based methods (DQN and its variants) and policy gradient methods (A3C, DDPG, PPO), the question is which method to choose? While not much happens right now for value-based methods, policy gradient methods are attracting a lot of attention, as they are able to learn policies in continuous action spaces, what is very important in robotics. https://flyyufelix.github.io/2017/10/12/dqn-vs-pg.html summarizes the advantages and inconvenients of policy gradient methods.

Advantages of PG:

  • Better convergence properties, more stable (Duan et al., 2016).
  • Effective in high-dimensional or continuous action spaces.
  • Can learn stochastic policies.

Disadvantages of PG:

  • Typically converge to a local rather than global optimum.
  • Evaluating a policy is often inefficient and having a high variance.
  • Worse sample efficiency (but it is getting better).