Misc.
Average-DQN (Anschel et al., 2016) proposes to increase the stability and performance of DQN by replacing the single target network (a copy of the trained network) by an average of the last parameter values, in other words an average of many past target networks.
He et al. (2016) proposed fast reward propagation through optimality tightening to speedup learning: when rewards are sparse, they require a lot of episodes to propagate these rare rewards to all actions leading to it. Their method combines immediate rewards (single steps) with actual returns (as in Monte Carlo) via a constrained optimization approach.
Never Give Up: Learning Directed Exploration Strategies (Badia et al., 2020b)
Agent57: Outperforming the Atari Human Benchmark (Badia et al., 2020a)
Human-level Atari 200x faster (Kapturowski et al., 2022)