References

Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Computation 10, 251–276.

Anschel, O., Baram, N., and Shimkin, N. (2016). Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1611.01929.

Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein GAN. Available at: http://arxiv.org/abs/1701.07875.

Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. (2017). A Brief Survey of Deep Reinforcement Learning. Available at: https://arxiv.org/pdf/1708.05866.pdf.

Baird, L. C. (1993). Advantage updating. Wright-Patterson Air Force Base Available at: http://leemon.com/papers/1993b.pdf.

Bakker, B. (2001). Reinforcement Learning with Long Short-Term Memory. in Advances in Neural Information Processing Systems 14 (NIPS 2001), 1475–1482. Available at: https://papers.nips.cc/paper/1953-reinforcement-learning-with-long-short-term-memory.

Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB, D., et al. (2018). Distributed Distributional Deterministic Policy Gradients. Available at: http://arxiv.org/abs/1804.08617.

Bellemare, M. G., Dabney, W., and Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. Available at: http://arxiv.org/abs/1707.06887.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Available at: http://arxiv.org/abs/1406.1078.

Chou, P.-W., Maturana, D., and Scherer, S. (2017). Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution. in International Conference on Machine Learning Available at: http://proceedings.mlr.press/v70/chou17a/chou17a.pdf.

Clavera, I., Nagabandi, A., Fearing, R. S., Abbeel, P., Levine, S., and Finn, C. (2018). Learning to Adapt: Meta-Learning for Model-Based Control. Available at: http://arxiv.org/abs/1803.11347.

Corneil, D., Gerstner, W., and Brea, J. (2018). Efficient Model-Based Deep Reinforcement Learning with Variational State Tabulation. Available at: http://arxiv.org/abs/1802.04325.

Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. (2017). Distributional Reinforcement Learning with Quantile Regression. arXiv:1710.10044 [cs, stat]. Available at: http://arxiv.org/abs/1710.10044 [Accessed June 28, 2019].

Degris, T., White, M., and Sutton, R. S. (2012). Linear Off-Policy Actor-Critic. in Proceedings of the 2012 International Conference on Machine Learning Available at: http://arxiv.org/abs/1205.4839.

Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Benchmarking Deep Reinforcement Learning for Continuous Control. Available at: http://arxiv.org/abs/1604.06778.

Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. (2018). Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning. Available at: http://arxiv.org/abs/1803.00101.

Gers, F. (2001). Long Short-Term Memory in Recurrent Neural Networks. Available at: http://www.felixgers.de/papers/phd.pdf.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press Available at: http://www.deeplearningbook.org.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative Adversarial Networks. Available at: http://arxiv.org/abs/1406.2661.

Goyal, A., Brakel, P., Fedus, W., Lillicrap, T., Levine, S., Larochelle, H., et al. (2018). Recall Traces: Backtracking Models for Efficient Reinforcement Learning. Available at: http://arxiv.org/abs/1804.00379.

Gruslys, A., Dabney, W., Azar, M. G., Piot, B., Bellemare, M., and Munos, R. (2017). The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning. Available at: http://arxiv.org/abs/1704.04651.

Gu, S., Holly, E., Lillicrap, T., and Levine, S. (2017). Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates. in Proc. ICRA Available at: http://arxiv.org/abs/1610.00633.

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. (2016a). Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic. Available at: http://arxiv.org/abs/1611.02247.

Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. (2016b). Continuous Deep Q-Learning with Model-based Acceleration. Available at: http://arxiv.org/abs/1603.00748.

Ha, D., and Schmidhuber, J. (2018). World Models. doi:10.5281/zenodo.1207631.

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement Learning with Deep Energy-Based Policies. arXiv:1702.08165 [cs]. Available at: http://arxiv.org/abs/1702.08165 [Accessed February 13, 2019].

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Available at: http://arxiv.org/abs/1801.01290.

Hafner, R., and Riedmiller, M. (2011). Reinforcement learning in feedback control. Machine Learning 84, 137–169. doi:10.1007/s10994-011-5235-x.

Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. (2016). Q(\(\lambda\)) with off-policy corrections. Available at: http://arxiv.org/abs/1602.04951.

Hausknecht, M., and Stone, P. (2015). Deep Recurrent Q-Learning for Partially Observable MDPs. Available at: http://arxiv.org/abs/1507.06527.

He, F. S., Liu, Y., Schwing, A. G., and Peng, J. (2016). Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening. Available at: http://arxiv.org/abs/1611.01606.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual Learning for Image Recognition. Available at: http://arxiv.org/abs/1512.03385.

Heess, N., Wayne, G., Silver, D., Lillicrap, T., Tassa, Y., and Erez, T. (2015). Learning continuous control policies by stochastic value gradients. Proc. International Conference on Neural Information Processing Systems, 2944–2952. Available at: http://dl.acm.org/citation.cfm?id=2969569.

Heinrich, J., Lanctot, M., and Silver, D. (2015). Fictitious Self-Play in Extensive-Form Games. 805–813. Available at: http://proceedings.mlr.press/v37/heinrich15.html.

Heinrich, J., and Silver, D. (2016). Deep Reinforcement Learning from Self-Play in Imperfect-Information Games. Available at: http://arxiv.org/abs/1603.01121.

Henaff, M., Whitney, W. F., and LeCun, Y. (2017). Model-Based Planning with Discrete and Continuous Actions. Available at: http://arxiv.org/abs/1705.07177.

Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., et al. (2017). Rainbow: Combining Improvements in Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1710.02298.

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Available at: http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf.

Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural computation 9, 1735–80. Available at: http://www.ncbi.nlm.nih.gov/pubmed/9377276.

Ioffe, S., and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Available at: http://arxiv.org/abs/1502.03167.

Kakade, S. (2001). A Natural Policy Gradient. in Advances in Neural Information Processing Systems 14 Available at: https://papers.nips.cc/paper/2073-a-natural-policy-gradient.pdf.

Kakade, S., and Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. Proc. 19th International Conference on Machine Learning, 267–274. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.7.7601.

Kansky, K., Silver, T., Mély, D. A., Eldawy, M., Lázaro-Gredilla, M., Lou, X., et al. (2017). Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics. arXiv:1706.04317 [cs]. Available at: http://arxiv.org/abs/1706.04317 [Accessed January 10, 2019].

Kingma, D. P., and Welling, M. (2013). Auto-Encoding Variational Bayes. Available at: http://arxiv.org/abs/1312.6114.

Knight, E., and Lerner, O. (2018). Natural Gradient Deep Q-learning. Available at: http://arxiv.org/abs/1803.07482.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. in Advances in Neural Information Processing Systems (NIPS) Available at: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.

Levine, S., and Koltun, V. (2013). Guided Policy Search. in Proceedings of Machine Learning Research, 1–9. Available at: http://proceedings.mlr.press/v28/levine13.html.

Li, Y. (2017). Deep Reinforcement Learning: An Overview. Available at: http://arxiv.org/abs/1701.07274.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning. CoRR. Available at: http://arxiv.org/abs/1509.02971.

Lötzsch, W., Vitay, J., and Hamker, F. H. (2017). Training a deep policy gradient-based neural network with asynchronous learners on a simulated robotic problem. in INFORMATIK 2017. Gesellschaft für Informatik, eds. M. Eibl and M. Gaedke (Gesellschaft für Informatik, Bonn), 2143–2154. Available at: https://dl.gi.de/handle/20.500.12116/3986.

Machado, M. C., Bellemare, M. G., and Bowling, M. (2018). Count-Based Exploration with the Successor Representation. arXiv:1807.11622 [cs, stat]. Available at: http://arxiv.org/abs/1807.11622 [Accessed February 23, 2019].

Meuleau, N., Peshkin, L., Kaelbling, L. P., and Kim, K.-e. (2000). Off-Policy Policy Search. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.894.

Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A. J., Banino, A., et al. (2016). Learning to Navigate in Complex Environments. Available at: http://arxiv.org/abs/1611.03673.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. in Proc. ICML Available at: http://arxiv.org/abs/1602.01783.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., et al. (2013). Playing Atari with Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature 518, 529–533. doi:10.1038/nature14236.

Mousavi, S. S., Schukat, M., and Howley, E. (2018). “Deep Reinforcement Learning: An Overview,” in (Springer, Cham), 426–440. doi:10.1007/978-3-319-56991-8_32.

Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. (2016). Safe and Efficient Off-Policy Reinforcement Learning. Available at: http://arxiv.org/abs/1606.02647.

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the Gap Between Value and Policy Based Reinforcement Learning. arXiv:1702.08892 [cs, stat]. Available at: http://arxiv.org/abs/1702.08892 [Accessed June 12, 2019].

Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., et al. (2015). Massively Parallel Methods for Deep Reinforcement Learning. Available at: https://arxiv.org/pdf/1507.04296.pdf.

Nielsen, M. A. (2015). Neural Networks and Deep Learning. Determination Press Available at: http://neuralnetworksanddeeplearning.com/.

Niu, F., Recht, B., Re, C., and Wright, S. J. (2011). HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. in Proc. Advances in Neural Information Processing Systems, 21–21. Available at: http://arxiv.org/abs/1106.5730.

O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016). Combining policy gradient and Q-learning. arXiv:1611.01626 [cs, math, stat]. Available at: http://arxiv.org/abs/1611.01626 [Accessed February 13, 2019].

Oh, J., Guo, Y., Singh, S., and Lee, H. (2018). Self-Imitation Learning. Available at: http://arxiv.org/abs/1806.05635.

Pardo, F., Levdik, V., and Kormushev, P. (2018). Q-map: A Convolutional Approach for Goal-Oriented Reinforcement Learning. Available at: http://arxiv.org/abs/1810.02927.

Pascanu, R., and Bengio, Y. (2013). Revisiting Natural Gradient for Deep Networks. Available at: https://arxiv.org/abs/1301.3584.

Peng, B., Li, X., Gao, J., Liu, J., Wong, K.-F., and Su, S.-Y. (2018). Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning. Available at: http://arxiv.org/abs/1801.06176.

Peshkin, L., and Shelton, C. R. (2002). Learning from Scarce Experience. Available at: http://arxiv.org/abs/cs/0204043.

Peters, J., and Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural Networks 21, 682–697. doi:10.1016/j.neunet.2008.02.003.

Pong, V., Gu, S., Dalal, M., and Levine, S. (2018). Temporal Difference Models: Model-Free Deep RL for Model-Based Control. Available at: http://arxiv.org/abs/1802.09081.

Popov, I., Heess, N., Lillicrap, T., Hafner, R., Barth-Maron, G., Vecerik, M., et al. (2017). Data-efficient Deep Reinforcement Learning for Dexterous Manipulation. Available at: http://arxiv.org/abs/1704.03073.

Precup, D., Sutton, R. S., and Singh, S. (2000). Eligibility traces for off-policy policy evaluation. in Proceedings of the Seventeenth International Conference on Machine Learning.

Ruder, S. (2016). An overview of gradient descent optimization algorithms. Available at: http://arxiv.org/abs/1609.04747.

Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. Available at: http://arxiv.org/abs/1703.03864.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015). Prioritized Experience Replay. Available at: http://arxiv.org/abs/1511.05952.

Schulman, J., Chen, X., and Abbeel, P. (2017a). Equivalence Between Policy Gradients and Soft Q-Learning. arXiv:1704.06440 [cs]. Available at: http://arxiv.org/abs/1704.06440 [Accessed June 12, 2019].

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015a). Trust Region Policy Optimization. 1889–1897. Available at: http://proceedings.mlr.press/v37/schulman15.html.

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015b). High-Dimensional Continuous Control Using Generalized Advantage Estimation. Available at: http://arxiv.org/abs/1506.02438.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017b). Proximal Policy Optimization Algorithms. Available at: http://arxiv.org/abs/1707.06347.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., et al. (2016a). Mastering the game of Go with deep neural networks and tree search. Nature. doi:10.1038/nature16961.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic Policy Gradient Algorithms. in Proc. ICML Proceedings of Machine Learning Research., eds. E. P. Xing and T. Jebara (PMLR), 387–395. Available at: http://proceedings.mlr.press/v32/silver14.html.

Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., et al. (2016b). The Predictron: End-To-End Learning and Planning. Available at: http://arxiv.org/abs/1612.08810.

Simonyan, K., and Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations (ICRL), 1–14. doi:10.1016/j.infsof.2008.09.005.

Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. (2018). Universal Planning Networks. Available at: http://arxiv.org/abs/1804.00645.

Sutton, R. S. (1990). Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. Machine Learning Proceedings 1990, 216–224. doi:10.1016/B978-1-55860-141-3.50030-4.

Sutton, R. S., and Barto, A. G. (1998). Reinforcement Learning: An introduction. Cambridge, MA: MIT press.

Sutton, R. S., and Barto, A. G. (2017). Reinforcement Learning: An Introduction. 2nd ed. Cambridge, MA: MIT Press Available at: http://incompleteideas.net/book/the-book-2nd.html.

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. in Proceedings of the 12th International Conference on Neural Information Processing Systems (MIT Press), 1057–1063. Available at: https://dl.acm.org/citation.cfm?id=3009806.

Szita, I., and Lörincz, A. (2006). Learning Tetris Using the Noisy Cross-Entropy Method. Neural Computation 18, 2936–2941. doi:10.1162/neco.2006.18.12.2936.

Tang, J., and Abbeel, P. (2010). On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient. in Adv. Neural inf. Process. Syst. Available at: http://rll.berkeley.edu/~jietang/pubs/nips10_Tang.pdf.

Todorov, E. (2008). General duality between optimal control and estimation. in 2008 47th IEEE Conference on Decision and Control, 4286–4292. doi:10.1109/CDC.2008.4739438.

Toussaint, M. (2009). Robot Trajectory Optimization Using Approximate Inference. in Proceedings of the 26th Annual International Conference on Machine Learning ICML ’09. (Montreal, Quebec, Canada: ACM), 1049–1056. doi:10.1145/1553374.1553508.

Uhlenbeck, G. E., and Ornstein, L. S. (1930). On the Theory of the Brownian Motion. Physical Review 36. doi:10.1103/PhysRev.36.823.

van Hasselt, H. (2010). Double Q-learning. in Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2 (Curran Associates Inc.), 2613–2621. Available at: https://dl.acm.org/citation.cfm?id=2997187.

van Hasselt, H., Guez, A., and Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning. Available at: http://arxiv.org/abs/1509.06461.

Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., et al. (2017). Sample Efficient Actor-Critic with Experience Replay. Available at: http://arxiv.org/abs/1611.01224.

Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. (2016). Dueling Network Architectures for Deep Reinforcement Learning. arXiv:1511.06581 [cs]. Available at: http://arxiv.org/abs/1511.06581 [Accessed November 21, 2019].

Watkins, C. J. (1989). Learning from delayed rewards.

Watter, M., Springenberg, J. T., Boedecker, J., and Riedmiller, M. (2015). Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images. Available at: https://arxiv.org/pdf/1506.07365.pdf.

Weber, T., Racanière, S., Reichert, D. P., Buesing, L., Guez, A., Rezende, D. J., et al. (2017). Imagination-Augmented Agents for Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1707.06203.

Wierstra, D., Foerster, A., Peters, J., and Schmidhuber, J. (2007). “Solving Deep Memory POMDPs with Recurrent Policy Gradients,” in (Springer, Berlin, Heidelberg), 697–706. doi:10.1007/978-3-540-74690-4_71.

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256.

Williams, R. J., and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science 3, 241–268.

Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. (2008). Maximum Entropy Inverse Reinforcement Learning. in AAAI Conference on Artificial Intelligence, 6.