References

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. E. (2014). Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. in Proceedings of the 31 st International Conference on Machine Learning (Beijing, China), 9. https://arxiv.org/abs/1402.0555.
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., et al. (2017). Hindsight Experience Replay. http://arxiv.org/abs/1707.01495.
Arora, S., and Doshi, P. (2019). A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress. http://arxiv.org/abs/1806.06877.
Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H., et al. (2016). Successor Features for Transfer in Reinforcement Learning. http://arxiv.org/abs/1606.05312.
Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB, D., et al. (2018). Distributed Distributional Deterministic Policy Gradients. http://arxiv.org/abs/1804.08617.
Barto, A. G. (2013). “Intrinsic Motivation and Reinforcement Learning,” in Intrinsically Motivated Learning in Natural and Artificial Systems, eds. G. Baldassarre and M. Mirolli (Berlin, Heidelberg: Springer), 17–47. doi:10.1007/978-3-642-32375-1_2.
Belkhale, S., Li, R., Kahn, G., McAllister, R., Calandra, R., and Levine, S. (2021). Model-Based Meta-Reinforcement Learning for Flight with Suspended Payloads. IEEE Robot. Autom. Lett., 1–1. doi:10.1109/LRA.2021.3057046.
Bellemare, M. G., Dabney, W., and Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. http://arxiv.org/abs/1707.06887.
Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., et al. (2016). End to End Learning for Self-Driving Cars. http://arxiv.org/abs/1604.07316.
Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. (2018). Large-Scale Study of Curiosity-Driven Learning. http://arxiv.org/abs/1808.04355.
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., et al. (2021). Decision Transformer: Reinforcement Learning via Sequence Modeling. http://arxiv.org/abs/2106.01345.
Chou, P.-W., Maturana, D., and Scherer, S. (2017). Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution. in International Conference on Machine Learning http://proceedings.mlr.press/v70/chou17a/chou17a.pdf.
Dabney, W., Ostrovski, G., Silver, D., and Munos, R. (2018). Implicit Quantile Networks for Distributional Reinforcement Learning. http://arxiv.org/abs/1806.06923.
Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. (2017). Distributional Reinforcement Learning with Quantile Regression. http://arxiv.org/abs/1710.10044.
Dayan, P. (1993). Improving Generalization for Temporal Difference Learning: The Successor Representation. Neural Computation 5, 613–624. doi:10.1162/neco.1993.5.4.613.
Dayan, P., and Niv, Y. (2008). Reinforcement learning: The Good, The Bad and The Ugly. Current Opinion in Neurobiology 18, 185–196. doi:10.1016/j.conb.2008.08.003.
Degris, T., White, M., and Sutton, R. S. (2012). Linear Off-Policy Actor-Critic. in Proceedings of the 2012 International Conference on Machine Learning http://arxiv.org/abs/1205.4839.
Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. (2016). RL$2̂$: Fast Reinforcement Learning via Slow Reinforcement Learning. http://arxiv.org/abs/1611.02779.
Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. (2018). Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning. http://arxiv.org/abs/1803.00101.
Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., et al. (2017). Noisy Networks for Exploration. http://arxiv.org/abs/1706.10295.
Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. (2017). Meta Learning Shared Hierarchies. http://arxiv.org/abs/1710.09767.
Fujimoto, S., Meger, D., and Precup, D. (2019). Off-Policy Deep Reinforcement Learning without Exploration. in Proceedings of the 36th International Conference on Machine Learning (PMLR), 2052–2062. https://proceedings.mlr.press/v97/fujimoto19a.html.
Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. http://arxiv.org/abs/1802.09477.
Gehring, C. A. (2015). Approximate Linear Successor Representation. in, 5. http://people.csail.mit.edu/gehring/publications/clement-gehring-rldm-2015.pdf.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press http://www.deeplearningbook.org.
Gruslys, A., Dabney, W., Azar, M. G., Piot, B., Bellemare, M., and Munos, R. (2017). The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning. http://arxiv.org/abs/1704.04651.
Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. (2016). Continuous Deep Q-Learning with Model-based Acceleration. http://arxiv.org/abs/1603.00748.
Ha, D., and Eck, D. (2017). A Neural Representation of Sketch Drawings. http://arxiv.org/abs/1704.03477.
Ha, D., and Schmidhuber, J. (2018). World Models. doi:10.5281/zenodo.1207631.
Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement Learning with Deep Energy-Based Policies. http://arxiv.org/abs/1702.08165.
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. http://arxiv.org/abs/1801.01290.
Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. http://arxiv.org/abs/1912.01603.
Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., et al. (2019). Learning Latent Dynamics for Planning from Pixels. http://arxiv.org/abs/1811.04551.
Hausknecht, M., and Stone, P. (2015). Deep Recurrent Q-Learning for Partially Observable MDPs. http://arxiv.org/abs/1507.06527.
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., et al. (2017). Rainbow: Combining Improvements in Deep Reinforcement Learning. http://arxiv.org/abs/1710.02298.
Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., et al. (2018). Distributed Prioritized Experience Replay. http://arxiv.org/abs/1803.00933.
Kakade, S., and Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. Proc. 19th International Conference on Machine Learning, 267–274. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.7.7601.
Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., and Dabney, W. (2019). Recurrent experience replay in distributed reinforcement learning. in, 19. https://openreview.net/pdf?id=r1lyTjAqYX.
Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J.-M., et al. (2018). Learning to Drive in a Day. http://arxiv.org/abs/1807.00412.
Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. (2021). MOReL : Model-Based Offline Reinforcement Learning. http://arxiv.org/abs/2005.05951.
Kulkarni, T. D., Saeedi, A., Gautam, S., and Gershman, S. J. (2016). Deep Successor Reinforcement Learning. http://arxiv.org/abs/1606.02396.
Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. (2018). Model-Ensemble Trust-Region Policy Optimization. http://arxiv.org/abs/1802.10592.
Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. http://arxiv.org/abs/2005.01643.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning. CoRR. http://arxiv.org/abs/1509.02971.
Mao, H., Alizadeh, M., Menache, I., and Kandula, S. (2016). Resource Management with Deep Reinforcement Learning. in Proceedings of the 15th ACM Workshop on Hot Topics in Networks - HotNets ’16 (Atlanta, GA, USA: ACM Press), 50–56. doi:10.1145/3005745.3005750.
Micheli, V., Alonso, E., and Fleuret, F. (2022). Transformers are Sample Efficient World Models. doi:10.48550/arXiv.2209.00588.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. in Proc. ICML http://arxiv.org/abs/1602.01783.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., et al. (2013). Playing Atari with Deep Reinforcement Learning. http://arxiv.org/abs/1312.5602.
Momennejad, I., Russek, E. M., Cheong, J. H., Botvinick, M. M., Daw, N. D., and Gershman, S. J. (2017). The successor representation in human reinforcement learning. Nature Human Behaviour 1, 680–692. doi:10.1038/s41562-017-0180-8.
Moore, A. W., and Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Mach Learn 13, 103–130. doi:10.1007/BF00993104.
Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2017). Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. http://arxiv.org/abs/1708.02596.
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., et al. (2015). Massively Parallel Methods for Deep Reinforcement Learning. https://arxiv.org/pdf/1507.04296.pdf.
Niu, F., Recht, B., Re, C., and Wright, S. J. (2011). HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. in Proc. Advances in Neural Information Processing Systems, 21–21. http://arxiv.org/abs/1106.5730.
Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017). Curiosity-driven Exploration by Self-supervised Prediction. http://arxiv.org/abs/1705.05363.
Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., et al. (2018). Parameter Space Noise for Exploration. http://arxiv.org/abs/1706.01905.
Pong, V., Gu, S., Dalal, M., and Levine, S. (2018). Temporal Difference Models: Model-Free Deep RL for Model-Based Control. http://arxiv.org/abs/1802.09081.
Popov, I., Heess, N., Lillicrap, T., Hafner, R., Barth-Maron, G., Vecerik, M., et al. (2017). Data-efficient Deep Reinforcement Learning for Dexterous Manipulation. http://arxiv.org/abs/1704.03073.
Russek, E. M., Momennejad, I., Botvinick, M. M., Gershman, S. J., and Daw, N. D. (2017). Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLOS Computational Biology 13, e1005768. doi:10.1371/journal.pcbi.1005768.
Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., et al. (2016). Policy Distillation. http://arxiv.org/abs/1511.06295.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015). Prioritized Experience Replay. http://arxiv.org/abs/1511.05952.
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., et al. (2019). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. http://arxiv.org/abs/1911.08265.
Schulman, J., Chen, X., and Abbeel, P. (2017). Equivalence Between Policy Gradients and Soft Q-Learning. http://arxiv.org/abs/1704.06440.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015a). Trust Region Policy Optimization. in Proceedings of the 31 st International Conference on Machine Learning, 1889–1897. http://proceedings.mlr.press/v37/schulman15.html.
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015b). High-Dimensional Continuous Control Using Generalized Advantage Estimation. http://arxiv.org/abs/1506.02438.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489. doi:10.1038/nature16961.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 1140–1144. doi:10.1126/science.aar6404.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic Policy Gradient Algorithms. in Proc. ICML Proceedings of Machine Learning Research., eds. E. P. Xing and T. Jebara (PMLR), 387–395. http://proceedings.mlr.press/v32/silver14.html.
Stachenfeld, K. L., Botvinick, M. M., and Gershman, S. J. (2017). The hippocampus as a predictive map. Nature Neuroscience 20, 1643–1653. doi:10.1038/nn.4650.
Sutton, R. S. (1990). Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. Machine Learning Proceedings 1990, 216–224. doi:10.1016/B978-1-55860-141-3.50030-4.
Sutton, R. S., and Barto, A. G. (1998). Reinforcement Learning: An introduction. Cambridge, MA: MIT press.
Sutton, R. S., and Barto, A. G. (2017). Reinforcement Learning: An Introduction. 2nd ed. Cambridge, MA: MIT Press http://incompleteideas.net/book/the-book-2nd.html.
Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. in Proceedings of the 12th International Conference on Neural Information Processing Systems (MIT Press), 1057–1063. https://dl.acm.org/citation.cfm?id=3009806.
Teh, Y. W., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., et al. (2017). Distral: Robust Multitask Reinforcement Learning. http://arxiv.org/abs/1707.04175.
Tesauro, G. (1995). TD-Gammon: A Self-Teaching Backgammon Program,” in Applications of Neural Networks, ed. A. F. Murray (Boston, MA: Springer US), 267–285. doi:10.1007/978-1-4757-2379-3_11.
Todorov, E. (2008). General duality between optimal control and estimation. in 2008 47th IEEE Conference on Decision and Control, 4286–4292. doi:10.1109/CDC.2008.4739438.
Toussaint, M. (2009). Robot Trajectory Optimization Using Approximate Inference. in Proceedings of the 26th Annual International Conference on Machine Learning ICML ’09. (New York, NY, USA: ACM), 1049–1056. doi:10.1145/1553374.1553508.
Uhlenbeck, G. E., and Ornstein, L. S. (1930). On the Theory of the Brownian Motion. Physical Review 36. doi:10.1103/PhysRev.36.823.
van Hasselt, H., Guez, A., and Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning. http://arxiv.org/abs/1509.06461.
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., et al. (2017a). Learning to reinforcement learn. http://arxiv.org/abs/1611.05763.
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., et al. (2017b). Sample Efficient Actor-Critic with Experience Replay. http://arxiv.org/abs/1611.01224.
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. (2016). Dueling Network Architectures for Deep Reinforcement Learning. http://arxiv.org/abs/1511.06581.
Weber, T., Racanière, S., Reichert, D. P., Buesing, L., Guez, A., Rezende, D. J., et al. (2017). Imagination-Augmented Agents for Deep Reinforcement Learning. http://arxiv.org/abs/1707.06203.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256.
Williams, R. J., and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science 3, 241–268.
Ye, W., Liu, S., Kurutach, T., Abbeel, P., and Gao, Y. (2021). Mastering Atari Games with Limited Data. doi:10.48550/arXiv.2111.00210.
Zhang, J., Springenberg, J. T., Boedecker, J., and Burgard, W. (2016). Deep Reinforcement Learning with Successor Features for Navigation across Similar Environments. http://arxiv.org/abs/1612.05533.
Zhu, Y., Gordon, D., Kolve, E., Fox, D., Fei-Fei, L., Gupta, A., et al. (2017). Visual Semantic Planning using Deep Successor Representations. http://arxiv.org/abs/1705.08080.