References
Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R.
E. (2014). Taming the Monster: A Fast and
Simple Algorithm for Contextual Bandits. in
Proceedings of the 31 st International Conference on
Machine Learning (Beijing, China), 9. https://arxiv.org/abs/1402.0555.
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R.,
Welinder, P., et al. (2017). Hindsight Experience Replay.
http://arxiv.org/abs/1707.01495.
Arora, S., and Doshi, P. (2019). A Survey of Inverse
Reinforcement Learning: Challenges,
Methods and Progress. http://arxiv.org/abs/1806.06877.
Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van
Hasselt, H., et al. (2016). Successor Features for
Transfer in Reinforcement Learning. http://arxiv.org/abs/1606.05312.
Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB,
D., et al. (2018). Distributed Distributional Deterministic Policy
Gradients. http://arxiv.org/abs/1804.08617.
Barto, A. G. (2013). “Intrinsic Motivation and
Reinforcement Learning,” in Intrinsically
Motivated Learning in Natural and
Artificial Systems, eds. G. Baldassarre and M. Mirolli
(Berlin, Heidelberg: Springer), 17–47. doi:10.1007/978-3-642-32375-1_2.
Belkhale, S., Li, R., Kahn, G., McAllister, R., Calandra, R., and
Levine, S. (2021). Model-Based Meta-Reinforcement Learning
for Flight with Suspended Payloads. IEEE
Robot. Autom. Lett., 1–1. doi:10.1109/LRA.2021.3057046.
Bellemare, M. G., Dabney, W., and Munos, R. (2017). A
Distributional Perspective on Reinforcement
Learning. http://arxiv.org/abs/1707.06887.
Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B.,
Goyal, P., et al. (2016). End to End Learning for
Self-Driving Cars. http://arxiv.org/abs/1604.07316.
Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros,
A. A. (2018). Large-Scale Study of Curiosity-Driven
Learning. http://arxiv.org/abs/1808.04355.
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., et
al. (2021). Decision Transformer: Reinforcement
Learning via Sequence Modeling. http://arxiv.org/abs/2106.01345.
Chou, P.-W., Maturana, D., and Scherer, S. (2017). Improving
Stochastic Policy Gradients in Continuous
Control with Deep Reinforcement Learning using the
Beta Distribution. in International
Conference on Machine Learning http://proceedings.mlr.press/v70/chou17a/chou17a.pdf.
Dabney, W., Ostrovski, G., Silver, D., and Munos, R. (2018). Implicit
Quantile Networks for Distributional Reinforcement
Learning. http://arxiv.org/abs/1806.06923.
Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. (2017).
Distributional Reinforcement Learning with Quantile
Regression. http://arxiv.org/abs/1710.10044.
Dayan, P. (1993). Improving Generalization for
Temporal Difference Learning: The Successor
Representation. Neural Computation 5, 613–624. doi:10.1162/neco.1993.5.4.613.
Dayan, P., and Niv, Y. (2008). Reinforcement learning: The
Good, The Bad and The Ugly. Current
Opinion in Neurobiology 18, 185–196. doi:10.1016/j.conb.2008.08.003.
Degris, T., White, M., and Sutton, R. S. (2012). Linear Off-Policy
Actor-Critic. in Proceedings of the 2012 International
Conference on Machine Learning http://arxiv.org/abs/1205.4839.
Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and
Abbeel, P. (2016). RL$2̂$: Fast Reinforcement
Learning via Slow Reinforcement Learning. http://arxiv.org/abs/1611.02779.
Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and
Levine, S. (2018). Model-Based Value Estimation for
Efficient Model-Free Reinforcement Learning. http://arxiv.org/abs/1803.00101.
Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves,
A., et al. (2017). Noisy Networks for
Exploration. http://arxiv.org/abs/1706.10295.
Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. (2017). Meta
Learning Shared Hierarchies. http://arxiv.org/abs/1710.09767.
Fujimoto, S., Meger, D., and Precup, D. (2019). Off-Policy Deep
Reinforcement Learning without Exploration. in
Proceedings of the 36th International Conference on
Machine Learning (PMLR), 2052–2062. https://proceedings.mlr.press/v97/fujimoto19a.html.
Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing
Function Approximation Error in Actor-Critic
Methods. http://arxiv.org/abs/1802.09477.
Gehring, C. A. (2015). Approximate Linear Successor
Representation. in, 5. http://people.csail.mit.edu/gehring/publications/clement-gehring-rldm-2015.pdf.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep
Learning. MIT Press http://www.deeplearningbook.org.
Gruslys, A., Dabney, W., Azar, M. G., Piot, B., Bellemare, M., and
Munos, R. (2017). The Reactor: A fast and
sample-efficient Actor-Critic agent for Reinforcement
Learning. http://arxiv.org/abs/1704.04651.
Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. (2016). Continuous
Deep Q-Learning with Model-based
Acceleration. http://arxiv.org/abs/1603.00748.
Ha, D., and Eck, D. (2017). A Neural Representation of
Sketch Drawings. http://arxiv.org/abs/1704.03477.
Ha, D., and Schmidhuber, J. (2018). World Models. doi:10.5281/zenodo.1207631.
Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement
Learning with Deep Energy-Based Policies. http://arxiv.org/abs/1702.08165.
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft
Actor-Critic: Off-Policy Maximum Entropy Deep
Reinforcement Learning with a Stochastic Actor. http://arxiv.org/abs/1801.01290.
Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. (2020). Dream to
Control: Learning Behaviors by Latent
Imagination. http://arxiv.org/abs/1912.01603.
Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H.,
et al. (2019). Learning Latent Dynamics for
Planning from Pixels. http://arxiv.org/abs/1811.04551.
Hausknecht, M., and Stone, P. (2015). Deep Recurrent
Q-Learning for Partially Observable MDPs. http://arxiv.org/abs/1507.06527.
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G.,
Dabney, W., et al. (2017). Rainbow: Combining Improvements
in Deep Reinforcement Learning. http://arxiv.org/abs/1710.02298.
Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van
Hasselt, H., et al. (2018). Distributed Prioritized Experience
Replay. http://arxiv.org/abs/1803.00933.
Kakade, S., and Langford, J. (2002). Approximately Optimal
Approximate Reinforcement Learning. Proc. 19th International
Conference on Machine Learning, 267–274. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.7.7601.
Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., and Dabney, W.
(2019). Recurrent experience replay in distributed reinforcement
learning. in, 19. https://openreview.net/pdf?id=r1lyTjAqYX.
Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J.-M., et
al. (2018). Learning to Drive in a Day. http://arxiv.org/abs/1807.00412.
Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. (2021).
MOReL : Model-Based Offline Reinforcement
Learning. http://arxiv.org/abs/2005.05951.
Kulkarni, T. D., Saeedi, A., Gautam, S., and Gershman, S. J. (2016).
Deep Successor Reinforcement Learning. http://arxiv.org/abs/1606.02396.
Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. (2018).
Model-Ensemble Trust-Region Policy Optimization. http://arxiv.org/abs/1802.10592.
Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline
Reinforcement Learning: Tutorial,
Review, and Perspectives on Open
Problems. http://arxiv.org/abs/2005.01643.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa,
Y., et al. (2015). Continuous control with deep reinforcement learning.
CoRR. http://arxiv.org/abs/1509.02971.
Mao, H., Alizadeh, M., Menache, I., and Kandula, S. (2016). Resource
Management with Deep Reinforcement Learning.
in Proceedings of the 15th ACM Workshop on Hot
Topics in Networks - HotNets ’16
(Atlanta, GA, USA: ACM Press), 50–56. doi:10.1145/3005745.3005750.
Micheli, V., Alonso, E., and Fleuret, F. (2022). Transformers are
Sample Efficient World Models. doi:10.48550/arXiv.2209.00588.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley,
T., et al. (2016). Asynchronous Methods for Deep
Reinforcement Learning. in Proc. ICML http://arxiv.org/abs/1602.01783.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I.,
Wierstra, D., et al. (2013). Playing Atari with Deep
Reinforcement Learning. http://arxiv.org/abs/1312.5602.
Momennejad, I., Russek, E. M., Cheong, J. H., Botvinick, M. M., Daw, N.
D., and Gershman, S. J. (2017). The successor representation in human
reinforcement learning. Nature Human Behaviour 1, 680–692.
doi:10.1038/s41562-017-0180-8.
Moore, A. W., and Atkeson, C. G. (1993). Prioritized sweeping:
Reinforcement learning with less data and less time.
Mach Learn 13, 103–130. doi:10.1007/BF00993104.
Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2017). Neural
Network Dynamics for Model-Based Deep Reinforcement
Learning with Model-Free Fine-Tuning. http://arxiv.org/abs/1708.02596.
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De
Maria, A., et al. (2015). Massively Parallel Methods for
Deep Reinforcement Learning. https://arxiv.org/pdf/1507.04296.pdf.
Niu, F., Recht, B., Re, C., and Wright, S. J. (2011).
HOGWILD!: A Lock-Free Approach to
Parallelizing Stochastic Gradient Descent. in Proc.
Advances in Neural Information Processing
Systems, 21–21. http://arxiv.org/abs/1106.5730.
Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017).
Curiosity-driven Exploration by Self-supervised Prediction. http://arxiv.org/abs/1705.05363.
Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen,
X., et al. (2018). Parameter Space Noise for
Exploration. http://arxiv.org/abs/1706.01905.
Pong, V., Gu, S., Dalal, M., and Levine, S. (2018). Temporal
Difference Models: Model-Free Deep RL for
Model-Based Control. http://arxiv.org/abs/1802.09081.
Popov, I., Heess, N., Lillicrap, T., Hafner, R., Barth-Maron, G.,
Vecerik, M., et al. (2017). Data-efficient Deep Reinforcement
Learning for Dexterous Manipulation. http://arxiv.org/abs/1704.03073.
Russek, E. M., Momennejad, I., Botvinick, M. M., Gershman, S. J., and
Daw, N. D. (2017). Predictive representations can link model-based
reinforcement learning to model-free mechanisms. PLOS Computational
Biology 13, e1005768. doi:10.1371/journal.pcbi.1005768.
Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G.,
Kirkpatrick, J., Pascanu, R., et al. (2016). Policy
Distillation. http://arxiv.org/abs/1511.06295.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015). Prioritized
Experience Replay. http://arxiv.org/abs/1511.05952.
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L.,
Schmitt, S., et al. (2019). Mastering Atari,
Go, Chess and Shogi by
Planning with a Learned Model. http://arxiv.org/abs/1911.08265.
Schulman, J., Chen, X., and Abbeel, P. (2017). Equivalence Between
Policy Gradients and Soft Q-Learning. http://arxiv.org/abs/1704.06440.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P.
(2015a). Trust Region Policy Optimization. in
Proceedings of the 31 st International Conference on
Machine Learning, 1889–1897. http://proceedings.mlr.press/v37/schulman15.html.
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P.
(2015b). High-Dimensional Continuous Control Using Generalized
Advantage Estimation. http://arxiv.org/abs/1506.02438.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den
Driessche, G., et al. (2016). Mastering the game of Go with
deep neural networks and tree search. Nature 529, 484–489.
doi:10.1038/nature16961.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M.,
Guez, A., et al. (2018). A general reinforcement learning algorithm that
masters chess, shogi, and Go through self-play.
Science 362, 1140–1144. doi:10.1126/science.aar6404.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and
Riedmiller, M. (2014). Deterministic Policy Gradient
Algorithms. in Proc. ICML Proceedings of
Machine Learning Research., eds. E. P. Xing and T. Jebara
(PMLR), 387–395. http://proceedings.mlr.press/v32/silver14.html.
Stachenfeld, K. L., Botvinick, M. M., and Gershman, S. J. (2017). The
hippocampus as a predictive map. Nature Neuroscience 20,
1643–1653. doi:10.1038/nn.4650.
Sutton, R. S. (1990). Integrated Architectures for
Learning, Planning, and Reacting
Based on Approximating Dynamic Programming.
Machine Learning Proceedings 1990, 216–224. doi:10.1016/B978-1-55860-141-3.50030-4.
Sutton, R. S., and Barto, A. G. (1998). Reinforcement
Learning: An introduction.
Cambridge, MA: MIT press.
Sutton, R. S., and Barto, A. G. (2017). Reinforcement
Learning: An Introduction. 2nd ed.
Cambridge, MA: MIT Press http://incompleteideas.net/book/the-book-2nd.html.
Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy
gradient methods for reinforcement learning with function approximation.
in Proceedings of the 12th International Conference on
Neural Information Processing Systems (MIT
Press), 1057–1063. https://dl.acm.org/citation.cfm?id=3009806.
Teh, Y. W., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J.,
Hadsell, R., et al. (2017). Distral: Robust Multitask
Reinforcement Learning. http://arxiv.org/abs/1707.04175.
Tesauro, G. (1995). “TD-Gammon: A Self-Teaching
Backgammon Program,” in Applications of Neural
Networks, ed. A. F. Murray (Boston, MA:
Springer US), 267–285. doi:10.1007/978-1-4757-2379-3_11.
Todorov, E. (2008). General duality between optimal control and
estimation. in 2008 47th IEEE Conference on
Decision and Control, 4286–4292. doi:10.1109/CDC.2008.4739438.
Toussaint, M. (2009). Robot Trajectory Optimization Using
Approximate Inference. in Proceedings of the 26th
Annual International Conference on Machine
Learning ICML ’09. (New York, NY,
USA: ACM), 1049–1056. doi:10.1145/1553374.1553508.
Uhlenbeck, G. E., and Ornstein, L. S. (1930). On the Theory
of the Brownian Motion. Physical Review 36. doi:10.1103/PhysRev.36.823.
van Hasselt, H., Guez, A., and Silver, D. (2015). Deep
Reinforcement Learning with Double
Q-learning. http://arxiv.org/abs/1509.06461.
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z.,
Munos, R., et al. (2017a). Learning to reinforcement learn. http://arxiv.org/abs/1611.05763.
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., et
al. (2017b). Sample Efficient Actor-Critic with
Experience Replay. http://arxiv.org/abs/1611.01224.
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de
Freitas, N. (2016). Dueling Network Architectures for
Deep Reinforcement Learning. http://arxiv.org/abs/1511.06581.
Weber, T., Racanière, S., Reichert, D. P., Buesing, L., Guez, A.,
Rezende, D. J., et al. (2017). Imagination-Augmented Agents
for Deep Reinforcement Learning. http://arxiv.org/abs/1707.06203.
Williams, R. J. (1992). Simple statistical gradient-following algorithms
for connectionist reinforcement learning. Machine Learning 8,
229–256.
Williams, R. J., and Peng, J. (1991). Function optimization using
connectionist reinforcement learning algorithms. Connection
Science 3, 241–268.
Ye, W., Liu, S., Kurutach, T., Abbeel, P., and Gao, Y. (2021). Mastering
Atari Games with Limited Data. doi:10.48550/arXiv.2111.00210.
Zhang, J., Springenberg, J. T., Boedecker, J., and Burgard, W. (2016).
Deep Reinforcement Learning with Successor
Features for Navigation across Similar
Environments. http://arxiv.org/abs/1612.05533.
Zhu, Y., Gordon, D., Kolve, E., Fox, D., Fei-Fei, L., Gupta, A., et al.
(2017). Visual Semantic Planning using Deep Successor
Representations. http://arxiv.org/abs/1705.08080.