References

Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Computation 10, 251–276.

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., et al. (2017). Hindsight Experience Replay. Available at: http://arxiv.org/abs/1707.01495.

Anschel, O., Baram, N., and Shimkin, N. (2016). Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1611.01929.

Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein GAN. Available at: http://arxiv.org/abs/1701.07875.

Badia, A. P., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A., Guo, D., et al. (2020a). Agent57: Outperforming the Atari Human Benchmark. Available at: http://arxiv.org/abs/2003.13350 [Accessed January 17, 2022].

Badia, A. P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., et al. (2020b). Never Give Up: Learning Directed Exploration Strategies. Available at: http://arxiv.org/abs/2002.06038 [Accessed January 17, 2022].

Badrinarayanan, V., Kendall, A., and Cipolla, R. (2016). SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. Available at: http://arxiv.org/abs/1511.00561 [Accessed November 29, 2020].

Baird, L. C. (1993). Advantage updating. Wright-Patterson Air Force Base Available at: http://leemon.com/papers/1993b.pdf.

Bakker, B. (2001). Reinforcement Learning with Long Short-Term Memory. in Advances in Neural Information Processing Systems 14 (NIPS 2001), 1475–1482. Available at: https://papers.nips.cc/paper/1953-reinforcement-learning-with-long-short-term-memory.

Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB, D., et al. (2018). Distributed Distributional Deterministic Policy Gradients. Available at: http://arxiv.org/abs/1804.08617.

Bellemare, M. G., Dabney, W., and Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. Available at: http://arxiv.org/abs/1707.06887.

Bishop, C. M. (1994). Mixture Density Networks. Birmingham, UK: Neural Computing Research Group, Aston University Available at: https://publications.aston.ac.uk/id/eprint/373/1/NCRG_94_004.pdf [Accessed November 12, 2024].

Burda, Y., Edwards, H., Storkey, A., and Klimov, O. (2018). Exploration by Random Network Distillation. doi:10.48550/arXiv.1810.12894.

Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. Available at: https://arxiv.org/abs/2303.04137v5 [Accessed October 9, 2024].

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Available at: http://arxiv.org/abs/1406.1078.

Chou, P.-W., Maturana, D., and Scherer, S. (2017). Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution. in International Conference on Machine Learning Available at: http://proceedings.mlr.press/v70/chou17a/chou17a.pdf.

Clavera, I., Nagabandi, A., Fearing, R. S., Abbeel, P., Levine, S., and Finn, C. (2018). Learning to Adapt: Meta-Learning for Model-Based Control. Available at: http://arxiv.org/abs/1803.11347.

Co-Reyes, J. D., Liu, Y., Gupta, A., Eysenbach, B., Abbeel, P., and Levine, S. (2018). Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings.

Corneil, D., Gerstner, W., and Brea, J. (2018). Efficient Model-Based Deep Reinforcement Learning with Variational State Tabulation. Available at: http://arxiv.org/abs/1802.04325.

Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. (2017). Distributional Reinforcement Learning with Quantile Regression. Available at: http://arxiv.org/abs/1710.10044 [Accessed June 28, 2019].

Dayan, P., and Niv, Y. (2008). Reinforcement learning: The Good, The Bad and The Ugly. Current Opinion in Neurobiology 18, 185–196. doi:10.1016/j.conb.2008.08.003.

Degrave, J., Felici, F., Buchli, J., Neunert, M., Tracey, B., Carpanese, F., et al. (2022). Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602, 414–419. doi:10.1038/s41586-021-04301-9.

Degris, T., White, M., and Sutton, R. S. (2012). Linear Off-Policy Actor-Critic. in Proceedings of the 2012 International Conference on Machine Learning Available at: http://arxiv.org/abs/1205.4839.

Ding, Y., Florensa, C., Phielipp, M., and Abbeel, P. (2019). Goal-conditioned Imitation Learning. in (Long Beach, California: PMLR), 8. Available at: https://openreview.net/pdf?id=HkglHcSj2N.

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., et al. (2018). IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. doi:10.48550/arXiv.1802.01561.

Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. (2018). Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning. Available at: http://arxiv.org/abs/1803.00101.

Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., et al. (2017). Noisy Networks for Exploration. Available at: http://arxiv.org/abs/1706.10295 [Accessed March 2, 2020].

Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. Available at: http://arxiv.org/abs/1802.09477 [Accessed March 1, 2020].

Gers, F. (2001). Long Short-Term Memory in Recurrent Neural Networks. Available at: http://www.felixgers.de/papers/phd.pdf.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative Adversarial Networks. Available at: http://arxiv.org/abs/1406.2661.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press Available at: http://www.deeplearningbook.org.

Goyal, A., Brakel, P., Fedus, W., Lillicrap, T., Levine, S., Larochelle, H., et al. (2018). Recall Traces: Backtracking Models for Efficient Reinforcement Learning. Available at: http://arxiv.org/abs/1804.00379.

Gruslys, A., Dabney, W., Azar, M. G., Piot, B., Bellemare, M., and Munos, R. (2017). The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning. Available at: http://arxiv.org/abs/1704.04651.

Gu, S., Holly, E., Lillicrap, T., and Levine, S. (2017). Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates. in Proc. ICRA Available at: http://arxiv.org/abs/1610.00633.

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. (2016a). Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic. Available at: http://arxiv.org/abs/1611.02247.

Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. (2016b). Continuous Deep Q-Learning with Model-based Acceleration. Available at: http://arxiv.org/abs/1603.00748.

Ha, D., and Eck, D. (2017). A Neural Representation of Sketch Drawings. Available at: http://arxiv.org/abs/1704.03477 [Accessed January 17, 2021].

Ha, D., and Schmidhuber, J. (2018). World Models. doi:10.5281/zenodo.1207631.

Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. (2018a). Latent Space Policies for Hierarchical Reinforcement Learning. Available at: http://arxiv.org/abs/1804.02808.

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement Learning with Deep Energy-Based Policies. Available at: http://arxiv.org/abs/1702.08165 [Accessed February 13, 2019].

Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., et al. (2018b). Soft Actor-Critic Algorithms and Applications. Available at: http://arxiv.org/abs/1812.05905 [Accessed February 5, 2019].

Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. Available at: http://arxiv.org/abs/1912.01603 [Accessed March 24, 2020].

Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., et al. (2019). Learning Latent Dynamics for Planning from Pixels. Available at: http://arxiv.org/abs/1811.04551 [Accessed January 24, 2020].

Hafner, R., and Riedmiller, M. (2011). Reinforcement learning in feedback control. Machine Learning 84, 137–169. doi:10.1007/s10994-011-5235-x.

Hansen, N., and Ostermeier, A. (2001). Completely Derandomized Self-Adaptation in Evolution Strategies. Evolutionary Computation 9, 159–195. doi:10.1162/106365601750190398.

Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. (2016). Q(λ) with off-policy corrections. Available at: http://arxiv.org/abs/1602.04951.

Hausknecht, M., and Stone, P. (2015). Deep Recurrent Q-Learning for Partially Observable MDPs. Available at: http://arxiv.org/abs/1507.06527.

He, F. S., Liu, Y., Schwing, A. G., and Peng, J. (2016). Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening. Available at: http://arxiv.org/abs/1611.01606.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual Learning for Image Recognition. Available at: http://arxiv.org/abs/1512.03385.

Heess, N., Wayne, G., Silver, D., Lillicrap, T., Tassa, Y., and Erez, T. (2015). Learning continuous control policies by stochastic value gradients. Proc. International Conference on Neural Information Processing Systems, 2944–2952. Available at: http://dl.acm.org/citation.cfm?id=2969569.

Heinrich, J., Lanctot, M., and Silver, D. (2015). Fictitious Self-Play in Extensive-Form Games. 805–813. Available at: http://proceedings.mlr.press/v37/heinrich15.html.

Heinrich, J., and Silver, D. (2016). Deep Reinforcement Learning from Self-Play in Imperfect-Information Games. Available at: http://arxiv.org/abs/1603.01121.

Henaff, M., Whitney, W. F., and LeCun, Y. (2017). Model-Based Planning with Discrete and Continuous Actions. Available at: http://arxiv.org/abs/1705.07177.

Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., et al. (2017). Rainbow: Combining Improvements in Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1710.02298.

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Available at: http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf.

Hochreiter, S., and Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9, 1735–1780. doi:10.1162/neco.1997.9.8.1735.

Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., et al. (2018). Distributed Prioritized Experience Replay. Available at: http://arxiv.org/abs/1803.00933 [Accessed December 14, 2019].

Ioffe, S., and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Available at: http://arxiv.org/abs/1502.03167.

Kakade, S. (2001). A Natural Policy Gradient. in Advances in Neural Information Processing Systems 14 Available at: https://papers.nips.cc/paper/2073-a-natural-policy-gradient.pdf.

Kakade, S., and Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. Proc. 19th International Conference on Machine Learning, 267–274. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.7.7601.

Kansky, K., Silver, T., Mély, D. A., Eldawy, M., Lázaro-Gredilla, M., Lou, X., et al. (2017). Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics. Available at: http://arxiv.org/abs/1706.04317 [Accessed January 10, 2019].

Kapturowski, S., Campos, V., Jiang, R., Rakićević, N., van Hasselt, H., Blundell, C., et al. (2022). Human-level Atari 200x faster. doi:10.48550/arXiv.2209.07550.

Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., and Dabney, W. (2019). Recurrent experience replay in distributed reinforcement learning. in, 19. Available at: https://openreview.net/pdf?id=r1lyTjAqYX.

Kaufmann, E., Bauersfeld, L., Loquercio, A., Müller, M., Koltun, V., and Scaramuzza, D. (2023). Champion-level drone racing using deep reinforcement learning. Nature 620, 982–987. doi:10.1038/s41586-023-06419-4.

Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J.-M., et al. (2018). Learning to Drive in a Day. Available at: http://arxiv.org/abs/1807.00412 [Accessed December 19, 2018].

Kingma, D. P., and Welling, M. (2013). Auto-Encoding Variational Bayes. Available at: http://arxiv.org/abs/1312.6114.

Knight, E., and Lerner, O. (2018). Natural Gradient Deep Q-learning. Available at: http://arxiv.org/abs/1803.07482.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. in Advances in Neural Information Processing Systems (NIPS) Available at: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.

Levine, S., and Koltun, V. (2013). Guided Policy Search. in Proceedings of Machine Learning Research, 1–9. Available at: http://proceedings.mlr.press/v28/levine13.html.

Li, W., Zhu, Y., and Zhao, D. (2022). Missile guidance with assisted deep reinforcement learning for head-on interception of maneuvering target. Complex Intell. Syst. 8, 1205–1216. doi:10.1007/s40747-021-00577-6.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning. CoRR. Available at: http://arxiv.org/abs/1509.02971.

Lötzsch, W., Vitay, J., and Hamker, F. H. (2017). Training a deep policy gradient-based neural network with asynchronous learners on a simulated robotic problem. in INFORMATIK 2017. Gesellschaft für Informatik, eds. M. Eibl and M. Gaedke (Gesellschaft für Informatik, Bonn), 2143–2154. Available at: https://dl.gi.de/handle/20.500.12116/3986.

Luo, J., Paduraru, C., Voicu, O., Chervonyi, Y., Munns, S., Li, J., et al. (2022). Controlling Commercial Cooling Systems Using Reinforcement Learning. doi:10.48550/arXiv.2211.07357.

Machado, M. C., Bellemare, M. G., and Bowling, M. (2018). Count-Based Exploration with the Successor Representation. Available at: http://arxiv.org/abs/1807.11622 [Accessed February 23, 2019].

Madeka, D., Torkkola, K., Eisenach, C., Luo, A., Foster, D. P., and Kakade, S. M. (2022). Deep Inventory Management. doi:10.48550/arXiv.2210.03137.

Malibari, N., Katib, I., and Mehmood, R. (2023). Systematic Review on Reinforcement Learning in the Field of Fintech. doi:10.48550/arXiv.2305.07466.

Meuleau, N., Peshkin, L., Kaelbling, L. P., and Kim, K. (2000). Off-Policy Policy Search. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.894.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. in Proc. ICML Available at: http://arxiv.org/abs/1602.01783.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., et al. (2013). Playing Atari with Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature 518, 529–533. doi:10.1038/nature14236.

Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. (2016). Safe and Efficient Off-Policy Reinforcement Learning. Available at: http://arxiv.org/abs/1606.02647.

Nachum, O., Gu, S., Lee, H., and Levine, S. (2018). Data-Efficient Hierarchical Reinforcement Learning. Available at: http://arxiv.org/abs/1805.08296.

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the Gap Between Value and Policy Based Reinforcement Learning. Available at: http://arxiv.org/abs/1702.08892 [Accessed June 12, 2019].

Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2017). Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. Available at: http://arxiv.org/abs/1708.02596 [Accessed March 3, 2019].

Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., et al. (2015). Massively Parallel Methods for Deep Reinforcement Learning. Available at: https://arxiv.org/pdf/1507.04296.pdf.

Nielsen, M. A. (2015). Neural Networks and Deep Learning. Determination Press Available at: http://neuralnetworksanddeeplearning.com/.

Niu, F., Recht, B., Re, C., and Wright, S. J. (2011). HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. in Proc. Advances in Neural Information Processing Systems, 21–21. Available at: http://arxiv.org/abs/1106.5730.

O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016). Combining policy gradient and Q-learning. Available at: http://arxiv.org/abs/1611.01626 [Accessed February 13, 2019].

Oh, J., Guo, Y., Singh, S., and Lee, H. (2018). Self-Imitation Learning. Available at: http://arxiv.org/abs/1806.05635.

Pardo, F., Levdik, V., and Kormushev, P. (2018). Q-map: A Convolutional Approach for Goal-Oriented Reinforcement Learning. Available at: http://arxiv.org/abs/1810.02927.

Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017). Curiosity-driven Exploration by Self-supervised Prediction. Available at: http://arxiv.org/abs/1705.05363 [Accessed February 6, 2021].

Peng, B., Li, X., Gao, J., Liu, J., Wong, K.-F., and Su, S.-Y. (2018). Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning. Available at: http://arxiv.org/abs/1801.06176.

Peshkin, L., and Shelton, C. R. (2002). Learning from Scarce Experience. Available at: http://arxiv.org/abs/cs/0204043.

Peters, J., and Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural Networks 21, 682–697. doi:10.1016/j.neunet.2008.02.003.

Pong, V., Gu, S., Dalal, M., and Levine, S. (2018). Temporal Difference Models: Model-Free Deep RL for Model-Based Control. Available at: http://arxiv.org/abs/1802.09081.

Popov, I., Heess, N., Lillicrap, T., Hafner, R., Barth-Maron, G., Vecerik, M., et al. (2017). Data-efficient Deep Reinforcement Learning for Dexterous Manipulation. Available at: http://arxiv.org/abs/1704.03073.

Precup, D., Sutton, R. S., and Singh, S. (2000). Eligibility traces for off-policy policy evaluation. in Proceedings of the Seventeenth International Conference on Machine Learning.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. Available at: http://arxiv.org/abs/1505.04597 [Accessed November 29, 2020].

Roy, R., Raiman, J., Kant, N., Elkin, I., Kirby, R., Siu, M., et al. (2022). PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning. doi:10.1109/DAC18074.2021.9586094.

Ruder, S. (2016). An overview of gradient descent optimization algorithms. Available at: http://arxiv.org/abs/1609.04747.

Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., et al. (2016). Policy Distillation. Available at: http://arxiv.org/abs/1511.06295 [Accessed January 26, 2020].

Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. Available at: http://arxiv.org/abs/1703.03864.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015). Prioritized Experience Replay. Available at: http://arxiv.org/abs/1511.05952.

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., et al. (2019). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Available at: http://arxiv.org/abs/1911.08265 [Accessed November 24, 2019].

Schulman, J., Chen, X., and Abbeel, P. (2017a). Equivalence Between Policy Gradients and Soft Q-Learning. Available at: http://arxiv.org/abs/1704.06440 [Accessed June 12, 2019].

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015a). Trust Region Policy Optimization. in Proceedings of the 31 st International Conference on Machine Learning, 1889–1897. Available at: http://proceedings.mlr.press/v37/schulman15.html.

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015b). High-Dimensional Continuous Control Using Generalized Advantage Estimation. Available at: http://arxiv.org/abs/1506.02438.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017b). Proximal Policy Optimization Algorithms. Available at: http://arxiv.org/abs/1707.06347.

Sekar, R., Rybkin, O., Daniilidis, K., Abbeel, P., Hafner, D., and Pathak, D. (2020). Planning to Explore via Self-Supervised World Models. Available at: http://arxiv.org/abs/2005.05960 [Accessed January 29, 2024].

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016a). Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489. doi:10.1038/nature16961.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 1140–1144. doi:10.1126/science.aar6404.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic Policy Gradient Algorithms. in Proc. ICML Proceedings of Machine Learning Research., eds. E. P. Xing and T. Jebara (PMLR), 387–395. Available at: http://proceedings.mlr.press/v32/silver14.html.

Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., et al. (2016b). The Predictron: End-To-End Learning and Planning. Available at: http://arxiv.org/abs/1612.08810.

Simonyan, K., and Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations (ICRL), 1–14. doi:10.1016/j.infsof.2008.09.005.

Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. (2018). Universal Planning Networks. Available at: http://arxiv.org/abs/1804.00645.

Sutton, R. S. (1990). Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. Machine Learning Proceedings 1990, 216–224. doi:10.1016/B978-1-55860-141-3.50030-4.

Sutton, R. S., and Barto, A. G. (1998). Reinforcement Learning: An introduction. Cambridge, MA: MIT press.

Sutton, R. S., and Barto, A. G. (2017). Reinforcement Learning: An Introduction. 2nd ed. Cambridge, MA: MIT Press Available at: http://incompleteideas.net/book/the-book-2nd.html.

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. in Proceedings of the 12th International Conference on Neural Information Processing Systems (MIT Press), 1057–1063. Available at: https://dl.acm.org/citation.cfm?id=3009806.

Szita, I., and Lörincz, A. (2006). Learning Tetris Using the Noisy Cross-Entropy Method. Neural Computation 18, 2936–2941. doi:10.1162/neco.2006.18.12.2936.

Tang, J., and Abbeel, P. (2010). On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient. in Adv. Neural inf. Process. Syst. Available at: http://rll.berkeley.edu/~jietang/pubs/nips10_Tang.pdf.

Teh, Y. W., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., et al. (2017). Distral: Robust Multitask Reinforcement Learning. Available at: http://arxiv.org/abs/1707.04175 [Accessed January 26, 2020].

Todorov, E. (2008). General duality between optimal control and estimation. in 2008 47th IEEE Conference on Decision and Control, 4286–4292. doi:10.1109/CDC.2008.4739438.

Toussaint, M. (2009). Robot Trajectory Optimization Using Approximate Inference. in Proceedings of the 26th Annual International Conference on Machine Learning ICML ’09. (New York, NY, USA: ACM), 1049–1056. doi:10.1145/1553374.1553508.

Uhlenbeck, G. E., and Ornstein, L. S. (1930). On the Theory of the Brownian Motion. Physical Review 36. doi:10.1103/PhysRev.36.823.

van Hasselt, H. (2010). Double Q-learning. in Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2 (Curran Associates Inc.), 2613–2621. Available at: https://dl.acm.org/citation.cfm?id=2997187.

van Hasselt, H., Guez, A., and Silver, D. (2015). Deep Reinforcement Learning with Double Q-learning. Available at: http://arxiv.org/abs/1509.06461.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., et al. (2017). Learning to reinforcement learn. Available at: http://arxiv.org/abs/1611.05763 [Accessed February 5, 2021].

Wang, Z., Li, Z., Mandlekar, A., Xu, Z., Fan, J., Narang, Y., et al. (2024). One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation. doi:10.48550/arXiv.2410.21257.

Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. (2016). Dueling Network Architectures for Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1511.06581 [Accessed November 21, 2019].

Watkins, C. J. (1989). Learning from delayed rewards.

Watter, M., Springenberg, J. T., Boedecker, J., and Riedmiller, M. (2015). Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images. Available at: https://arxiv.org/pdf/1506.07365.pdf.

Weber, T., Racanière, S., Reichert, D. P., Buesing, L., Guez, A., Rezende, D. J., et al. (2017). Imagination-Augmented Agents for Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1707.06203.

Wierstra, D., Foerster, A., Peters, J., and Schmidhuber, J. (2007). “Solving Deep Memory POMDPs with Recurrent Policy Gradients,” in (Springer, Berlin, Heidelberg), 697–706. doi:10.1007/978-3-540-74690-4_71.

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256.

Williams, R. J., and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science 3, 241–268.

Wu, P., Escontrela, A., Hafner, D., Goldberg, K., and Abbeel, P. (2022). DayDreamer: World Models for Physical Robot Learning. doi:10.48550/arXiv.2206.14176.

Ye, W., Liu, S., Kurutach, T., Abbeel, P., and Gao, Y. (2021). Mastering Atari Games with Limited Data. doi:10.48550/arXiv.2111.00210.

Yu, C., Liu, J., and Nemati, S. (2020). Reinforcement Learning in Healthcare: A Survey. doi:10.48550/arXiv.1908.08796.

Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. (2008). Maximum Entropy Inverse Reinforcement Learning. in, 6.