References
Amari, S.-I. (1998). Natural gradient works efficiently in learning.
Neural Computation 10, 251–276.
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R.,
Welinder, P., et al. (2017). Hindsight Experience Replay.
Available at: http://arxiv.org/abs/1707.01495.
Anschel, O., Baram, N., and Shimkin, N. (2016).
Averaged-DQN: Variance Reduction and
Stabilization for Deep Reinforcement Learning.
Available at: http://arxiv.org/abs/1611.01929.
Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein
GAN. Available at: http://arxiv.org/abs/1701.07875.
Badia, A. P., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A.,
Guo, D., et al. (2020a). Agent57: Outperforming the
Atari Human Benchmark. Available at: http://arxiv.org/abs/2003.13350
[Accessed January 17, 2022].
Badia, A. P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B.,
Kapturowski, S., et al. (2020b). Never Give Up:
Learning Directed Exploration Strategies. Available at: http://arxiv.org/abs/2002.06038
[Accessed January 17, 2022].
Badrinarayanan, V., Kendall, A., and Cipolla, R. (2016).
SegNet: A Deep Convolutional Encoder-Decoder
Architecture for Image Segmentation. Available at:
http://arxiv.org/abs/1511.00561
[Accessed November 29, 2020].
Baird, L. C. (1993). Advantage updating. Wright-Patterson Air Force Base
Available at: http://leemon.com/papers/1993b.pdf.
Bakker, B. (2001). Reinforcement Learning with Long
Short-Term Memory. in Advances in Neural Information
Processing Systems 14 (NIPS 2001), 1475–1482.
Available at: https://papers.nips.cc/paper/1953-reinforcement-learning-with-long-short-term-memory.
Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB,
D., et al. (2018). Distributed Distributional Deterministic Policy
Gradients. Available at: http://arxiv.org/abs/1804.08617.
Bellemare, M. G., Dabney, W., and Munos, R. (2017). A
Distributional Perspective on Reinforcement
Learning. Available at: http://arxiv.org/abs/1707.06887.
Bishop, C. M. (1994). Mixture Density Networks. Birmingham,
UK: Neural Computing Research Group, Aston University Available at: https://publications.aston.ac.uk/id/eprint/373/1/NCRG_94_004.pdf
[Accessed November 12, 2024].
Burda, Y., Edwards, H., Storkey, A., and Klimov, O. (2018). Exploration
by Random Network Distillation. doi:10.48550/arXiv.1810.12894.
Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., et al.
(2023). Diffusion Policy: Visuomotor Policy
Learning via Action Diffusion. Available at: https://arxiv.org/abs/2303.04137v5
[Accessed October 9, 2024].
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F.,
Schwenk, H., et al. (2014). Learning Phrase Representations
using RNN Encoder-Decoder for Statistical Machine
Translation. Available at: http://arxiv.org/abs/1406.1078.
Chou, P.-W., Maturana, D., and Scherer, S. (2017). Improving
Stochastic Policy Gradients in Continuous
Control with Deep Reinforcement Learning using the
Beta Distribution. in International
Conference on Machine Learning Available
at: http://proceedings.mlr.press/v70/chou17a/chou17a.pdf.
Clavera, I., Nagabandi, A., Fearing, R. S., Abbeel, P., Levine, S., and
Finn, C. (2018). Learning to Adapt:
Meta-Learning for Model-Based Control.
Available at: http://arxiv.org/abs/1803.11347.
Co-Reyes, J. D., Liu, Y., Gupta, A., Eysenbach, B., Abbeel, P., and
Levine, S. (2018). Self-Consistent Trajectory Autoencoder:
Hierarchical Reinforcement Learning with Trajectory
Embeddings.
Corneil, D., Gerstner, W., and Brea, J. (2018). Efficient
Model-Based Deep Reinforcement Learning with
Variational State Tabulation. Available at: http://arxiv.org/abs/1802.04325.
Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. (2017).
Distributional Reinforcement Learning with Quantile
Regression. Available at: http://arxiv.org/abs/1710.10044
[Accessed June 28, 2019].
Dayan, P., and Niv, Y. (2008). Reinforcement learning: The
Good, The Bad and The Ugly. Current
Opinion in Neurobiology 18, 185–196. doi:10.1016/j.conb.2008.08.003.
Degrave, J., Felici, F., Buchli, J., Neunert, M., Tracey, B., Carpanese,
F., et al. (2022). Magnetic control of tokamak plasmas through deep
reinforcement learning. Nature 602, 414–419. doi:10.1038/s41586-021-04301-9.
Degris, T., White, M., and Sutton, R. S. (2012). Linear Off-Policy
Actor-Critic. in Proceedings of the 2012 International
Conference on Machine Learning Available at: http://arxiv.org/abs/1205.4839.
Ding, Y., Florensa, C., Phielipp, M., and Abbeel, P. (2019).
Goal-conditioned Imitation Learning. in (Long Beach,
California: PMLR), 8. Available at: https://openreview.net/pdf?id=HkglHcSj2N.
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., et
al. (2018). IMPALA: Scalable Distributed
Deep-RL with Importance Weighted Actor-Learner
Architectures. doi:10.48550/arXiv.1802.01561.
Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and
Levine, S. (2018). Model-Based Value Estimation for
Efficient Model-Free Reinforcement Learning. Available at:
http://arxiv.org/abs/1803.00101.
Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves,
A., et al. (2017). Noisy Networks for
Exploration. Available at: http://arxiv.org/abs/1706.10295
[Accessed March 2, 2020].
Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing
Function Approximation Error in Actor-Critic
Methods. Available at: http://arxiv.org/abs/1802.09477
[Accessed March 1, 2020].
Gers, F. (2001). Long Short-Term Memory in Recurrent
Neural Networks. Available at: http://www.felixgers.de/papers/phd.pdf.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,
D., Ozair, S., et al. (2014). Generative Adversarial
Networks. Available at: http://arxiv.org/abs/1406.2661.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep
Learning. MIT Press Available at: http://www.deeplearningbook.org.
Goyal, A., Brakel, P., Fedus, W., Lillicrap, T., Levine, S., Larochelle,
H., et al. (2018). Recall Traces: Backtracking
Models for Efficient Reinforcement Learning.
Available at: http://arxiv.org/abs/1804.00379.
Gruslys, A., Dabney, W., Azar, M. G., Piot, B., Bellemare, M., and
Munos, R. (2017). The Reactor: A fast and
sample-efficient Actor-Critic agent for Reinforcement
Learning. Available at: http://arxiv.org/abs/1704.04651.
Gu, S., Holly, E., Lillicrap, T., and Levine, S. (2017). Deep
Reinforcement Learning for Robotic
Manipulation with Asynchronous Off-Policy Updates.
in Proc. ICRA Available at: http://arxiv.org/abs/1610.00633.
Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S.
(2016a). Q-Prop: Sample-Efficient Policy
Gradient with An Off-Policy Critic. Available at: http://arxiv.org/abs/1611.02247.
Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. (2016b). Continuous
Deep Q-Learning with Model-based
Acceleration. Available at: http://arxiv.org/abs/1603.00748.
Ha, D., and Eck, D. (2017). A Neural Representation of
Sketch Drawings. Available at: http://arxiv.org/abs/1704.03477
[Accessed January 17, 2021].
Ha, D., and Schmidhuber, J. (2018). World Models. doi:10.5281/zenodo.1207631.
Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. (2018a).
Latent Space Policies for Hierarchical Reinforcement
Learning. Available at: http://arxiv.org/abs/1804.02808.
Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement
Learning with Deep Energy-Based Policies.
Available at: http://arxiv.org/abs/1702.08165
[Accessed February 13, 2019].
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., et
al. (2018b). Soft Actor-Critic Algorithms and
Applications. Available at: http://arxiv.org/abs/1812.05905
[Accessed February 5, 2019].
Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. (2020). Dream to
Control: Learning Behaviors by Latent
Imagination. Available at: http://arxiv.org/abs/1912.01603
[Accessed March 24, 2020].
Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H.,
et al. (2019). Learning Latent Dynamics for
Planning from Pixels. Available at: http://arxiv.org/abs/1811.04551
[Accessed January 24, 2020].
Hafner, R., and Riedmiller, M. (2011). Reinforcement learning in
feedback control. Machine Learning 84, 137–169. doi:10.1007/s10994-011-5235-x.
Hansen, N., and Ostermeier, A. (2001). Completely Derandomized
Self-Adaptation in Evolution Strategies.
Evolutionary Computation 9, 159–195. doi:10.1162/106365601750190398.
Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. (2016).
Q(λ) with off-policy corrections. Available at: http://arxiv.org/abs/1602.04951.
Hausknecht, M., and Stone, P. (2015). Deep Recurrent
Q-Learning for Partially Observable MDPs. Available
at: http://arxiv.org/abs/1507.06527.
He, F. S., Liu, Y., Schwing, A. G., and Peng, J. (2016). Learning to
Play in a Day: Faster Deep Reinforcement
Learning by Optimality Tightening. Available at: http://arxiv.org/abs/1611.01606.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual
Learning for Image Recognition. Available at: http://arxiv.org/abs/1512.03385.
Heess, N., Wayne, G., Silver, D., Lillicrap, T., Tassa, Y., and Erez, T.
(2015). Learning continuous control policies by stochastic value
gradients. Proc. International Conference on Neural Information
Processing Systems, 2944–2952. Available at: http://dl.acm.org/citation.cfm?id=2969569.
Heinrich, J., Lanctot, M., and Silver, D. (2015). Fictitious
Self-Play in Extensive-Form Games. 805–813.
Available at: http://proceedings.mlr.press/v37/heinrich15.html.
Heinrich, J., and Silver, D. (2016). Deep Reinforcement
Learning from Self-Play in
Imperfect-Information Games. Available at: http://arxiv.org/abs/1603.01121.
Henaff, M., Whitney, W. F., and LeCun, Y. (2017). Model-Based
Planning with Discrete and Continuous
Actions. Available at: http://arxiv.org/abs/1705.07177.
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G.,
Dabney, W., et al. (2017). Rainbow: Combining Improvements
in Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1710.02298.
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen
Netzen. Available at: http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf.
Hochreiter, S., and Schmidhuber, J. (1997). Long Short-Term
Memory. Neural Computation 9, 1735–1780. doi:10.1162/neco.1997.9.8.1735.
Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van
Hasselt, H., et al. (2018). Distributed Prioritized Experience
Replay. Available at: http://arxiv.org/abs/1803.00933
[Accessed December 14, 2019].
Ioffe, S., and Szegedy, C. (2015). Batch Normalization:
Accelerating Deep Network Training by Reducing
Internal Covariate Shift. Available at: http://arxiv.org/abs/1502.03167.
Kakade, S. (2001). A Natural Policy Gradient. in
Advances in Neural Information Processing Systems
14 Available at: https://papers.nips.cc/paper/2073-a-natural-policy-gradient.pdf.
Kakade, S., and Langford, J. (2002). Approximately Optimal
Approximate Reinforcement Learning. Proc. 19th International
Conference on Machine Learning, 267–274. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.7.7601.
Kansky, K., Silver, T., Mély, D. A., Eldawy, M., Lázaro-Gredilla, M.,
Lou, X., et al. (2017). Schema Networks: Zero-shot Transfer with a Generative Causal
Model of Intuitive Physics. Available at: http://arxiv.org/abs/1706.04317
[Accessed January 10, 2019].
Kapturowski, S., Campos, V., Jiang, R., Rakićević, N., van Hasselt, H.,
Blundell, C., et al. (2022). Human-level Atari 200x faster.
doi:10.48550/arXiv.2209.07550.
Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., and Dabney, W.
(2019). Recurrent experience replay in distributed reinforcement
learning. in, 19. Available at: https://openreview.net/pdf?id=r1lyTjAqYX.
Kaufmann, E., Bauersfeld, L., Loquercio, A., Müller, M., Koltun, V., and
Scaramuzza, D. (2023). Champion-level drone racing using deep
reinforcement learning. Nature 620, 982–987. doi:10.1038/s41586-023-06419-4.
Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J.-M., et
al. (2018). Learning to Drive in a Day.
Available at: http://arxiv.org/abs/1807.00412
[Accessed December 19, 2018].
Kingma, D. P., and Welling, M. (2013). Auto-Encoding Variational
Bayes. Available at: http://arxiv.org/abs/1312.6114.
Knight, E., and Lerner, O. (2018). Natural Gradient
Deep Q-learning. Available at: http://arxiv.org/abs/1803.07482.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet
Classification with Deep Convolutional Neural
Networks. in Advances in Neural Information Processing
Systems (NIPS) Available at: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
Levine, S., and Koltun, V. (2013). Guided Policy Search. in
Proceedings of Machine Learning Research, 1–9.
Available at: http://proceedings.mlr.press/v28/levine13.html.
Li, W., Zhu, Y., and Zhao, D. (2022). Missile guidance with assisted
deep reinforcement learning for head-on interception of maneuvering
target. Complex Intell. Syst. 8, 1205–1216. doi:10.1007/s40747-021-00577-6.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa,
Y., et al. (2015). Continuous control with deep reinforcement learning.
CoRR. Available at: http://arxiv.org/abs/1509.02971.
Lötzsch, W., Vitay, J., and Hamker, F. H. (2017). Training a deep policy
gradient-based neural network with asynchronous learners on a simulated
robotic problem. in INFORMATIK 2017.
Gesellschaft für Informatik, eds. M. Eibl
and M. Gaedke (Gesellschaft für Informatik, Bonn), 2143–2154. Available
at: https://dl.gi.de/handle/20.500.12116/3986.
Luo, J., Paduraru, C., Voicu, O., Chervonyi, Y., Munns, S., Li, J., et
al. (2022). Controlling Commercial Cooling Systems Using
Reinforcement Learning. doi:10.48550/arXiv.2211.07357.
Machado, M. C., Bellemare, M. G., and Bowling, M. (2018).
Count-Based Exploration with the Successor
Representation. Available at: http://arxiv.org/abs/1807.11622
[Accessed February 23, 2019].
Madeka, D., Torkkola, K., Eisenach, C., Luo, A., Foster, D. P., and
Kakade, S. M. (2022). Deep Inventory Management. doi:10.48550/arXiv.2210.03137.
Malibari, N., Katib, I., and Mehmood, R. (2023). Systematic
Review on Reinforcement Learning in the
Field of Fintech. doi:10.48550/arXiv.2305.07466.
Meuleau, N., Peshkin, L., Kaelbling, L. P., and Kim, K. (2000).
Off-Policy Policy Search. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.894.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley,
T., et al. (2016). Asynchronous Methods for Deep
Reinforcement Learning. in Proc. ICML
Available at: http://arxiv.org/abs/1602.01783.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I.,
Wierstra, D., et al. (2013). Playing Atari with Deep
Reinforcement Learning. Available at: http://arxiv.org/abs/1312.5602.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,
Bellemare, M. G., et al. (2015). Human-level control through deep
reinforcement learning. Nature 518, 529–533. doi:10.1038/nature14236.
Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. (2016).
Safe and Efficient Off-Policy Reinforcement Learning.
Available at: http://arxiv.org/abs/1606.02647.
Nachum, O., Gu, S., Lee, H., and Levine, S. (2018). Data-Efficient
Hierarchical Reinforcement Learning. Available at: http://arxiv.org/abs/1805.08296.
Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the
Gap Between Value and Policy Based Reinforcement
Learning. Available at: http://arxiv.org/abs/1702.08892
[Accessed June 12, 2019].
Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2017). Neural
Network Dynamics for Model-Based Deep Reinforcement
Learning with Model-Free Fine-Tuning. Available at:
http://arxiv.org/abs/1708.02596
[Accessed March 3, 2019].
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De
Maria, A., et al. (2015). Massively Parallel Methods for
Deep Reinforcement Learning. Available at: https://arxiv.org/pdf/1507.04296.pdf.
Nielsen, M. A. (2015). Neural Networks and Deep
Learning. Determination Press Available at: http://neuralnetworksanddeeplearning.com/.
Niu, F., Recht, B., Re, C., and Wright, S. J. (2011).
HOGWILD!: A Lock-Free Approach to
Parallelizing Stochastic Gradient Descent. in Proc.
Advances in Neural Information Processing
Systems, 21–21. Available at: http://arxiv.org/abs/1106.5730.
O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016).
Combining policy gradient and Q-learning.
Available at: http://arxiv.org/abs/1611.01626
[Accessed February 13, 2019].
Oh, J., Guo, Y., Singh, S., and Lee, H. (2018). Self-Imitation
Learning. Available at: http://arxiv.org/abs/1806.05635.
Pardo, F., Levdik, V., and Kormushev, P. (2018). Q-map: A
Convolutional Approach for Goal-Oriented
Reinforcement Learning. Available at: http://arxiv.org/abs/1810.02927.
Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017).
Curiosity-driven Exploration by Self-supervised Prediction. Available at: http://arxiv.org/abs/1705.05363
[Accessed February 6, 2021].
Peng, B., Li, X., Gao, J., Liu, J., Wong, K.-F., and Su, S.-Y. (2018).
Deep Dyna-Q: Integrating Planning for
Task-Completion Dialogue Policy Learning. Available at: http://arxiv.org/abs/1801.06176.
Peshkin, L., and Shelton, C. R. (2002). Learning from Scarce
Experience. Available at: http://arxiv.org/abs/cs/0204043.
Peters, J., and Schaal, S. (2008). Reinforcement learning of motor
skills with policy gradients. Neural Networks 21, 682–697.
doi:10.1016/j.neunet.2008.02.003.
Pong, V., Gu, S., Dalal, M., and Levine, S. (2018). Temporal
Difference Models: Model-Free Deep RL for
Model-Based Control. Available at: http://arxiv.org/abs/1802.09081.
Popov, I., Heess, N., Lillicrap, T., Hafner, R., Barth-Maron, G.,
Vecerik, M., et al. (2017). Data-efficient Deep Reinforcement
Learning for Dexterous Manipulation. Available at:
http://arxiv.org/abs/1704.03073.
Precup, D., Sutton, R. S., and Singh, S. (2000). Eligibility traces for
off-policy policy evaluation. in Proceedings of the
Seventeenth International Conference on Machine
Learning.
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net:
Convolutional Networks for Biomedical Image
Segmentation. Available at: http://arxiv.org/abs/1505.04597
[Accessed November 29, 2020].
Roy, R., Raiman, J., Kant, N., Elkin, I., Kirby, R., Siu, M., et al.
(2022). PrefixRL: Optimization of
Parallel Prefix Circuits using Deep Reinforcement
Learning. doi:10.1109/DAC18074.2021.9586094.
Ruder, S. (2016). An overview of gradient descent optimization
algorithms. Available at: http://arxiv.org/abs/1609.04747.
Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G.,
Kirkpatrick, J., Pascanu, R., et al. (2016). Policy
Distillation. Available at: http://arxiv.org/abs/1511.06295
[Accessed January 26, 2020].
Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. (2017).
Evolution Strategies as a Scalable Alternative
to Reinforcement Learning. Available at: http://arxiv.org/abs/1703.03864.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015). Prioritized
Experience Replay. Available at: http://arxiv.org/abs/1511.05952.
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L.,
Schmitt, S., et al. (2019). Mastering Atari,
Go, Chess and Shogi by
Planning with a Learned Model. Available at:
http://arxiv.org/abs/1911.08265
[Accessed November 24, 2019].
Schulman, J., Chen, X., and Abbeel, P. (2017a). Equivalence
Between Policy Gradients and Soft Q-Learning.
Available at: http://arxiv.org/abs/1704.06440
[Accessed June 12, 2019].
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P.
(2015a). Trust Region Policy Optimization. in
Proceedings of the 31 st International Conference on
Machine Learning, 1889–1897. Available at: http://proceedings.mlr.press/v37/schulman15.html.
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P.
(2015b). High-Dimensional Continuous Control Using Generalized
Advantage Estimation. Available at: http://arxiv.org/abs/1506.02438.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O.
(2017b). Proximal Policy Optimization Algorithms. Available
at: http://arxiv.org/abs/1707.06347.
Sekar, R., Rybkin, O., Daniilidis, K., Abbeel, P., Hafner, D., and
Pathak, D. (2020). Planning to Explore via
Self-Supervised World Models. Available at: http://arxiv.org/abs/2005.05960
[Accessed January 29, 2024].
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den
Driessche, G., et al. (2016a). Mastering the game of Go
with deep neural networks and tree search. Nature 529, 484–489.
doi:10.1038/nature16961.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M.,
Guez, A., et al. (2018). A general reinforcement learning algorithm that
masters chess, shogi, and Go through self-play.
Science 362, 1140–1144. doi:10.1126/science.aar6404.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and
Riedmiller, M. (2014). Deterministic Policy Gradient
Algorithms. in Proc. ICML Proceedings of
Machine Learning Research., eds. E. P. Xing and T. Jebara
(PMLR), 387–395. Available at: http://proceedings.mlr.press/v32/silver14.html.
Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley,
T., et al. (2016b). The Predictron: End-To-End
Learning and Planning. Available at: http://arxiv.org/abs/1612.08810.
Simonyan, K., and Zisserman, A. (2015). Very Deep Convolutional
Networks for Large-Scale Image Recognition.
International Conference on Learning Representations (ICRL),
1–14. doi:10.1016/j.infsof.2008.09.005.
Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. (2018).
Universal Planning Networks. Available at: http://arxiv.org/abs/1804.00645.
Sutton, R. S. (1990). Integrated Architectures for
Learning, Planning, and Reacting
Based on Approximating Dynamic Programming.
Machine Learning Proceedings 1990, 216–224. doi:10.1016/B978-1-55860-141-3.50030-4.
Sutton, R. S., and Barto, A. G. (1998). Reinforcement
Learning: An introduction. Cambridge, MA:
MIT press.
Sutton, R. S., and Barto, A. G. (2017). Reinforcement
Learning: An Introduction. 2nd ed.
Cambridge, MA: MIT Press Available at: http://incompleteideas.net/book/the-book-2nd.html.
Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy
gradient methods for reinforcement learning with function approximation.
in Proceedings of the 12th International Conference on
Neural Information Processing Systems (MIT Press),
1057–1063. Available at: https://dl.acm.org/citation.cfm?id=3009806.
Szita, I., and Lörincz, A. (2006). Learning Tetris Using
the Noisy Cross-Entropy Method. Neural Computation
18, 2936–2941. doi:10.1162/neco.2006.18.12.2936.
Tang, J., and Abbeel, P. (2010). On a Connection between
Importance Sampling and the Likelihood Ratio Policy
Gradient. in Adv. Neural inf.
Process. Syst. Available at: http://rll.berkeley.edu/~jietang/pubs/nips10_Tang.pdf.
Teh, Y. W., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J.,
Hadsell, R., et al. (2017). Distral: Robust Multitask
Reinforcement Learning. Available at: http://arxiv.org/abs/1707.04175
[Accessed January 26, 2020].
Todorov, E. (2008). General duality between optimal control and
estimation. in 2008 47th IEEE Conference on
Decision and Control, 4286–4292. doi:10.1109/CDC.2008.4739438.
Toussaint, M. (2009). Robot Trajectory Optimization Using
Approximate Inference. in Proceedings of the 26th
Annual International Conference on Machine
Learning ICML ’09. (New York, NY, USA: ACM),
1049–1056. doi:10.1145/1553374.1553508.
Uhlenbeck, G. E., and Ornstein, L. S. (1930). On the Theory
of the Brownian Motion. Physical Review 36. doi:10.1103/PhysRev.36.823.
van Hasselt, H. (2010). Double Q-learning.
in Proceedings of the 23rd International Conference on
Neural Information Processing Systems - Volume
2 (Curran Associates Inc.), 2613–2621. Available at: https://dl.acm.org/citation.cfm?id=2997187.
van Hasselt, H., Guez, A., and Silver, D. (2015). Deep
Reinforcement Learning with Double
Q-learning. Available at: http://arxiv.org/abs/1509.06461.
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z.,
Munos, R., et al. (2017). Learning to reinforcement learn. Available at:
http://arxiv.org/abs/1611.05763
[Accessed February 5, 2021].
Wang, Z., Li, Z., Mandlekar, A., Xu, Z., Fan, J., Narang, Y., et al.
(2024). One-Step Diffusion Policy: Fast Visuomotor
Policies via Diffusion Distillation. doi:10.48550/arXiv.2410.21257.
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de
Freitas, N. (2016). Dueling Network Architectures for
Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1511.06581
[Accessed November 21, 2019].
Watkins, C. J. (1989). Learning from delayed rewards.
Watter, M., Springenberg, J. T., Boedecker, J., and Riedmiller, M.
(2015). Embed to Control: A Locally Linear Latent
Dynamics Model for Control from Raw
Images. Available at: https://arxiv.org/pdf/1506.07365.pdf.
Weber, T., Racanière, S., Reichert, D. P., Buesing, L., Guez, A.,
Rezende, D. J., et al. (2017). Imagination-Augmented Agents
for Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1707.06203.
Wierstra, D., Foerster, A., Peters, J., and Schmidhuber, J. (2007).
“Solving Deep Memory POMDPs with Recurrent
Policy Gradients,” in (Springer, Berlin, Heidelberg),
697–706. doi:10.1007/978-3-540-74690-4_71.
Williams, R. J. (1992). Simple statistical gradient-following algorithms
for connectionist reinforcement learning. Machine Learning 8,
229–256.
Williams, R. J., and Peng, J. (1991). Function optimization using
connectionist reinforcement learning algorithms. Connection
Science 3, 241–268.
Wu, P., Escontrela, A., Hafner, D., Goldberg, K., and Abbeel, P. (2022).
DayDreamer: World Models for Physical
Robot Learning. doi:10.48550/arXiv.2206.14176.
Ye, W., Liu, S., Kurutach, T., Abbeel, P., and Gao, Y. (2021). Mastering
Atari Games with Limited Data. doi:10.48550/arXiv.2111.00210.
Yu, C., Liu, J., and Nemati, S. (2020). Reinforcement
Learning in Healthcare: A Survey.
doi:10.48550/arXiv.1908.08796.
Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. (2008). Maximum
Entropy Inverse Reinforcement Learning. in, 6.