References
Agrawal, P., Nair, A., Abbeel, P., Malik, J., and Levine, S. (2016).
Learning to Poke by Poking: Experiential
Learning of Intuitive Physics. Available at: http://arxiv.org/abs/1606.07419.
Amari, S.-I. (1998). Natural gradient works efficiently in learning.
Neural Computation 10, 251–276.
Amarjyoti, S. (2017). Deep Reinforcement Learning for
Robotic Manipulation-The state of the art. Available at: http://arxiv.org/abs/1701.08878.
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R.,
Welinder, P., et al. (2017). Hindsight Experience Replay.
Available at: http://arxiv.org/abs/1707.01495.
Anschel, O., Baram, N., and Shimkin, N. (2016).
Averaged-DQN: Variance Reduction and
Stabilization for Deep Reinforcement Learning.
Available at: http://arxiv.org/abs/1611.01929.
Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein
GAN. Available at: http://arxiv.org/abs/1701.07875.
Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A.
(2017). A Brief Survey of Deep Reinforcement
Learning. Available at: https://arxiv.org/pdf/1708.05866.pdf.
Ba, J., Mnih, V., and Kavukcuoglu, K. (2014). Multiple Object
Recognition with Visual Attention. Available at: http://arxiv.org/abs/1412.7755.
Baird, L. C. (1993). Advantage updating. Wright-Patterson Air
Force Base Available at: http://leemon.com/papers/1993b.pdf.
Bakker, B. (2001). Reinforcement Learning with Long
Short-Term Memory. in Advances in Neural Information
Processing Systems 14 (NIPS 2001), 1475–1482.
Available at: https://papers.nips.cc/paper/1953-reinforcement-learning-with-long-short-term-memory.
Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., TB,
D., et al. (2018). Distributed Distributional Deterministic Policy
Gradients. Available at: http://arxiv.org/abs/1804.08617.
Bellemare, M. G., Dabney, W., and Munos, R. (2017). A
Distributional Perspective on Reinforcement
Learning. Available at: http://arxiv.org/abs/1707.06887.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F.,
Schwenk, H., et al. (2014). Learning Phrase Representations
using RNN Encoder-Decoder for Statistical Machine
Translation. Available at: http://arxiv.org/abs/1406.1078.
Chou, P.-W., Maturana, D., and Scherer, S. (2017). Improving
Stochastic Policy Gradients in Continuous
Control with Deep Reinforcement Learning using the
Beta Distribution. in International
Conference on Machine Learning Available
at: http://proceedings.mlr.press/v70/chou17a/chou17a.pdf.
Clavera, I., Nagabandi, A., Fearing, R. S., Abbeel, P., Levine, S., and
Finn, C. (2018). Learning to Adapt:
Meta-Learning for Model-Based Control.
Available at: http://arxiv.org/abs/1803.11347.
Co-Reyes, J. D., Liu, Y., Gupta, A., Eysenbach, B., Abbeel, P., and
Levine, S. (2018). Self-Consistent Trajectory Autoencoder:
Hierarchical Reinforcement Learning with Trajectory
Embeddings.
Corneil, D., Gerstner, W., and Brea, J. (2018). Efficient
Model-Based Deep Reinforcement Learning with
Variational State Tabulation. Available at: http://arxiv.org/abs/1802.04325.
Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. (2017).
Distributional Reinforcement Learning with Quantile
Regression. Available at: http://arxiv.org/abs/1710.10044
[Accessed June 28, 2019].
Degris, T., White, M., and Sutton, R. S. (2012). Linear Off-Policy
Actor-Critic. in Proceedings of the 2012 International
Conference on Machine Learning Available at: http://arxiv.org/abs/1205.4839.
Ding, Y., Florensa, C., Phielipp, M., and Abbeel, P. (2019).
Goal-conditioned Imitation Learning. in (Long Beach,
California: PMLR), 8. Available at: https://openreview.net/pdf?id=HkglHcSj2N.
Dosovitskiy, A., and Koltun, V. (2016). Learning to Act by
Predicting the Future. Available at: http://arxiv.org/abs/1611.01779.
Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016).
Benchmarking Deep Reinforcement Learning for
Continuous Control. Available at: http://arxiv.org/abs/1604.06778.
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., et
al. (2018). IMPALA: Scalable Distributed
Deep-RL with Importance Weighted Actor-Learner
Architectures. doi:10.48550/arXiv.1802.01561.
Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and
Levine, S. (2018). Model-Based Value Estimation for
Efficient Model-Free Reinforcement Learning. Available at:
http://arxiv.org/abs/1803.00101.
Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves,
A., et al. (2017). Noisy Networks for
Exploration. Available at: http://arxiv.org/abs/1706.10295
[Accessed March 2, 2020].
Gers, F. (2001). Long Short-Term Memory in Recurrent
Neural Networks. Available at: http://www.felixgers.de/papers/phd.pdf.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,
D., Ozair, S., et al. (2014). Generative Adversarial
Networks. Available at: http://arxiv.org/abs/1406.2661.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep
Learning. MIT Press Available at: http://www.deeplearningbook.org.
Goyal, A., Brakel, P., Fedus, W., Lillicrap, T., Levine, S., Larochelle,
H., et al. (2018). Recall Traces: Backtracking
Models for Efficient Reinforcement Learning.
Available at: http://arxiv.org/abs/1804.00379.
Gruslys, A., Dabney, W., Azar, M. G., Piot, B., Bellemare, M., and
Munos, R. (2017). The Reactor: A fast and
sample-efficient Actor-Critic agent for Reinforcement
Learning. Available at: http://arxiv.org/abs/1704.04651.
Gu, S., Holly, E., Lillicrap, T., and Levine, S. (2017). Deep
Reinforcement Learning for Robotic
Manipulation with Asynchronous Off-Policy Updates.
in Proc. ICRA Available at: http://arxiv.org/abs/1610.00633.
Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S.
(2016a). Q-Prop: Sample-Efficient Policy
Gradient with An Off-Policy Critic. Available at: http://arxiv.org/abs/1611.02247.
Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. (2016b). Continuous
Deep Q-Learning with Model-based
Acceleration. Available at: http://arxiv.org/abs/1603.00748.
Ha, D., and Schmidhuber, J. (2018). World Models. doi:10.5281/zenodo.1207631.
Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. (2018a).
Latent Space Policies for Hierarchical Reinforcement
Learning. Available at: http://arxiv.org/abs/1804.02808.
Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement
Learning with Deep Energy-Based Policies.
Available at: http://arxiv.org/abs/1702.08165
[Accessed February 13, 2019].
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., et
al. (2018b). Soft Actor-Critic Algorithms and
Applications. Available at: http://arxiv.org/abs/1812.05905
[Accessed February 5, 2019].
Hafner, R., and Riedmiller, M. (2011). Reinforcement learning in
feedback control. Machine Learning 84, 137–169. doi:10.1007/s10994-011-5235-x.
Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. (2016).
Q(λ) with off-policy corrections. Available at: http://arxiv.org/abs/1602.04951.
Hausknecht, M., and Stone, P. (2015). Deep Recurrent
Q-Learning for Partially Observable MDPs. Available
at: http://arxiv.org/abs/1507.06527.
He, F. S., Liu, Y., Schwing, A. G., and Peng, J. (2016). Learning to
Play in a Day: Faster Deep Reinforcement
Learning by Optimality Tightening. Available at: http://arxiv.org/abs/1611.01606.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual
Learning for Image Recognition. Available at: http://arxiv.org/abs/1512.03385.
Heess, N., Wayne, G., Silver, D., Lillicrap, T., Tassa, Y., and Erez, T.
(2015). Learning continuous control policies by stochastic value
gradients. Proc. International Conference on Neural Information
Processing Systems, 2944–2952. Available at: http://dl.acm.org/citation.cfm?id=2969569.
Heinrich, J., Lanctot, M., and Silver, D. (2015). Fictitious
Self-Play in Extensive-Form Games. 805–813.
Available at: http://proceedings.mlr.press/v37/heinrich15.html.
Heinrich, J., and Silver, D. (2016). Deep Reinforcement
Learning from Self-Play in
Imperfect-Information Games. Available at: http://arxiv.org/abs/1603.01121.
Henaff, M., Whitney, W. F., and LeCun, Y. (2017). Model-Based
Planning with Discrete and Continuous
Actions. Available at: http://arxiv.org/abs/1705.07177.
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G.,
Dabney, W., et al. (2017). Rainbow: Combining Improvements
in Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1710.02298.
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen
Netzen. Available at: http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf.
Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory.
Neural computation 9, 1735–80. Available at: https://www.ncbi.nlm.nih.gov/pubmed/9377276.
Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van
Hasselt, H., et al. (2018). Distributed Prioritized Experience
Replay. Available at: http://arxiv.org/abs/1803.00933
[Accessed December 14, 2019].
Ioffe, S., and Szegedy, C. (2015). Batch Normalization:
Accelerating Deep Network Training by Reducing
Internal Covariate Shift. Available at: http://arxiv.org/abs/1502.03167.
Kakade, S. (2001). A Natural Policy Gradient. in
Advances in Neural Information Processing Systems
14 Available at: https://papers.nips.cc/paper/2073-a-natural-policy-gradient.pdf.
Kakade, S., and Langford, J. (2002). Approximately Optimal
Approximate Reinforcement Learning. Proc. 19th International
Conference on Machine Learning, 267–274. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.7.7601.
Kansky, K., Silver, T., Mély, D. A., Eldawy, M., Lázaro-Gredilla, M.,
Lou, X., et al. (2017). Schema Networks: Zero-shot Transfer with a Generative Causal
Model of Intuitive Physics. Available at: http://arxiv.org/abs/1706.04317
[Accessed January 10, 2019].
Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., and Dabney, W.
(2019). Recurrent experience replay in distributed reinforcement
learning. in, 19. Available at: https://openreview.net/pdf?id=r1lyTjAqYX.
Kingma, D. P., and Welling, M. (2013). Auto-Encoding Variational
Bayes. Available at: http://arxiv.org/abs/1312.6114.
Knight, E., and Lerner, O. (2018). Natural Gradient
Deep Q-learning. Available at: http://arxiv.org/abs/1803.07482.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet
Classification with Deep Convolutional Neural
Networks. in Advances in Neural Information Processing
Systems (NIPS) Available at: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016a).
End-to-End Training of Deep Visuomotor
Policies. JMLR 17. Available at: http://arxiv.org/abs/1504.00702.
Levine, S., and Koltun, V. (2013). Guided Policy Search. in
Proceedings of Machine Learning Research, 1–9.
Available at: http://proceedings.mlr.press/v28/levine13.html.
Levine, S., Pastor, P., Krizhevsky, A., and Quillen, D. (2016b).
Learning Hand-Eye Coordination for Robotic
Grasping with Deep Learning and Large-Scale
Data Collection. in Proc. ISER Available
at: http://arxiv.org/abs/1603.02199.
Li, Y. (2017). Deep Reinforcement Learning: An
Overview. Available at: http://arxiv.org/abs/1701.07274.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa,
Y., et al. (2015). Continuous control with deep reinforcement learning.
CoRR. Available at: http://arxiv.org/abs/1509.02971.
Lötzsch, W., Vitay, J., and Hamker, F. H. (2017). Training a deep policy
gradient-based neural network with asynchronous learners on a simulated
robotic problem. in INFORMATIK 2017.
Gesellschaft für Informatik, eds. M. Eibl
and M. Gaedke (Gesellschaft für Informatik, Bonn),
2143–2154. Available at: https://dl.gi.de/handle/20.500.12116/3986.
Machado, M. C., Bellemare, M. G., and Bowling, M. (2018).
Count-Based Exploration with the Successor
Representation. Available at: http://arxiv.org/abs/1807.11622
[Accessed February 23, 2019].
Meuleau, N., Peshkin, L., Kaelbling, L. P., and Kim, K. (2000).
Off-Policy Policy Search. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.894.
Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A. J., Banino,
A., et al. (2016). Learning to Navigate in Complex
Environments. Available at: http://arxiv.org/abs/1611.03673.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley,
T., et al. (2016). Asynchronous Methods for Deep
Reinforcement Learning. in Proc. ICML
Available at: http://arxiv.org/abs/1602.01783.
Mnih, V., Heess, N., Graves, A., and Kavukcuoglu, K. (2014). Recurrent
Models of Visual Attention. Available at: http://arxiv.org/abs/1406.6247.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I.,
Wierstra, D., et al. (2013). Playing Atari with Deep
Reinforcement Learning. Available at: http://arxiv.org/abs/1312.5602.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,
Bellemare, M. G., et al. (2015). Human-level control through deep
reinforcement learning. Nature 518, 529–533. doi:10.1038/nature14236.
Mousavi, S. S., Schukat, M., and Howley, E. (2018). “Deep
Reinforcement Learning: An Overview,”
in (Springer, Cham), 426–440. doi:10.1007/978-3-319-56991-8_32.
Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. (2016).
Safe and Efficient Off-Policy Reinforcement Learning.
Available at: http://arxiv.org/abs/1606.02647.
Nachum, O., Gu, S., Lee, H., and Levine, S. (2018). Data-Efficient
Hierarchical Reinforcement Learning. Available at: http://arxiv.org/abs/1805.08296.
Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the
Gap Between Value and Policy Based Reinforcement
Learning. Available at: http://arxiv.org/abs/1702.08892
[Accessed June 12, 2019].
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De
Maria, A., et al. (2015). Massively Parallel Methods for
Deep Reinforcement Learning. Available at: https://arxiv.org/pdf/1507.04296.pdf.
Nielsen, M. A. (2015). Neural Networks and Deep
Learning. Determination Press Available at: http://neuralnetworksanddeeplearning.com/.
Niu, F., Recht, B., Re, C., and Wright, S. J. (2011).
HOGWILD!: A Lock-Free Approach to
Parallelizing Stochastic Gradient Descent. in Proc.
Advances in Neural Information Processing
Systems, 21–21. Available at: http://arxiv.org/abs/1106.5730.
O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016).
Combining policy gradient and Q-learning.
Available at: http://arxiv.org/abs/1611.01626
[Accessed February 13, 2019].
Oh, J., Guo, Y., Singh, S., and Lee, H. (2018). Self-Imitation
Learning. Available at: http://arxiv.org/abs/1806.05635.
Pardo, F., Levdik, V., and Kormushev, P. (2018). Q-map: A
Convolutional Approach for Goal-Oriented
Reinforcement Learning. Available at: http://arxiv.org/abs/1810.02927.
Peng, B., Li, X., Gao, J., Liu, J., Wong, K.-F., and Su, S.-Y. (2018).
Deep Dyna-Q: Integrating Planning for
Task-Completion Dialogue Policy Learning. Available at: http://arxiv.org/abs/1801.06176.
Peshkin, L., and Shelton, C. R. (2002). Learning from Scarce
Experience. Available at: http://arxiv.org/abs/cs/0204043.
Peters, J., and Schaal, S. (2008). Reinforcement learning of motor
skills with policy gradients. Neural Networks 21, 682–697.
doi:10.1016/j.neunet.2008.02.003.
Pong, V., Gu, S., Dalal, M., and Levine, S. (2018). Temporal
Difference Models: Model-Free Deep RL for
Model-Based Control. Available at: http://arxiv.org/abs/1802.09081.
Popov, I., Heess, N., Lillicrap, T., Hafner, R., Barth-Maron, G.,
Vecerik, M., et al. (2017). Data-efficient Deep Reinforcement
Learning for Dexterous Manipulation. Available at:
http://arxiv.org/abs/1704.03073.
Precup, D., Sutton, R. S., and Singh, S. (2000). Eligibility traces for
off-policy policy evaluation. in Proceedings of the
Seventeenth International Conference on Machine
Learning.
Ruder, S. (2016). An overview of gradient descent optimization
algorithms. Available at: http://arxiv.org/abs/1609.04747.
Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. (2017).
Evolution Strategies as a Scalable Alternative
to Reinforcement Learning. Available at: http://arxiv.org/abs/1703.03864.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015). Prioritized
Experience Replay. Available at: http://arxiv.org/abs/1511.05952.
Schoettler, G., Nair, A., Luo, J., Bahl, S., Ojea, J. A., Solowjow, E.,
et al. (2019). Deep Reinforcement Learning for
Industrial Insertion Tasks with Visual Inputs
and Natural Rewards. Available at: http://arxiv.org/abs/1906.05841
[Accessed June 18, 2019].
Schulman, J., Chen, X., and Abbeel, P. (2017a). Equivalence
Between Policy Gradients and Soft Q-Learning.
Available at: http://arxiv.org/abs/1704.06440
[Accessed June 12, 2019].
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P.
(2015a). Trust Region Policy Optimization. in
Proceedings of the 31 st International Conference on
Machine Learning, 1889–1897. Available at: http://proceedings.mlr.press/v37/schulman15.html.
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P.
(2015b). High-Dimensional Continuous Control Using Generalized
Advantage Estimation. Available at: http://arxiv.org/abs/1506.02438.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O.
(2017b). Proximal Policy Optimization Algorithms. Available
at: http://arxiv.org/abs/1707.06347.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den
Driessche, G., et al. (2016a). Mastering the game of Go
with deep neural networks and tree search. Nature 529, 484–489.
doi:10.1038/nature16961.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and
Riedmiller, M. (2014). Deterministic Policy Gradient
Algorithms. in Proc. ICML Proceedings of
Machine Learning Research., eds. E. P. Xing and T. Jebara
(PMLR), 387–395. Available at: http://proceedings.mlr.press/v32/silver14.html.
Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley,
T., et al. (2016b). The Predictron: End-To-End
Learning and Planning. Available at: http://arxiv.org/abs/1612.08810.
Simonyan, K., and Zisserman, A. (2015). Very Deep Convolutional
Networks for Large-Scale Image Recognition.
International Conference on Learning Representations (ICRL),
1–14. doi:10.1016/j.infsof.2008.09.005.
Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. (2018).
Universal Planning Networks. Available at: http://arxiv.org/abs/1804.00645.
Stollenga, M., Masci, J., Gomez, F., and Schmidhuber, J. (2014). Deep
Networks with Internal Selective Attention
through Feedback Connections. Available at: http://arxiv.org/abs/1407.3068.
Sutton, R. S., and Barto, A. G. (1990). “Time-derivative models of
Pavlovian reinforcement,” in Learning and
Computational Neuroscience: Foundations of
Adaptive Networks (MIT Press), 497–537.
Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.81.98.
Sutton, R. S., and Barto, A. G. (1998). Reinforcement
Learning: An introduction.
Cambridge, MA: MIT press.
Sutton, R. S., and Barto, A. G. (2017). Reinforcement
Learning: An Introduction. 2nd ed.
Cambridge, MA: MIT Press Available at: http://incompleteideas.net/book/the-book-2nd.html.
Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy
gradient methods for reinforcement learning with function approximation.
in Proceedings of the 12th International Conference on
Neural Information Processing Systems (MIT
Press), 1057–1063. Available at: https://dl.acm.org/citation.cfm?id=3009806.
Szita, I., and Lörincz, A. (2006). Learning Tetris Using
the Noisy Cross-Entropy Method. Neural Computation
18, 2936–2941. doi:10.1162/neco.2006.18.12.2936.
Tang, J., and Abbeel, P. (2010). On a Connection between
Importance Sampling and the Likelihood Ratio Policy
Gradient. in Adv. Neural inf.
Process. Syst. Available at: http://rll.berkeley.edu/~jietang/pubs/nips10_Tang.pdf.
Todorov, E. (2008). General duality between optimal control and
estimation. in 2008 47th IEEE Conference on
Decision and Control, 4286–4292. doi:10.1109/CDC.2008.4739438.
Toussaint, M. (2009). Robot Trajectory Optimization Using
Approximate Inference. in Proceedings of the 26th
Annual International Conference on Machine
Learning ICML ’09. (New York, NY,
USA: ACM), 1049–1056. doi:10.1145/1553374.1553508.
Uhlenbeck, G. E., and Ornstein, L. S. (1930). On the Theory
of the Brownian Motion. Physical Review 36. doi:10.1103/PhysRev.36.823.
van Hasselt, H. (2010). Double Q-learning.
in Proceedings of the 23rd International Conference on
Neural Information Processing Systems - Volume
2 (Curran Associates Inc.), 2613–2621. Available at:
https://dl.acm.org/citation.cfm?id=2997187.
van Hasselt, H., Guez, A., and Silver, D. (2015). Deep
Reinforcement Learning with Double
Q-learning. Available at: http://arxiv.org/abs/1509.06461.
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z.,
Munos, R., et al. (2017). Learning to reinforcement learn. Available at:
http://arxiv.org/abs/1611.05763
[Accessed February 5, 2021].
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de
Freitas, N. (2016). Dueling Network Architectures for
Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1511.06581
[Accessed November 21, 2019].
Watkins, C. J. (1989). Learning from delayed rewards.
Watter, M., Springenberg, J. T., Boedecker, J., and Riedmiller, M.
(2015). Embed to Control: A Locally Linear Latent
Dynamics Model for Control from Raw
Images. Available at: https://arxiv.org/pdf/1506.07365.pdf.
Weber, T., Racanière, S., Reichert, D. P., Buesing, L., Guez, A.,
Rezende, D. J., et al. (2017). Imagination-Augmented Agents
for Deep Reinforcement Learning. Available at: http://arxiv.org/abs/1707.06203.
Wierstra, D., Foerster, A., Peters, J., and Schmidhuber, J. (2007).
“Solving Deep Memory POMDPs with Recurrent
Policy Gradients,” in (Springer, Berlin,
Heidelberg), 697–706. doi:10.1007/978-3-540-74690-4_71.
Williams, R. J. (1992). Simple statistical gradient-following algorithms
for connectionist reinforcement learning. Machine Learning 8,
229–256.
Williams, R. J., and Peng, J. (1991). Function optimization using
connectionist reinforcement learning algorithms. Connection
Science 3, 241–268.
Zhang, F., Leitner, J., Milford, M., Upcroft, B., and Corke, P. (2015).
Towards Vision-Based Deep Reinforcement Learning for
Robotic Motion Control. in Proc. Acra
Available at: http://arxiv.org/abs/1511.03791.
Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. (2008). Maximum
Entropy Inverse Reinforcement Learning. in, 6.