References

Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein GAN. http://arxiv.org/abs/1701.07875.

Atito, S., Awais, M., and Kittler, J. (2021). SiT: Self-supervised vIsion Transformer. http://arxiv.org/abs/2104.03602.

Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer Normalization. http://arxiv.org/abs/1607.06450.

Badrinarayanan, V., Kendall, A., and Cipolla, R. (2016). SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. http://arxiv.org/abs/1511.00561.

Bahdanau, D., Cho, K., and Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to Align and Translate. http://arxiv.org/abs/1409.0473.

Binder, A., Montavon, G., Bach, S., Müller, K.-R., and Samek, W. (2016). Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers. http://arxiv.org/abs/1604.00825.

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., et al. (2016). End to End Learning for Self-Driving Cars. http://arxiv.org/abs/1604.07316.

Brette, R., and Gerstner, W. (2005). Adaptive Exponential Integrate-and-Fire Model as an Effective Description of Neuronal Activity. Journal of Neurophysiology 94, 3637–3642. doi:10.1152/jn.00686.2005.

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., et al. (2021). Emerging Properties in Self-Supervised Vision Transformers. http://arxiv.org/abs/2104.14294.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. doi:10.48550/arXiv.2002.05709.

Chollet, F. (2017a). Deep Learning with Python. Manning publications https://www.manning.com/books/deep-learning-with-python.

Chollet, F. (2017b). Xception: Deep Learning with Depthwise Separable Convolutions. http://arxiv.org/abs/1610.02357.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. http://arxiv.org/abs/1412.3555.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. http://arxiv.org/abs/1810.04805.

Doersch, C., Gupta, A., and Efros, A. A. (2016). Unsupervised Visual Representation Learning by Context Prediction. doi:10.48550/arXiv.1505.05192.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. http://arxiv.org/abs/2010.11929.

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybernetics 36, 193–202. doi:10.1007/BF00344251.

Gers, F. A., and Schmidhuber, J. (2000). Recurrent nets that time and count. in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, 189–194 vol.3. doi:10.1109/IJCNN.2000.861302.

Gidaris, S., Singh, P., and Komodakis, N. (2018). Unsupervised Representation Learning by Predicting Image Rotations. doi:10.48550/arXiv.1803.07728.

Girshick, R. (2015). Fast R-CNN. http://arxiv.org/abs/1504.08083.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. http://arxiv.org/abs/1311.2524.

Glorot, X., and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. in AISTATS, 8.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative Adversarial Networks. http://arxiv.org/abs/1406.2661.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press http://www.deeplearningbook.org.

Guo, X., Liu, X., Zhu, E., and Yin, J. (2017). Deep Clustering with Convolutional Autoencoders. in Neural Information Processing Lecture Notes in Computer Science., eds. D. Liu, S. Xie, Y. Li, D. Zhao, and E.-S. M. El-Alfy (Cham: Springer International Publishing), 373–382. doi:10.1007/978-3-319-70096-0_39.

Gupta, S., Girshick, R., Arbeláez, P., and Malik, J. (2014). Learning Rich Features from RGB-D Images for Object Detection and Segmentation. http://arxiv.org/abs/1407.5736.

Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., et al. (2014). Deep Speech: Scaling up end-to-end speech recognition. http://arxiv.org/abs/1412.5567.

Haykin, S. S. (2009). Neural Networks and Learning Machines, 3rd Edition. Pearson http://dai.fmph.uniba.sk/courses/NN/haykin.neural-networks.3ed.2009.pdf.

He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2018). Mask R-CNN. http://arxiv.org/abs/1703.06870.

He, K., Zhang, X., Ren, S., and Sun, J. (2015a). Deep Residual Learning for Image Recognition. http://arxiv.org/abs/1512.03385.

He, K., Zhang, X., Ren, S., and Sun, J. (2015b). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. http://arxiv.org/abs/1502.01852.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., et al. (2016). Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. in ICLR 2017 https://openreview.net/forum?id=Sy2fzU9gl.

Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554. doi:10.1162/neco.2006.18.7.1527.

Hinton, G. E., and Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science 313, 504–507. doi:10.1126/science.1127647.

Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the Knowledge in a Neural Network. http://arxiv.org/abs/1503.02531.

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf.

Hochreiter, S., and Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9, 1735–1780. doi:10.1162/neco.1997.9.8.1735.

Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. (2018). Densely Connected Convolutional Networks. http://arxiv.org/abs/1608.06993.

Ioffe, S., and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. http://arxiv.org/abs/1502.03167.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2018). Image-to-Image Translation with Conditional Adversarial Networks. http://arxiv.org/abs/1611.07004.

Izhikevich, E. M. (2003). Simple model of spiking neurons. IEEE transactions on neural networks 14, 1569–72. doi:10.1109/TNN.2003.820440.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. (2020). Analyzing and Improving the Image Quality of StyleGAN. http://arxiv.org/abs/1912.04958.

Kendall, A., Grimes, M., and Cipolla, R. (2016). PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. http://arxiv.org/abs/1505.07427.

Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. http://arxiv.org/abs/1408.5882.

Kingma, D. P., and Welling, M. (2013). Auto-Encoding Variational Bayes. http://arxiv.org/abs/1312.6114.

Kingma, D., and Ba, J. (2014). Adam: A Method for Stochastic Optimization. in Proc. ICLR, 1–13. doi:10.1145/1830483.1830503.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. in Advances in Neural Information Processing Systems (NIPS) https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.

Lapuschkin, S., Wäldchen, S., Binder, A., Montavon, G., Samek, W., and Müller, K.-R. (2019). Unmasking Clever Hans predictors and assessing what machines really learn. Nature Communications 10, 1096. doi:10.1038/s41467-019-08987-4.

Le, Q. V. (2013). Building high-level features using large scale unsupervised learning. in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (Vancouver, BC, Canada: IEEE), 8595–8598. doi:10.1109/ICASSP.2013.6639343.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient Based Learning Applied to Document Recognition. Proceedings of the IEEE 86, 2278–2324. doi:10.1109/5.726791.

Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2018). Visualizing the Loss Landscape of Neural Nets. http://arxiv.org/abs/1712.09913.

Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. (2016). Random synaptic feedback weights support error backpropagation for deep learning. Nat Commun 7, 1–10. doi:10.1038/ncomms13276.

Linnainmaa, S. (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., et al. (2016). SSD: Single Shot MultiBox Detector. 9905, 21–37. doi:10.1007/978-3-319-46448-0_2.

Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier Nonlinearities Improve Neural Network Acoustic Models. in ICML, 6.

Malinowski, M., Rohrbach, M., and Fritz, M. (2015). Ask Your Neurons: A Neural-based Approach to Answering Questions about Images. http://arxiv.org/abs/1505.01121.

McInnes, L., Healy, J., and Melville, J. (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. http://arxiv.org/abs/1802.03426.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. http://arxiv.org/abs/1301.3781.

Mirza, M., and Osindero, S. (2014). Conditional Generative Adversarial Nets. http://arxiv.org/abs/1411.1784.

Misra, I., Zitnick, C. L., and Hebert, M. (2016). Shuffle and Learn: Unsupervised Learning using Temporal Order Verification. doi:10.48550/arXiv.1603.08561.

Murphy, K. P. (2022). Probabilistic Machine Learning: An introduction. MIT Press https://probml.github.io/pml-book/book1.html.

Noroozi, M., and Favaro, P. (2017). Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. doi:10.48550/arXiv.1603.09246.

Nowozin, S., Cseke, B., and Tomioka, R. (2016). F-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. http://arxiv.org/abs/1606.00709.

Olshausen, B. A., and Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research 37, 3311–3325. doi:10.1016/S0042-6989(97)00169-7.

Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016). WaveNet: A Generative Model for Raw Audio. http://arxiv.org/abs/1609.03499.

Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Efros, A. A. (2016). Context Encoders: Feature Learning by Inpainting. doi:10.48550/arXiv.1604.07379.

Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. http://arxiv.org/abs/1511.06434.

Razavi, A., Oord, A. van den, and Vinyals, O. (2019). Generating Diverse High-Fidelity Images with VQ-VAE-2. http://arxiv.org/abs/1906.00446.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. http://arxiv.org/abs/1506.02640.

Redmon, J., and Farhadi, A. (2016). YOLO9000: Better, Faster, Stronger. http://arxiv.org/abs/1612.08242.

Redmon, J., and Farhadi, A. (2018). YOLOv3: An Incremental Improvement. http://arxiv.org/abs/1804.02767.

Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H. (2016). Generative Adversarial Text to Image Synthesis. http://arxiv.org/abs/1605.05396.

Ren, S., He, K., Girshick, R., and Sun, J. (2016). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. http://arxiv.org/abs/1506.01497.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. http://arxiv.org/abs/1505.04597.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature 323, 533–536. doi:10.1038/323533a0.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Improved Techniques for Training GANs. http://arxiv.org/abs/1606.03498.

Sathe, S., Shinde, S., Chorge, S., Thakare, S., and Kulkarni, L. (2022). Overview of Image Caption Generators and Its Applications. in Proceeding of International Conference on Computational Science and Applications Algorithms for Intelligent Systems., eds. S. Bhalla, M. Bedekar, R. Phalnikar, and S. Sirsikar (Singapore: Springer Nature), 105–110. doi:10.1007/978-981-19-0863-7_8.

Simonyan, K., and Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations (ICRL), 1–14. doi:10.1016/j.infsof.2008.09.005.

Sohn, K., Lee, H., and Yan, X. (2015). “Learning Structured Output Representation using Deep Conditional Generative Models,” in Advances in Neural Information Processing Systems 28, eds. C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Curran Associates, Inc.), 3483–3491. http://papers.nips.cc/paper/5775-learning-structured-output-representation-using-deep-conditional-generative-models.pdf.

Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2015). Striving for Simplicity: The All Convolutional Net. http://arxiv.org/abs/1412.6806.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, 1929–1958. http://jmlr.org/papers/v15/srivastava14a.html.

Srivastava, R. K., Greff, K., and Schmidhuber, J. (2015). Highway Networks. http://arxiv.org/abs/1505.00387.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. http://arxiv.org/abs/1409.3215.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision. http://arxiv.org/abs/1512.00567.

Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). DeepFace: Closing the Gap to Human-Level Performance in Face Verification. in 2014 IEEE Conference on Computer Vision and Pattern Recognition (Columbus, OH, USA: IEEE), 1701–1708. doi:10.1109/CVPR.2014.220.

Van Etten, A. (2019). Satellite Imagery Multiscale Rapid Detection with Windowed Networks. in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 735–743. doi:10.1109/WACV.2019.00083.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention Is All You Need. http://arxiv.org/abs/1706.03762.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. J. Mach. Learn. Res. 11, 3371–3408.

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and Tell: A Neural Image Caption Generator. http://arxiv.org/abs/1411.4555.

Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., and Yang, M. (2018). Toward Characteristic-Preserving Image-based Virtual Try-On Network. http://arxiv.org/abs/1807.07688.

Werbos, P. J. (1982). Applications of advances in nonlinear sensitivity analysis. in System Modeling and Optimization: Proc. IFIP (Springer).

Wu, N., Green, B., Ben, X., and O’Banion, S. (2020). Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case. http://arxiv.org/abs/2001.08317.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., et al. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://arxiv.org/abs/1609.08144v2.

Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. in Proceedings of the 32nd International Conference on Machine Learning - Volume 37 ICML’15. (JMLR.org), 2048–2057. http://dl.acm.org/citation.cfm?id=3045118.3045336.

Zhou, Y., and Tuzel, O. (2017). VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. http://arxiv.org/abs/1711.06396.

Zhu, Y., Gao, T., Fan, L., Huang, S., Edmonds, M., Liu, H., et al. (2020). Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense. Engineering. doi:10.1016/j.eng.2020.01.011.