References
Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein
GAN. http://arxiv.org/abs/1701.07875.
Atito, S., Awais, M., and Kittler, J. (2021). SiT: Self-supervised vIsion Transformer. http://arxiv.org/abs/2104.03602.
Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer
Normalization. http://arxiv.org/abs/1607.06450.
Badrinarayanan, V., Kendall, A., and Cipolla, R. (2016).
SegNet: A Deep Convolutional Encoder-Decoder
Architecture for Image Segmentation. http://arxiv.org/abs/1511.00561.
Bahdanau, D., Cho, K., and Bengio, Y. (2016). Neural Machine
Translation by Jointly Learning to
Align and Translate. http://arxiv.org/abs/1409.0473.
Binder, A., Montavon, G., Bach, S., Müller, K.-R., and Samek, W. (2016).
Layer-wise Relevance Propagation for Neural
Networks with Local Renormalization Layers. http://arxiv.org/abs/1604.00825.
Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B.,
Goyal, P., et al. (2016). End to End Learning for
Self-Driving Cars. http://arxiv.org/abs/1604.07316.
Brette, R., and Gerstner, W. (2005). Adaptive Exponential Integrate-and-Fire Model as an
Effective Description of Neuronal Activity.
Journal of Neurophysiology 94, 3637–3642. doi:10.1152/jn.00686.2005.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski,
P., et al. (2021). Emerging Properties in
Self-Supervised Vision Transformers. http://arxiv.org/abs/2104.14294.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A
Simple Framework for Contrastive Learning of
Visual Representations. doi:10.48550/arXiv.2002.05709.
Chollet, F. (2017a). Deep Learning with
Python. Manning publications https://www.manning.com/books/deep-learning-with-python.
Chollet, F. (2017b). Xception: Deep Learning with
Depthwise Separable Convolutions. http://arxiv.org/abs/1610.02357.
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical
Evaluation of Gated Recurrent Neural Networks
on Sequence Modeling. http://arxiv.org/abs/1412.3555.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019).
BERT: Pre-training of
Deep Bidirectional Transformers for Language
Understanding. http://arxiv.org/abs/1810.04805.
Doersch, C., Gupta, A., and Efros, A. A. (2016). Unsupervised
Visual Representation Learning by Context
Prediction. doi:10.48550/arXiv.1505.05192.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
Unterthiner, T., et al. (2021). An Image is
Worth 16x16 Words: Transformers
for Image Recognition at Scale. http://arxiv.org/abs/2010.11929.
Fukushima, K. (1980). Neocognitron: A self-organizing
neural network model for a mechanism of pattern recognition unaffected
by shift in position. Biol. Cybernetics 36, 193–202. doi:10.1007/BF00344251.
Gers, F. A., and Schmidhuber, J. (2000). Recurrent nets that time and
count. in Proceedings of the IEEE-INNS-ENNS International
Joint Conference on Neural Networks.
IJCNN 2000. Neural Computing: New
Challenges and Perspectives for the New
Millennium, 189–194 vol.3. doi:10.1109/IJCNN.2000.861302.
Gidaris, S., Singh, P., and Komodakis, N. (2018). Unsupervised
Representation Learning by Predicting Image
Rotations. doi:10.48550/arXiv.1803.07728.
Girshick, R. (2015). Fast R-CNN. http://arxiv.org/abs/1504.08083.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). Rich
feature hierarchies for accurate object detection and semantic
segmentation. http://arxiv.org/abs/1311.2524.
Glorot, X., and Bengio, Y. (2010). Understanding the difficulty of
training deep feedforward neural networks. in
AISTATS, 8.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,
D., Ozair, S., et al. (2014). Generative Adversarial
Networks. http://arxiv.org/abs/1406.2661.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep
Learning. MIT Press http://www.deeplearningbook.org.
Guo, X., Liu, X., Zhu, E., and Yin, J. (2017). Deep
Clustering with Convolutional Autoencoders. in
Neural Information Processing Lecture
Notes in Computer Science., eds. D. Liu, S.
Xie, Y. Li, D. Zhao, and E.-S. M. El-Alfy (Cham:
Springer International Publishing), 373–382. doi:10.1007/978-3-319-70096-0_39.
Gupta, S., Girshick, R., Arbeláez, P., and Malik, J. (2014). Learning
Rich Features from RGB-D Images for
Object Detection and Segmentation. http://arxiv.org/abs/1407.5736.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E.,
et al. (2014). Deep Speech: Scaling up
end-to-end speech recognition. http://arxiv.org/abs/1412.5567.
Haykin, S. S. (2009). Neural Networks and
Learning Machines, 3rd Edition.
Pearson http://dai.fmph.uniba.sk/courses/NN/haykin.neural-networks.3ed.2009.pdf.
He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2018). Mask
R-CNN. http://arxiv.org/abs/1703.06870.
He, K., Zhang, X., Ren, S., and Sun, J. (2015a). Deep Residual
Learning for Image Recognition. http://arxiv.org/abs/1512.03385.
He, K., Zhang, X., Ren, S., and Sun, J. (2015b). Delving
Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification.
http://arxiv.org/abs/1502.01852.
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick,
M., et al. (2016). Beta-VAE: Learning Basic Visual
Concepts with a Constrained Variational Framework.
in ICLR 2017 https://openreview.net/forum?id=Sy2fzU9gl.
Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning
algorithm for deep belief nets. Neural Comput. 18, 1527–1554.
doi:10.1162/neco.2006.18.7.1527.
Hinton, G. E., and Salakhutdinov, R. R. (2006). Reducing the
Dimensionality of Data with Neural
Networks. Science 313, 504–507. doi:10.1126/science.1127647.
Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the
Knowledge in a Neural Network. http://arxiv.org/abs/1503.02531.
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen
Netzen. http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf.
Hochreiter, S., and Schmidhuber, J. (1997). Long Short-Term
Memory. Neural Computation 9, 1735–1780. doi:10.1162/neco.1997.9.8.1735.
Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. (2018).
Densely Connected Convolutional Networks. http://arxiv.org/abs/1608.06993.
Ioffe, S., and Szegedy, C. (2015). Batch Normalization:
Accelerating Deep Network Training by Reducing
Internal Covariate Shift. http://arxiv.org/abs/1502.03167.
Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2018).
Image-to-Image Translation with Conditional
Adversarial Networks. http://arxiv.org/abs/1611.07004.
Izhikevich, E. M. (2003). Simple model of spiking neurons. IEEE
transactions on neural networks 14, 1569–72. doi:10.1109/TNN.2003.820440.
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and
Aila, T. (2020). Analyzing and Improving the Image
Quality of StyleGAN. http://arxiv.org/abs/1912.04958.
Kendall, A., Grimes, M., and Cipolla, R. (2016). PoseNet:
A Convolutional Network for Real-Time
6-DOF Camera Relocalization. http://arxiv.org/abs/1505.07427.
Kim, Y. (2014). Convolutional Neural Networks for
Sentence Classification. http://arxiv.org/abs/1408.5882.
Kingma, D. P., and Welling, M. (2013). Auto-Encoding Variational
Bayes. http://arxiv.org/abs/1312.6114.
Kingma, D., and Ba, J. (2014). Adam: A Method for
Stochastic Optimization. in Proc.
ICLR, 1–13. doi:10.1145/1830483.1830503.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet
Classification with Deep Convolutional Neural
Networks. in Advances in Neural Information Processing
Systems (NIPS) https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
Lapuschkin, S., Wäldchen, S., Binder, A., Montavon, G., Samek, W., and
Müller, K.-R. (2019). Unmasking Clever Hans predictors and
assessing what machines really learn. Nature Communications 10,
1096. doi:10.1038/s41467-019-08987-4.
Le, Q. V. (2013). Building high-level features using large scale
unsupervised learning. in 2013 IEEE International
Conference on Acoustics, Speech and
Signal Processing (Vancouver, BC, Canada:
IEEE), 8595–8598. doi:10.1109/ICASSP.2013.6639343.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient
Based Learning Applied to Document
Recognition. Proceedings of the IEEE 86, 2278–2324.
doi:10.1109/5.726791.
Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2018).
Visualizing the Loss Landscape of Neural Nets.
http://arxiv.org/abs/1712.09913.
Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. (2016).
Random synaptic feedback weights support error backpropagation for deep
learning. Nat Commun 7, 1–10. doi:10.1038/ncomms13276.
Linnainmaa, S. (1970). The representation of the cumulative rounding
error of an algorithm as a Taylor expansion of the local
rounding errors.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., et
al. (2016). SSD: Single Shot MultiBox
Detector. 9905, 21–37. doi:10.1007/978-3-319-46448-0_2.
Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier
Nonlinearities Improve Neural Network Acoustic Models. in
ICML, 6.
Malinowski, M., Rohrbach, M., and Fritz, M. (2015). Ask Your
Neurons: A Neural-based Approach to
Answering Questions about Images. http://arxiv.org/abs/1505.01121.
McInnes, L., Healy, J., and Melville, J. (2020). UMAP:
Uniform Manifold Approximation and Projection
for Dimension Reduction. http://arxiv.org/abs/1802.03426.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient
Estimation of Word Representations in
Vector Space. http://arxiv.org/abs/1301.3781.
Mirza, M., and Osindero, S. (2014). Conditional Generative
Adversarial Nets. http://arxiv.org/abs/1411.1784.
Misra, I., Zitnick, C. L., and Hebert, M. (2016). Shuffle and
Learn: Unsupervised Learning using
Temporal Order Verification. doi:10.48550/arXiv.1603.08561.
Murphy, K. P. (2022). Probabilistic Machine Learning:
An introduction. MIT Press https://probml.github.io/pml-book/book1.html.
Noroozi, M., and Favaro, P. (2017). Unsupervised Learning
of Visual Representations by Solving Jigsaw
Puzzles. doi:10.48550/arXiv.1603.09246.
Nowozin, S., Cseke, B., and Tomioka, R. (2016). F-GAN:
Training Generative Neural Samplers using Variational
Divergence Minimization. http://arxiv.org/abs/1606.00709.
Olshausen, B. A., and Field, D. J. (1997). Sparse coding with an
overcomplete basis set: A strategy employed by
V1? Vision Research 37, 3311–3325. doi:10.1016/S0042-6989(97)00169-7.
Oord, A. van den, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O.,
Graves, A., et al. (2016). WaveNet: A Generative
Model for Raw Audio. http://arxiv.org/abs/1609.03499.
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and Efros, A. A.
(2016). Context Encoders: Feature Learning by
Inpainting. doi:10.48550/arXiv.1604.07379.
Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised
Representation Learning with Deep Convolutional
Generative Adversarial Networks. http://arxiv.org/abs/1511.06434.
Razavi, A., Oord, A. van den, and Vinyals, O. (2019). Generating
Diverse High-Fidelity Images with VQ-VAE-2. http://arxiv.org/abs/1906.00446.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You
Only Look Once: Unified, Real-Time
Object Detection. http://arxiv.org/abs/1506.02640.
Redmon, J., and Farhadi, A. (2016). YOLO9000:
Better, Faster, Stronger. http://arxiv.org/abs/1612.08242.
Redmon, J., and Farhadi, A. (2018). YOLOv3: An
Incremental Improvement. http://arxiv.org/abs/1804.02767.
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H.
(2016). Generative Adversarial Text to Image
Synthesis. http://arxiv.org/abs/1605.05396.
Ren, S., He, K., Girshick, R., and Sun, J. (2016). Faster
R-CNN: Towards Real-Time Object Detection with
Region Proposal Networks. http://arxiv.org/abs/1506.01497.
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net:
Convolutional Networks for Biomedical Image
Segmentation. http://arxiv.org/abs/1505.04597.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning
representations by back-propagating errors. Nature 323,
533–536. doi:10.1038/323533a0.
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and
Chen, X. (2016). Improved Techniques for Training
GANs. http://arxiv.org/abs/1606.03498.
Sathe, S., Shinde, S., Chorge, S., Thakare, S., and Kulkarni, L. (2022).
Overview of Image Caption Generators and Its
Applications. in Proceeding of International
Conference on Computational Science and
Applications Algorithms for Intelligent
Systems., eds. S. Bhalla, M. Bedekar, R. Phalnikar, and S.
Sirsikar (Singapore: Springer Nature),
105–110. doi:10.1007/978-981-19-0863-7_8.
Simonyan, K., and Zisserman, A. (2015). Very Deep Convolutional
Networks for Large-Scale Image Recognition.
International Conference on Learning Representations (ICRL),
1–14. doi:10.1016/j.infsof.2008.09.005.
Sohn, K., Lee, H., and Yan, X. (2015). “Learning Structured
Output Representation using Deep Conditional Generative
Models,” in Advances in Neural Information
Processing Systems 28, eds. C. Cortes, N. D. Lawrence, D. D.
Lee, M. Sugiyama, and R. Garnett (Curran Associates, Inc.),
3483–3491. http://papers.nips.cc/paper/5775-learning-structured-output-representation-using-deep-conditional-generative-models.pdf.
Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M.
(2015). Striving for Simplicity: The All
Convolutional Net. http://arxiv.org/abs/1412.6806.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and
Salakhutdinov, R. (2014). Dropout: A Simple Way to
Prevent Neural Networks from Overfitting.
Journal of Machine Learning Research 15, 1929–1958. http://jmlr.org/papers/v15/srivastava14a.html.
Srivastava, R. K., Greff, K., and Schmidhuber, J. (2015). Highway
Networks. http://arxiv.org/abs/1505.00387.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to
Sequence Learning with Neural Networks. http://arxiv.org/abs/1409.3215.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2015).
Rethinking the Inception Architecture for Computer
Vision. http://arxiv.org/abs/1512.00567.
Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014).
DeepFace: Closing the Gap to
Human-Level Performance in Face Verification.
in 2014 IEEE Conference on Computer Vision
and Pattern Recognition (Columbus, OH,
USA: IEEE), 1701–1708. doi:10.1109/CVPR.2014.220.
Van Etten, A. (2019). Satellite Imagery Multiscale Rapid
Detection with Windowed Networks. in 2019
IEEE Winter Conference on Applications of
Computer Vision (WACV), 735–743. doi:10.1109/WACV.2019.00083.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., et al. (2017). Attention Is All You Need. http://arxiv.org/abs/1706.03762.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A.
(2010). Stacked Denoising Autoencoders: Learning
Useful Representations in a Deep Network with a
Local Denoising Criterion. J. Mach. Learn. Res.
11, 3371–3408.
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and
Tell: A Neural Image Caption Generator. http://arxiv.org/abs/1411.4555.
Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., and Yang, M. (2018).
Toward Characteristic-Preserving Image-based
Virtual Try-On Network. http://arxiv.org/abs/1807.07688.
Werbos, P. J. (1982). Applications of advances in nonlinear sensitivity
analysis. in System Modeling and
Optimization: Proc. IFIP
(Springer).
Wu, N., Green, B., Ben, X., and O’Banion, S. (2020). Deep
Transformer Models for Time Series
Forecasting: The Influenza Prevalence Case. http://arxiv.org/abs/2001.08317.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., et
al. (2016). Google’s Neural Machine Translation System:
Bridging the Gap between Human
and Machine Translation. https://arxiv.org/abs/1609.08144v2.
Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R.,
et al. (2015). Show, Attend and Tell:
Neural Image Caption Generation with Visual
Attention. in Proceedings of the 32nd International
Conference on Machine Learning - Volume
37 ICML’15. (JMLR.org), 2048–2057. http://dl.acm.org/citation.cfm?id=3045118.3045336.
Zhou, Y., and Tuzel, O. (2017). VoxelNet: End-to-End Learning for Point Cloud Based 3D
Object Detection. http://arxiv.org/abs/1711.06396.
Zhu, Y., Gao, T., Fan, L., Huang, S., Edmonds, M., Liu, H., et al.
(2020). Dark, Beyond Deep: A Paradigm Shift to
Cognitive AI with Humanlike Common Sense.
Engineering. doi:10.1016/j.eng.2020.01.011.