Deep Learning Architectures - 2023
Updated 2/2023
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz, 2 Ch
Genre: eLearning | Language: English | Duration: 68 lectures • 42h 36m | Size: 25 GB
Updated 2/2023
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz, 2 Ch
Genre: eLearning | Language: English | Duration: 68 lectures • 42h 36m | Size: 25 GB
Machine Learning, Natural Language Processing, Computer Vision, Adversarial Learning, Neural Networks, NLP
What you'll learn
Modern Deep Learning Architectures, Computer Vission, Neural Networks, Adversarial Learning
Neural Ordinary Differential Equations, Detecting Adversarial Examples, Batch Normalization: Accelerating Deep Network Training
Assumptions in the Unsupervised Learning, Better Representations by Interpolating Hidden States, Processing Megapixel Images with Deep Attention-Sampling Models
Gauge Equivariant Convolutional Networks and the Icosahedral CNN, Dynamic Routing Between Capsules, Learning a Generative Model from a Single Natural Image
Reformer: The Efficient Transformer, Growing Neural Cellular Automata, Deep Learning for Symbolic Mathematics, Neural Weather Model for Precipitation Forecast
POET: Endlessly Generating Increasingly Complex and Diverse Learning Environment
Evolving Normalization-Activation Layers, Finding Sparse &Trainable Neural Networks, FixMatch: Simplifying Semi-Supervised Learning with Consistency
Longformer: Long-Document Transformer, Jukebox: A Generative Model for Music, Big Transfer (BiT): General Visual Representation Learning, Group Normalization
DETR: End-to-End Object Detection with Transformers
Weight Standardization
Synthesizer: Rethinking Self-Attention in Transformer Models
Learning to Classify Images without Labels
Rapid Architecture Search
End-to-end Adversarial Text-to-Speech
Linformer: Self-Attention with Linear Complexity
VirTex: Learning Visual Representations from Textual Annotations
A bio-inspired bistable recurrent cell allows for long-lasting memory
SIREN: Implicit Neural Representations with Periodic Activation Functions
Discovering Symbolic Models from Deep Learning with Inductive Biases
RepNet: Counting Out Time - Class Agnostic Video Repetition Counting in Wild
Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Archetictures
Set Distribution Networks: A generative Model for Sets of Images
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
SpiNet: Learning Scale-Permuted Backbone for Recognition and Localization
Trasnformers are RNNs: Fast Autoregressive Transformers with Linear Attention
SupSup: SuperMasks in Superposition
NVAE: A deep Hierarchical Variational Autoencoder
Gradient Origin Networks
ImageNet Classification with Deep Convolutional Neural Networks
Generative Adversarial Networks
Deep Residual Learning for Image Recognition
Neural Architecture Search without Training
Big Bird: Transformers for Longer Sequences
Hopfield Networks
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
An image is worth 16x16 words: Transformers for Image Recognition at Scale
Lambda Networks: Modeling long-range Interactions without Attention
Rethinking Attention with Performers
Fourier Neural Operator for Parametric Partial Differential Equations
Switch Transformers: Scaling to Trillion Parameter Models
Feedback Transformers
NFNets: High-performance large-scale image recognition without normalization
TransGAN: Two transformers can make one strong GAN
GLOM: How to represent part-whole hierarchies in a neural network
Perceiver: General Perception with Iterative Attention (Google DeepMind)
MLP-Mixer: An all-MLP Archeticture for Vission (Machine Learning Paper)
Involution: Inverting the Inherence of Convolution for Visual Recognition
Expire-Span: Not all memories are created equal
Autoregressive Diffusion Models
Sparse is enough in scaling transformers
Requirements
Basic Mathematics
Basic understanding of AI concepts
Description
Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised, or unsupervised. In this course, we will go through modern deep-learning concepts and papers.
All the papers discussed in this course are available for download. Please refer to the resources under each lecture.
All the codes elaborated on in this course are available on GitHub. Please refer to the description of each lecture to find the GitHub links.
You will learn:
Neural Ordinary Differential Equations
We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black-box differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.
The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
We investigate conditions under which test statistics exist that can reliably detect examples, which have been adversarially manipulated in a white-box attack. These statistics can be easily computed and calibrated by randomly corrupting inputs. They exploit certain anomalies that adversarial attacks introduce, in particular if they follow the paradigm of choosing perturbations optimally under p-norm constraints. Access to the log-odds is the only requirement to defend models. We justify our approach empirically, but also provide conditions under which detectability via the suggested test statistics is guaranteed to be effective. In our experiments, we show that it is even possible to correct test time predictions for adversarial attacks with high accuracy.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
In recent years, the interest in unsupervised learning of disentangled representations has significantly increased. The key assumption is that real-world data is generated by a few explanatory factors of variation and that these factors can be recovered by unsupervised learning algorithms. A large number of unsupervised learning approaches based on auto-encoding and quantitative evaluation metrics of disentanglement have been proposed; yet, the efficacy of the proposed approaches and utility of proposed notions of disentanglement has not been challenged in prior work. In this paper, we provide a sober look on recent progress in the field and challenge some common assumptions. We first theoretically show that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases on both the models and the data. Then, we train more than 12000 models covering the six most prominent methods, and evaluate them across six disentanglement metrics in a reproducible large-scale experimental study on seven different data sets. On the positive side, we observe that different methods successfully enforce properties "encouraged" by the corresponding losses. On the negative side, we observe in our study that well-disentangled models seemingly cannot be identified without access to ground-truth labels even if we are allowed to transfer hyperparameters across data sets. Furthermore, increased disentanglement does not seem to lead to a decreased sample complexity of learning for downstream tasks. These results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision, investigate concrete benefits of enforcing disentanglement of the learned representations, and consider a reproducible experimental setup covering several data sets.
Manifold Mixup: Better Representations by Interpolating Hidden States
Deep neural networks excel at learning the training data, but often provide incorrect and confident predictions when evaluated on slightly different test examples. This includes distribution shifts, outliers, and adversarial examples. To address these issues, we propose Manifold Mixup, a simple regularizer that encourages neural networks to predict less confidently on interpolations of hidden representations. Manifold Mixup leverages semantic interpolations as additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation. As a result, neural networks trained with Manifold Mixup learn class-representations with fewer directions of variance. We prove theory on why this flattening happens under ideal conditions, validate it on practical situations, and connect it to previous works on information theory and generalization. In spite of incurring no significant computation and being implemented in a few lines of code, Manifold Mixup improves strong baselines in supervised learning, robustness to single-step adversarial attacks, and test log-likelihood.
Processing Megapixel Images with Deep Attention-Sampling Models
Current CNNs have to downsample large images before processing them, which can lose a lot of detail information. This paper proposes attention sampling, which learns to selectively process parts of any large image in full resolution, while discarding uninteresting bits. This leads to enormous gains in speed and memory consumption.
Gauge Equivariant Convolutional Networks and the Icosahedral CNN
Ever wanted to do a convolution on a Klein Bottle? This paper defines CNNs over manifolds such that they are independent of which coordinate frame you choose. Amazingly, this then results in an efficient practical method to achieve state-of-the-art in several tasks!
Dynamic Routing Between Capsules
Geoff Hinton's next big idea! Capsule Networks are an alternative way of implementing neural networks by dividing each layer into capsules. Each capsule is responsible for detecting the presence and properties of one particular entity in the input sample. This information is then allocated dynamically to higher-level capsules in a novel and unconventional routing scheme. While Capsule Networks are still in their infancy, they are an exciting and promising new direction.
SinGAN: Learning a Generative Model from a Single Natural Image
With just a single image as an input, this algorithm learns a generative model that matches the input image's patch distribution at multiple scales and resolutions. This enables sampling of extremely realistic looking variations on the original image and much more.
Reformer: The Efficient Transformer
The Transformer for the masses! Reformer solves the biggest problem with the famous Transformer model: Its huge resource requirements. By cleverly combining Locality Sensitive Hashing and ideas from Reversible Networks, the classically huge footprint of the Transformer is drastically reduced. Not only does that mean the model uses less memory, but it can process much longer input sequences, up to 16K tokens with just 16gb of memory!
Growing Neural Cellular Automata
The Game of Life on steroids! This model learns to grow complex patterns in an entirely local way. Each cell is trained to listen to its neighbors and update itself in a way such that, collectively, an overall goal is reached. Fascinating and interactive!
Deep Learning for Symbolic Mathematics
This model solves integrals and ODEs by doing seq2seq!
Axial Attention & MetNet: A Neural Weather Model for Precipitation Forecasting
MetNet is a predictive neural network model for weather prediction. It uses axial attention to capture long-range dependencies. Axial attention decomposes attention layers over images into row-attention and column-attention in order to save memory and computation.
POET: Endlessly Generating Increasingly Complex and Diverse Learning Environment
From the makers of Go-Explore, POET is a mixture of ideas from novelty search, evolutionary methods, open-ended learning and curriculum learning.
Evolving Normalization-Activation Layers
Normalization and activation layers have seen a long history of hand-crafted variants with various results. This paper proposes an evolutionary search to determine the ultimate, final and best combined normalization-activation layer… in a very specific setting.
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Stunning evidence for the hypothesis that neural networks work so well because their random initialization almost certainly contains a nearly optimal sub-network that is responsible for most of the final performance.
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
FixMatch is a simple, yet surprisingly effective approach to semi-supervised learning. It combines two previous methods in a clever way and achieves state-of-the-art in regimes with few and very few labeled examples.
Longformer: The Long-Document Transformer
he Longformer extends the Transformer by introducing sliding window attention and sparse global attention. This allows for the processing of much longer documents than classic models like BERT.
Jukebox: A Generative Model for Music
This generative model for music can make entire songs with remarkable quality and consistency. It can be conditioned on genre, artist, and even lyrics.
Big Transfer (BiT): General Visual Representation Learning
One CNN to rule them all! BiT is a pre-trained ResNet that can be used as a starting point for any visual task. This paper explains what it takes to pre-train such a large model and details how fine-tuning on downstream tasks is done best.
Group Normalization
The dirty little secret of Batch Normalization is its intrinsic dependence on the training batch size. Group Normalization attempts to achieve the benefits of normalization without batch statistics and, most importantly, without sacrificing performance compared to Batch Normalization.
Weight Standardization
It's common for neural networks to include data normalization such as BatchNorm or GroupNorm. This paper extends the normalization to also include the weights of the network. This surprisingly simple change leads to a boost in performance and - combined with GroupNorm - new state-of-the-art results.
DETR: End-to-End Object Detection with Transformers
Object detection in images is a notoriously hard task! Objects can be of a wide variety of classes, can be numerous or absent, they can occlude each other or be out of frame. All of this makes it even more surprising that the architecture in this paper is so simple. Thanks to a clever loss function, a single Transformer stacked on a CNN is enough to handle the entire task!
Synthesizer: Rethinking Self-Attention in Transformer Models
Do we really need dot-product attention? The attention mechanism is a central part of modern Transformers, mainly due to the dot-product attention mechanism. This paper changes the mechanism to remove the quadratic interaction terms and comes up with a new model, the Synthesizer. As it turns out, you can do pretty well like that!
Learning to Classify Images without Labels
How do you learn labels without labels? How do you classify images when you don't know what to classify them into? This paper investigates a new combination of representation learning, clustering, and self-labeling in order to group visually similar images together - and achieves surprisingly high accuracy on benchmark datasets.
Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search
Neural Architecture Search is usually prohibitively expensive in both time and resources to be useful. A search strategy has to keep evaluating new models, training them to convergence in an inner loop to find out if they are any good. This paper proposes to abstract the problem and extract the essential part of the architecture to be optimized into a smaller version and evaluates that version on specifically custom learned data points to predict its performance, which is much faster and cheaper than running the full model.
End-to-end Adversarial Text-to-Speech
Text-to-speech engines are usually multi-stage pipelines that transform the signal into many intermediate representations and require supervision at each step. When trying to train TTS end-to-end, the alignment problem arises: Which text corresponds to which piece of sound? This paper uses an alignment module to tackle this problem and produces astonishingly good sound.
Linformer: Self-Attention with Linear Complexity
Transformers are notoriously resource-intensive because their self-attention mechanism requires a squared number of memory and computations in the length of the input sequence. The Linformer Model gets around that by using the fact that often, the actual information in the attention matrix is of lower rank and can be approximated.
VirTex: Learning Visual Representations from Textual Annotations
Pre-training a CNN backbone for visual transfer learning has recently seen a big push into the direction of incorporating more data, at the cost of less supervision. This paper investigates the opposite: Visual transfer learning by pre-training from very few, but very high-quality samples on an image captioning task.
A bio-inspired bistable recurrent cell allows for long-lasting memory
Even though LSTMs and GRUs solve the vanishing and exploding gradient problems, they have trouble learning to remember things over very long time spans. Inspired from bistability, a property of biological neurons, this paper constructs a recurrent cell with an inherent memory property, with only minimal modification to existing architectures.
TUNIT: Rethinking the Truly Unsupervised Image-to-Image Translation (Paper Explained)
Image-to-Image translation usually requires corresponding samples or at least domain labels of the dataset. This paper removes that restriction and allows for fully unsupervised image translation of a source image to the style of one or many reference images. This is achieved by jointly training a guiding network that provides style information and pseudo-labels.
SIREN: Implicit Neural Representations with Periodic Activation Functions
Implicit neural representations are created when a neural network is used to represent a signal as a function. SIRENs are a particular type of INR that can be applied to a variety of signals, such as images, sound, or 3D shapes. This is an interesting departure from regular machine learning and required me to think differently.
Discovering Symbolic Models from Deep Learning with Inductive Biases
Neural networks are very good at predicting systems' numerical outputs, but not very good at deriving the discrete symbolic equations that govern many physical systems. This paper combines Graph Networks with symbolic regression and shows that the strong inductive biases of these models can be used to derive accurate symbolic equations from observation data.
RepNet: Counting Out Time - Class Agnostic Video Repetition Counting in Wild
Counting repeated actions in a video is one of the easiest tasks for humans, yet remains incredibly hard for machines. RepNet achieves state-of-the-art by creating an information bottleneck in the form of a temporal self-similarity matrix, relating video frames to each other in a way that forces the model to surface the information relevant for counting. Along with that, the authors produce a new dataset for evaluating counting models.
Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures
Backpropagation is one of the central components of modern deep learning. However, it's not biologically plausible, which limits the applicability of deep learning to understand how the human brain works. Direct Feedback Alignment is a biologically plausible alternative and this paper shows that, contrary to previous research, it can be successfully applied to modern deep architectures and solve challenging tasks.
Set Distribution Networks: A generative Model for Sets of Images
We've become very good at making generative models for images and classes of images, but not yet of sets of images, especially when the number of sets is unknown and can contain sets that have never been encountered during training. This paper builds a probabilistic framework and a practical implementation of a generative model for sets of images based on variational methods.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Google builds a 600 billion parameter transformer to do massively multilingual, massive machine translation. Interestingly, the larger model scale does not come from increasing depth of the transformer, but from increasing width in the feedforward layers, combined with a hard routing to parallelize computations on up to 2048 TPUs. A very detailed engineering paper!
SpiNet: Learning Scale-Permuted Backbone for Recognition and Localization
The high-level architecture of CNNs has not really changed over the years. We tend to build high-resolution low-dimensional layers first, followed by ever more coarse, but deep layers. This paper challenges this decades-old heuristic and uses neural architecture search to find an alternative, called SpineNet that employs multiple rounds of re-scaling and long-range skip connections.
Trasnformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Transformers are famous for two things: Their superior performance and their insane requirements of compute and memory. This paper reformulates the attention mechanism in terms of kernel functions and obtains a linear formulation, which reduces these requirements. Surprisingly, this formulation also surfaces an interesting connection between autoregressive transformers and RNNs.
SupSup: SuperMasks in Superposition
Supermasks are binary masks of a randomly initialized neural network that result in the masked network performing well on a particular task. This paper considers the problem of (sequential) Lifelong Learning and trains one Supermask per Task, while keeping the randomly initialized base network constant. By minimizing the output entropy, the system can automatically derive the Task ID of a data point at inference time and distinguish up to 2500 tasks automatically.
NVAE: A deep Hierarchical Variational Autoencoder
VAEs have been traditionally hard to train at high resolutions and unstable when going deep with many layers. In addition, VAE samples are often more blurry and less crisp than those from GANs. This paper details all the engineering choices necessary to successfully train a deep hierarchical VAE that exhibits global consistency and astounding sharpness at high resolutions.
Gradient Origin Networks
Neural networks for implicit representations, such as SIRENs, have been very successful at modeling natural signals. However, in the classical approach, each data point requires its own neural network to be fit. This paper extends implicit representations to an entire dataset by introducing latent vectors of data points to SIRENs. Interestingly, the paper shows that such latent vectors can be obtained without the need for an explicit encoder, by simply looking at the negative gradient of the zero-vector through the representation function.
ImageNet Classification with Deep Convolutional Neural Networks
AlexNet was the start of the deep learning revolution. Up until 2012, the best computer vision systems relied on hand-crafted features and highly specialized algorithms to perform object classification. This paper was the first to successfully train a deep convolutional neural network on not one, but two GPUs and managed to outperform the competition on ImageNet by an order of magnitude.
Generative Adversarial Networks
GANs are of the main models in modern deep learning. This is the paper that started it all! While the task of image classification was making progress, the task of image generation was still cumbersome and prone to artifacts. The main idea behind GANs is to pit two competing networks against each other, thereby creating a generative model that only ever has implicit access to the data through a second, discriminative, model. The paper combines architecture, experiments, and theoretical analysis beautifully.
Deep Residual Learning for Image Recognition
ResNets are one of the cornerstones of modern Computer Vision. Before their invention, people were not able to scale deep neural networks beyond 20 or so layers, but with this paper's invention of residual connections, all of a sudden networks could be arbitrarily deep. This led to a big spike in the performance of convolutional neural networks and rapid adoption in the community. To this day, ResNets are the backbone of most vision models and residual connections appear all throughout deep learning.
Neural Architecture Search without Training
Neural Architecture Search is typically very slow and resource-intensive. A meta-controller has to train many hundreds or thousands of different models to find a suitable building plan. This paper proposes to use statistics of the Jacobian around data points to estimate the performance of proposed architectures at initialization. This method does not require training and speeds up NAS by orders of magnitude.
Big Bird: Transformers for Longer Sequences
The quadratic resource requirements of the attention mechanism are the main roadblock in scaling up transformers to long sequences. This paper replaces the full quadratic attention mechanism by a combination of random attention, window attention, and global attention. Not only does this allow the processing of longer sequences, translating to state-of-the-art experimental results, but also the paper shows that BigBird comes with theoretical guarantees of universal approximation and turing completeness.
Hopfield Networks
Hopfield Networks are one of the classic models of biological memory networks. This paper generalizes modern Hopfield Networks to continuous states and shows that the corresponding update rule is equal to the attention mechanism used in modern Transformers. It further analyzes a pre-trained BERT model through the lens of Hopfield Networks and uses a Hopfield Attention Layer to perform Immune Repertoire Classification.
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
Convolutional Neural Networks have dominated image processing for the last decade, but transformers are quickly replacing traditional models. This paper proposes a fully attentional model for images by combining learned Positional Embeddings with Axial Attention. This new model can compete with CNNs on image classification and achieve state-of-the-art in various image segmentation tasks.
An image is worth 16x16 words: Transformers for Image Recognition at Scale
Transformers are Ruining Convolutions. This paper, under review at ICLR, shows that given enough data, a standard Transformer can outperform Convolutional Neural Networks in image recognition tasks, which are classically tasks where CNNs excel. In this Video, I explain the architecture of the Vision Transformer (ViT), the reason why it works better and rant about why double-bline peer review is broken.
Lambda Networks: Modeling long-range Interactions without Attention
Transformers, having already captured NLP, have recently started to take over the field of Computer Vision. So far, the size of images as input has been challenging, as the Transformers' Attention Mechanism's memory requirements grows quadratic in its input size. LambdaNetworks offer a way around this requirement and capture long-range interactions without the need to build expensive attention maps. They reach a new state-of-the-art in ImageNet and compare favorably to both Transformers and CNNs in terms of efficiency.
Rethinking Attention with Performers
Transformers have huge memory and compute requirements because they construct an Attention matrix, which grows quadratically in the size of the input. The Performer is a model that uses random positive orthogonal features to construct an unbiased estimator to the Attention matrix and obtains an arbitrarily good approximation in linear time! The method generalizes beyond attention and opens the door to the next generation of deep learning architectures.
Fourier Neural Operator for Parametric Partial Differential Equations
Numerical solvers for Partial Differential Equations are notoriously slow. They need to evolve their state by tiny steps in order to stay accurate, and they need to repeat this for each new problem. Neural Fourier Operators, the architecture proposed in this paper, can evolve a PDE in time by a single forward pass, and do so for an entire family of PDEs, as long as the training set covers them well. By performing crucial operations only in Fourier Space, this new architecture is also independent of the discretization or sampling of the underlying signal and has the potential to speed up many scientific applications.
Switch Transformers: Scaling to Trillion Parameter Models
Scale is the next frontier for AI. Google Brain uses sparsity and hard routing to massively increase a model's parameters, while keeping the FLOPs per forward pass constant. The Switch Transformer compares favorably to its dense counterparts in terms of speed and sample efficiency and breaks the next magic number: One Trillion Parameters.
Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory
Autoregressive Transformers have taken over the world of Language Modeling (GPT-3). However, in order to train them, people use causal masking and sample parallelism, which means computation only happens in a feedforward manner. This results in higher layer information, which would be available, to not be used in the lower layers of subsequent tokens, and leads to a loss in the computational capabilities of the overall model. Feedback Transformers trade-off training speed for access to these representations and demonstrate remarkable improvements in complex reasoning and long-range dependency tasks.
NFNets: High-performance large-scale image recognition without normalization
Batch Normalization is a core component of modern deep learning. It enables training at higher batch sizes, prevents mean shift, provides implicit regularization, and allows networks to reach higher performance than without. However, BatchNorm also has disadvantages, such as its dependence on batch size and its computational overhead, especially in distributed settings. Normalizer-Free Networks, developed at Google DeepMind, are a class of CNNs that achieve state-of-the-art classification accuracy on ImageNet without batch normalization. This is achieved by using adaptive gradient clipping (AGC), combined with a number of improvements in general network architecture. The resulting networks train faster, are more accurate, and provide better transfer learning performance. Code is provided in Jax.
TransGAN: Two transformers can make one strong GAN
Generative Adversarial Networks (GANs) hold the state-of-the-art when it comes to image generation. However, while the rest of computer vision is slowly taken over by transformers or other attention-based architectures, all working GANs to date contain some form of convolutional layers. This paper changes that and builds TransGAN, the first GAN where both the generator and the discriminator are transformers. The discriminator is taken over from ViT (an image is worth 16x16 words), and the generator uses pixelshuffle to successfully up-sample the generated resolution. Three tricks make training work: Data augmentations using DiffAug, an auxiliary superresolution task, and a localized initialization of self-attention. Their largest model reaches competitive performance with the best convolutional GANs on CIFAR10, STL-10, and CelebA.
GLOM: How to represent part-whole hierarchies in a neural network
Geoffrey Hinton describes GLOM, a Computer Vision model that combines transformers, neural fields, contrastive learning, capsule networks, denoising autoencoders and RNNs. GLOM decomposes an image into a parse tree of objects and their parts. However, unlike previous systems, the parse tree is constructed dynamically and differently for each input, without changing the underlying neural network. This is done by a multi-step consensus algorithm that runs over different levels of abstraction at each location of an image simultaneously. GLOM is just an idea for now but suggests a radically new approach to AI visual scene understanding.
Perceiver: General Perception with Iterative Attention (Google DeepMind)
Inspired by the fact that biological creatures attend to multiple modalities at the same time, DeepMind releases its new Perceiver model. Based on the Transformer architecture, the Perceiver makes no assumptions on the modality of the input data and also solves the long-standing quadratic bottleneck problem. This is achieved by having a latent low-dimensional Transformer, where the input data is fed multiple times via cross-attention. The Perceiver's weights can also be shared across layers, making it very similar to an RNN. Perceivers achieve competitive performance on ImageNet and state-of-the-art on other modalities, all while making no architectural adjustments to input data.
MLP-Mixer: An all-MLP Architecture for Vision (Machine Learning Paper)
Convolutional Neural Networks have dominated computer vision for nearly 10 years, and that might finally come to an end. First, Vision Transformers (ViT) have shown remarkable performance, and now even simple MLP-based models reach competitive accuracy, as long as sufficient data is used for pre-training. This paper presents MLP-Mixer, using MLPs in a particular weight-sharing arrangement to achieve a competitive, high-throughput model and it raises some interesting questions about the nature of learning and inductive biases and their interaction with scale for future research.
Involution: Inverting the Inherence of Convolution for Visual Recognition
Convolutional Neural Networks (CNNs) have dominated computer vision for almost a decade by applying two fundamental principles: Spatial agnosticism and channel-specific computations. Involution aims to invert these principles and presents a spatial-specific computation, which is also channel-agnostic. The resulting Involution Operator and RedNet architecture are a compromise between classic Convolutions and the newer Local Self-Attention architectures and perform favorably in terms of computation accuracy tradeoff when compared to either.
Expire-Span: Not all memories are created equal
Facebook AI (FAIR) researchers present Expire-Span, a variant of Transformer XL that dynamically assigns expiration dates to previously encountered signals. Because of this, Expire-Span can handle sequences of many thousand tokens, while keeping the memory and compute requirements at a manageable level. It severely matches or outperforms baseline systems, while consuming much less resources. We discuss its architecture, advantages, and shortcomings.
Autoregressive Diffusion Models
Diffusion models have made large advances in recent months as a new type of generative models. This paper introduces Autoregressive Diffusion Models (ARDMs), which are a mix between autoregressive generative models and diffusion models. ARDMs are trained to be agnostic to the order of autoregressive decoding and give the user a dynamic tradeoff between speed and performance at decoding time. This paper applies ARDMs to both text and image data, and as an extension, the models can also be used to perform lossless compression.
Sparse is enough in scaling transformers
Transformers keep pushing the state of the art in language and other domains, mainly due to their ability to scale to ever more parameters. However, this scaling has made it prohibitively expensive to run a lot of inference requests against a Transformer, both in terms of compute and memory requirements. Scaling Transformers are a new kind of architecture that leverage sparsity in the Transformer blocks to massively speed up inference, and by including additional ideas from other architectures, they create the Terraformer, which is both fast, accurate, and consumes very little memory.
Who this course is for:
Students
Those who want to develop their career in data science and AI