Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters.
It is time to recognize and build on the convergence of multiple diverse disciplines on a substantive common model of the intelligent agent.
Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details.
In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers.
All-MLP architectures have attracted increasing interest as an alternative to attention-based models.
Ranked #1 on
Zero-Shot Learning
on COPA
We currently witness the spectacular success of artificial intelligence in both science and public life.
It is merely a transformer layer: it uses self-attention and cross-attention to efficiently compute a recurrent function over a large set of state vectors and tokens.
Generative Adversarial Networks (GANs) are very popular frameworks for generating high-quality data, and are immensely used in both the academia and industry in many domains.
In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin.
Ranked #1 on
Image Classification
on ImageNet
(using extra training data)
We start by describing two conceptually different approaches to building embedding modules: the first one is based on a piecewise linear encoding of scalar values, and the second one utilizes periodic activations.