Sobolev Training for Neural Networks

Wojciech Marian Czarnecki, Simon Osindero, Max Jaderberg
Grzegorz Swirszcz, and Razvan Pascanu
DeepMind, London, UK
{lejlot,osindero,jaderberg,swirszcz,razp}@google.com
Abstract

At the heart of deep learning we aim to use neural networks as function approximators – training them to produce outputs from inputs in emulation of a ground truth function or data creation process. In many cases we only have access to input-output pairs from the ground truth, however it is becoming more common to have access to derivatives of the target output with respect to the input – for example when the ground truth function is itself a neural network such as in network compression or distillation. Generally these target derivatives are not computed, or are ignored. This paper introduces Sobolev Training for neural networks, which is a method for incorporating these target derivatives in addition the to target values while training. By optimising neural networks to not only approximate the function’s outputs but also the function’s derivatives we encode additional information about the target function within the parameters of the neural network. Thereby we can improve the quality of our predictors, as well as the data-efficiency and generalization capabilities of our learned function approximation. We provide theoretical justifications for such an approach as well as examples of empirical evidence on three distinct domains: regression on classical optimisation datasets, distilling policies of an agent playing Atari, and on large-scale applications of synthetic gradients. In all three domains the use of Sobolev Training, employing target derivatives in addition to target values, results in models with higher accuracy and stronger generalisation.

1 Introduction

Deep Neural Networks (DNNs) are one of the main tools of modern machine learning. They are consistently proven to be powerful function approximators, able to model a wide variety of functional forms – from image recognition [8, 24], through audio synthesis [27], to human-beating policies in the ancient game of GO [22]. In many applications the process of training a neural network consists of receiving a dataset of input-output pairs from a ground truth function, and minimising some loss with respect to the network’s parameters. This loss is usually designed to encourage the network to produce the same output, for a given input, as that from the target ground truth function. Many of the ground truth functions we care about in practice have an unknown analytic form, e.g. because they are the result of a natural physical process, and therefore we only have the observed input-output pairs for supervision. However, there are scenarios where we do know the analytic form and so are able to compute the ground truth gradients (or higher order derivatives), alternatively sometimes these quantities may be simply observable. A common example is when the ground truth function is itself a neural network; for instance this is the case for distillation [9, 20], compressing neural networks [7], and the prediction of synthetic gradients [12]. Additionally, if we are dealing with an environment/data-generation process (vs. a pre-determined set of data points), then even though we may be dealing with a black box we can still approximate derivatives using finite differences. In this work, we consider how this additional information can be incorporated in the learning process, and what advantages it can provide in terms of data efficiency and performance. We propose Sobolev Training (ST) for neural networks as a simple and efficient technique for leveraging derivative information about the desired function in a way that can easily be incorporated into any training pipeline using modern machine learning libraries.

The approach is inspired by the work of Hornik [10] which proved the universal approximation theorems for neural networks in Sobolev spaces – metric spaces where distances between functions are defined both in terms of their differences in values and differences in values of their derivatives.

In particular, it was shown that a sigmoid network can not only approximate a function’s value arbitrarily well, but that the network’s derivatives with respect to its inputs can approximate the corresponding derivatives of the ground truth function arbitrarily well too. Sobolev Training exploits this property, and tries to match not only the output of the function being trained but also its derivatives.

Refer to caption
Figure 1: a) Sobolev Training of order 2. Diamond nodes m𝑚m and f𝑓f indicate parameterised functions, where m𝑚m is trained to approximate f𝑓f. Green nodes receive supervision. Solid lines indicate connections through which error signal from loss l𝑙l, l1subscript𝑙1l_{1}, and l2subscript𝑙2l_{2} are backpropagated through to train m𝑚m. b) Stochastic Sobolev Training of order 2. If f𝑓f and m𝑚m are multivariate functions, the gradients are Jacobian matrices. To avoid computing these high dimensional objects, we can efficiently compute and fit their projections on a random vector vjsubscript𝑣𝑗v_{j} sampled from the unit sphere.

There are several related works which have also exploited derivative information for function approximation. For instance Wu et al. [30] and antecedents propose a technique for Bayesian optimisation with Gaussian Processess (GP), where it was demonstrated that the use of information about gradients and Hessians can improve the predictive power of GPs. In previous work on neural networks, derivatives of predictors have usually been used either to penalise model complexity (e.g. by pushing Jacobian norm to 0 [19]), or to encode additional, hand crafted invariances to some transformations (for instance, as in Tangentprop [23]), or estimated derivatives for dynamical systems [6] and very recently to provide additional learning signal during attention distillation [31]111Please relate to Supplementary Materials, section 5 for details. Similar techniques have also been used in critic based Reinforcement Learning (RL), where a critic’s derivatives are trained to match its target’s derivatives [29, 15, 5, 4, 26] using small, sigmoid based models. Finally, Hyvärinen proposed Score Matching Networks [11], which are based on the somewhat surprising observation that one can model unknown derivatives of the function without actual access to its values – all that is needed is a sampling based strategy and specific penalty. However, such an estimator has a high variance [28], thus it is not really useful when true derivatives are given.

To the best of our knowledge and despite its simplicity, the proposal to directly match network derivatives to the true derivatives of the target function has been minimally explored for deep networks, especially modern ReLU based models. In our method, we show that by using the additional knowledge of derivatives with Sobolev Training we are able to train better models – models which achieve lower approximation errors and generalise to test data better – and reduce the sample complexity of learning. The contributions of our paper are therefore threefold: (1): We introduce Sobolev Training – a new paradigm for training neural networks. (2): We look formally at the implications of matching derivatives, extending previous results of Hornik [10] and showing that modern architectures are well suited for such training regimes. (3): Empirical evidence demonstrating that Sobolev Training leads to improved performance and generalisation, particularly in low data regimes. Example domains are: regression on classical optimisation problems; policy distillation from RL agents trained on the Atari domain; and training deep, complex models using synthetic gradients – we report the first successful attempt to train a large-scale ImageNet model using synthetic gradients.

2 Sobolev Training

We begin by introducing the idea of training using Sobolev spaces. When learning a function f𝑓f, we may have access to not only the output values f(xi)𝑓subscript𝑥𝑖f(x_{i}) for training points xisubscript𝑥𝑖x_{i}, but also the values of its j𝑗j-th order derivatives with respect to the input, D𝐱jf(xi)subscriptsuperscript𝐷𝑗𝐱𝑓subscript𝑥𝑖D^{j}_{\mathbf{x}}f(x_{i}). In other words, instead of the typical training set consisting of pairs {(xi,f(xi))}i=1Nsuperscriptsubscriptsubscript𝑥𝑖𝑓subscript𝑥𝑖𝑖1𝑁\{(x_{i},f(x_{i}))\}_{i=1}^{N} we have access to (K+2)𝐾2(K+2)-tuples {(xi,f(xi),D𝐱1f(xi),,D𝐱Kf(xi))}i=1Nsuperscriptsubscriptsubscript𝑥𝑖𝑓subscript𝑥𝑖superscriptsubscript𝐷𝐱1𝑓subscript𝑥𝑖superscriptsubscript𝐷𝐱𝐾𝑓subscript𝑥𝑖𝑖1𝑁\{(x_{i},f(x_{i}),D_{\mathbf{x}}^{1}f(x_{i}),...,D_{\mathbf{x}}^{K}f(x_{i}))\}_{i=1}^{N}. In this situation, the derivative information can easily be incorporated into training a neural network model of f𝑓f by making derivatives of the neural network match the ones given by f𝑓f.

Considering a neural network model m𝑚m parameterised with θ𝜃\theta, one typically seeks to minimise the empirical error in relation to f𝑓f according to some loss function \ell

i=1N(m(xi|θ),f(xi)).superscriptsubscript𝑖1𝑁𝑚conditionalsubscript𝑥𝑖𝜃𝑓subscript𝑥𝑖\sum_{i=1}^{N}\ell(m(x_{i}|\theta),f(x_{i})).

When learning in Sobolev spaces, this is replaced with:

i=1N[(m(xi|θ),f(xi))+j=1Kj(D𝐱jm(xi|θ),D𝐱jf(xi))],superscriptsubscript𝑖1𝑁delimited-[]𝑚conditionalsubscript𝑥𝑖𝜃𝑓subscript𝑥𝑖superscriptsubscript𝑗1𝐾subscript𝑗superscriptsubscript𝐷𝐱𝑗𝑚conditionalsubscript𝑥𝑖𝜃superscriptsubscript𝐷𝐱𝑗𝑓subscript𝑥𝑖\sum_{i=1}^{N}\left[\ell(m(x_{i}|\theta),f(x_{i}))+\sum_{j=1}^{K}\ell_{j}\left(D_{\mathbf{x}}^{j}m(x_{i}|\theta),D_{\mathbf{x}}^{j}f(x_{i})\right)\right],(1)

where jsubscript𝑗\ell_{j} are loss functions measuring error on j𝑗j-th order derivatives. This causes the neural network to encode derivatives of the target function in its own derivatives. Such a model can still be trained using backpropagation and off-the-shelf optimisers.

A potential concern is that this optimisation might be expensive when either the output dimensionality of f𝑓f or the order K𝐾K are high, however one can reduce this cost through stochastic approximations. Specifically, if f𝑓f is a multivariate function, instead of a vector gradient, one ends up with a full Jacobian matrix which can be large. To avoid adding computational complexity to the training process, one can use an efficient, stochastic version of Sobolev Training: instead of computing a full Jacobian/Hessian, one just computes its projection onto a random vector (a direct application of a known estimation trick  [19]). In practice, this means that during training we have a random variable v𝑣v sampled uniformly from the unit sphere, and we match these random projections instead:

i=1N[(m(xi|θ),f(xi))+j=1K𝔼vj[j(D𝐱jm(xi|θ),vj,D𝐱jf(xi),vj)]].superscriptsubscript𝑖1𝑁delimited-[]𝑚conditionalsubscript𝑥𝑖𝜃𝑓subscript𝑥𝑖superscriptsubscript𝑗1𝐾subscript𝔼superscript𝑣𝑗delimited-[]subscript𝑗superscriptsubscript𝐷𝐱𝑗𝑚conditionalsubscript𝑥𝑖𝜃superscript𝑣𝑗superscriptsubscript𝐷𝐱𝑗𝑓subscript𝑥𝑖superscript𝑣𝑗\sum_{i=1}^{N}\left[\ell(m(x_{i}|\theta),f(x_{i}))+\sum_{j=1}^{K}\mathbb{E}_{v^{j}}\left[\ell_{j}\left(\left\langle D_{\mathbf{x}}^{j}m(x_{i}|\theta),v^{j}\right\rangle,\left\langle D_{\mathbf{x}}^{j}f(x_{i}),v^{j}\right\rangle\right)\right]\right].(2)

Figure 1 illustrates compute graphs for non-stochastic and stochastic Sobolev Training of order 2.

3 Theory and motivation

While in the previous section we defined Sobolev Training, it is not obvious that modeling the derivatives of the target function f𝑓f is beneficial to function approximation, or that optimising such an objective is even feasible. In this section we motivate and explore these questions theoretically, showing that the Sobolev Training objective is a well posed one, and that incorporating derivative information has the potential to drastically reduce the sample complexity of learning.

Hornik showed [10] that neural networks with non-constant, bounded, continuous activation functions, with continuous derivatives up to order K𝐾K are universal approximators in the Sobolev spaces of order K𝐾K, thus showing that sigmoid-networks are indeed capable of approximating elements of these spaces arbitrarily well. However, nowadays we often use activation functions such as ReLU which are neither bounded nor have continuous derivatives. The following theorem shows that for K=1𝐾1K=1 we can use ReLU function (or a similar one, like leaky ReLU) to create neural networks that are universal approximators in Sobolev spaces. We will use a standard symbol 𝒞1(S)superscript𝒞1𝑆\mathcal{C}^{1}(S) (or simply 𝒞1superscript𝒞1\mathcal{C}^{1}) to denote a space of functions which are continuous, differentiable, and have a continuous derivative on a space S𝑆S [14]. All proofs are given in the Supplementary Materials (SM).

Theorem 1.

Let f𝑓f be a 𝒞1superscript𝒞1\mathcal{C}^{1} function on a compact set. Then, for every positive ε𝜀\varepsilon there exists a single hidden layer neural network with a ReLU (or a leaky ReLU) activation which approximates f𝑓f in Sobolev space 𝒮1subscript𝒮1\mathcal{S}_{1} up to ϵitalic-ϵ\epsilon error.

This suggests that the Sobolev Training objective is achievable, and that we can seek to encode the values and derivatives of the target function in the values and derivatives of a ReLU neural network model. Interestingly, we can show that if we seek to encode an arbitrary function in the derivatives of the model then this is impossible not only for neural networks but also for any arbitrary differentiable predictor on compact sets.

Theorem 2.

Let f𝑓f be a 𝒞1superscript𝒞1\mathcal{C}^{1} function. Let g𝑔g be a continuous function satisfying gfx>0subscriptnorm𝑔𝑓𝑥0\|g-\tfrac{\partial f}{\partial x}\|_{\infty}>0. Then, there exists an η>0𝜂0\eta>0 such that for any 𝒞1superscript𝒞1\mathcal{C}^{1} function hh either fhηsubscriptnorm𝑓𝜂\|f-h\|_{\infty}\geq\eta or ghxηsubscriptnorm𝑔𝑥𝜂\left\|g-\frac{\partial h}{\partial x}\right\|_{\infty}\geq\eta.

However, when we move to the regime of finite training data, we can encode any arbitrary function in the derivatives (as well as higher order signals if the resulting Sobolev spaces are not degenerate), as shown in the following Proposition.

Proposition 1.

Given any two functions f:S:𝑓𝑆f:S\rightarrow\mathbb{R} and g:Sd:𝑔𝑆superscript𝑑g:S\rightarrow\mathbb{R}^{d} on Sd𝑆superscript𝑑S\subseteq\mathbb{R}^{d} and a finite set ΣSΣ𝑆\Sigma\subset S, there exists neural network hh with a ReLU (or a leaky ReLU) activation such that xΣ:f(x)=h(x):for-all𝑥Σ𝑓𝑥𝑥\forall x\in\Sigma:f(x)=h(x) and g(x)=hx(x)𝑔𝑥𝑥𝑥g(x)=\tfrac{\partial h}{\partial x}(x) (it has 0 training loss).

Having shown that it is possible to train neural networks to encode both the values and derivatives of a target function, we now formalise one possible way of showing that Sobolev Training has lower sample complexity than regular training.

Refer to captionRefer to caption
Figure 2: Left: From top: Example of the piece-wise linear function; Two (out of a continuum of) hypotheses consistent with 3 training points, showing that one needs two points to identify each linear segment; The only hypothesis consistent with 3 training points enriched with derivative information. Right: Logarithm of test error (MSE) for various optimisation benchmarks with varied training set size (20, 100 and 10000 points) sampled uniformly from the problem’s domain.

Let \mathcal{F} denote the family of functions parametrised by ω𝜔\omega. We define Kreg=Kreg()subscript𝐾𝑟𝑒𝑔subscript𝐾𝑟𝑒𝑔K_{reg}=K_{reg}(\mathcal{F}) to be a measure of the amount of data needed to learn some target function f𝑓f. That is Kregsubscript𝐾𝑟𝑒𝑔K_{reg} is the smallest number for which there holds: for every fωsubscript𝑓𝜔f_{\omega}\in\mathcal{F} and every set of distinct Kregsubscript𝐾𝑟𝑒𝑔K_{reg} points (x1,,xKreg)subscript𝑥1subscript𝑥subscript𝐾𝑟𝑒𝑔(x_{1},...,x_{K_{reg}}) such that i=1,,Kregf(xi)=fω(xi)f=fωsubscriptfor-all𝑖1subscript𝐾𝑟𝑒𝑔𝑓subscript𝑥𝑖subscript𝑓𝜔subscript𝑥𝑖𝑓subscript𝑓𝜔\forall_{i=1,...,K_{reg}}f(x_{i})=f_{\omega}(x_{i})\Rightarrow f=f_{\omega}. Ksobsubscript𝐾𝑠𝑜𝑏K_{sob} is defined analogously, but the final implication is of form f(xi)=fω(xi)fx(xi)=fωx(xi)f=fω𝑓subscript𝑥𝑖subscript𝑓𝜔subscript𝑥𝑖𝑓𝑥subscript𝑥𝑖subscript𝑓𝜔𝑥subscript𝑥𝑖𝑓subscript𝑓𝜔f(x_{i})=f_{\omega}(x_{i})\wedge\frac{\partial f}{\partial x}(x_{i})=\frac{\partial f_{\omega}}{\partial x}(x_{i})\Rightarrow f=f_{\omega}. Straight from the definition there follows:

Proposition 2.

For any \mathcal{F}, there holds Ksob()Kreg()subscript𝐾𝑠𝑜𝑏subscript𝐾𝑟𝑒𝑔K_{sob}(\mathcal{F})\leq K_{reg}(\mathcal{F}).

For many families, the above inequality becomes sharp. For example, to determine the coefficients of a polynomial of degree n𝑛n one needs to compute its values in at least n+1𝑛1n+1 distinct points. If we know values and the derivatives at k𝑘k points, it is a well-known fact that only n2𝑛2\lceil\frac{n}{2}\rceil points suffice to determine all the coefficients. We present two more examples in a slightly more formal way. Let GsubscriptG\mathcal{F}_{\rm{G}} denote a family of Gaussian PDF-s (parametrised by μ𝜇\mu, σ𝜎\sigma). Let dD=D1Dnsuperset-ofsuperscript𝑑𝐷subscript𝐷1subscript𝐷𝑛\mathbb{R}^{d}\supset D=D_{1}\cup\ldots\cup D_{n} and let PLsubscriptPL\mathcal{F}_{\rm{PL}} be a family of functions from D1××Dnsubscript𝐷1subscript𝐷𝑛D_{1}\times...\times D_{n} (Cartesian product of sets Disubscript𝐷𝑖D_{i}) to nsuperscript𝑛\mathbb{R}^{n} of form f(x)=[A1x1+b1,,Anxn+bn]𝑓𝑥subscript𝐴1subscript𝑥1subscript𝑏1subscript𝐴𝑛subscript𝑥𝑛subscript𝑏𝑛f(x)=[A_{1}x_{1}+b_{1},…,A_{n}x_{n}+b_{n}] (linear element-wise) (Figure 2 Left).

Proposition 3.

There holds Ksob(G)<Kreg(G)subscript𝐾𝑠𝑜𝑏subscriptGsubscript𝐾𝑟𝑒𝑔subscriptGK_{sob}\left(\mathcal{F}_{\rm{G}}\right)<K_{reg}(\mathcal{F}_{\rm{G}}) and Ksob(PL)<Kreg(PL)subscript𝐾𝑠𝑜𝑏subscriptPLsubscript𝐾𝑟𝑒𝑔subscriptPLK_{sob}(\mathcal{F}_{\rm{PL}})<K_{reg}(\mathcal{F}_{\rm{PL}}).

This result relates to Deep ReLU networks as they build a hyperplanes-based model of the target function. If those were parametrised independently one could expect a reduction of sample complexity by d+1𝑑1d+1 times, where d𝑑d is the dimension of the function domain. In practice parameters of hyperplanes in such networks are not independent, furthermore the hinges positions change so the Proposition cannot be directly applied, but it can be seen as an intuitive way to see why the sample complexity drops significantly for Deep ReLU networks too.

4 Experimental Results

We consider three domains where information about derivatives is available during training222All experiments were performed using TensorFlow [2] and the Sonnet neural network library [1]..

4.1 Artificial Data

First, we consider the task of regression on a set of well known low-dimensional functions used for benchmarking optimisation methods.

We train two hidden layer neural networks with 256 hidden units per layer with ReLU activations to regress towards function values, and verify generalisation capabilities by evaluating the mean squared error on a hold-out test set. Since the task is standard regression, we choose all the losses of Sobolev Training to be L2 errors, and use a first order Sobolev method (second order derivatives of ReLU networks with a linear output layer are constant, zero). The optimisation is therefore:

minθ1Ni=1Nf(xi)m(xi|θ)22+xf(xi)xm(xi|θ)22.\min_{\theta}\tfrac{1}{N}\sum_{i=1}^{N}\|f(x_{i})-m(x_{i}|\theta)\|^{2}_{2}+\|\nabla_{x}f(x_{i})-\nabla_{x}m(x_{i}|\theta)\|_{2}^{2}.
Dataset20 training samples100 training samples
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
RegularSobolevRegularSobolev
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 3: Styblinski-Tang function (on the left) and its models using regular neural network training (left part of each plot) and Sobolev Training (right part). We also plot the vector field of the gradients of each predictor underneath the function plot.

Figure 2 right shows the results for the optimisation benchmarks. As expected, Sobolev trained networks perform extremely well – for six out of seven benchmark problems they significantly reduce the testing error with the obtained errors orders of magnitude smaller than the corresponding errors of the regularly trained networks. The stark difference in approximation error is highlighted in Figure 3, where we show the Styblinski-Tang function and its approximations with both regular and Sobolev Training. It is clear that even in very low data regimes, the Sobolev trained networks can capture the functional shape.

Looking at the results, we make two important observations. First, the effect of Sobolev Training is stronger in low-data regimes, however it does not disappear even in the high data regime, when one has 10,000 training examples for training a two-dimensional function. Second, the only case where regular regression performed better is the regression towards Ackley’s function. This particular example was chosen to show that one possible weak point of our approach might be approximating functions with a very high frequency signal component in the relatively low data regime. Ackley’s function is composed of exponents of high frequency cosine waves, thus creating an extremely bumpy surface, consequently a method that tries to match the derivatives can behave badly during testing if one does not have enough data to capture this complexity. However, once we have enough training data points, Sobolev trained networks are able to approximate this function better.

4.2 Distillation

Another possible application of Sobolev Training is to perform model distillation. This technique has many applications, such as network compression [21], ensemble merging [9], or more recently policy distillation in reinforcement learning [20].

We focus here on a task of distilling a policy. We aim to distill a target policy π(s)superscript𝜋𝑠\pi^{*}(s) – a trained neural network which outputs a probability distribution over actions – into a smaller neural network π(s|θ)𝜋conditional𝑠𝜃\pi(s|\theta), such that the two policies πsuperscript𝜋\pi^{*} and π𝜋\pi have the same behaviour. In practice this is often done by minimising an expected divergence measure between πsuperscript𝜋\pi^{*} and π𝜋\pi, for example, the Kullback–Leibler divergence DKL(π(s)π(s))subscript𝐷𝐾𝐿conditional𝜋𝑠superscript𝜋𝑠D_{KL}(\pi(s)\|\pi^{*}(s)), over states gathered while following πsuperscript𝜋\pi^{*}. Since policies are multivariate functions, direct application of Sobolev Training would mean producing full Jacobian matrices with respect to the s𝑠s, which for large actions spaces is computationally expensive. To avoid this issue we employ a stochastic approximation described in Section 2, thus resulting in the objective

minθDKL(π(s|θ)π(s))+α𝔼v[slogπ(s),vslogπ(s|θ),v],\min_{\theta}D_{KL}(\pi(s|\theta)\|\pi^{*}(s))+\alpha\mathbb{E}_{v}\left[\|\nabla_{s}\langle\log\pi^{*}(s),v\rangle-\nabla_{s}\langle\log\pi(s|\theta),v\rangle\|\right],

where the expectation is taken with respect to v𝑣v coming from a uniform distribution over the unit sphere, and Monte Carlo sampling is used to approximate it.

As target policies πsuperscript𝜋\pi^{*}, we use agents playing Atari games [17] that have been trained with A3C [16] on three well known games: Pong, Breakout and Space Invaders. The agent’s policy is a neural network consisting of 3 layers of convolutions followed by two fully-connected layers, which we distill to a smaller network with 2 convolutional layers and a single smaller fully-connected layer (see SM for details). Distillation is treated here as a purely supervised learning problem, as our aim is not to re-evaluate known distillation techniques, but rather to show that if the aim is to minimise a given divergence measure, we can improve distillation using Sobolev Training.

Test action prediction errorTest DKLsubscript𝐷𝐾𝐿D_{KL}
Refer to captionRefer to caption
Regular distillation    Sobolev distillation
Figure 4: Test results of distillation of RL agents on three Atari games. Reported test action prediction error (left) is the error of the most probable action predicted between the distilled policy and target policy, and test DKL (right) is the Kulblack-Leibler divergence between policies. Numbers in the column title represents the percentage of the 100K recorded states used for training (the remaining are used for testing). In all scenarios the Sobolev distilled networks are significantly more similar to the target policy.

Figure 4 shows test error during training with and without Sobolev Training333Testing is performed on a held out set of episodes, thus there are no temporal nor causal relations between training and testing. The introduction of Sobolev Training leads to similar effects as in the previous section – the network generalises much more effectively, and this is especially true in low data regimes. Note the performance gap on Pong is small due to the fact that optimal policy is quite degenerate for this game444For majority of the time the policy in Pong is uniform, since actions taken when the ball is far away from the player do not matter at all. Only in crucial situations it peaks so the ball hits the paddle.. In all remaining games one can see a significant performance increase from using our proposed method, and as well as minor to no overfitting.

Despite looking like a regularisation effect, we stress that Sobolev Training is not trying to find the simplest models for data or suppress the expressivity of the model. This training method aims at matching the original function’s smoothness/complexity and so reduces overfitting by effectively extending the information content of the training set, rather than by imposing a data-independent prior as with regularisation.

4.3 Synthetic Gradients

Table 1: Various techniques for producing synthetic gradients. Green shaded nodes denote nodes that get supervision from the corresponding object from the main network (gradient or loss value). We report accuracy on the test set ±plus-or-minus\pm standard deviation. Backpropagation results are given in parenthesis.
                   [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
NopropDirect SG [12]VFBN [25]CriticSobolev
CIFAR-10 with 3 synthetic gradient modules
Top 1 (94.3%)54.5% ±1.15plus-or-minus1.15\pm 1.1579.2% ±0.01plus-or-minus0.01\pm 0.0188.5% ±2.70plus-or-minus2.70\pm 2.7093.2% ±0.02plus-or-minus0.02\pm 0.0293.5% ±0.01plus-or-minus0.01\pm 0.01
ImageNet with 1 synthetic gradient module
Top 1 (75.0%)54.0% ±0.29plus-or-minus0.29\pm 0.29-57.9% ±2.03plus-or-minus2.03\pm 2.0371.7% ±0.23plus-or-minus0.23\pm 0.2372.0% ±0.05plus-or-minus0.05\pm 0.05
Top 5 (92.3%)77.3% ±0.06plus-or-minus0.06\pm 0.06-81.5% ±1.20plus-or-minus1.20\pm 1.2090.5% ±0.15plus-or-minus0.15\pm 0.1590.8% ±0.01plus-or-minus0.01\pm 0.01
ImageNet with 3 synthetic gradient modules
Top 1 (75.0%)18.7% ±0.18plus-or-minus0.18\pm 0.18-28.3% ±5.24plus-or-minus5.24\pm 5.2465.7% ±0.56plus-or-minus0.56\pm 0.5666.5% ±0.22plus-or-minus0.22\pm 0.22
Top 5 (92.3%)38.0% ±0.34plus-or-minus0.34\pm 0.34-52.9% ±6.62plus-or-minus6.62\pm 6.6286.9% ±0.33plus-or-minus0.33\pm 0.3387.4% ±0.11plus-or-minus0.11\pm 0.11

The previous experiments have shown how information about the derivatives can boost approximating function values. However, the core idea of Sobolev Training is broader than that, and can be employed in both directions. Namely, if one ultimately cares about approximating derivatives, then additionally approximating values can help this process too. One recent technique, which requires a model of gradients is Synthetic Gradients (SG) [12] – a method for training complex neural networks in a decoupled, asynchronous fashion. In this section we show how we can use Sobolev Training for SG.

The principle behind SG is that instead of doing full backpropagation using the chain-rule, one splits a network into two (or more) parts, and approximates partial derivatives of the loss L𝐿L with respect to some hidden layer activations hh with a trainable function SG(h,y|θ)𝑆𝐺conditional𝑦𝜃SG(h,y|\theta). In other words, given that network parameters up to hh are denoted by ΘΘ\Theta

LΘ=LhhΘSG(h,y|θ)hΘ.𝐿Θ𝐿Θ𝑆𝐺conditional𝑦𝜃Θ\frac{\partial L}{\partial\Theta}=\frac{\partial L}{\partial h}\frac{\partial h}{\partial\Theta}\approx SG(h,y|\theta)\frac{\partial h}{\partial\Theta}.

In the original SG paper, this module is trained to minimise LSG(θ)=SG(h,y|θ)L(ph,y)h22,L_{SG}(\theta)=\left\|SG(h,y|\theta)-\tfrac{\partial L(p_{h},y)}{\partial h}\right\|^{2}_{2}, where phsubscript𝑝p_{h} is the final prediction of the main network for hidden activations hh. For the case of learning a classifier, in order to apply Sobolev Training in this context we construct a loss predictor, composed of a class predictor p(|θ)p(\cdot|\theta) followed by the log loss, which gets supervision from the true loss, and the gradient of the prediction gets supervision from the true gradient:

m(h,y|θ):=L(p(h|θ),y),SG(h,y|θ):=m(h,y|θ)/h,formulae-sequenceassign𝑚conditional𝑦𝜃𝐿𝑝conditional𝜃𝑦assign𝑆𝐺conditional𝑦𝜃𝑚conditional𝑦𝜃m(h,y|\theta):=L(p(h|\theta),y),\;\;\;\;SG(h,y|\theta):=\partial m(h,y|\theta)/\partial h,
LSGsob(θ)=(m(h,y|θ),L(ph,y)))+1(m(h,y|θ)h,L(ph,y)h).L_{SG}^{sob}(\theta)=\ell(m(h,y|\theta),L(p_{h},y)))+\ell_{1}\left(\tfrac{\partial m(h,y|\theta)}{\partial h},\tfrac{\partial L(p_{h},y)}{\partial h}\right).

In the Sobolev Training framework, the target function is the loss of the main network L(ph,y)𝐿subscript𝑝𝑦L(p_{h},y) for which we train a model m(h,y|θ)𝑚conditional𝑦𝜃m(h,y|\theta) to approximate, and in addition ensure that the model’s derivatives m(h,y|θ)/h𝑚conditional𝑦𝜃\partial m(h,y|\theta)/{\partial h} are matched to the true derivatives L(ph,y)/h𝐿subscript𝑝𝑦\partial L(p_{h},y)/\partial h. The model’s derivatives m(h,y|θ)/h𝑚conditional𝑦𝜃\partial m(h,y|\theta)/\partial h are used as the synthetic gradient to decouple the main network.

This setting closely resembles what is known in reinforcement learning as critic methods [13]. In particular, if we do not provide supervision on the gradient part, we end up with a loss critic. Similarly if we do not provide supervision at the loss level, but only on the gradient component, we end up in a method that resembles VFBN [25]. In light of these connections, our approach in this application setting can be seen as a generalisation and unification of several existing ones (see Table 1 for illustrations of these approaches).

We perform experiments on decoupling deep convolutional neural network image classifiers using synthetic gradients produced by loss critics that are trained with Sobolev Training, and compare to regular loss critic training, and regular synthetic gradient training. We report results on CIFAR-10 for three network splits (and therefore three synthetic gradient modules) and on ImageNet with one and three network splits 555N.b. the experiments presented use learning rates, annealing schedule, etc. optimised to maximise the backpropagation baseline, rather than the synthetic gradient decoupled result (details in the SM). .

The results are shown in Table 1. With a naive SG model, we obtain 79.2% test accuracy on CIFAR-10. Using an SG architecture which resembles a small version of the rest of the model makes learning much easier and led to 88.5% accuracy, while Sobolev Training achieves 93.5% final performance. The regular critic also trains well, achieving 93.2%, as the critic forces the lower part of the network to provide a representation which it can use to reduce the classification (and not just prediction) error. Consequently it provides a learning signal which is well aligned with the main optimisation. However, this can lead to building representations which are suboptimal for the rest of the network. Adding additional gradient supervision by constructing our Sobolev SG module avoids this issue by making sure that synthetic gradients are truly aligned and gives an additional boost to the final accuracy.

For ImageNet [3] experiments based on ResNet50 [8], we obtain qualitatively similar results. Due to the complexity of the model and an almost 40% gap between no backpropagation and full backpropagation results, the difference between methods with vs without loss supervision grows significantly. This suggests that at least for ResNet-like architectures, loss supervision is a crucial component of a SG module. After splitting ResNet50 into four parts the Sobolev SG achieves 87.4% top 5 accuracy, while the regular critic SG achieves 86.9%, confirming our claim about suboptimal representation being enforced by gradients from a regular critic. Sobolev Training results were also much more reliable in all experiments (significantly smaller standard deviation of the results).

5 Discussion and Conclusion

In this paper we have introduced Sobolev Training for neural networks – a simple and effective way of incorporating knowledge about derivatives of a target function into the training of a neural network function approximator. We provided theoretical justification that encoding both a target function’s value as well as its derivatives within a ReLU neural network is possible, and that this results in more data efficient learning. Additionally, we show that our proposal can be efficiently trained using stochastic approximations if computationally expensive Jacobians or Hessians are encountered.

In addition to toy experiments which validate our theoretical claims, we performed experiments to highlight two very promising areas of applications for such models: one being distillation/compression of models; the other being the application to various meta-optimisation techniques that build models of other models dynamics (such as synthetic gradients, learning-to-learn, etc.). In both cases we obtain significant improvement over classical techniques, and we believe there are many other application domains in which our proposal should give a solid performance boost.

In this work we focused on encoding true derivatives in the corresponding ones of the neural network. Another possibility for future work is to encode information which one believes to be highly correlated with derivatives. For example curvature [18] is believed to be connected to uncertainty. Therefore, given a problem with known uncertainty at training points, one could use Sobolev Training to match the second order signal to the provided uncertainty signal. Finite differences can also be used to approximate gradients for black box target functions, which could help when, for example, learning a generative temporal model. Another unexplored path would be to apply Sobolev Training to internal derivatives rather than just derivatives with respect to the inputs.

References

  • [1] Sonnet. https://github.com/deepmind/sonnet. 2017.
  • [2] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
  • [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
  • [4] Michael Fairbank and Eduardo Alonso. Value-gradient learning. In Neural Networks (IJCNN), The 2012 International Joint Conference on, pages 1–8. IEEE, 2012.
  • [5] Michael Fairbank, Eduardo Alonso, and Danil Prokhorov. Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks. IEEE transactions on neural networks and learning systems, 23(10):1671–1676, 2012.
  • [6] A Ronald Gallant and Halbert White. On learning the derivatives of an unknown mapping with multilayer feedforward networks. Neural Networks, 5(1):129–138, 1992.
  • [7] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  • [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [9] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [10] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
  • [11] Aapo Hyvärinen. Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research, pages 695–709, 2005.
  • [12] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343, 2016.
  • [13] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In NIPS, volume 13, pages 1008–1014, 1999.
  • [14] Steven G Krantz. Handbook of complex variables. Springer Science & Business Media, 2012.
  • [15] W Thomas Miller, Paul J Werbos, and Richard S Sutton. Neural networks for control. MIT press, 1995.
  • [16] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
  • [17] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • [18] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
  • [19] Salah Rifai, Grégoire Mesnil, Pascal Vincent, Xavier Muller, Yoshua Bengio, Yann Dauphin, and Xavier Glorot. Higher order contractive auto-encoder. Machine Learning and Knowledge Discovery in Databases, pages 645–660, 2011.
  • [20] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
  • [21] Bharat Bhusan Sau and Vineeth N Balasubramanian. Deep model compression: Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650, 2016.
  • [22] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • [23] Patrice Simard, Bernard Victorri, Yann LeCun, and John S Denker. Tangent prop-a formalism for specifying selected invariances in an adaptive network. In NIPS, volume 91, pages 895–903, 1991.
  • [24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [25] Shin-ichi Maeda Koyama Masanori Takeru Miyato, Daisuke Okanohara. Synthetic gradient methods with virtual forward-backward networks. ICLR workshop proceedings, 2017.
  • [26] Yuval Tassa and Tom Erez. Least squares solutions of the hjb equation with neural network value-function approximators. IEEE transactions on neural networks, 18(4):1031–1041, 2007.
  • [27] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR abs/1609.03499, 2016.
  • [28] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  • [29] Paul J Werbos. Approximate dynamic programming for real-time control and neural modeling. Handbook of intelligent control, 1992.
  • [30] Anqi Wu, Mikio C Aoi, and Jonathan W Pillow. Exploiting gradients and hessians in bayesian optimization and bayesian quadrature. arXiv preprint arXiv:1704.00060, 2017.
  • [31] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.

Supplementary Materials for “Sobolev Training for Neural Networks”

1 Proofs

Theorem 1.

Let f𝑓f be a 𝒞1superscript𝒞1\mathcal{C}^{1} function on a compact set. Then, for every positive ε𝜀\varepsilon there exists a single hidden layer neural network with a ReLU (or a leaky ReLU) activation which approximates f𝑓f in Sobolev space 𝒮1subscript𝒮1\mathcal{S}_{1} up to ϵitalic-ϵ\epsilon error.

We start with a definition. We will say that a function p𝑝p on a set D𝐷D is piecewise-linear, if there exist D1,,Dnsubscript𝐷1subscript𝐷𝑛D_{1},\ldots,D_{n} such that D=D1Dn=D𝐷subscript𝐷1subscript𝐷𝑛𝐷D=D_{1}\cup\ldots\cup D_{n}=D and p|Dievaluated-at𝑝subscript𝐷𝑖p|_{D_{i}} is linear for every i=1,,n𝑖1𝑛i=1,\ldots,n (note, that we assume finiteness in the definition).

Lemma 1.

Let D𝐷D be a compact subset of \mathbb{R} and let φ𝒞1(D)𝜑superscript𝒞1𝐷\varphi\in\mathcal{C}^{1}(D). Then, for every ε>0𝜀0\varepsilon>0 there exists a piecewise-linear, continuous function p:D:𝑝𝐷p:D\rightarrow\mathbb{R} such that |φ(x)p(x)|<ε𝜑𝑥𝑝𝑥𝜀|\varphi(x)-p(x)|<\varepsilon for every xD𝑥𝐷x\in D and |φ(x)p(x)|<εsuperscript𝜑𝑥superscript𝑝𝑥𝜀|\varphi^{\prime}(x)-p^{\prime}(x)|<\varepsilon for every xDP𝑥𝐷𝑃x\in D\setminus{P}, where P𝑃P is the set of points of non-differentiability of p𝑝p.

Proof.

By assumption, the function φsuperscript𝜑\varphi^{\prime} is continuous on D𝐷D. Every continuous function on a compact set has to be uniformly continuous. Therefore, there exists δ1subscript𝛿1\delta_{1} such that for every x1subscript𝑥1x_{1}, x2subscript𝑥2x_{2}, with |x1x2|<δ1subscript𝑥1subscript𝑥2subscript𝛿1|x_{1}-x_{2}|<\delta_{1} there holds |φ(x1)φ(x2)|<εsuperscript𝜑subscript𝑥1superscript𝜑subscript𝑥2𝜀|\varphi^{\prime}(x_{1})-\varphi^{\prime}(x_{2})|<\varepsilon. Moreover, φsuperscript𝜑\varphi^{\prime} has to be bounded. Let M𝑀M denote supx|φ(x)|subscriptsupremum𝑥superscript𝜑𝑥\sup\limits_{x}|\varphi^{\prime}(x)|. By Mean Value Theorem, if |x1x2|<ε2Msubscript𝑥1subscript𝑥2𝜀2𝑀|x_{1}-x_{2}|<\frac{\varepsilon}{2M} then |φ(x1)φ(x2)|<ε2𝜑subscript𝑥1𝜑subscript𝑥2𝜀2|\varphi(x_{1})-\varphi(x_{2})|<\frac{\varepsilon}{2}. Let δ=min{δ1,ε2M}𝛿subscript𝛿1𝜀2𝑀\delta=\min\left\{\delta_{1},\frac{\varepsilon}{2M}\right\}. Let ξisubscript𝜉𝑖\xi_{i}, i=0,,N𝑖0𝑁i=0,\ldots,N be a sequence satisfying: ξi<ξjsubscript𝜉𝑖subscript𝜉𝑗\xi_{i}<\xi_{j} for i<j𝑖𝑗i<j, |ξiξi1|<δsubscript𝜉𝑖subscript𝜉𝑖1𝛿|\xi_{i}-\xi_{i-1}|<\delta for i=1,,N𝑖1𝑁i=1,\ldots,N and ξ0<x<ξNsubscript𝜉0𝑥subscript𝜉𝑁\xi_{0}<x<\xi_{N} for all xD𝑥𝐷x\in D. Such sequence obviously exists, because D𝐷D is a compact (and thus bounded) subset of \mathbb{R}. We define

p(x)=φ(ξi1)+φ(ξi)φ(ξi1)ξiξi1(xξi1)forx[ξi1,ξi]D.𝑝𝑥𝜑subscript𝜉𝑖1𝜑subscript𝜉𝑖𝜑subscript𝜉𝑖1subscript𝜉𝑖subscript𝜉𝑖1𝑥subscript𝜉𝑖1forxsubscript𝜉i1subscript𝜉iDp(x)=\varphi(\xi_{i-1})+\frac{\varphi(\xi_{i})-\varphi(\xi_{i-1})}{\xi_{i}-\xi_{i-1}}(x-\xi_{i-1})\;\;\;\rm{for}\;\;\;x\in[\xi_{i-1},\xi_{i}]\cap D.

It can be easily verified, that it has all the desired properties. Indeed, let xD𝑥𝐷x\in D. Let i𝑖i be such that ξi1xξisubscript𝜉𝑖1𝑥subscript𝜉𝑖\xi_{i-1}\leq x\leq\xi_{i}. Then |φ(x)p(x)|=|φ(x)φ(ξi)+p(ξi)p(x)||φ(x)φ(ξi)|+|p(ξi)p(x)|ε𝜑𝑥𝑝𝑥𝜑𝑥𝜑subscript𝜉𝑖𝑝subscript𝜉𝑖𝑝𝑥𝜑𝑥𝜑subscript𝜉𝑖𝑝subscript𝜉𝑖𝑝𝑥𝜀|\varphi(x)-p(x)|=|\varphi(x)-\varphi(\xi_{i})+p(\xi_{i})-p(x)|\leq|\varphi(x)-\varphi(\xi_{i})|+|p(\xi_{i})-p(x)|\leq\varepsilon, as φ(ξi)=p(ξi)𝜑subscript𝜉𝑖𝑝subscript𝜉𝑖\varphi(\xi_{i})=p(\xi_{i}) and |ξix||ξiξi1|<δsubscript𝜉𝑖𝑥subscript𝜉𝑖subscript𝜉𝑖1𝛿|\xi_{i}-x|\leq|\xi_{i}-\xi_{i-1}|<\delta by definitions. Moreover, applying Mean Value Theorem we get that there exists ζ[ξi1,ξi]𝜁subscript𝜉𝑖1subscript𝜉𝑖\zeta\in[\xi_{i-1},\xi_{i}] such that φ(ζ)=φ(ξi)φ(ξi1)ξiξi1=p(ζ)superscript𝜑𝜁𝜑subscript𝜉𝑖𝜑subscript𝜉𝑖1subscript𝜉𝑖subscript𝜉𝑖1superscript𝑝𝜁\varphi^{\prime}(\zeta)=\frac{\varphi(\xi_{i})-\varphi(\xi_{i-1})}{\xi_{i}-\xi_{i-1}}=p^{\prime}(\zeta). Thus, |φ(x)p(x)|=|φ(x)φ(ζ)+p(ζ)p(x)||φ(x)φ(ζ)|+|p(ζ)p(x)|εsuperscript𝜑𝑥superscript𝑝𝑥superscript𝜑𝑥superscript𝜑𝜁superscript𝑝𝜁superscript𝑝𝑥superscript𝜑𝑥𝜑𝜁superscript𝑝𝜁superscript𝑝𝑥𝜀|\varphi^{\prime}(x)-p^{\prime}(x)|=|\varphi^{\prime}(x)-\varphi^{\prime}(\zeta)+p^{\prime}(\zeta)-p^{\prime}(x)|\leq|\varphi^{\prime}(x)-\varphi(\zeta)|+|p^{\prime}(\zeta)-p^{\prime}(x)|\leq\varepsilon as p(ζ)=p(x)superscript𝑝𝜁superscript𝑝𝑥p^{\prime}(\zeta)=p^{\prime}(x) and |ζx|<δ𝜁𝑥𝛿|\zeta-x|<\delta.

Lemma 2.

Let φ𝒞1()𝜑superscript𝒞1\varphi\in\mathcal{C}^{1}(\mathbb{R}) have finite limits limxφ(x)=φsubscript𝑥𝜑𝑥subscript𝜑\lim\limits_{x\rightarrow-\infty}\varphi(x)=\varphi_{-} and limxφ(x)=φ+subscript𝑥𝜑𝑥subscript𝜑\lim\limits_{x\rightarrow\infty}\varphi(x)=\varphi_{+}, and let limxφ(x)=limxφ(x)=0subscript𝑥superscript𝜑𝑥subscript𝑥superscript𝜑𝑥0\lim\limits_{x\rightarrow-\infty}\varphi^{\prime}(x)=\lim\limits_{x\rightarrow\infty}\varphi^{\prime}(x)=0. Then, for every ε>0𝜀0\varepsilon>0 there exists a piecewise-linear, continuous function p::𝑝p:\mathbb{R}\rightarrow\mathbb{R} such that |φ(x)p(x)|<ε𝜑𝑥𝑝𝑥𝜀|\varphi(x)-p(x)|<\varepsilon for every x𝑥x\in\mathbb{R} and |φ(x)p(x)|<εsuperscript𝜑𝑥superscript𝑝𝑥𝜀|\varphi^{\prime}(x)-p^{\prime}(x)|<\varepsilon for every xP𝑥𝑃x\in\mathbb{R}\setminus{P}, where P𝑃P is the set of points of non-differentiability of p𝑝p.

Proof.

By definition of a limit there exist numbers K<K+subscript𝐾subscript𝐾K_{-}<K_{+} such that x<K|φ(x)φ|ε2𝑥subscript𝐾𝜑𝑥subscript𝜑𝜀2x<K_{-}\Rightarrow|\varphi(x)-\varphi_{-}|\leq\frac{\varepsilon}{2} and x>K+|φ(x)φ+|ε2𝑥subscript𝐾𝜑𝑥subscript𝜑𝜀2x>K_{+}\Rightarrow|\varphi(x)-\varphi_{+}|\leq\frac{\varepsilon}{2}. We apply Lemma 1 to the function φ𝜑\varphi and the set D=[K,K+]𝐷delimited-[]subscript𝐾,subscript𝐾D=[K_{,}K_{+}]. We define p~~𝑝\tilde{p} on [K,K+]subscript𝐾subscript𝐾[K_{-},K_{+}] according to Lemma 1. We define p𝑝p as

p(x)={φforx[,K]p~(x)forx[K,K+]φ+forx[K+,].𝑝𝑥casessubscript𝜑for𝑥subscript𝐾~𝑝𝑥for𝑥subscript𝐾subscript𝐾subscript𝜑for𝑥subscript𝐾p(x)=\left\{\begin{array}[]{lcr}\varphi_{-}&\rm{for}&x\in[-\infty,K_{-}]\\ \tilde{p}(x)&\rm{for}&x\in[K_{-},K_{+}]\\ \varphi_{+}&\rm{for}&x\in[K_{+},\infty]\\ \end{array}\right..

It can be easily verified, that it has all the desired properties. ∎

Corollary 1.

For every ε>0𝜀0\varepsilon>0 there exists a combination of ReLU functions which approximates a sigmoid function with accurracy ε𝜀\varepsilon in the Sobolev space.

Proof.

It follows immediately from Lemma 2 and the fact, that any piecewise-continuous function on \mathbb{R} can be expressed as a finite sum of ReLU activations. ∎

Remark 1.

The authors decided, for the sake of clarity and better readability of the paper, to not treat the issue of non-differentiabilities of the piecewise-linear function at the junction points. It can be approached in various ways, either by noticing they form a finite, and thus a zero-Lebesgue measure set and invoking the formal definition f Sobolev spaces, or by extending the definition of a derivative, but it leads only to non-interesting technical complications.

Proof of Theorem 1.

By Hornik’s result (Hornik [10]) there exists a combination of N𝑁N sigmoids approximating the function f𝑓f in the Sobolev space with ε2𝜀2\frac{\varepsilon}{2} accuracy. Each of those sigmoids can, in turn, be approximated up to ε2N𝜀2𝑁\frac{\varepsilon}{2N} accuracy by a finite combination of ReLU (or leaky ReLU) functions (Corollary 1), and the theorem follows. ∎

Theorem 2.

Let f𝑓f be a 𝒞1(S)superscript𝒞1𝑆\mathcal{C}^{1}(S). Let g𝑔g be a continuous function satisfying gfx>0norm𝑔𝑓𝑥0\|g-\tfrac{\partial f}{\partial x}\|>0. Then, there exists an ε=ε(f,g)𝜀𝜀𝑓𝑔\varepsilon=\varepsilon(f,g) such that for any 𝒞1superscript𝒞1\mathcal{C}^{1} function hh there holds either fhεnorm𝑓𝜀\|f-h\|\geq\varepsilon or ghxεnorm𝑔𝑥𝜀\left\|g-\frac{\partial h}{\partial x}\right\|\geq\varepsilon.

Proof.

Assume that the converse holds. This would imply, that there exists a sequence of functions hnsubscript𝑛h_{n} such that limnhnx=gsubscript𝑛subscript𝑛𝑥𝑔\lim\limits_{n\rightarrow\infty}\frac{\partial h_{n}}{\partial x}=g and limnhn=fsubscript𝑛subscript𝑛𝑓\lim\limits_{n\rightarrow\infty}h_{n}=f. A theorem about term-by-term differentiation implies then that the limit limnhnsubscript𝑛subscript𝑛\lim\limits_{n\rightarrow\infty}h_{n} is differentiable, and that the equality x(limnhn)=fx𝑥subscript𝑛subscript𝑛𝑓𝑥\frac{\partial}{\partial x}\left(\lim\limits_{n\rightarrow\infty}h_{n}\right)=\tfrac{\partial f}{\partial x} holds. However, x(limnhn)=limnhnx=g𝑥subscript𝑛subscript𝑛subscript𝑛subscript𝑛𝑥𝑔\frac{\partial}{\partial x}\left(\lim\limits_{n\rightarrow\infty}h_{n}\right)=\lim\limits_{n\rightarrow\infty}\frac{\partial h_{n}}{\partial x}=g, contradicting gfx>0norm𝑔𝑓𝑥0\|g-\tfrac{\partial f}{\partial x}\|>0. ∎

Proposition 1.

Given any two functions f:S:𝑓𝑆f:S\rightarrow\mathbb{R} and g:Sd:𝑔𝑆superscript𝑑g:S\rightarrow\mathbb{R}^{d} on Sd𝑆superscript𝑑S\subseteq\mathbb{R}^{d} and a finite set ΣSΣ𝑆\Sigma\subset S, there exists neural network hh with a ReLU (or a leaky ReLU) activation such that xΣ:f(x)=h(x):for-all𝑥Σ𝑓𝑥𝑥\forall x\in\Sigma:f(x)=h(x) and g(x)=hx(x)𝑔𝑥𝑥𝑥g(x)=\tfrac{\partial h}{\partial x}(x) (it has 0 training loss).

Proof.

We first prove the theorem in a special, 1-dimensional case (when S𝑆S is a subset of \mathbb{R}). Form now it will be assumed that S𝑆S is a subset of \mathbb{R} and Σ={σ1<<σn\Sigma=\{\sigma_{1}<\ldots<\sigma_{n}} is a finite subset of S𝑆S. Let ε𝜀\varepsilon be smaller than 15min(sisi1)15subscript𝑠𝑖subscript𝑠𝑖1\frac{1}{5}\min(s_{i}-s_{i-1}), i=2,,n𝑖2𝑛i=2,\ldots,n. We define a function pisubscript𝑝𝑖p_{i} as follows

pi(x)={f(σi)g(σi)εε(xσi+2ε)forx[σi2ε,σiε]f(σi)+g(σi)(xσi)forx[σiε,σi+ε]f(σi)+g(σi)εε(xσi2ε)forx[σi+ε,σi+2ε]0otherwise.subscript𝑝𝑖𝑥cases𝑓subscript𝜎𝑖𝑔subscript𝜎𝑖𝜀𝜀𝑥subscript𝜎𝑖2𝜀for𝑥subscript𝜎𝑖2𝜀subscript𝜎𝑖𝜀𝑓subscript𝜎𝑖𝑔subscript𝜎𝑖𝑥subscript𝜎𝑖for𝑥subscript𝜎𝑖𝜀subscript𝜎𝑖𝜀𝑓subscript𝜎𝑖𝑔subscript𝜎𝑖𝜀𝜀𝑥subscript𝜎𝑖2𝜀for𝑥subscript𝜎𝑖𝜀subscript𝜎𝑖2𝜀0missing-subexpressionotherwisep_{i}(x)=\left\{\begin{array}[]{lcr}\frac{f(\sigma_{i})-g(\sigma_{i})\varepsilon}{\varepsilon}(x-\sigma_{i}+2\varepsilon)&\rm{for}&x\in[\sigma_{i}-2\varepsilon,\sigma_{i}-\varepsilon]\\ f(\sigma_{i})+g(\sigma_{i})(x-\sigma_{i})&\rm{for}&x\in[\sigma_{i}-\varepsilon,\sigma_{i}+\varepsilon]\\ -\frac{f(\sigma_{i})+g(\sigma_{i})\varepsilon}{\varepsilon}(x-\sigma_{i}-2\varepsilon)&\rm{for}&x\in[\sigma_{i}+\varepsilon,\sigma_{i}+2\varepsilon]\\ 0&&{\rm otherwise}\end{array}\right..

Note that the functions pisubscript𝑝𝑖p_{i} have disjoint supports for ij𝑖𝑗i\neq j. We define h(x)=i=1npi(x)𝑥superscriptsubscript𝑖1𝑛subscript𝑝𝑖𝑥h(x)=\sum_{i=1}^{n}p_{i}(x). By construction, it has all the desired properties.

Now let us move to the general case, when S𝑆S is a subset of dsuperscript𝑑\mathbb{R}^{d}. We will denote by πksubscript𝜋𝑘\pi_{k} a projection of a d𝑑d-dimensional point σ𝜎\sigma onto the k𝑘k-th coordinate. The obstacle to repeating the 111-dimensional proof in a straightforward matter (coordinate-by-coordinate) is that two or more of the points σisubscript𝜎𝑖\sigma_{i} can have one or more coordinates equal. We will use a linear change of coordinates to get past this technical obstacle. Let AGL(d,)𝐴𝐺𝐿𝑑A\in GL(d,\mathbb{R}) be matrix such that there holds πk(Aσi)πk(Aσj)subscript𝜋𝑘𝐴subscript𝜎𝑖subscript𝜋𝑘𝐴subscript𝜎𝑗\pi_{k}(A\sigma_{i})\neq\pi_{k}(A\sigma_{j}) for any ij𝑖𝑗i\neq j and any K=1,,d𝐾1𝑑K=1,\ldots,d. Such A𝐴A exists, as every condition πk(Aσi)=πk(Aσj)subscript𝜋𝑘𝐴subscript𝜎𝑖subscript𝜋𝑘𝐴subscript𝜎𝑗\pi_{k}(A\sigma_{i})=\pi_{k}(A\sigma_{j}) defines a codimension-one submanifold in the space GL(d,)𝐺𝐿𝑑GL(d,\mathbb{R}), thus the complement of the union of all such submanifolds is a full dimension (and thus nonempty) subset of GL(d,)𝐺𝐿𝑑GL(d,\mathbb{R}). Using the one-dimensional construction we define functions pk(x)superscript𝑝𝑘𝑥p^{k}(x), k=1,,d𝑘1𝑑k=1,\ldots,d, such that pk(πk(Aσi))=1df(σi)superscript𝑝𝑘subscript𝜋𝑘𝐴subscript𝜎𝑖1𝑑𝑓subscript𝜎𝑖p^{k}(\pi_{k}(A\sigma_{i}))=\frac{1}{d}f(\sigma_{i}) and (pk)(πk(Aσi))=0superscriptsuperscript𝑝𝑘subscript𝜋𝑘𝐴subscript𝜎𝑖0(p^{k})^{\prime}(\pi_{k}(A\sigma_{i}))=0. Similarly, we construct qk(x)superscript𝑞𝑘𝑥q^{k}(x) in such manner qk(πk(Aσi))=0superscript𝑞𝑘subscript𝜋𝑘𝐴subscript𝜎𝑖0q^{k}(\pi_{k}(A\sigma_{i}))=0 and (qk)(πk(Aσi))=A1g(σi)superscriptsuperscript𝑞𝑘subscript𝜋𝑘𝐴subscript𝜎𝑖superscript𝐴1𝑔subscript𝜎𝑖(q^{k})^{\prime}(\pi_{k}(A\sigma_{i}))=A^{-1}g(\sigma_{i}). Note that those definitions a are valid because πk(Aσi)πk(Aσj)subscript𝜋𝑘𝐴subscript𝜎𝑖subscript𝜋𝑘𝐴subscript𝜎𝑗\pi_{k}(A\sigma_{i})\neq\pi_{k}(A\sigma_{j}) for ij𝑖𝑗i\neq j, so the right sides are well-defined unique numbers.

It remains to put all the elements together. This is done as follows. First we extend pksuperscript𝑝𝑘p^{k}, qksuperscript𝑞𝑘q^{k} to the whole space \mathbb{R} “trivially”, i.e. for any 𝐱𝐱\mathbf{x}\in\mathbb{R}, 𝐱=(x1,,xd)𝐱superscript𝑥1superscript𝑥𝑑\mathbf{x}=(x^{1},\ldots,x^{d}) we define Pk(𝐱):=pk(xk)assignsuperscript𝑃𝑘𝐱superscript𝑝𝑘superscript𝑥𝑘P^{k}(\mathbf{x}):=p^{k}(x^{k}). Similarly, Qik(𝐱):=qik(xk)assignsuperscriptsubscript𝑄𝑖𝑘𝐱superscriptsubscript𝑞𝑖𝑘superscript𝑥𝑘Q_{i}^{k}(\mathbf{x}):=q_{i}^{k}(x^{k}). Finally, h(𝐱):=k=1dPk(A𝐱)+k=1dQk(A𝐱)assign𝐱superscriptsubscript𝑘1𝑑superscript𝑃𝑘𝐴𝐱superscriptsubscript𝑘1𝑑superscript𝑄𝑘𝐴𝐱h(\mathbf{x}):=\sum_{k=1}^{d}P^{k}(A\mathbf{x})+\sum_{k=1}^{d}Q^{k}(A\mathbf{x}). This function has the desired properties. Indeed for every σisubscript𝜎𝑖\sigma_{i} we have

h(σi)=k=1dPk(Aσi)+k=1dQk(Aσi)=k=1dpk(πk(Aσi))+k=1d0=f(Aσi)subscript𝜎𝑖superscriptsubscript𝑘1𝑑superscript𝑃𝑘𝐴subscript𝜎𝑖superscriptsubscript𝑘1𝑑superscript𝑄𝑘𝐴subscript𝜎𝑖superscriptsubscript𝑘1𝑑superscript𝑝𝑘subscript𝜋𝑘𝐴subscript𝜎𝑖superscriptsubscript𝑘1𝑑0𝑓𝐴subscript𝜎𝑖h(\sigma_{i})=\sum_{k=1}^{d}P^{k}(A\sigma_{i})+\sum_{k=1}^{d}Q^{k}(A\sigma_{i})=\sum_{k=1}^{d}p^{k}(\pi_{k}(A\sigma_{i}))+\sum_{k=1}^{d}0=f(A\sigma_{i})

and

hx(σi)=k=1d(Pk)(Aσi)+k=1d(Qk)(Aσi)=k=1d0+k=1dQkx(πk(Aσi))=𝑥subscript𝜎𝑖superscriptsubscript𝑘1𝑑superscriptsuperscript𝑃𝑘𝐴subscript𝜎𝑖superscriptsubscript𝑘1𝑑superscriptsuperscript𝑄𝑘𝐴subscript𝜎𝑖superscriptsubscript𝑘1𝑑0superscriptsubscript𝑘1𝑑superscript𝑄𝑘𝑥subscript𝜋𝑘𝐴subscript𝜎𝑖absent\frac{\partial h}{\partial x}(\sigma_{i})=\sum_{k=1}^{d}(P^{k})^{\prime}(A\sigma_{i})+\sum_{k=1}^{d}(Q^{k})^{\prime}(A\sigma_{i})=\sum_{k=1}^{d}0+\sum_{k=1}^{d}\frac{\partial Q^{k}}{\partial x}(\pi_{k}(A\sigma_{i}))=
Ak=1d(0,,(qk)(πk(Aσi))k,,0)T=AA1g(σi)=g(σi).𝐴superscriptsubscript𝑘1𝑑superscript0subscriptsuperscriptsuperscript𝑞𝑘subscript𝜋𝑘𝐴subscript𝜎𝑖𝑘0𝑇𝐴superscript𝐴1𝑔subscript𝜎𝑖𝑔subscript𝜎𝑖A\sum_{k=1}^{d}(0,\ldots,\underbracket{(q^{k})^{\prime}(\pi_{k}(A\sigma_{i}))}_{k},\ldots,0)^{T}=A\cdot A^{-1}g(\sigma_{i})=g(\sigma_{i}).

This completes the proof.

Proposition 3.

There holds Ksob(G)<Kreg(G)subscript𝐾𝑠𝑜𝑏subscriptGsubscript𝐾𝑟𝑒𝑔subscriptGK_{sob}(\mathcal{F}_{\rm{G}})<K_{reg}(\mathcal{F}_{\rm{G}}) and Ksob(PL)<Kreg(PL)subscript𝐾𝑠𝑜𝑏subscriptPLsubscript𝐾𝑟𝑒𝑔subscriptPLK_{sob}(\mathcal{F}_{\rm{PL}})<K_{reg}(\mathcal{F}_{\rm{PL}}).

Proof.

Gaussian PDF functions form a 2-parameter family 12πσ2e(xμ)22σ212𝜋superscript𝜎2superscript𝑒superscript𝑥𝜇22superscript𝜎2\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}. Therefore, determining f𝑓f in that family is equivalent to determining the values of μ𝜇\mu and σ2superscript𝜎2\sigma^{2}. Given α=12πσ2e(xμ)22σ2𝛼12𝜋superscript𝜎2superscript𝑒superscript𝑥𝜇22superscript𝜎2\alpha=\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}, β=xμσ22πσ2e(xμ)22σ2𝛽𝑥𝜇superscript𝜎22𝜋superscript𝜎2superscript𝑒superscript𝑥𝜇22superscript𝜎2\beta=-\frac{x-\mu}{\sigma^{2}\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}, we get βα=xμσ2𝛽𝛼𝑥𝜇superscript𝜎2\frac{\beta}{\alpha}=-\frac{x-\mu}{\sigma^{2}} and 2ln(2πα)=ln(σ2)(xμ)2σ222𝜋𝛼superscript𝜎2superscript𝑥𝜇2superscript𝜎22\ln(\sqrt{2\pi}\alpha)=-\ln(\sigma^{2})-\frac{(x-\mu)^{2}}{\sigma^{2}}. Thus 2ln(2πα)=ln(σ2)β2α2σ222𝜋𝛼superscript𝜎2superscript𝛽2superscript𝛼2superscript𝜎22\ln(\sqrt{2\pi}\alpha)=-\ln(\sigma^{2})-\frac{\beta^{2}}{\alpha^{2}}\sigma^{2}. The right hand side is a strictly decreasing function of σ2superscript𝜎2\sigma^{2}. Substituting its unique solution to βα=xμσ2𝛽𝛼𝑥𝜇superscript𝜎2\frac{\beta}{\alpha}=-\frac{x-\mu}{\sigma^{2}} we determine μ𝜇\mu. Thus Ksobsubscript𝐾𝑠𝑜𝑏K_{sob} is equal to 111 for the family of Gaussian PDF functions.

On the other hand, there holds Kreg>2subscript𝐾𝑟𝑒𝑔2K_{reg}>2 for the family of Gaussian PDF functions. For example, N(2,1)𝑁21N(2,1) and N(2.847,1.641)𝑁2.8471.641N(2.847...,1.641...) have the same values at x=0𝑥0x=0 and x=3𝑥3x=3 (existence of a “real” solution near this approximate solution is an immediate consequence of the Implicit Function Theorem). This ends the proof for the GsubscriptG\mathcal{F}_{\rm{G}} family

We will discuss the family PLsubscriptPL\mathcal{F}_{\rm{PL}} now. Every linear function is uniquely determined by its value at a single point and its derivative. Thus, for any function fPL𝑓subscriptPLf\in\mathcal{F}_{\rm{PL}}, as the partition D=D1Dn𝐷subscript𝐷1subscript𝐷𝑛D=D_{1}\cup\ldots\cup D_{n} is fixed, it is sufficient to know the values and the values of the derivative of f𝑓f in σ1Dn,,σ1Dnformulae-sequencesubscript𝜎1subscript𝐷𝑛subscript𝜎1subscript𝐷𝑛\sigma_{1}\in D_{n},\ldots,\sigma_{1}\in D_{n} to determine it uniquely. On the other hand, we need at least d+1𝑑1d+1 (recall that d𝑑d is the dimension of the domain of f𝑓f) in each of the domains Disubscript𝐷𝑖D_{i} to determine f𝑓f uniquely, if we are allowed to look only at the values.

2 Artificial Datasets

Dataset20 training samples100 training samples
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
RegularSobolevRegularSobolev
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 5: Ackley function (on the left) and its models using regular neural network training (left part of each plot) and Sobolev Training (right part). We also plot the vector field of the gradients of each predictor underneath the function plot.
Dataset20 training samples100 training samples
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
RegularSobolevRegularSobolev
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 6: Beale function (on the left) and its models using regular neural network training (left part of each plot) and Sobolev Training (right part). We also plot the vector field of the gradients of each predictor underneath the function plot.
Dataset20 training samples100 training samples
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
RegularSobolevRegularSobolev
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 7: Booth function (on the left) and its models using regular neural network training (left part of each plot) and Sobolev Training (right part). We also plot the vector field of the gradients of each predictor underneath the function plot.
Dataset20 training samples100 training samples
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
RegularSobolevRegularSobolev
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 8: Bukin function (on the left) and its models using regular neural network training (left part of each plot) and Sobolev Training (right part). We also plot the vector field of the gradients of each predictor underneath the function plot.
Dataset20 training samples100 training samples
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
RegularSobolevRegularSobolev
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 9: McCormick function (on the left) and its models using regular neural network training (left part of each plot) and Sobolev Training (right part). We also plot the vector field of the gradients of each predictor underneath the function plot.
Dataset20 training samples100 training samples
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
RegularSobolevRegularSobolev
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 10: Rosenbrock function (on the left) and its models using regular neural network training (left part of each plot) and Sobolev Training (right part). We also plot the vector field of the gradients of each predictor underneath the function plot.
Dataset20 training samples100 training samples
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
RegularSobolevRegularSobolev
Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 11: Styblinski-Tang function (on the left) and its models using regular neural network training (left part of each plot) and Sobolev Training (right part). We also plot the vector field of the gradients of each predictor underneath the function plot.

Functions used (visualised at Figures 5-11):

  • Ackley’s

    f(x,y)=20exp(0.20.5(x2+y2))exp(0.5(cos(2πx)+cos(2πy)))+e+20,𝑓𝑥𝑦200.20.5superscript𝑥2superscript𝑦20.52𝜋𝑥2𝜋𝑦𝑒20f(x,y)=-20\exp\left(-0.2\sqrt{0.5(x^{2}+y^{2})}\right)-\exp\left(0.5(\cos(2\pi x)+\cos(2\pi y))\right)+e+20,

    for x,y[5,5]×[5,5]𝑥𝑦5555x,y\in[-5,5]\times[-5,5]

  • Beale’s

    f(x,y)=(1.5x+xy)2+(2.25x+xy2)2+(2.625x+xy3)2,𝑓𝑥𝑦superscript1.5𝑥𝑥𝑦2superscript2.25𝑥𝑥superscript𝑦22superscript2.625𝑥𝑥superscript𝑦32f(x,y)=(1.5-x+xy)^{2}+(2.25-x+xy^{2})^{2}+(2.625-x+xy^{3})^{2},

    for x,y[4.5,4.5]×[4.5,4.5]𝑥𝑦4.54.54.54.5x,y\in[-4.5,4.5]\times[-4.5,4.5]

  • Booth

    f(x,y)=(x+2y7)2+(2x+y5)2,𝑓𝑥𝑦superscript𝑥2𝑦72superscript2𝑥𝑦52f(x,y)=(x+2y-7)^{2}+(2x+y-5)^{2},

    for x,y[10,10]×[10,10]𝑥𝑦10101010x,y\in[-10,10]\times[-10,10]

  • Bukin

    f(x,y)=100|y=0.01x2|+0.01|x+10|,f(x,y)=100\sqrt{|y=0.01x^{2}|}+0.01|x+10|,

    for x,y[15,5]×[3,3]𝑥𝑦15533x,y\in[-15,-5]\times[-3,3]

  • McCormick

    f(x,y)=sin(x+y)+(xy)21.5x+2.5y+1,𝑓𝑥𝑦𝑥𝑦superscript𝑥𝑦21.5𝑥2.5𝑦1f(x,y)=\sin(x+y)+(x-y)^{2}-1.5x+2.5y+1,

    for x,y[1.5,4]×[3,4]𝑥𝑦1.5434x,y\in[-1.5,4]\times[-3,4]

  • Rosenbrock

    f(x,y)=100(yx2)2+(x1)2,𝑓𝑥𝑦100superscript𝑦superscript𝑥22superscript𝑥12f(x,y)=100(y-x^{2})^{2}+(x-1)^{2},

    for x,y[2,2]×[2,2]𝑥𝑦2222x,y\in[-2,2]\times[-2,2]

  • Styblinski-Tang

    f(x,y)=0.5(x416x2+5x+y416y2+5y),𝑓𝑥𝑦0.5superscript𝑥416superscript𝑥25𝑥superscript𝑦416superscript𝑦25𝑦f(x,y)=0.5(x^{4}-16x^{2}+5x+y^{4}-16y^{2}+5y),

    for x,y[5,5]×[5,5]𝑥𝑦5555x,y\in[-5,5]\times[-5,5]

Networks were trained using the Adam optimiser with learning rate 3e53𝑒53e-5. Training set has been sampled uniformly from the domain provided. Test set consists always of 10,000 points sampled uniformly from the same domain.

3 Policy Distillation

Agents policies are feed forward networks consisting of:

  • 32 8x8 kernels with stride 4

  • ReLU nonlinearity

  • 64 4x4 kernels with stride 2

  • ReLU nonlinearity

  • 64 3x3 kernels with stride 1

  • ReLU nonlinearity

  • Linear layer with 512 units

  • ReLU nonlinearity

  • Linear layer with 3 (Pong), 4 (Breakout) or 6 outputs (Space Invaders)

  • Softmax

They were trained with A3C [16] over 80e6 steps, using history of length 4, greyscaled input, and action repeat 4. Observations were scaled down to 84x84 pixels.

Data has been gathered by running trained policy to gather 100K frames (thus for 400K actual steps). Split into train and test sets has been done time-wise, ensuring that test frames come from different episodes than the training ones.

Distillation network consists of:

  • 16 8x8 kernels with stride 4

  • ReLU nonlinearity

  • 32 4x4 kernels with stride 2

  • ReLU nonlinearity

  • Linear layer with 256 units

  • ReLU nonlinearity

  • Linear layer with 3 (Pong), 4 (Breakout) or 6 outputs (Space Invaders)

  • Softmax

and was trained using Adam optimiser with learning rate fitted independently per game and per approach between 1e31𝑒31e-3 and 1e51𝑒51e-5. Batch size is 200 frames, randomly selected from the training set.

4 Synthetic Gradients

All models were trained using multi-GPU optimisation, with Sync main network updates and Hogwild SG module updates.

4.1 Meaning of Sobolev losses for synthetic gradients

In the setting considered, the true label y𝑦y is used only as a conditioning, however one could also provide supervision for m(h,y|θ)/y𝑚conditional𝑦𝜃𝑦\partial m(h,y|\theta)/\partial y. So what is the actual effect this Sobolev losses have on SG estimator? For L𝐿L being log loss, it is easy to show, that they are additional penalties on matching logp(h,y)𝑝𝑦\log p(h,y) to logphsubscript𝑝\log p_{h}, namely:

m(h,y|θ)/yL(h,y)/y2=logp(h|θ)logph2\|\partial m(h,y|\theta)/\partial y-\partial L(h,y)/\partial y\|^{2}=\|\log p(h|\theta)-\log{p_{h}}\|^{2}
m(h,y|θ)L(h,y)2=(logp(h|θ)y^logphy^)2,\|m(h,y|\theta)-L(h,y)\|^{2}=(\log p(h|\theta)_{\hat{y}}-\log{p_{h}}_{\hat{y}})^{2},

where y^^𝑦\hat{y} is the index of “1” in the one-hot encoded label vector y𝑦y. Consequently loss supervision makes sure that the internal prediction logp(h|θ)𝑝conditional𝜃\log p(h|\theta) for the true label y^^𝑦\hat{y} is close to the current prediction of the whole model logphsubscript𝑝\log p_{h}. On the other hand matching partial derivatives wrt. to label makes sure that predictions for all the classes are close to each other. Finally if we use both – we get a weighted sum, where penalty for deviating from the prediction on the true label is more expensive, than on all remaining ones666Adding L/y𝐿𝑦\partial L/\partial y supervision on toy MNIST experiments increased convergence speed and stability, however due to TensorFlow currently not supporting differentiating cross entropy wrt. to labels, it was omitted in our large-scale experiments..

4.2 Cifar10

All Cifar10 experiments use a deep convolutional network of following structure:

  • 64 3x3 kernels with stride 1

  • BatchNorm and ReLU nonlinearity

  • 64 3x3 kernels with stride 1

  • BatchNorm and ReLU nonlinearity

  • 128 3x3 kernels with stride 2

  • BatchNorm and ReLU nonlinearity

  • 128 3x3 kernels with stride 1

  • BatchNorm and ReLU nonlinearity

  • 128 3x3 kernels with stride 1

  • BatchNorm and ReLU nonlinearity

  • 256 3x3 kernels with stride 2

  • BatchNorm and ReLU nonlinearity

  • 256 3x3 kernels with stride 1

  • BatchNorm and ReLU nonlinearity

  • 256 3x3 kernels with stride 1

  • BatchNorm and ReLU nonlinearity

  • 512 3x3 kernels with stride 2

  • BatchNorm and ReLU nonlinearity

  • 512 3x3 kernels with stride 1

  • BatchNorm and ReLU nonlinearity

  • 512 3x3 kernels with stride 1

  • BatchNorm and ReLU nonlinearity

  • Linear layer with 10 outputs

  • Softmax

with L2 regularisation of 1e41𝑒41e-4. The network is trained in an asynchronous manner, using 10 GPUs in parallel. Each worker uses batch size of 32. The main optimiser is Stochastic Gradient Descent with momentm of 0.9. The learning rate is initialised to 0.1 and then dropped by an order of magniture after 40K, 60K and finally after 80K updates.

Each of the three SG modules is a convolutional network consisting of:

  • 128 3x3 kernels with stride 1

  • ReLU nonlinearity

  • Linear layer with 10 outputs

  • Softmax

It is trained using the Adam optimiser with learning rate 1e41𝑒41e-4, no learning rate schedule is applied. Updates of the synthetic gradient module are performed in a Hogwild manner. Losses used for both loss prediction and gradient estimation are L1.

For direct SG model we used architecture described in the original paper – 3 resolution preserving layers of 128 kernels of 3x3 convolutions with ReLU activations in between. The only difference is that we use L1 penalty instead of L2 as empirically we found it working better for the tasks considered.

4.3 Imagenet

All ImageNet experiments use ResNet50 network with L2 regularisation of 1e41𝑒41e-4. The network is trained in an asynchronous manner, using 34 GPUs in parallel. Each worker uses batch size of 32. The main optimiser is Stochastic Gradient Descent with momentum of 0.9. The learning rate is initialised to 0.1 and then dropped by an order of magnitude after 100K, 150K and finally after 175K updates.

The SG module is a convolutional network, attached after second ResNet block, consisting of:

  • 64 3x3 kernels with stride 1

  • ReLU nonlinearity

  • 64 3x3 kernels with stride 2

  • ReLU nonlinearity

  • Global averaging

  • 1000 1x1 kernels

  • Softmax

It is trained using the Adam optimiser with learning rate 1e41𝑒41e-4, no learning rate schedule is applied. Updates of the synthetic gradient module are performed in a Hogwild manner. Sobolev losses are set to L1.

Regular data augmentation has been applied during training, taken from the original Inception V1 paper.

5 Gradient-based attention transfer

Zagoruyko et al. [31] recently proposed a following cost for transfering attention model f𝑓f to model g𝑔g parametrised with θ𝜃\theta, under the cost L𝐿L:

Ltransfer(θ)=L(g(x|θ))+αL(g(x|θ))/xL(f(x))/x2L_{\text{transfer}}(\theta)=L(g(x|\theta))+\alpha\|\partial L(g(x|\theta))/\partial x-\partial L(f(x))/\partial x\|_{2}(3)

where the first term simply is the original minimisation problem, and the other measures loss sensitivity of the target (f𝑓f) and tries to match the corresponding quantity in the model g𝑔g. This can be seen as a Sobolev training under four additional assumptions:

  1. 1.

    ones does not model f𝑓f, but rather L(f(x))𝐿𝑓𝑥L(f(x)) (similarly to our Synthetic Gradient model – one constructs loss predictor),

  2. 2.

    L(f(x))=0𝐿𝑓𝑥0L(f(x))=0 (target model is perfect),

  3. 3.

    loss being estimated is non-negative (L()0𝐿0L(\cdot)\geq 0)

  4. 4.

    loss used to measure difference in predictor values (loss estimates) is L1.

If we combine these four assumptions we get

Lsobolev(θ)=L(g(x|θ))L(f(x))1+αL(g(x|θ))/xL(f(x))/x2L_{\text{sobolev}}(\theta)=\|L(g(x|\theta))-L(f(x))\|_{1}+\alpha\|\partial L(g(x|\theta))/\partial x-\partial L(f(x))/\partial x\|_{2}
=L(g(x|θ))1+αL(g(x|θ))/xL(f(x))/x2=\|L(g(x|\theta))\|_{1}+\alpha\|\partial L(g(x|\theta))/\partial x-\partial L(f(x))/\partial x\|_{2}
=L(g(x|θ))+αL(g(x|θ))/xL(f(x))/x2.=L(g(x|\theta))+\alpha\|\partial L(g(x|\theta))/\partial x-\partial L(f(x))/\partial x\|_{2}.

Note, however than in general these approaches are not the same, but rather share the idea of matching gradients of a predictor and a target in order to build a better model.

In other words, Sobolev training exploits derivatives to find a closer fit to the target function, while the transfer loss proposed adds a sensitivity-matching term to the original minimisation problem instead. Following observation make this distinction more formal.

Remark 2.

Lets assume that a target function Lf𝐿𝑓L\circ f belongs to hypotheses space \mathcal{H}, meaning that there exists θfsubscript𝜃𝑓\theta_{f} such that L(g(|θf))=L(f())L(g(\cdot|\theta_{f}))=L(f(\cdot)). Then θfsubscript𝜃𝑓\theta_{f} is a minimiser of Sobolev loss, but does not have to be a minimiser of transfer loss defined in Eq. (3).

Proof.

By the definition of Sobolev loss it is non-negative, thus it suffices to show that Lsobolev(θf)=0subscript𝐿sobolevsubscript𝜃𝑓0L_{\text{sobolev}}(\theta_{f})=0, but

Lsobolev(θf)subscript𝐿sobolevsubscript𝜃𝑓\displaystyle L_{\text{sobolev}}(\theta_{f})=L(g(x|θf))L(f(x))+αL(g(x|θf))/xL(f(x))/x\displaystyle=\|L(g(x|\theta_{f}))-L(f(x))\|+\alpha\|\partial L(g(x|\theta_{f}))/\partial x-\partial L(f(x))/\partial x\|
=L(f(x))L(f(x))+αL(f(x))/xL(f(x))/x=0.absentnorm𝐿𝑓𝑥𝐿𝑓𝑥𝛼norm𝐿𝑓𝑥𝑥𝐿𝑓𝑥𝑥0\displaystyle=\|L(f(x))-L(f(x))\|+\alpha\|\partial L(f(x))/\partial x-\partial L(f(x))/\partial x\|=0.

By the same argument we get for the transfer loss

Ltransfer(θf)subscript𝐿transfersubscript𝜃𝑓\displaystyle L_{\text{transfer}}(\theta_{f})=L(g(x|θf))+αL(g(x|θf))/xL(f(x))/x\displaystyle=L(g(x|\theta_{f}))+\alpha\|\partial L(g(x|\theta_{f}))/\partial x-\partial L(f(x))/\partial x\|
=L(g(x|θf))+αL(f(x))/xL(f(x))/x=L(g(x|θf)).absent𝐿𝑔conditional𝑥subscript𝜃𝑓𝛼norm𝐿𝑓𝑥𝑥𝐿𝑓𝑥𝑥𝐿𝑔conditional𝑥subscript𝜃𝑓\displaystyle=L(g(x|\theta_{f}))+\alpha\|\partial L(f(x))/\partial x-\partial L(f(x))/\partial x\|=L(g(x|\theta_{f})).

Consequently, if there exists another θhsubscript𝜃\theta_{h} such that L(g(x|θh))<L(g(x|θf))αL(g(x|θh))/xL(f(x))/xL(g(x|\theta_{h}))<L(g(x|\theta_{f}))-\alpha\|\partial L(g(x|\theta_{h}))/\partial x-\partial L(f(x))/\partial x\|, then θfsubscript𝜃𝑓\theta_{f} is not a minimiser of the loss considered.

To show that this final constraint does not lead to an empty set, lets consider a class of constant functions g(x|θ)=θ𝑔conditional𝑥𝜃𝜃g(x|\theta)=\theta, and L(p)=p2𝐿𝑝superscriptnorm𝑝2L(p)=\|p\|^{2}. Lets fix some θf>0subscript𝜃𝑓0\theta_{f}>0 that identifies f𝑓f, and we get:

Ltransfer(θf)=L(g(x|θf))=θf2>0subscript𝐿transfersubscript𝜃𝑓𝐿𝑔conditional𝑥subscript𝜃𝑓superscriptsubscript𝜃𝑓20L_{\text{transfer}}(\theta_{f})=L(g(x|\theta_{f}))=\theta_{f}^{2}>0

and at the same time for any |θh|<θfsubscript𝜃subscript𝜃𝑓|\theta_{h}|<\theta_{f} (i.e. θh=θf/2subscript𝜃subscript𝜃𝑓2\theta_{h}=\theta_{f}/2) we have:

Ltransfer(θh)subscript𝐿transfersubscript𝜃\displaystyle L_{\text{transfer}}(\theta_{h})=L(g(x|θh))+αL(g(x|θh))/xL(g(x|θf))/x\displaystyle=L(g(x|\theta_{h}))+\alpha\|\partial L(g(x|\theta_{h}))/\partial x-\partial L(g(x|\theta_{f}))/\partial x\|
=θh2+α(00)=θh2<θf2=Ltransfer(θf).absentsuperscriptsubscript𝜃2𝛼00superscriptsubscript𝜃2superscriptsubscript𝜃𝑓2subscript𝐿transfersubscript𝜃𝑓\displaystyle=\theta_{h}^{2}+\alpha(0-0)=\theta_{h}^{2}<\theta_{f}^{2}=L_{\text{transfer}}(\theta_{f}).