(cache)Deep learning generalizes because the parameter-function map is biased towards simple functions

Why does deep learning generalize?

Supervised learning theory

Because it has an inductive bias

Ok, but why does it have an inductive bias?

VC theory

Limited expressivity maybe?

Expressivity refers to "how many different functions can your model express", or "how large is the set of functions that your learning algorithm may output" (this set is called the hypothesis class, and typically written $H$ )

The simplest measure of this is literally that, "how many functions" are there in $H$ , i.e. its cardinality. However, this doesn't take into account whether some functions may be very similar to each other. The Vapnik-Chervonenkis (VC) dimension is a more sophisticated measure of expressivity that takes this into account, giving a measure of "effective size".

What does VC try to measure?

To understand VC dimension better, let's explain what it tries to capture. In binary classification, the VC dimension determines whether a hypothesis class $H$ has the uniform convergence property, which means that with high probability over the sampling of the training set every function in $H$ has a training error which is close to its generalization error. In particular, the VC dimension determines how large the training set needs to be for this property to hold. This in turn allows one to obtain generalization error upper bounds which are worst-case over all empirical-risk-minimizing algorithms and data distributions.

If the VC dimension is infinite, then the uniform convergence property doesn't hold, no matter how large the training set. Such a hypothesis class is said to not be PAC-learnable. However, remember this learnability property is worst-case over algorithms and distributions. So your particular algorithm/problem may still allow for generalization even if using this hypothesis class.

To understand these kinds of generalization bounds better, see the explanation at PAC-Bayes theorem below.

Hecc, but what is the actual definition of VC dimension?

More references

Zhang et al. (2017a)

No, neural networks (NNs) can fit randomly labelled data

If a binary classification model can fit any labelling on a given set of instances this means, by definition, that the VC dimension is greater than the size of the training set, which in turn implies that VC-dimension worst-case bounds are vacuous (

> 1

Ackshually

D Soudry et al., Zhang et al. (2017b), Zhang et al. (2018)

Maybe SGD is what's biasing towards certain solutions

But many very different optimization algorithms generalize well

Wu et al.

Yeah!

Hmm, maybe it's an intrinsic property of the NN, like its parameter-function map?

What's that?

Let the space of functions that the model can express be $F$ . If the model has $p$ real valued parameters, taking values within a set $Θ \subseteq R^{p}$ ,

the parameter-function map, $M$ , is defined as:

$\begin{aligned} M : Θ & \to F \\ θ & \mapsto f_{θ} \end{aligned}$

where $f_{θ}$ is the function implemented by the model with choice of parameter vector $θ$ .

So intuitively, a function is the input-output map that the network implements.

More precisely

We enumerate all the inputs (for instance, all images in our training set, or all 7-dimensional Boolean vectors for the first set of experiments), and the list the corresponding outputs the neural network (with parameters

θ

) produces. For a fixed ordering of the inputs, this sequence of outputs is what we call a function.

Smol achkshually

Because of overparametrization, there is lots of redundancy in the parameters, and so many paramaters give rise to the same function

Result 1: The parameter-function map is hugely biased

For all the neural network architectures we tried:

The volumes of regions of parameter space producing particular functions span a huge range of orders of magnitude.

Another more precise statement: Upon sampling parameters i.i.d. using a Gaussian or uniform distribution, the probability of obtaining different functions varies by maany orders of magnitude. Note that we can talk about the probability of a single function, because we consider only binary classification, so the output is 0/1, and we have a finite number of inputs (say

N

), there are then in total a finite number (at most

2^{N}

) functions that network can implement

Fig 1. Probability of obtaining a particular function versus its rank (ranked by probability)

Obtained in a sample of $10^{10}$ (blue) or $10^{7}$ (others) parameters for a fully connected network of shape (7,40,40,1)

What are the different lines?

The different lines (differen colors) represent different distributions over the parameters. The red line, however, represents Zipf'z law, where

P (rank) \propto \frac{1}{rank}

What is "Rank"?

Oh, and how did you find that out?

For a family of fully connected feedforward neural networks with $7$ Boolean inputs and one Boolean output of varying depths and widths, we sampled parameters with several distributions. In Figure 1, we show the empirical frequencies by which different functions are obtained

We sample the parameters of the neural network $N$ times. For each sample, we evaluate the network on each of the $2^{7}$ Boolean input vectors, to obtain the sequence of outputs, which we call "function". The "empirical frequency" is just how many times a particular function appeared after sampling $N$ times.

In Figure 1. these counts are dividied by the total number of samples $N$ , so that they approach the actual probability of obtaining that function, on the limit of large samples.

For all the input distributions, $N = 10^{8}$ , except for the blue one, for which we made a bigger sample $N = 10^{10}$ , which is why it goes further down.

For some larger neural networks with higher-dimensional input spaces, we used a Gaussian process approximation to calculate the probability of different functions. This can be seen in Figures 2a and the inset of Figure 3

We explain the Gaussian process approximation in more detail later. Also note that for high-dimensional input spaces, we need to look at the functions constrained on a limited set of inputs (e.g. a random sample of

100

images from CIFAR10), because we aren't gonna enumerate all possible images lol. We can say we are talking about labellings rather than functions proper, but the math is the same, and the inutition basically too.

Ok, but do we have any way to characterize the bias? What kinds of functions are the networks biased towards?

Result 2: The bias is towards simple functions

We found that in all cases, the probability of a function inversely correlated with its complexity (using a variety of measures of complexity)

Probability versus Lempel-Ziv complexity. Probabilities are estimated from a sample of

10^{8}

parameters, the same way as in Figure 1. Points with a frequency of

10^{- 8}

are removed for clarity because these suffer from finite-size effects (see Appendix G)

What is LZ complexity?

Probability (using GP approximation) versus critical sample ratio (CSR) of labelings of 1000 random CIFAR10 inputs, produced by

250

random samples of parameters. The network is a 4 layer CNN.

What is CSR?

Why is this correlation so good?

I'm not sure yet :P. But I'll just point out if we plot the

y

-axis with respect to entropy of the function (i.e. does it map most things to the same output, or not), the correlation is equally good. Food for thought.

Why are the networks biased?

No deeper explanation yet about why the parameter-function map is biased. However, we do have some deeper reason, based on algorithmic information theory for why it is biased towards simple functions, given that it is biased.

Dingle et al.

The probability $P (x)$ to obtain output $x$ of a simple map $f$ , upon sampling its inputs uniformly at random, depends only on the Kolmogorov complexity of the output $K (x)$ :

$P (x) \leq 2^{- K (x) + O (1)}$

The main condition on the map is that its Kolmogorov complexity is negligible relative to that of the output $K (f) ≪ K (x)$

Kolmogorov complexity is uncomputable, so we use computable approximations to it, like Lempel-Ziv complexity

The intuition behind this result is tricky to get at first, if you are not used to thinking about Kolmogorov complexity, but let me try

Kolmogorov complexity is defined as the shortest description of a thing, when using a Turing universal language (like all normal programming languages)

Therefore, it is typically not hard to upper bound the Kolmogorov complexity of something (let's assume the thing is a string, and call it $x$ for short). One just need to find a description of $x$ , and then the shortest description has to be equal or shorter than this one!

(finding a lower bound of Kolmogorov complexity is much harder, and in some cases provably impossible..)

So let's say the probaility distribution $P$ itself (or equivalently the input-output map $f$ defining it) has a very short description, taking $C$ bits

We can describe any $x$ in the support of $P$ by first describing $P$ , with $C$ bits, and then giving a code (that can be constructed from $P$ ) that uniquely identifies $x$ . An easy-to-describe coding escheme that is constructed from a distribution $P$ is the Shannon-Fano-Elias coding. The important thing is that this coding asigns to each $x$ a code/description of size $\log \frac{1}{P (x)}$ bits.

So we described $x$ using $C + \log \frac{1}{P (x)}$ bits. The constant $C$ doesn't grow with $x$ , and so we call it " $O (1)$ "

By the discussion above, this means we can upper bound the Kolmorov complexity by the size of this description $K (x) < - \log P (x) + O (1)$ , which rearranging gives the desired result (note that we still write $+ O (1)$ even if it changes sign, as its just a symbol meaning "a constant", not a specific value)

The parameter-function map satisfies

K (f) ≪ K (x)

, and indeed we found that the bound works (red line in Figure)

Here $K (f)$ is the complexity of the parameter-function map, and $K (x)$ is the complexity of an "output" of this map, namely a function

We can estimate the complexity of parameter-function map by the length of the python code you write to define your neural network architecture. This usually requires a constant number of bits, the only variables that scale with input dimension $n$ are usually the size of the layers. But encoding the number $n$ requires only $O (\log n)$ bits. On the other hand, a typical Boolean function on an $n$ -dimensional input space requires $O (2^{n})$ bits to describe (one bit, per possible input, think of the truth table). Even if we are considering the function restricted a finite set of inputs of size $m$ , typically $m ≫ \log n$ . We conclude that in almost all cases, $K (f) ≪ K (x)$ is expected to hold.

Is this bias enough to explain the observed generalization?

Result 3: The bias is enough to explain "the bulk" of the generalization in our experiments

Mean generalization error and corresponding PAC-Bayes bound versus percentage of label corruption, for three datasets and a training set of size 10000. Training set error is 0 in all experiments.

What's the

x

-axis?

The label-corruption in the

x

-axis is just so that we are testing the bounds on a larger set of data distributions. The amount of label corruption corresponds to the probability of randomizing the output label, and is a proxy for data distribution complexity (the more randomized, the more complex the target distribution trying to be learnt)

What do all the lines mean?

The different colors correspond to different datasets (MNIST, fashion-MNIST, CIFAR10). Note, we work with the binarized versions of these, because we are investigating binary classification. The classification task is therefore, to classify as $0$ if it belongs to one of the first $5$ classes of the original dataset, or as $1$ if it belongs to one of the last $5$ classes.

The dotted lines correspond to the actual generalization error on the test set, after training the neural network to $0$ training error.

The corresponding solid lines correspond to the generalization error bounds that obtain. These bounds assume only the bias of the parameter-function map, and depend on the training set (though for a given data distribution (i.e. mnist, cifar), it doesn't fluctuate much with the sample of particular training set; same with the true generalization error)

What can we conclude?

What's the inset?

Any more details?

The empirical errors are averaged over 8 initializations. The Gaussian process parameters were

σ_{w} = 1.0

σ_{b} = 1.0

and the network was a 4-layer CNN with no pooling.

Interesting, and how did you determine that?

To explore this question:

We use the PAC-Bayesian framework to translate probabilistic biases into generalization guarantees
We make the assumption that the algorithm optimizing the parameters is unbiased, to isolate the effect of the parameter-function map. More precisely, we assume that the optimization algorithm samples the zero-error region close to uniformly (Assumption 1).

Can you provide more details on your method to obtain PAC-Bayes bounds?

Corollary 1 (of Langford and Seeger's version of the PAC-Bayesian theorem (Langford et al.)) For any distribution $P$ on any function space and realizable distribution $D$ on a space of instances we have, for $0 < δ \leq 1$ , that with probability at least $1 - δ$ over the choice of sample $S$ of $m$ instances

- \ln (1 - ϵ (Q^{*})) \leq \frac{\ln \frac{1}{P (U)} + \ln (\frac{2 m}{δ})}{m - 1}

where

ϵ (Q^{*})

is the expected generalization error under distribution over functions

Q^{*} (c) = \frac{P (c)}{\sum_{c \in U} P (c)}

U

is the set of functions in

H

consistent with the sample

S

, and where

P (U) = \sum_{c \in U} P (c)

It's quite tricky to give an intuitive explanation of PAC-Bayes (though I wanna try in a future blog post or something, or in my Intro to supervised learning theory).

Here is a short attempt. You know people talk about VC dimension/capacity/expressivity? The idea of those measures is that if you have many different functions that your algorithm can output, then it may overfit, and not generalize. Conversely, if it has few functions it's considering, and it is still able to fit the training data with one, then one can be confident, that it will generalize well outside the training data.

You can think of a probabilistic prior over functions $P (f)$ as a "smoothed" version of limiting the set of functions (hypothesis class in jargon), where one instead makes the algorithm prefer some functions much much more than others, but it doesn't completely rule the others out. PAC-Bayes gives guarantees on the generalization error for algorithms with such prior.

But why can we be confident of generalization, when the hypothesis class is small?

You have to keep asking why, don't you? I like curious people like you ^^

The main argument is found in the proof of bounds in basic Probably Approximately Correct (PAC) theory

Imagine a single function in your hypothesis class that has an actual generalization error which is high, meaning that on a random input, it is likely to disagree with the ground truth function, assumed fixed. Call $p$ the probability that it classifies correctly upon a random input ( $1 - p$ is probability that it missclassifies).

It is very unlikely that that function has a small classification error (let alone $0$ error) on a reasonably sized training set, say of size $m = 1000$ . Just think of every input sample you do as flipping a coin with a low probability $p$ of head (now you'll see how important the i.i.d. assumption in supervised learning theory is!). Now think of how inlekily it is that you get all heads, in $1000$ samples! Very unlikely. Exponentially so, in fact, $p^{m}$

But, what if the algorithm is considering among a largeeee set of functions, say $H$ of them. Well, what we really care about is if there exists one among those functions that has a large generalization error $p$ , and a $0$ traning error. The existence of one such function is enough to not be able to guarantee generalization, if all we know about the algorithm is that it picks a function with $0$ training error. Who are you to say that it won't pick the bad one if it exists?

It's tricky in general to work out the probability that there exist one more or more such functions. It depends on the set of functions, and it's the thing that Vapnik and Chervonenkis worked so hard to understand.

But we can easily upper bound it, using the so-called union bound. Intuitively, if each individual function of error $1 - p$ (or higher) has a probability $p^{m}$ (or smaller) of having $0$ training error. Then {the probability that there exists at least one function on a set of $H$ such functions that has $0$ training error} is at most $H p^{m}$ . Makes intuitive sense, if you have $2^{1000}$ functions with low probability of correct classification $p$ , you may not be so suprised if one just happened to fit the training set, even if each one individually is very unlikely

So if we require that the probability of this bad thing to happen to be low, say at most $δ$ , then we can rearrange, $H p^{m} < δ$ . Well, it's easier if we call the probability of error $ϵ$ , so $p = 1 - ϵ < e^{- ϵ}$ , and then it's enough to have $H e^{- m ϵ} < δ$ , which rearranges to $ϵ > \frac{\log H + \log \frac{1}{δ $}}{m}$

huh? that looks like a lower bound to me. Actually, we need to be careful with our logical statement. We said that if $ϵ > \frac{\log H + \log \frac{1}{δ $}}{m}$ , then { {the probability of one function with $0$ training error among the functions that have true error $ϵ$ or higher} is smaller than $δ$ }. So we actually gave the name " $ϵ$ " to the upper bound on the true error not to the actual true error. So the above is a lower bound, on the upper bound we want :PP. We want the tightest (smallest) upper bound. So let's take $\frac{\log H + \log \frac{1}{δ $}}{m}$ . The true error is smaller than that with probability $1 - δ$

I know that is a lot to take in. But the point is that, there you have it, we can give a guarantee (holding with high probability, if we make delta very small), on the generalization error, which is smaller, the smallest the set of functions the algorithm considers.

Ah I see. So the bound depends on the data via

P (U)

, which is nothing but the marginal likelihood of the labels on the data, given by the prior

P (f)

. But how do you calculate

P (U)

for neural networks, isn't that intractable?

J Lee et al.

Yes. However,

P (f)

for deep fully connected neural networks approaches a Gaussian process as the width of the layers approaches infinity.

A Garriga-Alonso et al., R Novak et al.

also for convolutional networks, as the number of filters goes to infinity!

G Yang

actually most modern neural net architectures do too, in certain scaling limits!

AGG Mathews et al.

and it seems the networks don't need to be that wide for the approximation to be good (we independently checked this too)

Thanks everyone! The Gaussian process approximation is what allows us to compute

P (U)

for realistically-sized NNs. However, the marginal likelihood for a Gaussian process with Bernoulli likelihood (for binary classification, the setting of PAC-Bayes) is still intractable, and so we explored some approximation techniques: Variational, Laplace, expectation-propagation (EP), and found EP to work best for our purposes.

What's the effect of the optimization algorithm?

After all, different optimization algorithms do show differences in generalization in practice

Yes, but differences in generalization are typically of only a few percent.

However, you raise an important point. Although we have shown that the bias is enough to explain the bulk of the generalization, whether it is the actual origin of the generalization in DNNs depends on the behaviour of the optimization algorithm.

A sufficient (though not necessary) condition for the parameter-function map to be main origin of the generalization is that the optimization algorithm isn't too biased, namely Assumption 1 is approximately valid.

Average probability of finding a function for a variant of SGD (advSGD in this case; see Appendix A), versus average probability of finding a function when using the Gaussian process approximation to a 2 layer fully connected network. This is done for a randomly chosen, but fixed, target Boolean function of Lempel-Ziv complexity 84.0. See Appendix D for details. The Gaussian process parameters are

σ_{w} = 10.0

, and

σ_{b} = 10.0

. For advSGD, we have removed functions which only appeared once in the whole sample, to avoid finite-size effects

We conjecture that it is for many common DNN optimization algorithms (note that for exact Bayesian sampling it is true, by definition), and show some empirical evidence supporting this .

Future work?

There are problems regarding the validity of Assumption 1, the EP or other approximations to

P (U)

, as well as the tightness of PAC-Bayes itself.

Furthermore, one can dig deeper to try to better understand the origin of the bias, and characterize it. In particular, there is the very important question of why is the bias helpful for real-world tasks?

We work under the standard assumption of supervised learning theory, that the test set is sampled i.i.d. from the same distribution as the training set. This impies a couple of important limitations of our work:

Our bound is not applicable in many real-world situations where this assumption doesn't hold. See Do ImageNet Classifiers Generalize to ImageNet? for an example work showing that often this assumption is not totally valid (in other cases, it's really not valid)
In the situations where it applies, we think our bound offers important insight into the origin of the generalization performance of neural networks. However, if the objective is purely to predict generalization performance, other approaches like using a test set work better than our current bounds (see A Comparison of Tight Generalization Error Bounds from 2005 for a nice overview.)

Starring:

Zhang et al (2017a). Understanding deep learning requires rethinking generalization. Published in ICLR 2017

Wu et al. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.

J Lee et al. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.

A Garriga-Alonso et al. Deep convolutional networks as shallow Gaussian processes. Published in ICLR 2019

R Novak et al. Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes. Published in ICLR 2019

AGG Mathews et al. Gaussian Process Behaviour in Wide Deep Neural Networks. Published in ICLR 2018

Dingle et al. Input–output maps are strongly biased towards simple outputs. Nature communications, 9(1):761, 2018.

D Soudry et al. The implicit bias of gradient descent on separable data. Published in ICLR 2018

Zhang et al. (2017b) Musings on deep learning: Properties of sgd. CBMM Memos 04/2017.

Zhang et al. (2018) Energy–entropy competition and the effectiveness of stochastic gradient descent in machine learning. Molecular Physics, pp. 1–10, 2018.

Langford et al. Bounds for averaging classifiers. 2001.