Let the space of functions that the model can express be . If the model has real valued parameters, taking values within a set ,
the parameter-function map, , is defined as:
where is the function implemented by the model with choice of parameter vector .
For all the neural network architectures we tried:
The volumes of regions of parameter space producing particular functions span a huge range of orders of magnitude.
For some larger neural networks with higher-dimensional input spaces, we used a Gaussian process approximation to calculate the probability of different functions. This can be seen in Figures 2a and the inset of Figure 3
We found that in all cases, the probability of a function inversely correlated with its complexity (using a variety of measures of complexity)
No deeper explanation yet about why the parameter-function map is biased. However, we do have some deeper reason, based on algorithmic information theory for why it is biased towards simple functions, given that it is biased.
The probability to obtain output of a simple map , upon sampling its inputs uniformly at random, depends only on the Kolmogorov complexity of the output :
The main condition on the map is that its Kolmogorov complexity is negligible relative to that of the output
Kolmogorov complexity is uncomputable, so we use computable approximations to it, like Lempel-Ziv complexity
To explore this question:
After all, different optimization algorithms do show differences in generalization in practice
Yes, but differences in generalization are typically of only a few percent.
However, you raise an important point. Although we have shown that the bias is enough to explain the bulk of the generalization, whether it is the actual origin of the generalization in DNNs depends on the behaviour of the optimization algorithm.
A sufficient (though not necessary) condition for the parameter-function map to be main origin of the generalization is that the optimization algorithm isn't too biased, namely Assumption 1 is approximately valid.
We conjecture that it is for many common DNN optimization algorithms (note that for exact Bayesian sampling it is true, by definition), and show some empirical evidence supporting this .
Starring:
J Lee et al. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
AGG Mathews et al. Gaussian Process Behaviour in Wide Deep Neural Networks. Published in ICLR 2018
D Soudry et al. The implicit bias of gradient descent on separable data. Published in ICLR 2018
Zhang et al. (2017b) Musings on deep learning: Properties of sgd. CBMM Memos 04/2017.