Ard Louis: Kolmogorov complexity and explaining why neural networks generalise so well

Показать описание

Title: Kolmogorov complexity and explaining why neural networks generalise so well (using a function based picture)

Abstract: One of the most surprising properties of deep neural networks (DNNs) is that they perform best in the overparameterized regime. We are all taught in a basic statistics class that having more parameters than data points is a terrible idea. This intuition can be formalised in standard learning theory approaches, based for example on model capacity, which also predict that DNNs should heavily over-fit in this regime, and therefore not generalise at all. So why do DNNs work so well in a regime where theory says they should fail?
A popular strategy in the literature has been to look for some dynamical property of stochastic gradient descent (SGD) acting on a non-convex loss-landscape in order to explain the bias towards functions with good generalisation. Here I will present a different argument, namely that DNNs are implicitly biased towards simple (low Kolmogorov complexity) solutions at initialisation [1]. This Occam's razor like effect fundamentally arises from a version of the coding theorem of algorithmic information theory, applied to input-output maps [2]. We also show that for DNNs in the chaotic regime, the bias can be tuned away, and the good generalisation disappears. For highly biased loss-landscapes, SGD converges to functions with a probability that can, to first order, be approximated by the probability at initialisation [3]. Thus, even though, to second order, tweaking optimisation hyperparameters can improve performance, SGD itself does not explain why DNNs generalise well in the overparameterized regime. Instead it is the intrinsic bias towards simple (low Kolmogorov complexity) functions that explains why they do not overfit. Finally, this function based picture allows us to derive rigorous PAC-Bayes bounds that closely track DNN learning curves and can be used to rationalise differences in performance across architectures.

[1] Deep learning generalizes because the parameter-function map is biased towards simple functions, Guillermo Valle Pérez, Chico Q. Camargo, Ard A. Louis arxiv:1805.08522
[2] Input–output maps are strongly biased towards simple outputs, K. Dingle, C. Q. Camargo and A. A. Louis Nature Comm. 9, 761 (2018)
[3] Is SGD a Bayesian sampler? Well, almost, Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, Ard A. Louis arxiv:2006.15191