Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained)

preview_player
Показать описание
#grokking #openai #deeplearning

Grokking is a phenomenon when a neural network suddenly learns a pattern in the dataset and jumps from random chance generalization to perfect generalization very suddenly. This paper demonstrates grokking on small algorithmic datasets where a network has to fill in binary tables. Interestingly, the learned latent spaces show an emergence of the underlying binary operations that the data were created with.

OUTLINE:
0:00 - Intro & Overview
1:40 - The Grokking Phenomenon
3:50 - Related: Double Descent
7:50 - Binary Operations Datasets
11:45 - What quantities influence grokking?
15:40 - Learned Emerging Structure
17:35 - The role of smoothness
21:30 - Simple explanations win
24:30 - Why does weight decay encourage simplicity?
26:40 - Appendix
28:55 - Conclusion & Comments

Abstract:
In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of “grokking” a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset.

Authors: Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin & Vedant Misra

Links:

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Рекомендации по теме
Комментарии
Автор

OUTLINE:
0:00 - Intro & Overview
1:40 - The Grokking Phenomenon
3:50 - Related: Double Descent
7:50 - Binary Operations Datasets
11:45 - What quantities influence grokking?
15:40 - Learned Emerging Structure
17:35 - The role of smoothness
21:30 - Simple explanations win
24:30 - Why does weight decay encourage simplicity?
26:40 - Appendix
28:55 - Conclusion & Comments

YannicKilcher
Автор

The verb 'to grok' has been in niche use in computer science since the 1960's meaning to fully understand something (as opposed to just knowing how to apply the rules for example). It was coined in Robert A Heinlein's classic sci-fi novel "Stranger in a Strange Land".

ruemeese
Автор

WAIT, WHAT?

In my Masters thesis, I did something similar, and I got the same result. A quick overfit and subsequent snap into place. I never thought that this would be interesting, as I was not an ML engineer, but a mathematician looking for new ways to model structures.

That was in 2019. Could have been famous! Ah well...

martinsmolik
Автор

Thanks for accepting my suggestion in reviewing this! I think this paper can fall between "Wow what a strange behaviour" to "Is more compute all we need?"
If the findings of this paper are reproduced on a lot of other practical scenarious and we find ways to speed up grokking it might change the entire way we think about neural networks, where we just need a model that's big enough to express some solution and then keep training it to find simpler ones, leading to an algorithmic Occam's Razor!

mgostIH
Автор

So our low dimensional intuitions don't work for extremely high dimensional space. For example, look at the sphere bounding problem. If you join four 2D spheres, the sphere that you can draw inside the bounds of those four spheres is smaller--and that seems like common sense. But in 10D space, the sphere you place in the middle of the hyperspheres actually reaches outside the bounds of the enclosing hyperspheres. I imagine an analogous thing is happening here, if you were at a mountaintop and you could only feel around with your feet, you of course wouldn't get to mountain tops further away by just stepping in the direction of higher incline, but maybe in higher dimensions these mountain tops are actually closer together than we think. Or in other words, some local minima of the loss function which we associate with overfitting is actually not so far away from a good generalization, or global minimum.

Bellenchia
Автор

Loved the context you brought into this about double descent! And, as always, really great explanations in general!

Phenix
Автор

I don’t see what needs explaining here.
The loss function is something like L= MSE(w, x, y) + d*w^2
The optimization tries to get the error as small as possible and also get the weights as small as possible.
A simple rule requires a lot less representation than memorizing every datapoint, thus more weights can be 0.
The rule based solution is exactly what we are opimizing for, it just takes a long time to find it because the dominant term in the loss is the MSE, hence why it first moves toward minimizing the error by memorizing the training data then slowly moves towards minimizing the weights. The only way to minimize both is to learn the rule. How is this different from simple regularization?

As for why the generalization happens ”suddenly” it is just a result of how gradient descent (with the Adam optimizer) works. Once it has found a gradient direction toward the global minimim it will pick up speed and momentum moving towards that global minimum, but finding a path towards that global minimum could take a long time, possibly forever if it got stuck in a local minima without any way out. Training 1000x longer to achieve generalization feels like a bad/inefficient approach to me.

beaconofwierd
Автор

I actually expected this kind of behavior based on the principle of regularization. With regularization (weight decay is L2 regularization), there is always a tension between the prediction error term and regularization term in the loss. At some point in the long training session, regularization term starts to dominate and the network, given enough time and noise, finds a different solution that has much lower regularization loss. This is also true with increasing the number of parameters, since doing that also increases the regularization loss. At some point, adding more parameters results in the regularization loss dominating over the error loss and the network latches on another solution that would result in lower regularization loss and overall loss. In other words, regularization provides the pressure that, when given enough time and noise in the system, comes up with a solution that uses fewer effective parameters. I am assuming fewer parameters leads to better generalization.

nauy
Автор

Thanks Yannic! Started reading this paper but forgot about it, very handy to have it here.

CristianGarcia
Автор

I saw a similar idea on the paper "Train longer, generalize better: closing the generalization gap in large batch training of neural networks
Elad Hoffer,  Itay Hubara,  Daniel Soudry" from 2017. Since then, I always try to run the network longer than the over fitting point and in many cases it works.

drhilm
Автор

OMG, I'm not crazy! This has happened to me when I forgot to down a model for a month

Yaketycast
Автор

This is a fascinating phenomenon, thanks for sharing and walking through the paper.

AdobadoFantastico
Автор

Thank you! Very cool! I have observed the grokking phenomena and was really puzzled why I do not see anybody talking about it other than "well, you can try training your neural network longer and see what happen"

FromBeaverton
Автор

Great walkthrough of the paper Yannic! You covered some details which weren't clear to me initially!

bingochipspass
Автор

It would have been interesting if they had shown the evolution of the sum of the squared weights during the training.
Maybe the sum of the weights gets progressively smaller and this helps the model to find the "optimal" solution to the problem starting from the memorized solution. It can also be that the l2 penalty kind of progressively "destroys" the learned representation and it adds noise until it finds this optimal solution.
Very interesting paper and video anyways !

Chocapic_
Автор

Yannic, thanks for bring this to us. I agree with your heuristics on the local minimum vs global minimum. To me, the phenomena in this paper is related to the "oracle property" of Lasso in the tradition statistical learning literature, aka, for a penalized algorithm (Lasso or equivalent weight decay), such that the algorithm is almost sure to find the correct non-zero parameters (aka, the global minimum), as well as their correct coefficients. It will be interesting that some one further study this.

SeagleLiu
Автор

very cool phenomenon. thanks for highlighting this paper.

piratepartyftw
Автор

Thank you for the wonderful review of this interesting paper.

guruprasadsomasundaram
Автор

Just wanted to say thank you, I learned a lot.

AliMoeeny
Автор

Thanks for the amazing video... I have seen this while working with piecewise linear functions.. Somehow leaky relu works much better than relu or other deactivating functions.... Wouldn't have known this paper exists of you hadn't posted..

jasdeepsinghgrover