NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)

preview_player
Показать описание
#nfnets #deepmind #machinelearning

Batch Normalization is a core component of modern deep learning. It enables training at higher batch sizes, prevents mean shift, provides implicit regularization, and allows networks to reach higher performance than without. However, BatchNorm also has disadvantages, such as its dependence on batch size and its computational overhead, especially in distributed settings. Normalizer-Free Networks, developed at Google DeepMind, are a class of CNNs that achieve state-of-the-art classification accuracy on ImageNet without batch normalization. This is achieved by using adaptive gradient clipping (AGC), combined with a number of improvements in general network architecture. The resulting networks train faster, are more accurate, and provide better transfer learning performance. Code is provided in Jax.

OUTLINE:
0:00 - Intro & Overview
2:40 - What's the problem with BatchNorm?
11:00 - Paper contribution Overview
13:30 - Beneficial properties of BatchNorm
15:30 - Previous work: NF-ResNets
18:15 - Adaptive Gradient Clipping
21:40 - AGC and large batch size
23:30 - AGC induces implicit dependence between training samples
28:30 - Are BatchNorm's problems solved?
30:00 - Network architecture improvements
31:10 - Comparison to EfficientNet
33:00 - Conclusion & Comments

ERRATA (from Lucas Beyer): "I believe you missed the main concern with "batch cheating". It's for losses that act on the full batch, as opposed to on each sample individually.
For example, triplet in FaceNet or n-pairs in CLIP. BN allows for "shortcut" solution to loss. See also BatchReNorm paper."

Abstract:
Batch normalization is a key component of most image classification models, but it has many undesirable properties stemming from its dependence on the batch size and interactions between examples. Although recent work has succeeded in training deep ResNets without normalization layers, these models do not match the test accuracies of the best batch-normalized networks, and are often unstable for large learning rates or strong data augmentations. In this work, we develop an adaptive gradient clipping technique which overcomes these instabilities, and design a significantly improved class of Normalizer-Free ResNets. Our smaller models match the test accuracy of an EfficientNet-B7 on ImageNet while being up to 8.7x faster to train, and our largest models attain a new state-of-the-art top-1 accuracy of 86.5%. In addition, Normalizer-Free models attain significantly better performance than their batch-normalized counterparts when finetuning on ImageNet after large-scale pre-training on a dataset of 300 million labeled images, with our best models obtaining an accuracy of 89.2%. Our code is available at this https URL deepmind-research/tree/master/nfnets

Authors: Andrew Brock, Soham De, Samuel L. Smith, Karen Simonyan

Links:

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Рекомендации по теме
Комментарии
Автор

ERRATA (from Lucas Beyer): "I believe you missed the main concern with "batch cheating". It's for losses that act on the full batch, as opposed to on each sample individually.
For example, triplet in FaceNet or n-pairs in CLIP. BN allows for "shortcut" solution to loss. See also BatchReNorm paper."

YannicKilcher
Автор

I applaud the inclusion of a negative results appendix, and hopefully in the future it will become a standard or even required section (by conference/journal).

GeekProdigyGuy
Автор

I liked that you compared this to Speed-running.

billykotsos
Автор

DeepMind: new SOTA model
Google: you must have used tensorflow right?
DeepMind: tensor what?

LouisChiaki
Автор

My take for why they did the architecture search on top of AGC was that to be fair with architectures optimised for BatchNorm. BatchNorm was out in 2015 and since then all the architectures either by grad students or NAS has been designed or searched with BN as one of its core ingredient. This is especially true for the EfficientNet they are comparing against.
Now they have designed AGC trick that fixes all the problems of BN but do it in a different mechanism, just plug that in the place of BN layer for architectures optimised with BN wouldn't be very fair for it right? So they basically developed a BN replacement and squash a few years of NAS research into one, and did it with a more realisitc metric (actual training speed vs theoretical FLOp), which I think is pretty good.

PhucLe-qsnx
Автор

Dude telling how to beat a SOTA paper on youtube, EPIC; we would love to cite the channel

mrigankanath
Автор

"Don't come at me math people" - Yannic 2021

channel-sudi
Автор

The BEST deep learning channel on youtube. Congrats!

CarlosGarcia-hsyg
Автор

love your work, as always.
you sometimes make me think about things deeper than I originally did, or pull a new perspective on certain concepts

tho
Автор

Thanks for these videos man, you help a lot with your explanations and your way of navigating the paper!

rufus
Автор

I would go as far as to say: The batch normalisation shows the inferiority of the model to accommodate for the complexity coming from the variance of the data. It basically accommodates currently for the fact that research has not yet figured out why Batch Normalization is necessary at all for a model given training data.
Of course the same argument can be held against the data being unfit for the network. But here we have no other choice but to accept the data as is.

Same with preprocessing like transforming audio time signals to frequency domain first.

fak-yo-to
Автор

it isn't a clipping, its a rescaling (it's dividing by magnitude/lambda) ; if you have a batch of 4k, with 1 bad data, and you rescale, you're going to down-weight all the other items in the batch with the problematic sample. Consider the limit of all data in 1 batch, with a bad sample (correctly detected) -- all you would do is take small steps in the bad direction repeatedly.

andytroo
Автор

and you still see some people tell you, go for tensorflow because, well, you know, pytorch is just for hardcore research. machine learning, is still in "building process", and many knowledge, skills, considered now as solide, will be dismissed tomorow. Time for research and improvement. Not certainties.

rickrunner
Автор

I feel like every second day there is new SOTA in everything

MsFearco
Автор

You really are a fantastic teacher. Thank you, again!

dr.mikeybee
Автор

Could you do a video on transformers for time series data?
Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting

or

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

navidhakimi
Автор

Great for undergraduates, thank you !

benlee
Автор

The sum is finite, so you can swap it round. You’re good. -Math person

opx-tech
Автор

Thank you! Great explanation. Keep up the good work :)

sourabmangrulkar
Автор

Excellent walk through! Great points on the gradient clipping, it should definitely be done before averaging. I wonder if replacing the clipping by the median gradient would have a similar effect...

TheEbbemonster