NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)

Показать описание

#nfnets #deepmind #machinelearning

Batch Normalization is a core component of modern deep learning. It enables training at higher batch sizes, prevents mean shift, provides implicit regularization, and allows networks to reach higher performance than without. However, BatchNorm also has disadvantages, such as its dependence on batch size and its computational overhead, especially in distributed settings. Normalizer-Free Networks, developed at Google DeepMind, are a class of CNNs that achieve state-of-the-art classification accuracy on ImageNet without batch normalization. This is achieved by using adaptive gradient clipping (AGC), combined with a number of improvements in general network architecture. The resulting networks train faster, are more accurate, and provide better transfer learning performance. Code is provided in Jax.

OUTLINE:
0:00 - Intro & Overview
2:40 - What's the problem with BatchNorm?
11:00 - Paper contribution Overview
13:30 - Beneficial properties of BatchNorm
15:30 - Previous work: NF-ResNets
18:15 - Adaptive Gradient Clipping
21:40 - AGC and large batch size
23:30 - AGC induces implicit dependence between training samples
28:30 - Are BatchNorm's problems solved?
30:00 - Network architecture improvements
31:10 - Comparison to EfficientNet
33:00 - Conclusion & Comments

ERRATA (from Lucas Beyer): "I believe you missed the main concern with "batch cheating". It's for losses that act on the full batch, as opposed to on each sample individually.
For example, triplet in FaceNet or n-pairs in CLIP. BN allows for "shortcut" solution to loss. See also BatchReNorm paper."

Abstract:
Batch normalization is a key component of most image classification models, but it has many undesirable properties stemming from its dependence on the batch size and interactions between examples. Although recent work has succeeded in training deep ResNets without normalization layers, these models do not match the test accuracies of the best batch-normalized networks, and are often unstable for large learning rates or strong data augmentations. In this work, we develop an adaptive gradient clipping technique which overcomes these instabilities, and design a significantly improved class of Normalizer-Free ResNets. Our smaller models match the test accuracy of an EfficientNet-B7 on ImageNet while being up to 8.7x faster to train, and our largest models attain a new state-of-the-art top-1 accuracy of 86.5%. In addition, Normalizer-Free models attain significantly better performance than their batch-normalized counterparts when finetuning on ImageNet after large-scale pre-training on a dataset of 300 million labeled images, with our best models obtaining an accuracy of 89.2%. Our code is available at this https URL deepmind-research/tree/master/nfnets

Authors: Andrew Brock, Soham De, Samuel L. Smith, Karen Simonyan

Links:

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Рекомендации по теме

Комментарии

ERRATA (from Lucas Beyer): "I believe you missed the main concern with "batch cheating". It's for losses that act on the full batch, as opposed to on each sample individually.
For example, triplet in FaceNet or n-pairs in CLIP. BN allows for "shortcut" solution to loss. See also BatchReNorm paper."

YannicKilcher

I applaud the inclusion of a negative results appendix, and hopefully in the future it will become a standard or even required section (by conference/journal).

GeekProdigyGuy

I liked that you compared this to Speed-running.

billykotsos

DeepMind: new SOTA model
Google: you must have used tensorflow right?
DeepMind: tensor what?

LouisChiaki

My take for why they did the architecture search on top of AGC was that to be fair with architectures optimised for BatchNorm. BatchNorm was out in 2015 and since then all the architectures either by grad students or NAS has been designed or searched with BN as one of its core ingredient. This is especially true for the EfficientNet they are comparing against.
Now they have designed AGC trick that fixes all the problems of BN but do it in a different mechanism, just plug that in the place of BN layer for architectures optimised with BN wouldn't be very fair for it right? So they basically developed a BN replacement and squash a few years of NAS research into one, and did it with a more realisitc metric (actual training speed vs theoretical FLOp), which I think is pretty good.

PhucLe-qsnx

Dude telling how to beat a SOTA paper on youtube, EPIC; we would love to cite the channel

mrigankanath

"Don't come at me math people" - Yannic 2021

channel-sudi

The BEST deep learning channel on youtube. Congrats!

CarlosGarcia-hsyg

love your work, as always.
you sometimes make me think about things deeper than I originally did, or pull a new perspective on certain concepts

tho

Thanks for these videos man, you help a lot with your explanations and your way of navigating the paper!

rufus

I would go as far as to say: The batch normalisation shows the inferiority of the model to accommodate for the complexity coming from the variance of the data. It basically accommodates currently for the fact that research has not yet figured out why Batch Normalization is necessary at all for a model given training data.
Of course the same argument can be held against the data being unfit for the network. But here we have no other choice but to accept the data as is.

Same with preprocessing like transforming audio time signals to frequency domain first.

fak-yo-to

it isn't a clipping, its a rescaling (it's dividing by magnitude/lambda) ; if you have a batch of 4k, with 1 bad data, and you rescale, you're going to down-weight all the other items in the batch with the problematic sample. Consider the limit of all data in 1 batch, with a bad sample (correctly detected) -- all you would do is take small steps in the bad direction repeatedly.

andytroo

and you still see some people tell you, go for tensorflow because, well, you know, pytorch is just for hardcore research. machine learning, is still in "building process", and many knowledge, skills, considered now as solide, will be dismissed tomorow. Time for research and improvement. Not certainties.

rickrunner

I feel like every second day there is new SOTA in everything

MsFearco

You really are a fantastic teacher. Thank you, again!

dr.mikeybee

Could you do a video on transformers for time series data?
Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting

or

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

navidhakimi

Great for undergraduates, thank you !

benlee

The sum is finite, so you can swap it round. You’re good. -Math person

opx-tech

Thank you! Great explanation. Keep up the good work :)

sourabmangrulkar

Excellent walk through! Great points on the gradient clipping, it should definitely be done before averaging. I wonder if replacing the clipping by the median gradient would have a similar effect...

TheEbbemonster

NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)

NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)

NFNets: High Performance Large-Scale Image Recognition without Normalization

NfNet: High-Performance Large-Scale Image Recognition Without Normalization

NFNet and NFResNet: High-Performance Large-Scale Image Recognition Without Normalization

High-Performance Large-Scale Image Recognition Without Normalization presented by Hieu Do

DeepMind's NFNet on CV2 video stream

[SUB] HighPerformance Large Scale ImageRecognition Without Normalization

Normalizer-Free ResNets | Lecture 11 (Part 4) | Applied Deep Learning (Supplementary)

NFNets; An Advanced Image Classification Model #palpx #imageclassification

NFNets: подход к порождению семейства нейросетей для распознавания изображений без BatchNorm...

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

2019 EfficientNet paper summary

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization (Paper Explained)

Density estimation with normalizing flow in a minute

How far can we scale up? Deep Learning's Diminishing Returns (Article Review)

NFNets: семейство нейросетей для распознавания изображений без использования BatchNormalization...

EfficientNet | Lecture 21 (Part 1) | Applied Deep Learning

EfficientNet! - Keras Code Examples

LambdaNetworks: Modeling long-range Interactions without Attention (Paper Explained)

Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper

Antons Mislevics - Machine Learning in Image Recognition scenarios

Grad-CAM class activation visualization - Keras Code Examples

Paper presentation: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

On Feature Normalization and Data Augmentation (Moment Exchange or MoEx)