Group Normalization (Paper Explained)

Показать описание

The dirty little secret of Batch Normalization is its intrinsic dependence on the training batch size. Group Normalization attempts to achieve the benefits of normalization without batch statistics and, most importantly, without sacrificing performance compared to Batch Normalization.

Abstract:
Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN's usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN's computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.

Authors: Yuxin Wu, Kaiming He

Links:

Рекомендации по теме

Комментарии

Thanks for the video!

Note: normalizing *isn't* making the data more Gaussian, it's just transforming it to have mean of 0 and SD of 1. Gaussian data is often normalized and represented in this way too, but the normalization doesn't make your data any more Gaussian. Normalization does not change the inherent distributional shape of the data, just the mean and SD. For example, if your data was right-tailed in one dimension, it would remain right-tailed (and non-gaussian looking), it would just have a mean and SD of 0 and 1, respectively.

hillosand

Nice explanation of BN at the beginning! Glad you kept it simple and did not use mythical "internal covariate shift" terminology. ;-)

bluelng

Thanks for taking the time to walk us through this so slowly. Much appreciated.

mkamp

Thank gosh you mentioned the other way of thinking about batch norm @ 13:00. I thought I'd misunderstood batch norm the whole time. Like always, top notch content :)

rbain

The visualization/explanation of batch norm was really helpful to understand how it works in a CNN! Thanks :)

bimds

Thanks man. Perfect illustration to understand the difference between batch norm and layer norm.

vandanaschannel

Definitely one of the best NN explanation videos I've seen.

IBMua

How does GroupNorm compare with the modern version of BatchNorm with running momentums? Sounds like momentums should fix the problem of small batch sizes on their own

makgaiduk

Well, I know what I’m adding to my model later. Thank you for the clear explanation.

johnkilbride

Speaking about normalization, I was wondering about the intuition of LayerNorm in Transformer models. Usually it is applied after the concatenation and projection of the multiheaded self-attention output but wouldn't it make sense to apply it to each head separately to get more fine-grained normalization statistics?

DerUltraGamerr

Thanks for the videos. You do a great job of going over the details of papers and summarizing the key points.

MiroslawHorbal

What I dont quite get here, looking at the paper and the pytorch implementation, is that the batch axis remains unused. As far as making the point that you can get enough samples to compute meaningfull statistics without aggregating over the batch at all, its an interesting experiment. But it does show a tradeoff. If you have a batch thats bigger than 1, wouldn't you at least want to have the option to compute your statistics over the batch as well? Seem intuitively like that would bring you closer to the optimal big-batch-statistics behavior. Am I missing something here?

eelcohoogendoorn

good review Yannic it helps me a lot to faster understanding the papers

grayleafmorgo

"I usually don't believe the experiments that you see in single paper." LOL

reginaphalange

To be sure I am understanding everything correctly: If you are training a fully connected NN (MLP) with only 1 channel, then Layer Norm = Instance Norm = Group Norm, correct?

proreduction

Thanks a lot Yannic!! keep the videos coming

fahdciwan

Thank you so much! This explanation is literally what I needed 🙏🏽🤝🏽

carlosnacher

15:13 you said you calculate the mean of 3 channel but the picture looks like have 6 different channel. is picture does not represent the 3 channel?

moonryu

I usually even go with batch size of 1 when processing videos 😉

(with my brain)

dermitdembrot

It seems that the method is motivated by the fact that their might be a few correlated or similar channels but there is no effort to figure out which channels should be grouped together before normalizing them together. I'm surprised that this effect has been replicated across multiple efforts as you mention.

indraneilpaul

Group Normalization (Paper Explained)

Group Normalization (Paper Explained)

All About Normalizations! - Batch, Layer, Instance and Group Norm

Group Normalization

Lecture 49 : Layer, Instance, Group Normalization

Group Normalization | Lecture 42 (Part 2) | Applied Deep Learning

Group normalization | Great for small batch sizes

Batch Normalization - EXPLAINED!

Presentation on Group Normalization by Group 20 of MT19AIE Batch

Types of Normalization in Deep Learning

Lecture 49 Layer, Instance, Group Normalization

Layer Normalization - EXPLAINED (in Transformer Neural Networks)

Standardization vs Normalization Clearly Explained!

What is Layer Normalization? | Deep Learning Fundamentals

Batch normalization | What it is and how to implement it

Comparison of Batch, Layer, Instance and Group Normalization

[Paper Reivew] Group Normalization

Normalizing Activations in a Network (C2W3L04)

What is Layer Normalization?

group normalization

2015 Batch Normalization paper summary

2020 Cross iteration Batch Normalization paper summary

Big Transfer (BiT): General Visual Representation Learning (Paper Explained)

Lecture 46 : Normalization

Layer Normalization by hand