Weight Standardization (Paper Explained)

preview_player
Показать описание
It's common for neural networks to include data normalization such as BatchNorm or GroupNorm. This paper extends the normalization to also include the weights of the network. This surprisingly simple change leads to a boost in performance and - combined with GroupNorm - new state-of-the-art results.

Abstract:
In this paper, we propose Weight Standardization (WS) to accelerate deep network training. WS is targeted at the micro-batch training setting where each GPU typically has only 1-2 images for training. The micro-batch training setting is hard because small batch sizes are not enough for training networks with Batch Normalization (BN), while other normalization methods that do not rely on batch knowledge still have difficulty matching the performances of BN in large-batch training. Our WS ends this problem because when used with Group Normalization and trained with 1 image/GPU, WS is able to match or outperform the performances of BN trained with large batch sizes with only 2 more lines of code. In micro-batch training, WS significantly outperforms other normalization methods. WS achieves these superior results by standardizing the weights in the convolutional layers, which we show is able to smooth the loss landscape by reducing the Lipschitz constants of the loss and the gradients. The effectiveness of WS is verified on many tasks, including image classification, object detection, instance segmentation, video recognition, semantic segmentation, and point cloud recognition. The code is available here: this https URL.

Authors: Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, Alan Yuille

Links:
Рекомендации по теме
Комментарии
Автор

1:00 Main results
2:00 Why batch norm is suboptimal
5:20 Weight Standardization Method
12:55 Backpropagation through WS
16:05 Theory
16:30 Ablations
18:10 Conclusion

YannicKilcher
Автор

Woah gosh you're churning out videos at such a furious rate ... it's difficult to keep up... nonetheless I hope to emulate you in terms of such consistency....keep it up 🙏🙏

arnavdas
Автор

Pretty simple idea hidden in math piles. Thank you for the explanation!

Carbon-XII
Автор

At 1:43 Mask R-CNN is not recurrent, R stands for region (region based CNN)

HoriaCristescu
Автор

Hi Yannick,

I was thinking about what you told at 12:10. The weights become large, which will introduce instability or variance. Then, this method should help achieve convergence faster. Because, if we keep training the model, the weights will eventually come to a stable region.

Just a point, let me know what you think.

RohitKumarSingh
Автор

considered leaving a comment, nice video!

CristianGarcia
Автор

I'm slightly triggered because this type of technical neural network research is only addresses CNN as if it was the only NN architecture. Feedforward NN and RNN need love too.

Chrnalis
Автор

great videos. thanks for your efforts.

nikre
Автор

Great video Yannic. Recentering and rescaling weights reduces the otherwise high weights (overfitted weights). This acts as regularization and should improve performance when applied without group normalization. Thoughts?

MuditBachhawatIn
Автор

did you implement a faster paper processor net in your biological neural net?

impolitevegan
Автор

Thank you for the interesting video! I am curious about your experience with weight standardization. In the video you said that you would give it a try and you think it will bring something. Since you posted this video some time ago, I would like to know your feedback on the gain, if there was any.

azinjahedi
Автор

What happened to L2 weight regularization? Nobody uses it anymore. This looks like an evolved version of L2 regularization.

Great paper though. I will definitely use it in my projects.

herp_derpingson
Автор

Great video, thank you. However, I have a question, if the weights need to have zero mean, isn't it easier to initialized them to have zero mean and the forcing the gradient to have zero mean too. We get to keep all of the current workflow except for a tweak in initializer and optimizer, no?

iejtstr
Автор

Tiny inconsequential correction: The R in R-CNN is Regression not Recurrent. Edit: I'm wrong, see comments. d'oh

JackofSome
visit shbcf.ru