Layer Normalization - EXPLAINED (in Transformer Neural Networks)

preview_player
Показать описание
Lets talk about Layer Normalization in Transformer Neural Networks!

ABOUT ME

RESOURCES

PLAYLISTS FROM MY CHANNEL

MATH COURSES (7 day free trial)

OTHER RELATED COURSES (7 day free trial)

TIMSTAMPS
0:00 Transformer Encoder Overview
0:56 "Add & Norm": Transformer Encoder Deep Dive
5:13 Layer Normalization: What & why
7:33 Layer Normalization: Working out the math by hand
12:10 Final Coded Class
Рекомендации по теме
Комментарии
Автор

as per my understanding and from the LayerNorm code in pytorch, in NLP for an input of size [N, T, Embed], statistics are computed using only the Embed dim, and layer norm is applied to each token in each batch. But for vision with an input of size [N, C, H, W], statistics are computed using the [C, H, W] dimentions

vib
Автор

I really like your voice and delivery. It’s quite reassuring, which is nice when the subject of the videos can be pretty complicated.

jbca
Автор

Thank you so much for providing solid examples and calculations to explain these concepts. I have seen these concepts elsewhere but couldn't make sense of them until I saw how you computed the values. Great video!

Hangglide
Автор

6:56 actually you would use the variance instead of the standard deviation, so in the layernorm formula it should be sigma^2 in the divisor. and the epsilon is missing to prevent zero division

velvet_husky
Автор

Another great video - I like the structure you use of summarising the concept and then diving into the implementation. The code really helps bring it together as some others have commented. I look forward to seeing more of this series and would love to see a longer video of you deploying a transformer on some dummy data (perhaps you already have one - still going through the content)!

superghettoindian
Автор

10:30 when layers normalizations are "computed across the layer and also the batch", are their means and std connected as if the batch boundaries aren't there? So that means there's a different learnable gamma and beta parameter for each word?

tdk-in
Автор

very clear and sound explanation for a complex concept, thumb up for the hard works!

yangkewen
Автор

Nice series on transformer. Really liked it. Btw interesting design choice for the video to use a landscape layout of the transform architecture during intro :D

saahilnayyer
Автор

the python example really helped to solidify understanding.

wryltxw
Автор

A bit late but can you tell me if my understanding is correct. The output from the attention head might have skewed means because of the large spread of the values, this might consist early lead to wrong predictions. So the gradients might constantly changing itself to correct these predictions which might lead to incorrect change in the weights which eventually leads to exploding and vanishing gradients. To correct this we sort of reset the mean and then fine tune the mean according to the matrix ?

anishkrishnan
Автор

Great! Since you’re covering transformer components, I would love to see TransformerXL and RelativePositionalEmbedding concepts explained in the upcoming videos! ☺️

_.hello._.world_
Автор

Where can I find the reason why we need to calculate mean and standard deviation over the parameter shapes? In Pytorch, they just caculate over the last dimension, hidden size.

yanlu
Автор

Thanks Man,
you deserve more than a like

Ibrahimnada
Автор

Amazing your expiation are so clear. Can this help with exploding gradients?

fenglema
Автор

The diagram is great..but you should explain the code also corresponding to the diagram

shubhamgattani
Автор

if our batch has 2 vectors of 2 words x 3 embedding size say:
[[1, 2, 3], [4, 5, 6]] and [[1, 3, 5], [2, 4, 6]]
For layer normalization,
Is nu_1 = mean(1, 2, 3, 1, 3, 5)
and nu_2 = mean(4, 5, 6, 2, 4, 6)

Just wanted to clarify. Keep up the great work, brother. Like the small bite sized videos.

luvsuneja
Автор

Great video! I find these very informative, so please keep them going! Question on the output dimensions though. In your transformer overview video the big diagram shows that after the layer normalization, you have a matrix of shape [batch_size, sequence_len, dmodel] (in the video 30x50x512 I believe.) However here you end up with an output matrix (out) of [sequence_len, batch_size, dmodel] (5x3x8). Do we need to reshape these output matrices again to [batch_size, sequence_len, dmodel], or am I missing something? Thanks again for all the informative content!

TD-vijx
Автор

Have an actual question this time! While trying to understand the differences between layer and batch normalization, I was wondering whether it’s also accurate to say you are normalising across the features of a vector when normalising the activation function - since each layer is a matrix multiply across all features of a row, would normalising across activation functions be similar to normalising across the features?

In the same thread, can/should layer and batch normalization be run concurrently? If not, are there reasons to choose one over the other?

superghettoindian
Автор

Beautifully done video but isn't layer normalization essentially batch normalization layer?

misterx
Автор

Is pytorch better for NLP task than Tensorflow?

saurabhnirwan