Layer Normalization - EXPLAINED (in Transformer Neural Networks)

Показать описание

Lets talk about Layer Normalization in Transformer Neural Networks!

ABOUT ME

RESOURCES

PLAYLISTS FROM MY CHANNEL

MATH COURSES (7 day free trial)

OTHER RELATED COURSES (7 day free trial)

TIMSTAMPS
0:00 Transformer Encoder Overview
0:56 "Add & Norm": Transformer Encoder Deep Dive
5:13 Layer Normalization: What & why
7:33 Layer Normalization: Working out the math by hand
12:10 Final Coded Class

Рекомендации по теме

Комментарии

as per my understanding and from the LayerNorm code in pytorch, in NLP for an input of size [N, T, Embed], statistics are computed using only the Embed dim, and layer norm is applied to each token in each batch. But for vision with an input of size [N, C, H, W], statistics are computed using the [C, H, W] dimentions

vib

I really like your voice and delivery. It’s quite reassuring, which is nice when the subject of the videos can be pretty complicated.

jbca

Thank you so much for providing solid examples and calculations to explain these concepts. I have seen these concepts elsewhere but couldn't make sense of them until I saw how you computed the values. Great video!

Hangglide

6:56 actually you would use the variance instead of the standard deviation, so in the layernorm formula it should be sigma^2 in the divisor. and the epsilon is missing to prevent zero division

velvet_husky

Another great video - I like the structure you use of summarising the concept and then diving into the implementation. The code really helps bring it together as some others have commented. I look forward to seeing more of this series and would love to see a longer video of you deploying a transformer on some dummy data (perhaps you already have one - still going through the content)!

superghettoindian

10:30 when layers normalizations are "computed across the layer and also the batch", are their means and std connected as if the batch boundaries aren't there? So that means there's a different learnable gamma and beta parameter for each word?

tdk-in

very clear and sound explanation for a complex concept, thumb up for the hard works!

yangkewen

Nice series on transformer. Really liked it. Btw interesting design choice for the video to use a landscape layout of the transform architecture during intro :D

saahilnayyer

the python example really helped to solidify understanding.

wryltxw

A bit late but can you tell me if my understanding is correct. The output from the attention head might have skewed means because of the large spread of the values, this might consist early lead to wrong predictions. So the gradients might constantly changing itself to correct these predictions which might lead to incorrect change in the weights which eventually leads to exploding and vanishing gradients. To correct this we sort of reset the mean and then fine tune the mean according to the matrix ?

anishkrishnan

Great! Since you’re covering transformer components, I would love to see TransformerXL and RelativePositionalEmbedding concepts explained in the upcoming videos! ☺️

_.hello._.world_

Where can I find the reason why we need to calculate mean and standard deviation over the parameter shapes? In Pytorch, they just caculate over the last dimension, hidden size.

yanlu

Thanks Man,
you deserve more than a like

Ibrahimnada

Amazing your expiation are so clear. Can this help with exploding gradients?

fenglema

The diagram is great..but you should explain the code also corresponding to the diagram

shubhamgattani

if our batch has 2 vectors of 2 words x 3 embedding size say:
[[1, 2, 3], [4, 5, 6]] and [[1, 3, 5], [2, 4, 6]]
For layer normalization,
Is nu_1 = mean(1, 2, 3, 1, 3, 5)
and nu_2 = mean(4, 5, 6, 2, 4, 6)

Just wanted to clarify. Keep up the great work, brother. Like the small bite sized videos.

luvsuneja

Great video! I find these very informative, so please keep them going! Question on the output dimensions though. In your transformer overview video the big diagram shows that after the layer normalization, you have a matrix of shape [batch_size, sequence_len, dmodel] (in the video 30x50x512 I believe.) However here you end up with an output matrix (out) of [sequence_len, batch_size, dmodel] (5x3x8). Do we need to reshape these output matrices again to [batch_size, sequence_len, dmodel], or am I missing something? Thanks again for all the informative content!

TD-vijx

Have an actual question this time! While trying to understand the differences between layer and batch normalization, I was wondering whether it’s also accurate to say you are normalising across the features of a vector when normalising the activation function - since each layer is a matrix multiply across all features of a row, would normalising across activation functions be similar to normalising across the features?

In the same thread, can/should layer and batch normalization be run concurrently? If not, are there reasons to choose one over the other?

superghettoindian

Beautifully done video but isn't layer normalization essentially batch normalization layer?

misterx

Is pytorch better for NLP task than Tensorflow?

saurabhnirwan

Layer Normalization - EXPLAINED (in Transformer Neural Networks)

What is Layer Normalization? | Deep Learning Fundamentals

Layer Normalization - EXPLAINED (in Transformer Neural Networks)

What is Layer Normalization?

What is Layer Normalization ?

All About Normalizations! - Batch, Layer, Instance and Group Norm

Batch normalization | What it is and how to implement it

Batch Normalization (“batch norm”) explained

Batch Normalization - EXPLAINED!

Lesson 3 Transformer Components

Layer Normalization by hand

Evolving Normalization-Activation Layers

The Role of Residual Connections and Layer Normalization in Neural Networks and Gen AI Models

Coding Layer Normalization

Normalizing Activations in a Network (C2W3L04)

Evolving Normalization-Activation Layers

How Batch Normalization works to solve Internal Covariate Shift

Layer Normalization | Lecture 63 (Part 2) | Applied Deep Learning

Transformer layer normalization

[ICML 2024] On the Nonlinearity of Layer Normalization

Why Does Batch Norm Work? (C2W3L06)

Group Normalization (Paper Explained)

Standardization vs Normalization Clearly Explained!

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Types of Normalization in Deep Learning