Live -Transformers Indepth Architecture Understanding- Attention Is All You Need

preview_player
Показать описание
All Credits To Jay Alammar

Please donate if you want to support the channel through GPay UPID,

Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more

Please do subscribe my other channel too

Connect with me here:
Рекомендации по теме
Комментарии
Автор

@ 40:00 why we consider 64? - It is based on the how many multi head attention you want to apply. We used embedding size for each word = 512 and want to apply 8 multi head self attention; there fore for each attention we are using (512/8 =) 64 dimensional Q, K, V vector. So that, when we concatenate all the multi attention heads afterward, we can achieve the same 512 dimensional word embeddings which will be the input to the feed forward layer.

Now, for instance, if you want 16 multi attention head, in that case you can use 32 dimensional Q, K, and V vector. My opinion is that, the initial word embedding size and the number of multi attention head are the hyperparameters.

mohammadmasum
Автор

Krish is a hard working person, not for himself but for our country in the best way he could...We need more persons like him in our country

story_teller_
Автор

For anyone having a doubt at 40:00 as to why we have taken a square root of 64 is because, as per the research it was mathematically proven to be the best method to keep the gradients stable! Also, note that the value 64, which is the size of the Query, Keys and Values vectors, is in itself a hyperparameter which was found to be working the best. Hope this helps.

suddhasatwaAtGoogle
Автор

This might help the guy who asked why we take the square root and also for other aspirants :

The scores get scaled down by getting divided by the square root of the dimension of query and key. This is to allow for more stable gradients, as multiplying values can have exploding effects.

roshankumargupta
Автор

You can skim through all the youtube videos explaining transformers, but nobody comes close to this video.
Thank you Sir🙏🙏🙏

anusikhpanda
Автор

Thanks for explaining Jay's blog. To add to the explanation at 39:30, the reason for using sqrt(dk) is to prevent the problem of vanishing gradient as mentioned in the paper. Since we are applying softmax on Q*K and if we consider a high dimension of these matrices, it will produce a high value which will get transformed close to 1 after softmax and hence leads to a small update in gradient.

harshitjain
Автор

Krish, I really see the honesty in you man, lot of humility, very humble person. In the beginning of this video, you gave credit to Jay several times who created amazing blog for Transformers. I really liked that. Be like that.

ss-dytw
Автор

I cannot express the amount of appreciation enough of your videos, especially NLP deep learning related topics! They are extremely helpful and so easy to understand from scratch! Thank you very much!

dandyyu
Автор

Thanks for your fantastic LLM/Transformer series content, and I admire your positive attitude and support for the authors of these wonderful articles! 👏

nim-cast
Автор

I am very new to the world of AI. I was looking for easy videos to teach me about the different models. I cannot imagine that I was totally enthralled by this video as long as you taught. You are a very good teacher. Thank you for publishing this video free. Thanks to Jay as well for simplifying such complex topic.

shrikanyaghatak
Автор

For those getting confused with 8 heads, all the words would be going to all the heads. It's not one word per head. The X matrix remains the same only the W matrix would change in case of multi-head attention.

faezakamran
Автор

I really admire you now. Just because you give the credit to the deserving at the beginning of the video.

That attitude will make you a great leader. All the best!!

prasad
Автор

Excellent blog from Jay, Thanks Krish for introducing this blog on ur channel !!

sarrae
Автор

Sir, Please release the video of Bert. Eagerly waiting for it.

jeeveshkataria
Автор

Every time I get confused or distracted while listening to the Transformers, I have to watch the video again; this is my third time watching it, and now I understand it better.

shanthan.
Автор

Million tons appreciation for making this video. Thank you soo much for your amazing work.

akhilgangavarapu
Автор

@31:45 If my understanding is correct, reson why we have 64, is because we we divide 512 into 8 equal heads. As we are computing the dot products to get the attention vaue, if we do the dot product of 512 embedding dimension length it will not only be computationally expensive but also the fact that we will get only one relation between the words . Taking advantage of parallel computation we divide 512 into 8 equal parts. this is why we call it as multi head attention. This way its computationally fast and we also get 8 different relation between the words. (FIY Attention is basically a relation between the words ). Any way Good work on explaining the architecture krish.

junaidiqbal
Автор

Great Session Krish. Because of Research paper I understand things very easily and clearly.

hiteshyerekar
Автор

Really nice sir, looking forward to Bert Implementation 😊

MuhammadShahzad-dxje
Автор

You are a really good teacher that always check your audiences weather they get the concept or not. Also, I appreciate your patience and the way you try to rephrase to have a better explanations.

apppurchaser