Live -Transformers Indepth Architecture Understanding- Attention Is All You Need

Показать описание

All Credits To Jay Alammar

Please donate if you want to support the channel through GPay UPID,

Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more

Please do subscribe my other channel too

Connect with me here:

Рекомендации по теме

Комментарии

@ 40:00 why we consider 64? - It is based on the how many multi head attention you want to apply. We used embedding size for each word = 512 and want to apply 8 multi head self attention; there fore for each attention we are using (512/8 =) 64 dimensional Q, K, V vector. So that, when we concatenate all the multi attention heads afterward, we can achieve the same 512 dimensional word embeddings which will be the input to the feed forward layer.

Now, for instance, if you want 16 multi attention head, in that case you can use 32 dimensional Q, K, and V vector. My opinion is that, the initial word embedding size and the number of multi attention head are the hyperparameters.

mohammadmasum

Krish is a hard working person, not for himself but for our country in the best way he could...We need more persons like him in our country

story_teller_

For anyone having a doubt at 40:00 as to why we have taken a square root of 64 is because, as per the research it was mathematically proven to be the best method to keep the gradients stable! Also, note that the value 64, which is the size of the Query, Keys and Values vectors, is in itself a hyperparameter which was found to be working the best. Hope this helps.

suddhasatwaAtGoogle

This might help the guy who asked why we take the square root and also for other aspirants :

The scores get scaled down by getting divided by the square root of the dimension of query and key. This is to allow for more stable gradients, as multiplying values can have exploding effects.

roshankumargupta

You can skim through all the youtube videos explaining transformers, but nobody comes close to this video.
Thank you Sir🙏🙏🙏

anusikhpanda

Thanks for explaining Jay's blog. To add to the explanation at 39:30, the reason for using sqrt(dk) is to prevent the problem of vanishing gradient as mentioned in the paper. Since we are applying softmax on Q*K and if we consider a high dimension of these matrices, it will produce a high value which will get transformed close to 1 after softmax and hence leads to a small update in gradient.

harshitjain

Krish, I really see the honesty in you man, lot of humility, very humble person. In the beginning of this video, you gave credit to Jay several times who created amazing blog for Transformers. I really liked that. Be like that.

ss-dytw

I cannot express the amount of appreciation enough of your videos, especially NLP deep learning related topics! They are extremely helpful and so easy to understand from scratch! Thank you very much!

dandyyu

Thanks for your fantastic LLM/Transformer series content, and I admire your positive attitude and support for the authors of these wonderful articles! 👏

nim-cast

I am very new to the world of AI. I was looking for easy videos to teach me about the different models. I cannot imagine that I was totally enthralled by this video as long as you taught. You are a very good teacher. Thank you for publishing this video free. Thanks to Jay as well for simplifying such complex topic.

shrikanyaghatak

For those getting confused with 8 heads, all the words would be going to all the heads. It's not one word per head. The X matrix remains the same only the W matrix would change in case of multi-head attention.

faezakamran

I really admire you now. Just because you give the credit to the deserving at the beginning of the video.

That attitude will make you a great leader. All the best!!

prasad

Excellent blog from Jay, Thanks Krish for introducing this blog on ur channel !!

sarrae

Sir, Please release the video of Bert. Eagerly waiting for it.

jeeveshkataria

Every time I get confused or distracted while listening to the Transformers, I have to watch the video again; this is my third time watching it, and now I understand it better.

shanthan.

Million tons appreciation for making this video. Thank you soo much for your amazing work.

akhilgangavarapu

@31:45 If my understanding is correct, reson why we have 64, is because we we divide 512 into 8 equal heads. As we are computing the dot products to get the attention vaue, if we do the dot product of 512 embedding dimension length it will not only be computationally expensive but also the fact that we will get only one relation between the words . Taking advantage of parallel computation we divide 512 into 8 equal parts. this is why we call it as multi head attention. This way its computationally fast and we also get 8 different relation between the words. (FIY Attention is basically a relation between the words ). Any way Good work on explaining the architecture krish.

junaidiqbal

Great Session Krish. Because of Research paper I understand things very easily and clearly.

hiteshyerekar

Really nice sir, looking forward to Bert Implementation 😊

MuhammadShahzad-dxje

You are a really good teacher that always check your audiences weather they get the concept or not. Also, I appreciate your patience and the way you try to rephrase to have a better explanations.

apppurchaser

Live -Transformers Indepth Architecture Understanding- Attention Is All You Need

Live -Transformers Indepth Architecture Understanding- Attention Is All You Need

Illustrated Guide to Transformers Neural Network: A step by step explanation

Transformers, explained: Understand the model behind GPT, BERT, and T5

Live Session- Encoder Decoder,Attention Models, Transformers, Bert Part 1

Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!!

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

Transformers for beginners | What are they and how do they work

The Transformer architecture

[ 100k Special ] Transformers: Zero to Hero

The Narrated Transformer Language Model

Transformers Neural Networks Explained | NLP with Deep Learning | Deep Learning Course | Edureka

Live- Attention Models, Transformers In depth Intuition Deep Learning- Part 2

Live Session- Understanding Attention Models Architecture And Maths Intuition- Deep Learning

Live -Transformers Architecture Understanding indepth - Attention Is All You Need Part 2

What are Transformers (Machine Learning Model)?

What are Transformer Neural Networks?

Transformer models and BERT model: Overview

5 concepts in transformer neural networks (Part 1)

The Transformer neural network architecture EXPLAINED. “Attention is all you need”

L19.5.2.2 GPT-v1: Generative Pre-Trained Transformer

Transformers in NLP | GeeksforGeeks

Transformers | Basics of Transformers

Guide to TRANSFORMERS ENCODER-DECODER Neural Network : A Step by Step Intuitive Explanation

Attention mechanism: Overview