Transformer - Part 8 - Decoder (3): Encoder-decoder self-attention

preview_player
Показать описание
This is the third video about the transformer decoder and the final video introducing the transformer architecture. Here we mainly learn about the encoder-decoder multi-head self-attention layer, used to incorporate information from the encoder into the decoder. It should be noted that this layer is also commonly known as the cross-attention layer.

Рекомендации по теме
Комментарии
Автор

very few people know these concepts well enough to give detailed explanation with formulae. thanks a ton. I was having a lot of queries and this video helped resolve those

subusrable
Автор

Best YouTube video explaining Transformer ever!

SungheeYun
Автор

Undoubtedly, these 8 videos best explain transformers. I tried other videos and tutorials, but you are the best.

shaifulchowdhury
Автор

Beautifully explained, thank you. Transformers are so simplistic yet powerful.

notanape
Автор

I have been struggling to understand the size mismatch between the encoder-decoder, your video made it clear. others usually skip this part. thanks sir

AI_Life_Journey
Автор

These videos are wonderful, thank you for putting in the work. Everything was communicated so clearly and thoroughly.

My interpretation of the attention mechanism is that the result of the similarity (weight) matrix multiplied by the value matrix gives us an offset vector, which we then add to the value and normalize to get a contextualized vector. It's interesting in the decoder, we derive this offset from a value vector in the source language, add it to the target words and it is still somehow meaningful. I presume that it is the final linear layer which ensures that this resulting normalized output vector maps coherently to a discrete word in the target language.

If we can do this across languages, I wonder if this can be done across modalities.

ryanhewitt
Автор

great,I regret not seeing your class earlier,many tutorials say little about decoder part

zimingzhang
Автор

Thanks a lot teacher....You made many things clear for me🙏🏽❤️

cedricmanouan
Автор

Thank you professor for this amazing series on the transformer!

nappingyiyi
Автор

Your videos are both precise and very educational, many thanks!

wawa
Автор

this is so clear explanations, thanks so much

violinplayer
Автор

Thanks a lot, the only complete course about transformers that I found. One question, Why K = [q1 q2 ... q_(nE)] and not K=[ k1 .... ] (or its typo?)

antonisnesios
Автор

Dear Lennart, that was awesome, could you please make a tutorial in python as well? :)

TechSuperGirl
Автор

thank you for your work, these are incredible videos. but there is one thing I didn't understand. during the training phase, the entire sentence already correctly translated is given as input to the decoder and to prevent the transformer from "cheating" masked self attention is used. How many times does this step happen? because if it only happened once then the hidden words would not be usable during training. During the training phase, after each step does backpropagation occur and then does the mask move, hiding fewer words?

nomecognome-fw
Автор

Thanks for the great lecture! One think I'd like to ask: Why do you still call it "self-attention" when information between encoder and decoder are combined? Wouldn't just "attention" or even "cross_attention" make more sense here? If not, what is the self in self-attention and what is not-self-attention?

paulvoigtlaender
Автор

In this encode-decoder architecture, I wanted to understand if we have N encoders stacked together (one after the other), is Enocder_1 feeding the Decoder_1 or is Encoder_N feeding the Decoder_1?

mrinalde
Автор

After the training of the model, when we are giving an unknown source sentence to the model how does it predict or decode the words?

akhileshbisht
Автор

Thank you for this video. During calculations in Encoder-Decoder attention layer, are matrices Wq, Wk, Wv specific to that layer, and learned only in this layer? Also, am I correct to understand, that Decoder masked self attention and Encoder-Decoder attention act as essentially different layers, with different set of W matrices?

КонстантинДемьянов-лп
Автор

Many thanks professor. However, I am not sure if we should use transpose(K) * Q, or Q * transpose(K). Suppose that Q.shape = (nd, d), K.shape = (ne, d), I think that we should use Q * transpose(K) to render an output with shape (nd, ne)

chenqu
welcome to shbcf.ru