torch.nn.TransformerDecoderLayer - Part 2 - Embedding, First Multi-Head attention and Normalization

preview_player
Показать описание


Рекомендации по теме
Комментарии
Автор

First Multihead Attention is masked, and the second is normal multihead attention. I think you forgot to mention this is masked cause the decoder behaves as an autoencoder which sends the output back into the output. During training as we will need to mimic this behaviour of inference/testing in training, therefore attention is basically the attention with mask.

Great videos overall!

kushagrasharma