torch.nn.TransformerDecoderLayer - Part 2 - Embedding, First Multi-Head attention and Normalization

preview_player

Показать описание

Рекомендации по теме

Комментарии

First Multihead Attention is masked, and the second is normal multihead attention. I think you forgot to mention this is masked cause the decoder behaves as an autoencoder which sends the output back into the output. During training as we will need to mimic this behaviour of inference/testing in training, therefore attention is basically the attention with mask.

Great videos overall!

kushagrasharma