Pytorch Transformers from Scratch (Attention is all you need)

preview_player
Показать описание
In this video we read the original transformer paper "Attention is all you need" and implement it from scratch!

Attention is all you need paper:

A good blogpost on Transformers:

❤️ Support the channel ❤️

Paid Courses I recommend for learning (affiliate links, no extra cost for you):

✨ Free Resources that are great:

💻 My Deep Learning Setup and Recording Setup:

GitHub Repository:

✅ One-Time Donations:

▶️ You Can Connect with me on:

OUTLINE:
0:00 - Introduction
0:54 - Paper Review
11:20 - Attention Mechanism
27:00 - TransformerBlock
32:18 - Encoder
38:20 - DecoderBlock
42:00 - Decoder
46:55 - Putting it togethor to form The Transformer
52:45 - A Small Example
54:25 - Fixing Errors
56:44 - Ending
Рекомендации по теме
Комментарии
Автор

Here's the outline for the video:
0:00 - Introduction
0:54 - Paper Review
11:20 - Attention Mechanism
27:00 - TransformerBlock
32:18 - Encoder
38:20 - DecoderBlock
42:00 - Decoder
46:55 - Forming The Transformer
52:45 - A Small Example
54:25 - Fixing Errors
56:44 - Ending

AladdinPersson
Автор

Attention is not all we need, this video is all we need

rhronsky
Автор

Not found a tutorial so much detail oriented. Now I am completely able to understand the Transformer and Attention Mechanism.Great Work.Thank you😊

pratikhmanas
Автор

I watched 3 Transformer videos before this one and thought I would never understand it. Love the way you explained such a complicated topic.

bhargav
Автор

This is undoubtedly one of the best transformer implementation videos I have seen. Thanks for posting such good content. Looking forward to seeing some more paper implementation videos.

nishantraj
Автор

In the original paper each head should have seperate weights, but in your code all heads share the same weights. here are two steps to fix it:
1. in __init__: self.queries = nn.Linear(self.embed_size, self.embed_size, bias=False) (same for key and value weights)
2. in forward: put "queries = self.queries(queries)" above "queries = queries.reshape(...)" (also same for keys and values)
Great video btw

sehbanomer
Автор

Great Tutorial! Thanks Aladdin <3. Just for anyone who is wondering, there should be values = self.values(values) and same for key and query at 17:05, otherwise we end up with no trainable parameters in the attention block. Thanks!

gautamvashishtha
Автор

you're an absolute saint. idk if i can even put it into words the amount of respect and appreciation I have for you man! thank you!

cc-tojn
Автор

I have been struggling to implement and understand custom transformer code from various sources. This was perhaps one of the best tutorials.

niranjansitapure
Автор

making something sophisticated so easy and clear that's what I call magic. Aladdin, you are truly the magician.

mykytahordia
Автор

45:59: WHOA! slow that down! Pause a sec, be emphatic if we're going to change something back up there

ScriptureFirst
Автор

This is cool. It would be helpful to have a section highlighting what parts of the dimensions should be changed if you are using a dataset of a different size or you want to change the input length. ie: keeping the architecture constant but noting how it could be used flexibly

chefmemesupreme
Автор

Hey Aladdin,
Really amazing videos brother!
This was the first video of yours that I stumbled upon and I fell in love with your channel. <3

Okay so my question is related to SelfAttention:
(same could be applied to Attention in general)
Let's say I have a sentence:
c = [[
[3, 9, 5, 2, 0],
[4, 5, 6, 0, 0]
]
0 indicating padding obviously
we are performing self attention, passing the mask too

Following outputs are after running the code in your video. Exact variable names are used:

1. Output of energy:
tensor([[[[-3.6146e-02, -1.0844e-01, -6.0244e-02, -2.4098e-02, -1.0000e+20],
[-1.0844e-01, -3.2532e-01, -1.8073e-01, -7.2293e-02, -1.0000e+20],
[-6.0244e-02, -1.8073e-01, -1.0041e-01, -4.0163e-02, -1.0000e+20],
[-2.4098e-02, -7.2293e-02, -4.0163e-02, -1.6065e-02, -1.0000e+20],
[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, -1.0000e+20]]],


[[[-6.4260e-02, -8.0325e-02, -9.6390e-02, -1.0000e+20, -1.0000e+20],
[-8.0325e-02, -1.0041e-01, -1.2049e-01, -1.0000e+20, -1.0000e+20],
[-9.6390e-02, -1.2049e-01, -1.4459e-01, -1.0000e+20, -1.0000e+20],
[0.0000e+00, 0.0000e+00, 0.0000e+00, -1.0000e+20, -1.0000e+20],
[0.0000e+00, 0.0000e+00, 0.0000e+00, -1.0000e+20, -1.0000e+20]]]],
grad_fn= < MaskedFillBackward0 > )

This output makes sense.
The last column in first energy matrix is set to '-inf' as we wanted it to as the last word is padding.
Similarly last and second last columns in second energy matrix are set to '-inf'.

Another observation:
last row of first energy matrix is set to 0
last 2 rows of second energy matrix is set to 0
this makes sense as the query words are padded words so their values would be 0


2. Output of attention: (softmax of energy with scaling factor)
tensor([[[[0.2552, 0.2374, 0.2491, 0.2583, 0.0000],
[0.2651, 0.2134, 0.2466, 0.2749, 0.0000],
[0.2586, 0.2292, 0.2484, 0.2638, 0.0000],
[0.2535, 0.2416, 0.2494, 0.2555, 0.0000],
[0.2500, 0.2500, 0.2500, 0.2500, 0.0000]]],


[[[0.3387, 0.3333, 0.3280, 0.0000, 0.0000],
[0.3400, 0.3333, 0.3267, 0.0000, 0.0000],
[0.3414, 0.3333, 0.3253, 0.0000, 0.0000],
[0.3333, 0.3333, 0.3333, 0.0000, 0.0000],
[0.3333, 0.3333, 0.3333, 0.0000, 0.0000]]]],
grad_fn= < SoftmaxBackward > )

This output makes sense that after passing through softmax the padded columns will convert to 0

3. Output of out:
tensor([[[2.8252],
[2.7250],
[2.7913],
[2.8424],
[2.8771]],

[[3.0221],
[3.0204],
[3.0188],
[3.0286],
[3.0286]]], grad_fn= < UnsafeViewBackward > )

In this output we are getting contextual embeddings for all the words of both the sentences including the padded words!
To address this should we not apply the query mask again over this to remove the activations of padded words?
I have read a lot of implementations and none seem to address this problem. Can you please give me your 2 cents?

SahilKhose
Автор

for the dropout in your codes, for example DecoderBlock forward, I think it should be:
query = self.norm(x + self.dropout(attention))

instead:
query = + x))

Here is the paper quote:
"We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized."

nicknguyen
Автор

This is one of the best explanation videos about a paper to code I've watched in a loong time! Congratz Aladdin dude!

TheClaxterix
Автор

Many thanks to you for this impressive tutorial, amazing job and outstanding explanation, and also thanks for sharing all these resources in the description.

FawzyBasily
Автор

wow, the best transformer tutorial I've seen

marksaroufim
Автор

Ya, .. agreed, .. this was an extremely difficult architecture to implement, . with .a LOT of moving parts, .. but this has to be the best walkthrough out there, .. sure, there are certain things like the src_mask unsqueeze that were a little tricky to visualize, .. but even barring that, you broke it down quite well! Thank you for this!. I'm so glad that we have all of this implemented in HF/PT hahah

bingochipspass
Автор

great explanation, much more helpful than the theoretical only explanations

张晨雨-qj
Автор

It's an extremely useful video for researches trying to implement paper codes. Do make a series implementing other Machine Learning codes described in other papers as well.
Please make a video to use this model on an actual NLP task such as translation, etc.

flamingflamingo