Pytorch Transformers from Scratch (Attention is all you need)

Показать описание

In this video we read the original transformer paper "Attention is all you need" and implement it from scratch!

Attention is all you need paper:

A good blogpost on Transformers:

❤️ Support the channel ❤️

Paid Courses I recommend for learning (affiliate links, no extra cost for you):

✨ Free Resources that are great:

💻 My Deep Learning Setup and Recording Setup:

GitHub Repository:

✅ One-Time Donations:

▶️ You Can Connect with me on:

OUTLINE:
0:00 - Introduction
0:54 - Paper Review
11:20 - Attention Mechanism
27:00 - TransformerBlock
32:18 - Encoder
38:20 - DecoderBlock
42:00 - Decoder
46:55 - Putting it togethor to form The Transformer
52:45 - A Small Example
54:25 - Fixing Errors
56:44 - Ending

Рекомендации по теме

Комментарии

Here's the outline for the video:
0:00 - Introduction
0:54 - Paper Review
11:20 - Attention Mechanism
27:00 - TransformerBlock
32:18 - Encoder
38:20 - DecoderBlock
42:00 - Decoder
46:55 - Forming The Transformer
52:45 - A Small Example
54:25 - Fixing Errors
56:44 - Ending

AladdinPersson

Attention is not all we need, this video is all we need

rhronsky

Not found a tutorial so much detail oriented. Now I am completely able to understand the Transformer and Attention Mechanism.Great Work.Thank you😊

pratikhmanas

I watched 3 Transformer videos before this one and thought I would never understand it. Love the way you explained such a complicated topic.

bhargav

This is undoubtedly one of the best transformer implementation videos I have seen. Thanks for posting such good content. Looking forward to seeing some more paper implementation videos.

nishantraj

In the original paper each head should have seperate weights, but in your code all heads share the same weights. here are two steps to fix it:
1. in __init__: self.queries = nn.Linear(self.embed_size, self.embed_size, bias=False) (same for key and value weights)
2. in forward: put "queries = self.queries(queries)" above "queries = queries.reshape(...)" (also same for keys and values)
Great video btw

sehbanomer

Great Tutorial! Thanks Aladdin <3. Just for anyone who is wondering, there should be values = self.values(values) and same for key and query at 17:05, otherwise we end up with no trainable parameters in the attention block. Thanks!

gautamvashishtha

you're an absolute saint. idk if i can even put it into words the amount of respect and appreciation I have for you man! thank you!

cc-tojn

I have been struggling to implement and understand custom transformer code from various sources. This was perhaps one of the best tutorials.

niranjansitapure

making something sophisticated so easy and clear that's what I call magic. Aladdin, you are truly the magician.

mykytahordia

45:59: WHOA! slow that down! Pause a sec, be emphatic if we're going to change something back up there

ScriptureFirst

This is cool. It would be helpful to have a section highlighting what parts of the dimensions should be changed if you are using a dataset of a different size or you want to change the input length. ie: keeping the architecture constant but noting how it could be used flexibly

chefmemesupreme

Hey Aladdin,
Really amazing videos brother!
This was the first video of yours that I stumbled upon and I fell in love with your channel. <3

Okay so my question is related to SelfAttention:
(same could be applied to Attention in general)
Let's say I have a sentence:
c = [[
[3, 9, 5, 2, 0],
[4, 5, 6, 0, 0]
]
0 indicating padding obviously
we are performing self attention, passing the mask too

Following outputs are after running the code in your video. Exact variable names are used:

1. Output of energy:
tensor([[[[-3.6146e-02, -1.0844e-01, -6.0244e-02, -2.4098e-02, -1.0000e+20],
[-1.0844e-01, -3.2532e-01, -1.8073e-01, -7.2293e-02, -1.0000e+20],
[-6.0244e-02, -1.8073e-01, -1.0041e-01, -4.0163e-02, -1.0000e+20],
[-2.4098e-02, -7.2293e-02, -4.0163e-02, -1.6065e-02, -1.0000e+20],
[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, -1.0000e+20]]],

[[[-6.4260e-02, -8.0325e-02, -9.6390e-02, -1.0000e+20, -1.0000e+20],
[-8.0325e-02, -1.0041e-01, -1.2049e-01, -1.0000e+20, -1.0000e+20],
[-9.6390e-02, -1.2049e-01, -1.4459e-01, -1.0000e+20, -1.0000e+20],
[0.0000e+00, 0.0000e+00, 0.0000e+00, -1.0000e+20, -1.0000e+20],
[0.0000e+00, 0.0000e+00, 0.0000e+00, -1.0000e+20, -1.0000e+20]]]],
grad_fn= < MaskedFillBackward0 > )

This output makes sense.
The last column in first energy matrix is set to '-inf' as we wanted it to as the last word is padding.
Similarly last and second last columns in second energy matrix are set to '-inf'.

Another observation:
last row of first energy matrix is set to 0
last 2 rows of second energy matrix is set to 0
this makes sense as the query words are padded words so their values would be 0

2. Output of attention: (softmax of energy with scaling factor)
tensor([[[[0.2552, 0.2374, 0.2491, 0.2583, 0.0000],
[0.2651, 0.2134, 0.2466, 0.2749, 0.0000],
[0.2586, 0.2292, 0.2484, 0.2638, 0.0000],
[0.2535, 0.2416, 0.2494, 0.2555, 0.0000],
[0.2500, 0.2500, 0.2500, 0.2500, 0.0000]]],

[[[0.3387, 0.3333, 0.3280, 0.0000, 0.0000],
[0.3400, 0.3333, 0.3267, 0.0000, 0.0000],
[0.3414, 0.3333, 0.3253, 0.0000, 0.0000],
[0.3333, 0.3333, 0.3333, 0.0000, 0.0000],
[0.3333, 0.3333, 0.3333, 0.0000, 0.0000]]]],
grad_fn= < SoftmaxBackward > )

This output makes sense that after passing through softmax the padded columns will convert to 0

3. Output of out:
tensor([[[2.8252],
[2.7250],
[2.7913],
[2.8424],
[2.8771]],

[[3.0221],
[3.0204],
[3.0188],
[3.0286],
[3.0286]]], grad_fn= < UnsafeViewBackward > )

In this output we are getting contextual embeddings for all the words of both the sentences including the padded words!
To address this should we not apply the query mask again over this to remove the activations of padded words?
I have read a lot of implementations and none seem to address this problem. Can you please give me your 2 cents?

SahilKhose

for the dropout in your codes, for example DecoderBlock forward, I think it should be:
query = self.norm(x + self.dropout(attention))

instead:
query = + x))

Here is the paper quote:
"We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized."

nicknguyen

This is one of the best explanation videos about a paper to code I've watched in a loong time! Congratz Aladdin dude!

TheClaxterix

Many thanks to you for this impressive tutorial, amazing job and outstanding explanation, and also thanks for sharing all these resources in the description.

FawzyBasily

wow, the best transformer tutorial I've seen

marksaroufim

Ya, .. agreed, .. this was an extremely difficult architecture to implement, . with .a LOT of moving parts, .. but this has to be the best walkthrough out there, .. sure, there are certain things like the src_mask unsqueeze that were a little tricky to visualize, .. but even barring that, you broke it down quite well! Thank you for this!. I'm so glad that we have all of this implemented in HF/PT hahah

bingochipspass

great explanation, much more helpful than the theoretical only explanations

张晨雨-qj

It's an extremely useful video for researches trying to implement paper codes. Do make a series implementing other Machine Learning codes described in other papers as well.
Please make a video to use this model on an actual NLP task such as translation, etc.

flamingflamingo

Pytorch Transformers from Scratch (Attention is all you need)

Pytorch Transformers from Scratch (Attention is all you need)

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

Attention in transformers, step-by-step | DL6

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

Let's build GPT: from scratch, in code, spelled out.

Illustrated Guide to Transformers Neural Network: A step by step explanation

Pytorch Transformers for Machine Translation

Transformers (how LLMs work) explained visually | DL5

Vision Transformer in PyTorch

[ 100k Special ] Transformers: Zero to Hero

Language Translation with Multi-Head Attention | Transformers from Scratch

Transformers Self-Attention with PyTorch (GPT Foundation)

Self Attention in Transformer Neural Networks (with Code!)

What are Transformers (Machine Learning Model)?

Transformers, explained: Understand the model behind GPT, BERT, and T5

Multi Head Attention in Transformer Neural Networks with Code!

Cross Attention vs Self Attention

Attention mechanism: Overview

PyTorch Implementation of Transformers

Transformers: The best idea in AI | Andrej Karpathy and Lex Fridman

Vision transformers #machinelearning #datascience #computervision

New course with StatQuest with Josh Starmer! Attention in Transformers: Concepts and Code in PyTorch

PyTorch or Tensorflow? Which Should YOU Learn!

🤯How ChatGPT REALLY works: LLMs and Transformers