Transformer Encoder in 100 lines of code!

preview_player
Показать описание
ABOUT ME

RESOURCES

PLAYLISTS FROM MY CHANNEL

MATH COURSES (7 day free trial)

OTHER RELATED COURSES (7 day free trial)

TIMESTAMP
0:00 What we will cover
0:53 Introducing Colab
1:24 Word Embeddings and d_model
3:00 What are Attention heads?
3:59 What is Dropout?
4:59 Why batch data?
7:46 How to sentences into the transformer?
9:03 Why feed forward layers in transformer?
9:44 Why Repeating Encoder layers?
11:00 The “Encoder” Class, nn.Module, nn.Sequential
14:38 The “EncoderLayer” Class
17:45 What is Attention: Query, Key, Value vectors
20:03 What is Attention: Matrix Transpose in PyTorch
21:17 What is Attention: Scaling
23:09 What is Attention: Masking
24:53 What is Attention: Softmax
25:42 What is Attention: Value Tensors
26:22 CRUX OF VIDEO: “MultiHeadAttention” Class
36:27 Returning the flow back to “EncoderLayer” Class
37:12 Layer Normalization
43:17 Returning the flow back to “EncoderLayer” Class
43:44 Feed Forward Layers
44:24 Why Activation Functions?
46:03 Finish the Flow of Encoder
48:03 Conclusion & Decoder for next video
Рекомендации по теме
Комментарии
Автор

If you think I deserve it, please consider hitting the like button and subscribe for more content like this :)

CodeEmporium
Автор

Best video out there for encoders, especially for beginners!

michellekelly-eejj
Автор

Next level video *especially* because of the dimensions laid out and giving intuition for things like k.transpose(-1, -2). Likely the best resource out right now!! Thanks for all your work!

sushantmehta
Автор

Best video on encoder. The backtracking of encoder concept...like a top-down approach is really amazing and helps to understand easily

shubhamgattani
Автор

This is the best explanation I have gone through

surajgorai
Автор

This is the most detailed Transformer video, THANK YOU!
I have one question, the values is [30, 8, 200, 64], before we reshape it, shouldn't we permute it first? like:
values = values.permute(0, 2, 1, 3).reshape(batch_size, max_sequence_length, self.num_heads * self.head_dim)

AnthonyY-oq
Автор

Superb and so love these classes! Will watch all of them one by one

jingcheng
Автор

It's really helpful that you are going through all the sizes of the various vectors and matrices.

wryltxw
Автор

Immense amount of effort put into video. Really appreciate the explanation especially keeping in mind the PyTorch aspect for beginners. Showing details like tensor dimensions throughout the code is just next level. Keep these videos coming.

aamirbadershah
Автор

bro... i love how u dive deep into explanations. You're a very good teacher holy shit

moseslee
Автор

I watched the entire series and it gave me a deeper understanding on how all of this works. Very well done!!!! Takes a real master to take a complex topic and break it down in such a consumable way. I do have one question: What is the point of the permute? Can we not specify the shape we want in the reshape call?

danielbrooks
Автор

You are awesome .The way you teach is incredible.

ulmwfue
Автор

This video was really informative. Thank you for all the detailed explanations!

seyedmatintavakoliafshari
Автор

@CodeEmporium
The transformer series is awesome!
It is very informative.
I have one comment, It is usually recommended to perform dropout before normalization layers. This is because normalization layers may undo dropout effects by re-scaling the input. By performing dropout before normalization, we ensure that the inputs to the normalization layer are still diverse and have different scales.

salemibrahim
Автор

Thank you, I going through all your videos. great work!

pierrelebreton
Автор

Very clear, useful and helpful explanation! Thank you!

gigabytechanz
Автор

Appreciate your work! As someone else mentioned, hope you can do an implementation of training the network for a few iterations.

KurtGr
Автор

Hi Ajay. I think, we need to make a small change in the forward() function of the encoder class. We should be doing `x_residual = x.clone() # or x_residual = x[:]` instead of `x_residual =x`. This will ensure that x_residual contains a copy of the original x and is not affected by any changes made to x.

prashantlawhatre
Автор

Awesome content as always ! Are you planning to demonstrate a training example of training for the encoder for the next video ? For example on a wikipedia data sample or something like that ?

TransalpDave
Автор

Thanks for the great series. Would be very helpful if you'd attach the Colab.

chenmargalit