Blowing up Transformer Decoder architecture

Показать описание

ABOUT ME

RESOURCES

PLAYLISTS FROM MY CHANNEL

MATH COURSES (7 day free trial)

OTHER RELATED COURSES (7 day free trial)

TIMESTAMP
0:00 Introduction
2:00 What is the Encoder doing?
3:30 Text Processing
5:05 Why are we batching data?
6:03 Position Encoding
6:34 Query, Key and Value Tensors
7:57 Masked Multi Head Self Attention
15:30 Residual Connections
17:47 Multi Head Cross Attention
21:25 Finishing up the Decoder Layer
22:17 Training the Transformer
24:33 Inference for the Transformer

Рекомендации по теме

Комментарии

I've been closely following the Transformer playlist, which has greatly helped in my comprehension of the Transformer Architecture. Your excellent work is evident, and I can truly appreciate the dedication you've shown in simplifying complex concepts. Your approach of deconstructing intricate ideas into manageable steps is truly praiseworthy. I also find it highly valuable how you begin each video with an overview of the entire architecture and contextualize the current steps within it. Your efforts are genuinely commendable, and I'm sincerely grateful for your contributions. Thank you.

ahmadfaraz

mind BLOWING..lucky enough to find your lectures

SarvaniChinthapalli

Your drawing skill is actually amazing!

JoeChang

Man you're a pure treasure! Keep up this outstanding work! 🙏🏼

galileo

Best drawing to explain this concept 👏🏼👏🏼👏🏼

MapumbaPaulus

You are really great at articulation, Thank you😇

MaheshKumar-bnq

truly amazing video, I have read the original paper but this video definitely helped me to understand it better, especially the way that you visualize the whole architecture.

limbenny

Can you explain in other video, examples of vectors of Q K V ? is still confusing for me what they represent.

jonfe

Great video !!! Clear explanation about dimensions and the whole process.

lathashreeh

Will you make a video on transformers using vision transformer + transfotmer decoder for image captioning?

tiffanyk

Thank you! Your video makes me know a lot

任晶-lo

Illustrating your explanations with code actually provides much deeper insights. Thanks, man! Quick note on this video: I was wondering why you haven't included the "output embeddings" in your sketch of the decoder?

nicolasdr

This is Awesome!!!!
thank you so much for the

lakshman

7:00 I feel as though the implementations that just repeat the Q K V matrices are making a mistake, mostly because the purpose of multihead attention is to learn different attentions right? In the attention blocks the linear layers / learnable parameters are at the beginning for each Q K V, then one big one after the heads are concatenated, so without the individual ones at the beginning (I’m assuming each initialized to random values) I believe the multiple heads would be useless. Thoughts or corrections?

philipbutler

One thing I don't understand is that at 20:35, the matrix obtained by multiplying the cross-attention matrix, derived from the encoder, with the v matrix is said to represent one English word per row. But the q part of the cross-attention matrix comes from the Kannada sentences in the masked attention, shouldn't each row of the resulting matrix correspond to a Kannada word?

jackwoo

Great work Ajay, Can you share the diagram link which you have showed in the video?

sandhyas

While we are yet to translate the sentence to kanada, how can we pass it to the decoder??

AbdulRahman-tjwc

Thank you for all the videos about transformer. Although I understood the architecture, I still dont know what to set for the input of the decoder (embeded target) and mask for the TEST phase?

sarahgh

Great work indeed. Helped clear a lot of things especially the part where softmax is used for the decoder output. So the first row will output the target lang first word. But in scenarios where two source words resonate with one target lang word, how is softmax handled their? Can you please help me in figuring this out.

hajrawaheed

At the end of the decoder block, isn't there supposed to be another "Add & Norm" operation as in the architecture? Did he miss it?

supremachine

Blowing up Transformer Decoder architecture

Blowing up Transformer Decoder architecture

Decoder-Only Transformers, ChatGPTs specific Transformer, Clearly Explained!!!

blowing up transformer decoder architecture

Transformer models: Decoders

Blowing up the Transformer Encoder!

Which transformer architecture is best? Encoder-only vs Encoder-decoder vs Decoder-only models

Transformer models: Encoder-Decoders

Encoder-decoder architecture: Overview

What are Transformers (Machine Learning Model)?

Decoder training with transformers

Transformer Decoder Architecture | Deep Learning | CampusX

Illustrated Guide to Transformers Neural Network: A step by step explanation

Decoder architecture in 60 seconds

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Why masked Self Attention in the Decoder but not the Encoder in Transformer Neural Network?

How chatgpt works

Encoder-Decoder Architecture in Transformers

Transformer models: Encoders

BERT vs GPT

What is Positional Encoding in Transformer?

BERT Networks in 60 seconds

Decoder-Only Transformer for Next Token Prediction: PyTorch Deep Learning Tutorial

Types of Language Model Architectures #llms

Lets code the Transformer Encoder