Self-attention in deep learning (transformers) - Part 1

preview_player
Показать описание
Self-attention in deep learning (transformers)

Self attention is very commonly used in deep learning these days. For example, it is one of the main building blocks of the Transformer paper (Attention is all you need) which is fast becoming the go to deep learning architectures for several problems both in computer vision and language processing. Additionally, all these famous papers like BERT, GPT, XLM, Performer use some variation of the transformers which in turn is built using self-attention layers as building blocks.

So this video is about understanding a simplified version of the attention mechanism in deep learning.

Note: This is part 1 in the series of videos about Transformers.

📚 📚 📚 BOOKS I HAVE READ, REFER AND RECOMMEND 📚 📚 📚
Рекомендации по теме
Комментарии
Автор

Best video on self-attention I've seen so far

StratosFair
Автор

Very good. After this video, I'm starting to understand self attention

patrickctrf
Автор

Thanks for the tutorial! Didn't get 2 things though:
Why is the matrix W* not a symmetrical matrix (dot-product is a commutative operation)?
And why is matrix W* not normalised?

seraf_in_._._._._._._._._._._
Автор

Good video, but the background music is really distracting. I wish you can remove it.

kanaipathak
Автор

I think starting with high level intuition would be really helpful.

wryltxw
Автор

Thanks for a good tutorial.
Dot product of x_1 and x_2 vector is wrong. It should be around 0.41 instead of 0.21. Hope it helps you to correct your amazing tutorial on self-attention.

LidoList
Автор

I have a question. after embedding we still have the same number of features x1->x4, and say that they have dimension is 1x10, means 10 features each. W* is 4x4 right! My question is X is 4x10 and X^T is 10x4. How we do dot product W . X^T as the dimension is (4x4) . (10x4)! or I am missing something?

M_Nagy_
Автор

Before computers people used to do TEXT compile a list of every individual word in a text...and the number of occurrences, its frequency...
Then compile a concordance dictionary. A KEY and ENTRY for each individual word. Then the BODY of the ENTRY is the set of LEFT- and/or RIGHT- contexts the KEYword has been observed in.
Usually LEFT is enuf.
Now if two KEYs can appear in a same context, they have something in common. They share meaning by distributional semantics. And the more FREQUENTLY they are observed to do that, the stronger the share.
==> and the LONGER the shared context, the CLOSER two words resemble each other.
==> works GREAT for words that are neither too FREQUENT (a, the, some, is...) or too INFREQUENT....There will be a TON of words only appearing ONCE UNIQUELY, so the LONGER the TEXT the tiny TOY texts like (only) the Old Testament or al-Qur'an....they are just TOO SHORT, but still useful pedagogically...
==> You need ASTRONOMICALLY LONG <texts> like the WHOLE internet, such as GPT-3 uses....

This approach works GREAT for Synthetic Text Generation, and it's good cuz both the Dictionary and the Python code can be studied....in great contrast to trained Neural Networks, that amount to Black see what's inside, dont really know what's happening during
==> whatever *IS* happening for the Neural Network, something ANALOGOUS is happening for the Ngram
and you can SEE, ANALYZE, DEBUG, IMPROVE the dictionary/Python, but not so easy for the NN....

rdurian
Автор

3 years ago, the number of dimensions of the embeddings dataset was around 100 as you say. Today, in 2024, the least you can think of is in the thousands.

sairaj
Автор

How does multiplying input matrix with W* gives info about context?

c.nbhaskar
Автор

22 years ago, saw the clear need for this but only working with simple back prop. Oh well, too old, too late.... LOL

NoferTrunions
Автор

Please add the link to the next part of your series to this videos description !
Otherwise many thanks for this wonderful video

Peebuttnutter
Автор

Order doesn't matter in dot products and x1.x2 = x2.x1

pinakigupta
Автор

Should w12 and w21 be same due to basic maths behind dot products??

solvinglife
Автор

while finding W12, should we multiple each element one by one, and add them together?
For example, 0.06 * 0.60 + 0.86*0.34 + ...=
In that case, shouldn't we find 0.4325 instead of 0.21? How can we get 0.21, can you please make it clarify?

mertolcaman
Автор

This is best, hated that you didn't continued for trainable attention.

martian._
Автор

Why put background music in videos? Its a curse for good education videos!
Many educational creators have completely ditched this curse already.

soch.original
Автор

Bro, get your math correct first. I always wonder why newbies make YouTube videos. Please stop it. You make it more tough to find an understandable video on YouTube.

saurabhmahra