Self-attention in deep learning (transformers) - Part 1

Показать описание

Self-attention in deep learning (transformers)

Self attention is very commonly used in deep learning these days. For example, it is one of the main building blocks of the Transformer paper (Attention is all you need) which is fast becoming the go to deep learning architectures for several problems both in computer vision and language processing. Additionally, all these famous papers like BERT, GPT, XLM, Performer use some variation of the transformers which in turn is built using self-attention layers as building blocks.

So this video is about understanding a simplified version of the attention mechanism in deep learning.

Note: This is part 1 in the series of videos about Transformers.

📚 📚 📚 BOOKS I HAVE READ, REFER AND RECOMMEND 📚 📚 📚

Рекомендации по теме

Комментарии

Best video on self-attention I've seen so far

StratosFair

Very good. After this video, I'm starting to understand self attention

patrickctrf

Thanks for the tutorial! Didn't get 2 things though:
Why is the matrix W* not a symmetrical matrix (dot-product is a commutative operation)?
And why is matrix W* not normalised?

seraf_in_._._._._._._._._._._

Good video, but the background music is really distracting. I wish you can remove it.

kanaipathak

I think starting with high level intuition would be really helpful.

wryltxw

Thanks for a good tutorial.
Dot product of x_1 and x_2 vector is wrong. It should be around 0.41 instead of 0.21. Hope it helps you to correct your amazing tutorial on self-attention.

LidoList

I have a question. after embedding we still have the same number of features x1->x4, and say that they have dimension is 1x10, means 10 features each. W* is 4x4 right! My question is X is 4x10 and X^T is 10x4. How we do dot product W . X^T as the dimension is (4x4) . (10x4)! or I am missing something?

M_Nagy_

Before computers people used to do TEXT compile a list of every individual word in a text...and the number of occurrences, its frequency...
Then compile a concordance dictionary. A KEY and ENTRY for each individual word. Then the BODY of the ENTRY is the set of LEFT- and/or RIGHT- contexts the KEYword has been observed in.
Usually LEFT is enuf.
Now if two KEYs can appear in a same context, they have something in common. They share meaning by distributional semantics. And the more FREQUENTLY they are observed to do that, the stronger the share.
==> and the LONGER the shared context, the CLOSER two words resemble each other.
==> works GREAT for words that are neither too FREQUENT (a, the, some, is...) or too INFREQUENT....There will be a TON of words only appearing ONCE UNIQUELY, so the LONGER the TEXT the tiny TOY texts like (only) the Old Testament or al-Qur'an....they are just TOO SHORT, but still useful pedagogically...
==> You need ASTRONOMICALLY LONG <texts> like the WHOLE internet, such as GPT-3 uses....

This approach works GREAT for Synthetic Text Generation, and it's good cuz both the Dictionary and the Python code can be studied....in great contrast to trained Neural Networks, that amount to Black see what's inside, dont really know what's happening during
==> whatever *IS* happening for the Neural Network, something ANALOGOUS is happening for the Ngram
and you can SEE, ANALYZE, DEBUG, IMPROVE the dictionary/Python, but not so easy for the NN....

rdurian

3 years ago, the number of dimensions of the embeddings dataset was around 100 as you say. Today, in 2024, the least you can think of is in the thousands.

sairaj

How does multiplying input matrix with W* gives info about context?

c.nbhaskar

22 years ago, saw the clear need for this but only working with simple back prop. Oh well, too old, too late.... LOL

NoferTrunions

Please add the link to the next part of your series to this videos description !
Otherwise many thanks for this wonderful video

Peebuttnutter

Order doesn't matter in dot products and x1.x2 = x2.x1

pinakigupta

Should w12 and w21 be same due to basic maths behind dot products??

solvinglife

while finding W12, should we multiple each element one by one, and add them together?
For example, 0.06 * 0.60 + 0.86*0.34 + ...=
In that case, shouldn't we find 0.4325 instead of 0.21? How can we get 0.21, can you please make it clarify?

mertolcaman

This is best, hated that you didn't continued for trainable attention.

martian._

Why put background music in videos? Its a curse for good education videos!
Many educational creators have completely ditched this curse already.

soch.original

Bro, get your math correct first. I always wonder why newbies make YouTube videos. Please stop it. You make it more tough to find an understandable video on YouTube.

saurabhmahra

Self-attention in deep learning (transformers) - Part 1

Self-attention in deep learning (transformers) - Part 1

Attention mechanism: Overview

Attention in transformers, visually explained | Chapter 6, Deep Learning

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!!

Self Attention in Transformer Neural Networks (with Code!)

Illustrated Guide to Transformers Neural Network: A step by step explanation

What are Transformers (Machine Learning Model)?

Attention for Neural Networks, Clearly Explained!!!

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 8 - Self-Attention and Transformers

Transformers, explained: Understand the model behind GPT, BERT, and T5

But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning

Attention Mechanism In a nutshell

What is Self Attention in Transformer Neural Networks?

Lecture 12.1 Self-attention

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

Transformers for beginners | What are they and how do they work

MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention

DEEP LEARNING: TRANSFORMERS - Introduzione

EE599 Project 12: Transformer and Self-Attention mechanism

Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention...

MIT 6.S191 (2023): Recurrent Neural Networks, Transformers, and Attention

CS480/680 Lecture 19: Attention and Transformer Networks

Pytorch Transformers from Scratch (Attention is all you need)