Attention in transformers, visually explained | Chapter 6, Deep Learning

Показать описание

Demystifying attention, the key mechanism inside transformers and LLMs.
An equally valuable form of support is to simply share the videos.

Demystifying self-attention, multiple heads, and cross-attention.

And yes, at 22:00 (and elsewhere), "breaks" is a typo.

------------------

Here are a few other relevant resources

Build a GPT from scratch, by Andrej Karpathy

If you want a conceptual understanding of language models from the ground up, @vcubingx just started a short series of videos on the topic:

If you're interested in the herculean task of interpreting what these large networks might actually be doing, the Transformer Circuits posts by Anthropic are great. In particular, it was only after reading one of these that I started thinking of the combination of the value and output matrices as being a combined low-rank map from the embedding space to itself, which, at least in my mind, made things much clearer than other sources.

Site with exercises related to ML programming and GPTs

An early paper on how directions in embedding spaces have meaning:

------------------

Timestamps:
0:00 - Recap on embeddings
1:39 - Motivating examples
4:29 - The attention pattern
11:08 - Masking
12:42 - Context size
13:10 - Values
15:44 - Counting parameters
18:21 - Cross-attention
19:19 - Multiple heads
22:16 - The output matrix
23:19 - Going deeper
24:54 - Ending

------------------

These animations are largely made using a custom Python library, manim. See the FAQ comments here:

All code for specific videos is visible here:

The music is by Vincent Rubinetti.

------------------

3blue1brown is a channel about animating math, in all senses of the word animate. If you're reading the bottom of a video description, I'm guessing you're more interested than the average viewer in lessons here. It would mean a lot to me if you chose to stay up to date on new ones, either by subscribing here on YouTube or otherwise following on whichever platform below you check most regularly.

Рекомендации по теме

Комментарии

A few added notes based on common comments I see.

Concerning masking self-attention, several people ask about cases where it feels like later words should update the meaning of earlier words, like languages with adjectives following nouns. The model can always put the richest meaning into the last token (e.g. early nouns getting baked into later adjectives). For example, @victorlevoso8984 noted below how empirical evidence suggests the meaning of a sentence often gets baked into the embedding of the punctuation mark at its end. Keep in mind that the model doesn't have to conceptualize things the way we humans do, and in all likelihood doesn't at all, so I wouldn't over-index on the motivating example given in this video.

Also, one thing I should have called out more explicitly is how I personally like to think of vectors like embeddings, keys, queries, etc. as columns, and as a convention display them this way, but other sources, including the Attention is All You Need paper, may present them organized in a row-by-row fashion. This is relevant to parsing the equation shown at 10:29, where the expression from the paper that looks like Q K^T would, by the conventions of this video, instead look like K^T Q.

bluebrown

I'm a university lecturer with a PhD in AI, and I cannot compete with the quality of this work. Videos like this put the entire higher education system to shame. Fantastic! ❤️

philrod

I've got to say - "Attention Is All You Need" is an incredible title for a research paper.

Steamrick

Are you kidding me? ONE WEEK FOR 2 MASTERPIECES?!
Thank you so much!

actualBIAS

3b1b is the only content producer whose videos I start by first making coffee, then upvoting, then hitting the play button.

sriramsrinivasan

As a graduating PhD student working in Natural Language Processing, I still found that video to be extremely beneficial. Awesome!

hailking

How I wish this video was available when the "Attention is What You Need" paper just came out. It was really hard to visualize by simply reading the paper. I read it multiple times but could not figure out what it was trying to do.

Then subsequently, Jay Alammar posted a blog post called The illustrated transformer. That was a huge help for me back then. But this video raises the illustration to an entirely different level.

Great job! I'm sure many undergraduates or hobbyist studying machine learning would benefit greatly.

QuantAI-kpxt

Attention existed before the 2017 paper "Attention Is All You Need".

The main contribution was that attention was... all you needed for sequence processing (you didn't need recurrence). Self-attention specifically was novel though.

Henry-fvbc

Geez Grant, I spent thousands of dollars on a very good deep learning executive certification from Carnegie Mellon, and your series here is better than their math slides. This series is really turning out great.

DataRae-AIEngineer

I cannot stress enough what a tour de force this is. It's probably one of the best math classes ever done anywhere in the world in all time.

You're the best in the game and an inspiration for many. So so much thank you, Grant, you're doing God's work here.

MatheusC

As a Master's student in Data Science and AI, I never really understood how attention worked. Thank you for making this video!

muelleer

There are people … all over the world … like me … who really, really, really appreciate you. I cannot thank you enough for taking the time to share your knowledge and help others to understand this technology much more deeply. Seriously, kudos and sincerest thanks. ❤

JustGrowers

Just Wow, the educational value of this video is incredible.
There are so many highly relevant and original ideas to explain abstract concepts and drastically simplify comprehension.
I'm so thankful that you've made this content available to everyone for free.
I absolutely love it!!

Otomega

As director of video content for a major educational publisher, this is some of the best educational content I’ve ever seen. Your content gives me ideas of how to shape the future of undergraduate level STEM videos. A true legend and inspiration in this space- thank you for the meticulously outstanding work that you do.

JonyBetancourt

You not only put out some of the best content on youtube but also give constant shutouts to other content creators that you admire. You are a GOAT 3Blue1Brown.

michaelthompson

If I could write poetry about how much I appreciate and learn from your videos, I would but I'm not a poet. Thanks to everyone who worked on these videos.

glizzy

the fact that this is freely available on YT is insane: thanks for all the amazing work throughout the years.

fluffybunny

I'm a Computer Science student currently working with a Transformer for my master thesis and this video is absolute gold to me. I think this is the best explanation video I've ever seen. Holy shit, it is so clear and insightful. I'm so looking forward to the third video of the series!!!! The first one was absolutely amazing too. Thank you sooo much for this genius piece of work!!!!

annachester

Grant is all you need.

This was propably the tenth video or podcast about the subject and only now I understand the underlying motivation for each component it has.

bola

Thank you for the mention, Grant!

For those who relate to the pain of wanting more practice problems for Machine Learning, I hear you.

I’ve created coding problems (run against test cases in your browser!) & quizzes covering the core ML concepts.

Check out the resource Grant mentioned (linked in the description) or just click on my channel!

gptLearningHub

Attention in transformers, visually explained | Chapter 6, Deep Learning

Attention in transformers, visually explained | Chapter 6, Deep Learning

Attention mechanism: Overview

Visualize the Transformers Multi-Head Attention in Action

Attention Mechanism In a nutshell

Visual Guide to Transformer Neural Networks - (Episode 2) Multi-Head & Self-Attention

But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning

Illustrated Guide to Transformers Neural Network: A step by step explanation

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

Visual Guide to Transformer Neural Networks - (Episode 1) Position Embeddings

Self-attention in deep learning (transformers) - Part 1

Transformer Neural Network: Visually Explained

Transformers: The best idea in AI | Andrej Karpathy and Lex Fridman

Live -Transformers Indepth Architecture Understanding- Attention Is All You Need

What are Transformers (Machine Learning Model)?

Transformers for beginners | What are they and how do they work

Visual Guide to Transformer Neural Networks - (Episode 3) Decoder’s Masked Attention

Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention...

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

What are Transformer Neural Networks?

Self-Attention and Transformers

Transformers | What is attention?

C5W3L07 Attention Model Intuition

The Transformer neural network architecture EXPLAINED. “Attention is all you need”