Transformers explained | The architecture behind LLMs

preview_player
Показать описание
All you need to know about the transformer architecture: How to structure the inputs, attention (Queries, Keys, Values), positional embeddings, residual connections. Bonus: an overview of the difference between Recurrent Neural Networks (RNNs) and transformers.
9:19 Order of multiplication should be the opposite: x1(vector) * Wq(matrix) = q1(vector). Otherwise we do not get the 1x3 dimensionality at the end. Sorry for messing up the animation!

Outline:
00:00 Transformers explained
00:47 Text inputs
02:29 Image inputs
03:57 Next word prediction / Classification
06:08 The transformer layer: 1. MLP sublayer
06:47 2. Attention explained
07:57 Attention vs. self-attention
08:35 Queries, Keys, Values
09:19 Order of multiplication should be the opposite: x1(vector) * Wq(matrix) = q1(vector).
11:26 Multi-head attention
13:04 Attention scales quadratically
13:53 Positional embeddings
15:11 Residual connections and Normalization Layers
17:09 Masked Language Modelling
17:59 Difference to RNNs

Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Vignesh Valliappan, @Mutual_Information , Kshitij

📄 Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🔗 Links:

#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​
Music 🎵 : Sunset n Beachz - Ofshane
Video editing: Nils Trost
Рекомендации по теме
Комментарии
Автор

Thanks for the explanation. At 9:19 : Shouldn't the order of multiplication be the opposite here? E.g. x1(vector) * Wq(matrix) = q1(vector). Otherwise I don't understand how we get the 1x3 dimensionality at the end

YuraCCC
Автор

Understood about 10%, but I like these vidoes and feel intuitively the usefulness.

Thomas-gk
Автор

Thanks, you helped so much explain Transformers to my PhD advisors <3

phiphi
Автор

BEST of BEST Explanation. 1) Visually, 2) intuitively, 3) by numerical examples. And your English is better than native for Foreigners to listen.

heejuneAhn
Автор

Had to go back and rewatch a section after I realized I'd been spacing out staring at the coffee bean's reactions.

uwisplaya
Автор

Great Video!! Nice improvement over the original

DatNgo-ukft
Автор

Thanks so much for this video. I’ve gone through a number of videos on transformers and this is much easier to grasp and understand for a non-data scientist like myself.

Clammer
Автор

Letitia, you're awesome and I look forward to learning more from you.

darylallen
Автор

You know how to explain things. This one is not easy: I can see the amount of work that went into this video, and it was a lot. I hope that your career takes you where you deserve.

DaveJ
Автор

I think I had at least 10 aha moments watching this, and I've watched many videos on these topics. Incredible job, thank you!

mccartym
Автор

Absolute banger of a video. Wish I had seen this when I was learning about transformers in uni last year :-)

l.suurmeijer
Автор

What a wonderful video! Thank you so much for sharing it!

manuelafernandesblancorodr
Автор

This is a very well-made explanation. I hadn't known that the feedforward layers only received one token at a time. Thanks for clearing that up for me! 😁

xxlvulkann
Автор

As far as I am aware, word embedding has changed from legacy static embedding like Word2Vec/GLOVE (like the famous queen=woman+king-man metaphor) to BPE & unigram, this change gave me quite a headache, as most of paper do not mention any detail of their "word embedding". Perhaps Letitia you can make a video to clarify this a bit for us.

tildarusso
Автор

Tomorrow i have thesis evaluation and i was thinking about watching that video again, but youtube algorithm suggested me without searching anything, Thank u youtube algo..
😅❤🔥

rahulrajpvrd
Автор

Time is quadratic, but memory is linear -- see the FlashAttention paper.
But the number of parameters is constant -- that's the magic !
Thanks for the excellent videos ! 👍

davidespinosa
Автор

Thank you very much for the very clear explanations and detailed analysis of the transformer architecture. Your truly the 3blue1brown of machine learning!

cosmic_reef_
Автор

One of the best videos on transformers that I have ever watched. Views 📈

abhishek-tandon
Автор

Best didatic explanation about Transformers so far. Thank you for sharing it.

jcneto
Автор

Thank you for the video! Maybe an explanation on the Mamba Architecture next?

SamehSyedAjmal