How DeepSeek Rewrote the Transformer [MLA]

preview_player
Показать описание

MLA/DeepSeek Poster at 17:12 (Free shipping for a limited time with code DEEPSEEK):

Limited edition MLA Poster and Signed Book:

Imaginary Numbers book is back in stock!

Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich

References

Technical Notes

2. I claim a couple times that MLA allows DeepSeek to generate tokens more than 6x faster than a vanilla transformer. The DeepSeek-V2 paper claims a slightly less than 6x throughput improvement with MLA, but since the V3/R1 architecture is heavier, we expect a larger lift, which is why i claim “more than 6x faster than a vanilla transformer” - in reality it’s probably significantly more than 6x for the V3/R1 architecture.
3. In all attention patterns and walkthroughs, we’re ignoring the |beginning of sentence| token. “The American flag is red, white, and” actually maps to 10 tokens if we include this starting token, and may attention patterns do assign high values to this token.
4. We’re ignoring bias terms matrix equations.
5. We’re ignoring positional embeddings. These are fascinating. See DeepSeek papers and ROPE.
Рекомендации по теме
Комментарии
Автор

As an AI researcher who has read the DeepSeek papers already, this was a fantastic video explanation. Please make more videos along these lines!

PaulScotti
Автор

This is by far THE best AI deepdive Ive ever seen. I actually understand now not just how the attention arcitechure works but how and why DeepSeek's changes to the architechure result in such an incredible performance AND efficency improvement.

MirorRflction
Автор

It's one of those ideas that look so so obvious when you see it, yet it is not at all! Very nice work from DeepSeek.

khoiduongminh
Автор

In ML, it's easy to overuse tools like latent space encodings when simpler methods would be more appropriate, but what the deepseek team cooked up here is, as Welch says, a truly elegant approach to the problem at hand. Thanks for another great video!

largemoney
Автор

Honestly, the best visual explanations for KV Caching, and really good explanations for other points as well.
I can see a lot of effort. Good work 👏

RevolutionofTime
Автор

The way you represent these large matrices as heat maps rather than grids of numbers helps a lot with reducing the overhead of processing the visuals. You're killing it with this series.

lakshaymd
Автор

6:52 a small additional insight:
i just would like to point out that in multiple head of attention mechanism we usually *don't have a separate Weight matrix * for each head to get K, Q and V but we divide the word embedding into chunks get a smaller separate K, Q, V matrix for each chunk for computing the dot product. that's why the word embedding lenght 'd' should be divisible by the total number of heads.
in chatgpt that is 768%12 == 0 and for deepseek R1 7168%128 == 0. this way we don't have a separate weigh matrix for each head. so that even if according to 13:00 we have a same weight matrix for KQ but each head is processing only a subsection of KQ [num tokens d/num_heads ] sized matrix because we divided the original word embedding among multiple heads otherwise each head would be learning the somewhat similar things because the KQ is the same for all of them.

the idea of dividing word embedding into chunks and processing them in different heads stems from the fact that each word can have different concepts based on the context e.g the words surrounding it. for example the word 'Bank' can be either the financial instituion or the rive bank 'Bank' so if the word around the bank are related to fincanical instituion the a specific chunk of the word embedding for the word 'bank' would provide higher dot product so that values in final Output will have high value section after we concatenate the outputs from all the heads to create back a matrix of size [num tokens, d] where d is 768 in gpt2 and 7168 in DSR1

Thank you for such an amazing video btw !

shny
Автор

Amazing work diving into the details to clearly. Bravo Stephen!

ArtOfTheProblem
Автор

12:50 extra clarification. "Performance" in the table refers to "Quality" of output. Not traditional meaning of "speed" which is basically the same ignoring memory loading time.

Danji_Coppersmoke
Автор

I'm so greatful that in-depth creators like you exist on youtube. I find detailed videos that don't cut technical details infinitely more valueable than over-simplified flashy ones. Your unique papercraft visuals and amazing animations on top of that are just the cherry on top.

oliverlong
Автор

Proof that Deepseek is not a Chinese knock off, but an improved perceptron algorithm of reducing KV caching by factor of 57 (1.76 orders of magnitude), clever math removing redundant matrix multiplication and providing an elegant solution.

sandeepvk
Автор

Sam Altman said that DeepSeek did not create anything new, but from your explanation, they actually improved the previous system a great deal and not only that, but they made everything public

nickf
Автор

LLMs feel like one of those things I will never understand even with videos from the best creators like Welch labs and 3b1b. The decision tree series was amazing and really helped me think about those models, nothing has made LLMs make any sense whatsoever lol I kinda get neural networks but to go from those to LLMs there are so many scales with so many layers of layers and I don't have the intuition to move from one scale to another.

LoganMarcosSchmidt
Автор

It's hard to imagine how much effort the author put into making these videos, thank you very much

guozhou-pz
Автор

Simply an outstanding explanation of both existing KV cache design and the improvements to it brought by the DeepSeek team. Bravo.

johnathancorgan
Автор

I'm just kind of sitting here nodding my head without really comprehending anything but this is so fascinating...

FutureAIDev
Автор

Holy crap… clever linear algebra just simplified the architecture immensely…

gingeral
Автор

*Great video. Wow*
This is the first video that *Successfully explains* how Multihead Latent Attention (MLA) *actually* works with necessary mathematical details. The matrix shapes are super helpful in explaining the details. Thanks for making this exceptional video.

vishalmishra
Автор

5:25 The number in the denominator is not the dimension of the tokens' embeddings but the dimension of the Query matrix, here 64

andrea-mjce
Автор

This video has the best explanation of Attention/Transformers I have ever seen

ahmedsaed
join shbcf.ru