How DeepSeek Rewrote the Transformer [MLA]

Показать описание

MLA/DeepSeek Poster at 17:12 (Free shipping for a limited time with code DEEPSEEK):

Limited edition MLA Poster and Signed Book:

Imaginary Numbers book is back in stock!

Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich

References

Technical Notes

2. I claim a couple times that MLA allows DeepSeek to generate tokens more than 6x faster than a vanilla transformer. The DeepSeek-V2 paper claims a slightly less than 6x throughput improvement with MLA, but since the V3/R1 architecture is heavier, we expect a larger lift, which is why i claim “more than 6x faster than a vanilla transformer” - in reality it’s probably significantly more than 6x for the V3/R1 architecture.
3. In all attention patterns and walkthroughs, we’re ignoring the |beginning of sentence| token. “The American flag is red, white, and” actually maps to 10 tokens if we include this starting token, and may attention patterns do assign high values to this token.
4. We’re ignoring bias terms matrix equations.
5. We’re ignoring positional embeddings. These are fascinating. See DeepSeek papers and ROPE.

Welch Labs

Рекомендации по теме

Комментарии

As an AI researcher who has read the DeepSeek papers already, this was a fantastic video explanation. Please make more videos along these lines!

PaulScotti

This is by far THE best AI deepdive Ive ever seen. I actually understand now not just how the attention arcitechure works but how and why DeepSeek's changes to the architechure result in such an incredible performance AND efficency improvement.

MirorRflction

It's one of those ideas that look so so obvious when you see it, yet it is not at all! Very nice work from DeepSeek.

khoiduongminh

In ML, it's easy to overuse tools like latent space encodings when simpler methods would be more appropriate, but what the deepseek team cooked up here is, as Welch says, a truly elegant approach to the problem at hand. Thanks for another great video!

largemoney

Honestly, the best visual explanations for KV Caching, and really good explanations for other points as well.
I can see a lot of effort. Good work 👏

RevolutionofTime

The way you represent these large matrices as heat maps rather than grids of numbers helps a lot with reducing the overhead of processing the visuals. You're killing it with this series.

lakshaymd

6:52 a small additional insight:
i just would like to point out that in multiple head of attention mechanism we usually *don't have a separate Weight matrix * for each head to get K, Q and V but we divide the word embedding into chunks get a smaller separate K, Q, V matrix for each chunk for computing the dot product. that's why the word embedding lenght 'd' should be divisible by the total number of heads.
in chatgpt that is 768%12 == 0 and for deepseek R1 7168%128 == 0. this way we don't have a separate weigh matrix for each head. so that even if according to 13:00 we have a same weight matrix for KQ but each head is processing only a subsection of KQ [num tokens d/num_heads ] sized matrix because we divided the original word embedding among multiple heads otherwise each head would be learning the somewhat similar things because the KQ is the same for all of them.

the idea of dividing word embedding into chunks and processing them in different heads stems from the fact that each word can have different concepts based on the context e.g the words surrounding it. for example the word 'Bank' can be either the financial instituion or the rive bank 'Bank' so if the word around the bank are related to fincanical instituion the a specific chunk of the word embedding for the word 'bank' would provide higher dot product so that values in final Output will have high value section after we concatenate the outputs from all the heads to create back a matrix of size [num tokens, d] where d is 768 in gpt2 and 7168 in DSR1

Thank you for such an amazing video btw !

shny

Amazing work diving into the details to clearly. Bravo Stephen!

ArtOfTheProblem

12:50 extra clarification. "Performance" in the table refers to "Quality" of output. Not traditional meaning of "speed" which is basically the same ignoring memory loading time.

Danji_Coppersmoke

I'm so greatful that in-depth creators like you exist on youtube. I find detailed videos that don't cut technical details infinitely more valueable than over-simplified flashy ones. Your unique papercraft visuals and amazing animations on top of that are just the cherry on top.

oliverlong

Proof that Deepseek is not a Chinese knock off, but an improved perceptron algorithm of reducing KV caching by factor of 57 (1.76 orders of magnitude), clever math removing redundant matrix multiplication and providing an elegant solution.

sandeepvk

Sam Altman said that DeepSeek did not create anything new, but from your explanation, they actually improved the previous system a great deal and not only that, but they made everything public

nickf

LLMs feel like one of those things I will never understand even with videos from the best creators like Welch labs and 3b1b. The decision tree series was amazing and really helped me think about those models, nothing has made LLMs make any sense whatsoever lol I kinda get neural networks but to go from those to LLMs there are so many scales with so many layers of layers and I don't have the intuition to move from one scale to another.

LoganMarcosSchmidt

It's hard to imagine how much effort the author put into making these videos, thank you very much

guozhou-pz

Simply an outstanding explanation of both existing KV cache design and the improvements to it brought by the DeepSeek team. Bravo.

johnathancorgan

I'm just kind of sitting here nodding my head without really comprehending anything but this is so fascinating...

FutureAIDev

Holy crap… clever linear algebra just simplified the architecture immensely…

gingeral

*Great video. Wow*
This is the first video that *Successfully explains* how Multihead Latent Attention (MLA) *actually* works with necessary mathematical details. The matrix shapes are super helpful in explaining the details. Thanks for making this exceptional video.

vishalmishra

5:25 The number in the denominator is not the dimension of the tokens' embeddings but the dimension of the Query matrix, here 64

andrea-mjce

This video has the best explanation of Attention/Transformers I have ever seen

ahmedsaed

How DeepSeek Rewrote the Transformer [MLA]

How DeepSeek Rewrote the Transformer [MLA]

What is DeepSeek? AI Model Basics Explained

DeepSeek R1 Explained to your grandma

DeepSeek is a Game Changer for AI - Computerphile

Deepseek R1 vs ChatGPT O3 Mini – The Ultimate AI Battle in 2025! 🏆🤖

Deepseek R1 an Idea that changed the AI world!

Understanding Transformers The Secret Behind AI Like ChatGPT & DeepSeek!

What’s Really Happening with DeepSeek

How DeepSeek Just Changed the AI Game Forever

Sparse Mixture of Experts - The transformer behind the most efficient LLMs (DeepSeek, Mixtral)

China AI Bombshell How DeepSeek R1 Just Changed EVERYTHING

DeepSeek Didn’t Just Build An AI Model, It Changed The Conversation – Nilekani’s Viewpoint

OpenAI PROVES DeepSeek COPIED Them!

How DeepSeek Changed the Game for US AI Agents

How DeepSeek and the Man Behind It Changed AI Forever

New AI Just Beat DeepSeek With Almost No Effort! (This Shouldn't Be Possible!)

Complete Code Rewrite with AI - ChatGPT - DeepSeek - Qwen

How has China’s DeepSeek disrupted international markets?

How DeepSeek $6M AI Model Changed Everything

How DeepSeek rewrote Mixture of Experts (MoE)?

How China's Deepseek AI Changed the World in 2025

USA: The AI Game Just Changed: DeepSeek's R1 Model Exposed!

AI Revolution Shocks Tech Giants! 💥 How DeepSeek Changed the Game

China just changed the AI game with DeepSeek