The KV Cache: Memory Usage in Transformers

preview_player
Показать описание

The KV cache is what takes up the bulk of the GPU memory during inference for large language models like GPT-4. Learn about how the KV cache works in this video!

0:00 - Introduction
1:15 - Review of self-attention
4:07 - How the KV cache works
5:55 - Memory usage and example

Further reading:
Рекомендации по теме
Комментарии
Автор

Thanks a ton for this crisp and precise explanation of why we use caching in transformers.

mohitlamba
Автор

You explained KV cache so well in an easy to understand way.

michaelnguyen
Автор

This is so clear! Thanks for the explanation!

TL-fesi
Автор

This is the best video for kv cache! Thx

jow
Автор

This was a beatiful, simple video. Great job. Feeding this youtube algo.

mamotivated
Автор

really great video! funny that i searched for "transformer kv cache" in google and your video was uploaded only 8 hours ago

alexandretl
Автор

Amazing. Some of my colleagues work on KV cache, and this video was a great introduction to the topic. Thank you!

forrest-forrest
Автор

A really concise explanation. Thanks a lot.

shashank
Автор

Great video. No I understand the importance of 'Time to 1st token'. I like the short ones that are to the point on a topic. Learning in smaller chunks works well for me.Thanks!

voncolborn
Автор

Awesome explanation! Looking forward to more videos.

zifencai
Автор

For running an LLM locally, a batch size of 1 is enough and would reduce the KV cache to just 1.4GB in the OPT-30 example

PMX
Автор

Excellent video, great high level overview!

yuanhu
Автор

I hear that vLLM is an optimization for K-V cache, it uses continue-batching and pagedAttention

huiwei-edip
Автор

8:00 i believe the industry jargon for this first computation is "prefill"

RyanLynch
Автор

Thanks, simplfiying to one layer and one head makes things very clear. Now suppose we have n_layers and n_heads, so we have the Memory Usage per Token like M = n_layers*d_embed (d_embed = d_k * n_heads). But, what about the Compute per Token? Some online references claims that C = n_layers * d_embed² and i'm very suprised that this formula does not depend on the past tokens (context + already generated) : I mean that the 2nd layer expects the embeddings vector x2 output by 1st layer to compute kv cache, and x2 depends on past tokens (see Attention formula). What do you think?

vxsgmqk
Автор

*Video Summary: The KV Cache: Memory Usage in Transformers*

- *Introduction*
- Discusses the memory limitations of Transformer models, especially during text generation.

- *Review of Self-Attention*
- Explains the self-attention mechanism in Transformers.
- Highlights how query, key, and value vectors are generated.

- *How the KV Cache Works*
- Introduces the concept of the KV (Key-Value) cache.
- Explains that the KV cache stores previous context to avoid redundant calculations.

- *Memory Usage and Example*
- Provides an equation for calculating the memory usage of the KV cache.
- Gives an example with a 30 billion parameter model, showing that the KV cache can take up to 180 GB.

- *Latency Considerations*
- Discusses the latency difference between processing the prompt and subsequent tokens due to the KV cache.

The video provides an in-depth look at the KV cache, a crucial component that significantly impacts the memory usage and efficiency of Transformer models. It explains how the KV cache works, its role in self-attention, and its implications for memory usage and latency.

wolpumba
Автор

Great Video! Btw can we use KV Cache during training?

peoplepeople
Автор

Thank you so so much! I have one question: Why is the logit generation for the already existent prompt necessarry? I want to understand how the prediction of a new token is directly related to the already generated logits. I hope my question makes sense.

Again, thank you so much, your videos are the best explanations on youtube!

SinanAkkoyun
Автор

Great video. Do you have a notebook implemented with KV Cache ? It will be really helpful.
Memory optimization is one of the key solution to realize on Device. Keep posting insightful optimizations.

senapatiashok
Автор

Is my understanding correct: in your example,“chill” has already been generated,you are demonstrating the preparation work after you got “chill” and before generating the token after “chill”.

bnglr
welcome to shbcf.ru