The KV Cache: Memory Usage in Transformers

Показать описание

The KV cache is what takes up the bulk of the GPU memory during inference for large language models like GPT-4. Learn about how the KV cache works in this video!

0:00 - Introduction
1:15 - Review of self-attention
4:07 - How the KV cache works
5:55 - Memory usage and example

Further reading:

Efficient NLP

Рекомендации по теме

Комментарии

Thanks a ton for this crisp and precise explanation of why we use caching in transformers.

mohitlamba

You explained KV cache so well in an easy to understand way.

michaelnguyen

This is so clear! Thanks for the explanation!

TL-fesi

This is the best video for kv cache! Thx

jow

This was a beatiful, simple video. Great job. Feeding this youtube algo.

mamotivated

really great video! funny that i searched for "transformer kv cache" in google and your video was uploaded only 8 hours ago

alexandretl

Amazing. Some of my colleagues work on KV cache, and this video was a great introduction to the topic. Thank you!

forrest-forrest

A really concise explanation. Thanks a lot.

shashank

Great video. No I understand the importance of 'Time to 1st token'. I like the short ones that are to the point on a topic. Learning in smaller chunks works well for me.Thanks!

voncolborn

Awesome explanation! Looking forward to more videos.

zifencai

For running an LLM locally, a batch size of 1 is enough and would reduce the KV cache to just 1.4GB in the OPT-30 example

PMX

Excellent video, great high level overview!

yuanhu

I hear that vLLM is an optimization for K-V cache, it uses continue-batching and pagedAttention

huiwei-edip

8:00 i believe the industry jargon for this first computation is "prefill"

RyanLynch

Thanks, simplfiying to one layer and one head makes things very clear. Now suppose we have n_layers and n_heads, so we have the Memory Usage per Token like M = n_layers*d_embed (d_embed = d_k * n_heads). But, what about the Compute per Token? Some online references claims that C = n_layers * d_embed² and i'm very suprised that this formula does not depend on the past tokens (context + already generated) : I mean that the 2nd layer expects the embeddings vector x2 output by 1st layer to compute kv cache, and x2 depends on past tokens (see Attention formula). What do you think?

vxsgmqk

*Video Summary: The KV Cache: Memory Usage in Transformers*

- *Introduction*
- Discusses the memory limitations of Transformer models, especially during text generation.

- *Review of Self-Attention*
- Explains the self-attention mechanism in Transformers.
- Highlights how query, key, and value vectors are generated.

- *How the KV Cache Works*
- Introduces the concept of the KV (Key-Value) cache.
- Explains that the KV cache stores previous context to avoid redundant calculations.

- *Memory Usage and Example*
- Provides an equation for calculating the memory usage of the KV cache.
- Gives an example with a 30 billion parameter model, showing that the KV cache can take up to 180 GB.

- *Latency Considerations*
- Discusses the latency difference between processing the prompt and subsequent tokens due to the KV cache.

The video provides an in-depth look at the KV cache, a crucial component that significantly impacts the memory usage and efficiency of Transformer models. It explains how the KV cache works, its role in self-attention, and its implications for memory usage and latency.

wolpumba

Great Video! Btw can we use KV Cache during training?

peoplepeople

Thank you so so much! I have one question: Why is the logit generation for the already existent prompt necessarry? I want to understand how the prediction of a new token is directly related to the already generated logits. I hope my question makes sense.

Again, thank you so much, your videos are the best explanations on youtube!

SinanAkkoyun

Great video. Do you have a notebook implemented with KV Cache ? It will be really helpful.
Memory optimization is one of the key solution to realize on Device. Keep posting insightful optimizations.

senapatiashok

Is my understanding correct: in your example，“chill” has already been generated，you are demonstrating the preparation work after you got “chill” and before generating the token after “chill”.

bnglr

The KV Cache: Memory Usage in Transformers

The KV Cache: Memory Usage in Transformers

Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)

LLM Jargons Explained: Part 4 - KV Cache

What is Cache Memory? L1, L2, and L3 Cache Memory Explained

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Mistral Architecture Explained From Scratch with Sliding Window Attention, KV Caching Explanation

Cache Systems Every Developer Should Know

FlashAttention - Tri Dao | Stanford MLSys #67

Accelerate Big Model Inference: How Does it Work?

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

How a Transformer works at inference vs training time

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

Rasa Algorithm Whiteboard - Transformers & Attention 2: Keys, Values, Queries

CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving (SIGCOMM'24, Paper...

How To Use KV Cache Quantization for Longer Generation by LLMs

System Design Interview - Distributed Cache

[QA] Beyond KV Caching: Shared Attention for Efficient LLMs

[QA] Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Redis in 100 Seconds

System Design: Why is single-threaded Redis so fast?

Inference Yarn Llama 2 13b 128k with KV Cache to answer quiz on very long textbook

Cache Aware Scheduling - Georgia Tech - Advanced Operating Systems

Revamped Llama.cpp with Full CUDA GPU Acceleration and KV Cache for Fast Story Generation!

Attention mechanism: Overview