Speculative Decoding: When Two LLMs are Faster than One

Показать описание

Speculative decoding (or speculative sampling) is a new technique where a smaller LLM (the draft model) generates the easier tokens which are then verified by a larger one (the target model). This make the generation faster computation without sacrificing accuracy.

0:00 - Introduction
1:00 - Main Ideas
2:27 - Algorithm
4:48 - Rejection Sampling
7:52 - Why sample (q(x) - p(x))+
10:55 - Visualization and Results

Efficient NLP

Рекомендации по теме

Комментарии

Super clear explanation of speculative decoding. I have been working with this for a while but this clarified some of my questions

dev..-

i recently wanted to brush up on this because it's been a while i read this paper. browsed a few tutorials/blog posts. it's funny how many people wrote about this without understanding it. you certainly did understand, and do a great job at breaking it down. thank you very much

oc

Really well done, my brain bulb went light up when you show the table! Thank you, keep it up!

kaenovama

really informative! One thing that I don't understand is how does the LLM knows the previous probability distributions in a single pass? I thought decoder llm's only outputs the new token's probability distribution

decycle

What an amazing explanation! Thank you so much

rajmankad

Very easy to understand. Thanx so much.

甘楽-uv

Thank you for the video, when i first heard this idea in February i was wondering how it made sense because i was picturing a large K, now seeing that the recommended K is about 3 I understand how most of the output will be the same.

einsteinsapples

I haven’t read the paper yet but my understanding is that we sample from q(x) - p(x) because we want the most surprising token that the draft model does not anticipate. It should maximize the entropy but then it should have log on the equation, anyway, I gotta read the paper to understand the math.

shairuno

Love your video, thanks!
If I had to give one request/critique, it'd be that I wish there were some slides in here similar to Samuel Albanie's videos that are quite information-dense recaps that could be lifted out of the presentations and put into our notes (or into a powerpoint for a paper club, or something).

_gunna

Hi, thanks for the great content. I have a question, Let's say during speculative decoding (vocab size = 5 token only) we got to a stage where draft model has the next token distribution as = [0.35, 0.3, 0.15, 0.2] and target_model = [0.4, 0.5, 0.05, 0.05]

so now the prob of token 1 in draft = 0.35 and prob of toke 1 in target = 0.4. what will the speculative algo do? Now if the speculative algo picks token 1 from the vocab, can we still say that we decode the exact same tokens what the larger model would decode? Thanks

anshumansinha

Thanks for creating this amazing video! I’m wondering if you could open source the slides as well?

jackzhang

Google and DeepMind doing the Spiderman meme 😅

kylewilliams

thank you for the explanations and the visuals. Does speculative decoding work with beam search? I understand that for LLM we generally just do greedy decoding in one pass, but for translation models like whisper, the performance increase significantly if we use beam search. I see even from hugging face official post discussing how speculative decoding improve whisper large inference speed by 2x, but to be honest, for non english audio data, with greedy decoding whisper is barely usable...

mingzhou

Thanks for sharing, I am wondering how target model check the generated tokens of draft model and produce probability distribution q of x for each token?

laulinky

Thanks for this! I've been enjoying your videos! Do you think you do a review / explanation on flash-decoding by tri dao? I have been reading the pytorch blog but I don't really understand it

waynelau

why does target model running with K new tokens spend almost the same computation than with just 1 new token? I know K new tokens can be computed in parallel at one single forward pass, but self-attension with K new tokens indeed need more works than 1 token (suppose KV-cache is used), isn't it?

feixyzliu

Interesting. Curious if we can use mutiple different fine-tuned small models to do the same task along with a bigger model.

saiashwalkaligotla

Thanks! But doesn't the google paper define Mq as the draft model i.e. flips the definitions?

ariellubonja

made very simple, but one more variable is choosing right draft model. Suppose if one chooses that is too too away from larger one's distribution then its also a problem.

Basant

this is great! is there any chance you could demonstrate something like this in code?

domenvake

Speculative Decoding: When Two LLMs are Faster than One

Speculative Decoding: When Two LLMs are Faster than One

Deep Dive: Optimizing LLM inference

How to make LLMs fast: KV Caching, Speculative Decoding, and Multi-Query Attention | Cursor Team

What is Speculative Sampling? | Boosting LLM inference speed

LLM Inference - Self Speculative Decoding

Speculative Decoding Explained

Accelerating Inference with Staged Speculative Decoding — Ben Spector | 2023 Hertz Summer Workshop

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Differential Compute in LLMs - System 1 vs. System 2, ReACT, Speculative Decoding

What is Speculative Sampling? How does Speculative Sampling Accelerate LLM Inference

The KV Cache: Memory Usage in Transformers

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

How Medusa Works

What is Speculative Sampling?

Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)

episode 1: Speculative Decoding

Accelerating LLM Inference: Medusa's Uglier Sisters (WITH CODE)

Accelerating LLM Inference with vLLM

Fast Inference from Transformers via Speculative Decoding

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA

Online Speculative Decoding

[short] Online Speculative Decoding

Accelerated LLM Inference with Anyscale | Ray Summit 2024

Fast Inference from Transformers via Speculative Decoding