filmov
tv
Speculative Decoding: When Two LLMs are Faster than One
![preview_player](https://i.ytimg.com/vi/S-8yr_RibJ4/maxresdefault.jpg)
Показать описание
Speculative decoding (or speculative sampling) is a new technique where a smaller LLM (the draft model) generates the easier tokens which are then verified by a larger one (the target model). This make the generation faster computation without sacrificing accuracy.
0:00 - Introduction
1:00 - Main Ideas
2:27 - Algorithm
4:48 - Rejection Sampling
7:52 - Why sample (q(x) - p(x))+
10:55 - Visualization and Results
Speculative Decoding: When Two LLMs are Faster than One
Deep Dive: Optimizing LLM inference
How to make LLMs fast: KV Caching, Speculative Decoding, and Multi-Query Attention | Cursor Team
What is Speculative Sampling? | Boosting LLM inference speed
LLM Inference - Self Speculative Decoding
Speculative Decoding Explained
Accelerating Inference with Staged Speculative Decoding — Ben Spector | 2023 Hertz Summer Workshop
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Differential Compute in LLMs - System 1 vs. System 2, ReACT, Speculative Decoding
What is Speculative Sampling? How does Speculative Sampling Accelerate LLM Inference
The KV Cache: Memory Usage in Transformers
Lecture 22: Hacker's Guide to Speculative Decoding in VLLM
How Medusa Works
What is Speculative Sampling?
Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)
episode 1: Speculative Decoding
Accelerating LLM Inference: Medusa's Uglier Sisters (WITH CODE)
Accelerating LLM Inference with vLLM
Fast Inference from Transformers via Speculative Decoding
Understanding the LLM Inference Workload - Mark Moyou, NVIDIA
Online Speculative Decoding
[short] Online Speculative Decoding
Accelerated LLM Inference with Anyscale | Ray Summit 2024
Fast Inference from Transformers via Speculative Decoding
Комментарии