E07 | Fast LLM Serving with vLLM and PagedAttention

Показать описание

Fast LLM Serving with vLLM and PagedAttention (SOSP'23)

Abstract: LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. To address this problem, we are developing vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention achieves up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes. vLLM has been developed at UC Berkeley and deployed for Chatbot Arena and Vicuna Demo for the past 5 months. In this talk, we will discuss the motivation, features, and implementation of vLLM in depth, and present our future plan.

Bio: Zhuohan Li is a CS PhD student at UC Berkeley, where he is advised by Professor Ion Stoica. He is interested in designing and building efficient machine-learning systems. Recently, he has been focusing on the training and serving of large models, specifically LLMs. His works include Alpa, AlpaServe, Vicuna, and vLLM (PagedAttention). He completed his BS at Peking University and has interned at Microsoft Research, Anyscale, and Google Brain.

MLSys Singapore

Рекомендации по теме

Комментарии

Thanks for the sharing. It’s educational for me.
One question, is the block size(16/32) related to the warp size(half-warp/warp)? Wondering the theory that you define the black size in kv cache.

ginsongsong

Thanks for sharing!
Is it possible to turn on an automatic subtitles (with translation)?

shabdanbatyrkulov

Any implementation that work with Azure?

chenghao

Is there a version with English speaking?

maciejgawinecki

E07 | Fast LLM Serving with vLLM and PagedAttention

E07 | Fast LLM Serving with vLLM and PagedAttention

vLLM - Turbo Charge your LLM Inference

What happens if AI alignment goes wrong, explained by Gilfoyle of Silicon valley.

VLLM: Rocket Enginer Of LLM Inference Speeding Up Inference By 24X

Generating Conversation: Gorilla, An LLM for Massive APIs - Shishir Patil, Tianjun Zhang (Episode 7)

Emporium mall me Is larki ki bygarti deko

EP7: RAG LLM App : How to build front end Demo app with Gradio

Season 7, Episode 7: AI Act: Navigating AI's High and Mid-Level Risks (ft. Kassi Burns & Ol...

The ULTIMATE Budget Workstation.

Meta's Llama 2 | Neural Video Editing | FlashAttention-2

No GPU? No Problem! Running Incredible AI Coding LLM on CPU!

ML-at-Scale '23 - LLM Batch Inference with Determined

Using LLM API in Python to Generate Summaries | Python AI Tutorial

Transforming Speech Recognition with Miguel Jetté of Rev.com | Episode 7

Ep 7: Visual Analytics and Digital Signage

Gorilla LLMs Explained!

Understanding Public Private Partnerships: Exploring the link between PPPs and DRM.

ServiceNow Workspace UI Builder Episode 7 - Retrieving and updating data using GraphQL

Masha and the Bear Shorties 👧🐻 NEW STORY 🚦 Traffic rules (Episode 26) 🔔

The AI Show - Episode 7 - Martin Taylor

Text Generation Inference runs AWQ models with up to 3x the speed over the native FP16 and 1.5X GPTQ

E7 - The Next Six Weeks, Six Months and Six Years: Exploring LLMs with Verint Experts

The Green Internet Episode #7: Jesse Damiani

Shishir Patil and Tianjun Zhang on Gorilla - Weaviate Podcast #64!