Fast LLM Serving with vLLM and PagedAttention

Показать описание

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. To address this problem, we are developing vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention achieves up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes. vLLM has been developed at UC Berkeley and deployed for Chatbot Arena and Vicuna Demo for the past three months. In this talk, we will discuss the motivation, features, and implementation of vLLM in depth, and present our future plan.

About Anyscale
---
Anyscale is the AI Application Platform for developing, running, and scaling AI.

If you're interested in a managed Ray service, check out:

About Ray
---
Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.

#llm #machinelearning #ray #deeplearning #distributedsystems #python #genai

Anyscale

Рекомендации по теме

Комментарии

Beautiful adaptation of a fundamental idea of paging, reference counting and copy -on-write.👌

hemanthsethuram

Full circle dynamic memory management and garbage collection. Great talk!

dinoscheidt

Such an elegant idea and amazingly clear explanation!

simonguo

It seems like there would be a performance increase for beam search as well? (That is, in addition to the memory savings it gets.) Would be great to see some benchmarks for that!

mshonle

Interesting sharings. Curious about the underlying implementation for KV blocks sharing part you have a copy-on-write mechanism, but how does it avoid dirty-read condition, where both request reads that ref count is 2 and both request copies the block simultaneously.

vaporeon

I think the last question was asked on impact on latency

alankhor

How do we calculate memory used by kv cache in paged attention.Example for input 500 and output 1000

Karthikprath

Fast LLM Serving with vLLM and PagedAttention

Fast LLM Serving with vLLM and PagedAttention

Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

vLLM - Turbo Charge your LLM Inference

E07 | Fast LLM Serving with vLLM and PagedAttention

Exploring the fastest open source LLM for inferencing and serving | VLLM

Boost Your AI Predictions: Maximize Speed with vLLM Library for Large Language Model Inference

vLLM: Fast & Affordable LLM Serving with PagedAttention | UC Berkeley's Open-Source Library

vLLM and PagedAttention is the best for fast Large Language Models (LLMs) inferencey | Lets see WHY

Enabling Cost-Efficient LLM Serving with Ray Serve

StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?

vLLM Faster LLM Inference || Gemma-2B and Camel-5B

vLLM on Kubernetes in Production

Deploy LLMs using Serverless vLLM on RunPod in 5 Minutes

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mist...

Setup vLLM with T4 GPU in Google Cloud

Serve a Custom LLM for Over 100 Customers

API For Open-Source Models 🔥 Easily Build With ANY Open-Source LLM

VLLM: Rocket Enginer Of LLM Inference Speeding Up Inference By 24X

vllm-project/vllm - Gource visualisation

Deploy FULLY PRIVATE & FAST LLM Chatbots! (Local + Production)

Inference, Serving, PagedAtttention and vLLM

Get Started with Mistral 7B Locally in 6 Minutes

How to Use Open Source LLMs in AutoGen Powered by vLLM

Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)