Inference, Serving, PagedAtttention and vLLM

Показать описание

GPT-4 Summary: Dive into the future of Large Language Model (LLM) serving with our live event on vLLM, the groundbreaking open-source inference engine designed to revolutionize how we serve and perform inference on LLMs. We'll start with a clear explanation of the basics of inference and serving, setting the stage for an in-depth look at vLLM and its innovative PagedAttention algorithm. This event promises to unveil how vLLM overcomes memory bottlenecks to deliver fast, efficient, and cost-effective LLM serving solutions. Expect a detailed walkthrough of vLLM's system components, a compelling live demo complete with code, and a forward-looking discussion on vLLM's place in the 2024 AI Engineering workflow. Whether you're battling with the load and fine-tuning challenges of current LLMs or looking for scalable serving solutions, this is a must-watch to stay ahead in the field of AI and machine learning.

Have a question for a speaker? Drop them here:

Speakers:
Dr. Greg, Co-Founder & CEO

The Wiz, Co-Founder & CTO

Join our community to start building, shipping, and sharing with us today!

Apply for our next AI Engineering Bootcamp on Maven today!

How'd we do? Share your feedback and suggestions for future events.

AI Makerspace

Рекомендации по теме

Комментарии

very nice lecture. it is totally super clear!

kged

Awesome work guys! I love the RAG analogy to KQV self-attention inner workings. Thanks for sharing this content for free to the eager ML community.❤

marloncajamarca

Trying your collab but got error:ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

lilia

How different these serving framweworks like vllm, rayserve, openly are from what langchain is offering as "langserve"? if my model is hosted at some place and I just access it with my api key and api url, which way to go?

NavjotMakkar

Holy shot it’s been 10 minutes and it’s just about analogies, nomenclature. Its being dumbed down too much

pepaw

Inference, Serving, PagedAtttention and vLLM

Inference, Serving, PagedAtttention and vLLM

Fast LLM Serving with vLLM and PagedAttention

Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

Accelerating LLM Inference with vLLM

What is vLLM & How do I Serve Llama 3.1 With It?

vLLM and PagedAttention is the best for fast Large Language Models (LLMs) inferencey | Lets see WHY

The KV Cache: Memory Usage in Transformers

Exploring the fastest open source LLM for inferencing and serving | VLLM

vLLM: Fast & Affordable LLM Serving with PagedAttention | UC Berkeley's Open-Source Library

vLLM - Turbo Charge your LLM Inference

Boost Your AI Predictions: Maximize Speed with vLLM Library for Large Language Model Inference

E07 | Fast LLM Serving with vLLM and PagedAttention

VLLM: A widely used inference and serving engine for LLMs

Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)

Accelerate Big Model Inference: How Does it Work?

vLLM Faster LLM Inference || Gemma-2B and Camel-5B

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mist...

Efficient Memory Management for Large Language Model Serving with PagedAttention

Is SGLang Better than vLLM? Serving Llama 3.1 with SGLang

vllm-project/vllm - Gource visualisation

The State of vLLM | Ray Summit 2024

VLLM: Rocket Enginer Of LLM Inference Speeding Up Inference By 24X

vLLM: Virtual LLM

Speculative Decoding: When Two LLMs are Faster than One