Accelerating LLM Inference with vLLM

Показать описание

vLLM is an open-source highly performant engine for LLM inference and serving developed at UC Berkeley. vLLM has been widely adopted across the industry, with 12K+ GitHub stars and 150+ contributors worldwide. Since its initial release, the vLLM team has improved performance by more than 10x. This session will cover various topics in LLM inference performance, including paged attention and continuous batching. Then, we will focus on new innovations we’ve made to vLLM and the technical challenges behind them, including: Speculative Decoding, Prefix Caching, Disaggregated Prefill, and multi-accelerator support. The session will conclude with industry case studies of vLLM and future roadmap plans. Takeaways: vLLM is an open source engine for LLM inference and serving, providing state-of-the-art performance and an accelerator-agnostic design. In focusing on production-readiness and extensibility, vLLM’s design choices have led to new system insights and rapid community adoption.

Talk By: Cade Daniel, Software Engineer, Anyscale ; Zhuohan Li, PhD student, UC Berkeley / vLLM

Here's more to explore: