filmov
tv
vLLM Office Hours - Distributed Inference with vLLM - January 23, 2025

Показать описание
In this session, we explored the motivation for distributed inference, delving into vLLM architecture and GPU parallelism to enhance performance. We discussed the challenges of serving large models, introduced the concept of tensor parallelism, and examined the benefits and trade-offs of leveraging multiple GPUs for inference. We also highlighted profiling tools for analyzing kernel performance and overhead, along with the potential challenges of adopting a disaggregated approach with separate nodes for prefill and decoding.
During the open discussion, we addressed various community questions, including practical applications of tensor parallelism in real-world scenarios, the impact of distributed inference on latency and throughput, and strategies for optimizing multi-GPU setups.
During the open discussion, we addressed various community questions, including practical applications of tensor parallelism in real-world scenarios, the impact of distributed inference on latency and throughput, and strategies for optimizing multi-GPU setups.
vLLM Office Hours - Distributed Inference with vLLM - January 23, 2025
[vLLM Office Hours #27] Intro to llm-d for Distributed LLM Inference
vLLM Office Hours - DeepSeek and vLLM - February 27, 2025
vLLM Office Hours #22 - Intro to vLLM V1 - March 27, 2025
What is vLLM? Efficient AI Inference for Large Language Models
Accelerating LLM Inference with vLLM
vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024
vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024
Distributed LLM inferencing across virtual machines using vLLM and Ray
Run A Local LLM Across Multiple Computers! (vLLM Distributed Inference)
What is vLLM & How do I Serve Llama 3.1 With It?
vLLM Office Hours - vLLM’s 2024 Wrapped and 2025 Vision - December 19, 2024
Unlock LLM Speed: VLLM Crushes the Competition!
Inference with vLLM on Aurora
Real Demo: 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐞𝐝 𝐯𝐋𝐋𝐌 𝐈𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 (zero exposure) – By Super Protocol...
[Ray Meetup] Ray + vLLM in Action: Lessons from Pinterest and Large Scale Distributed Inference
The Evolution of Multi-GPU Inference in vLLM | Ray Summit 2024
vLLM Office Hours - FP8 Quantization Deep Dive - July 9, 2024
Distributed Inference with Multi-Machine & Multi-GPU Setup | Deploying Large Models via vLLM &am...
Scalable and Efficient LLM Serving With the VLLM Production Stack - Junchen Jiang & Yue Zhu
Optimizing LLM Inference with AWS Trainium, Ray, vLLM, and Anyscale
Next-gen distributed LLM inference 🌐 #ai
Unlocking vLLM The Future of Open Source Inference Servers
vLLM Office Hours - vLLM Project Update and Open Discussion - January 09, 2025
Комментарии