filmov
tv
Databricks' vLLM Optimization for Cost-Effective LLM Inference | Ray Summit 2024
Показать описание
At Ray Summit 2024, Megha Agarwal from Databricks (MosaicML) presents their team's work on enhancing vLLM for improved LLM inference performance. The talk focuses on Databricks' efforts to achieve industry-leading cost and performance in their LLM serving product.
Agarwal delves into the challenges of optimizing vLLM, particularly addressing GPU blocking operations during decoding steps. These operations, including sampling, input tensor preparation, and output processing, can account for about 30% of each step at large batch sizes for 70B+ models.
The presentation covers Databricks' solutions to reduce GPU idle time and accelerate quantization using custom kernels. Agarwal shares insights from their experience as vLLM developers, discussing future optimization areas and offering best practices for benchmarking.
This session provides valuable information for organizations and developers working on large-scale LLM deployment, offering practical strategies to enhance inference efficiency and reduce costs.
--
Interested in more?
--
🔗 Connect with us:
Agarwal delves into the challenges of optimizing vLLM, particularly addressing GPU blocking operations during decoding steps. These operations, including sampling, input tensor preparation, and output processing, can account for about 30% of each step at large batch sizes for 70B+ models.
The presentation covers Databricks' solutions to reduce GPU idle time and accelerate quantization using custom kernels. Agarwal shares insights from their experience as vLLM developers, discussing future optimization areas and offering best practices for benchmarking.
This session provides valuable information for organizations and developers working on large-scale LLM deployment, offering practical strategies to enhance inference efficiency and reduce costs.
--
Interested in more?
--
🔗 Connect with us: