filmov
tv
Accelerated LLM Inference with Anyscale | Ray Summit 2024
Показать описание
At Ray Summit 2024, Anyscale Co-Founder and CTO Philipp Moritz, along with Cody Yu, present Anyscale's new LLM enterprise and production features, in addition to team's contributions to open-source inference engines.
In this talk, Moritz and Yu detail how the Anyscale team has collaborated with the vLLM open-source team, highlighting key advancements such as FP8 support, chunked prefill, multi-step decoding, and speculative decoding. They explain how these optimizations have led to significant performance improvements in vLLM, doubling both throughput and latency efficiency. The presentation also covers Anyscale-specific enhancements, including custom kernels, batch inference optimizations, and accelerated large model loading for autoscaling deployments.
This breakout session is a must-watch for anyone looking to gain insights into the latest techniques for improving LLM inference efficiency and scalability.
--
Interested in more?
--
🔗 Connect with us:
In this talk, Moritz and Yu detail how the Anyscale team has collaborated with the vLLM open-source team, highlighting key advancements such as FP8 support, chunked prefill, multi-step decoding, and speculative decoding. They explain how these optimizations have led to significant performance improvements in vLLM, doubling both throughput and latency efficiency. The presentation also covers Anyscale-specific enhancements, including custom kernels, batch inference optimizations, and accelerated large model loading for autoscaling deployments.
This breakout session is a must-watch for anyone looking to gain insights into the latest techniques for improving LLM inference efficiency and scalability.
--
Interested in more?
--
🔗 Connect with us: