Enabling Cost-Efficient LLM Serving with Ray Serve

Показать описание

Ray Serve is the cheapest and easiest way to deploy LLMs, and has served billions of tokens in Anyscale Endpoints. This talk discusses how Ray Serve reduces cost via fine-grained autoscaling, continuous batching, and model parallel inference, as well as the work we've done to make it easy to deploy any Hugging Face model with these optimizations.

Takeaways:

• Learn how Ray Serve saves costs by using fewer GPUs with finegrained autoscaling and integrating with libraries like VLLM to maximize GPU utilization.

About Anyscale
---
Anyscale is the AI Application Platform for developing, running, and scaling AI.

If you're interested in a managed Ray service, check out:

About Ray
---
Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.

#llm #machinelearning #ray #deeplearning #distributedsystems #python #genai