Enabling Cost-Efficient LLM Serving with Ray Serve

preview_player
Показать описание
Ray Serve is the cheapest and easiest way to deploy LLMs, and has served billions of tokens in Anyscale Endpoints. This talk discusses how Ray Serve reduces cost via fine-grained autoscaling, continuous batching, and model parallel inference, as well as the work we've done to make it easy to deploy any Hugging Face model with these optimizations.

Takeaways:

• Learn how Ray Serve saves costs by using fewer GPUs with finegrained autoscaling and integrating with libraries like VLLM to maximize GPU utilization.

About Anyscale
---
Anyscale is the AI Application Platform for developing, running, and scaling AI.

If you're interested in a managed Ray service, check out:

About Ray
---
Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.

#llm #machinelearning #ray #deeplearning #distributedsystems #python #genai
Рекомендации по теме
Комментарии
Автор

It should be noted, that since this talk, Anyscale deprecated Ray LLM and now recommend vLLM

elephantum