Scaling LLM Inference: AWS Inferentia Meets Ray Serve on EKS | Ray Summit 2024

preview_player
Показать описание
The race for efficient, scalable AI inference is on, and AWS is at the forefront with innovative solutions. This session showcases how to achieve high-performance, cost-effective inference for large language models like Llama2 and Mistral-7B using Ray Serve and AWS Inferentia on Amazon EKS.

Vara Bonthu and Ratnopam Chakrabarti will guide you through the intricacies of building a scalable inference infrastructure that bypasses GPU availability constraints. They'll demonstrate how the synergy between Ray Serve, AWS Neuron SDK, and Karpenter autoscaler on Amazon EKS creates a powerful, flexible environment for AI workloads. Attendees will explore strategies for optimizing costs while maintaining high performance, opening new possibilities for deploying and scaling advanced language models in production environments.

--

Interested in more?

--

🔗 Connect with us:
Рекомендации по теме