How We Power the Largest AI Deployments on the Planet: Running Vir... Brandon Jacobs & Lukas Gentele

preview_player
Показать описание
How We Power the Largest AI Deployments on the Planet: Running Virtual Clusters at Scale - Brandon Jacobs, CoreWeave & Lukas Gentele, Loft Labs

Running and managing a large number of Kubernetes clusters on bare metal poses significant challenges, from security to GPU provisioning to scalability. Specialized cloud provider CoreWeave experienced these first-hand, operating 3,000+ Kubernetes clusters on top of 5,000 bare metal nodes with massive amounts of GPUs to power modern AI applications at scale. In the session, we’ll dive into these challenges and how CoreWeave partnered with Loft Labs, the maintainers of vcluster, to create this serverless Kubernetes experience for numerous companies running AI workloads at scale. This session demonstrates the pitfalls, design choices and architectural challenges the teams have dealt with over the course of 3 years while evolving its serverless Kubernetes offering, including: -Secure Isolation Of Tenants On A Shared Infrastructure -Challenges in achieving 10 second autoscaling -On-Demand Cluster & Compute Provisioning For Tenants -Day 2 Operations & Managing A Fleet Of Clusters At Scale
Рекомендации по теме