filmov
tv
Scaling Ray Train to 10K Kubernetes Nodes on GKE | Ray Summit 2024
![preview_player](https://i.ytimg.com/vi/9S5WznGnIpE/maxresdefault.jpg)
Показать описание
In the race to train larger and more complex AI models, the ability to scale efficiently across massive compute clusters is paramount. This session unveils Google's groundbreaking achievement in scaling Ray Train and Ray Data to an unprecedented 10,000 node Kubernetes cluster on Google Kubernetes Engine (GKE).
Andrew Sy Kim and Saikat Roychowdhury will take you on a deep dive into the architecture and innovations that made this feat possible. They'll explore the synergy between KubeRay and key Kubernetes enhancements, revealing how these tools work in concert to manage enormous distributed workloads. The talk will also spotlight advancements in GKE's GCS Fuse CSI driver, demonstrating its crucial role in enabling distributed checkpointing and dataset processing at scale. Attendees will gain valuable insights into pushing the boundaries of distributed machine learning infrastructure, applicable to both modest and massive deployments.
--
Interested in more?
--
🔗 Connect with us:
Andrew Sy Kim and Saikat Roychowdhury will take you on a deep dive into the architecture and innovations that made this feat possible. They'll explore the synergy between KubeRay and key Kubernetes enhancements, revealing how these tools work in concert to manage enormous distributed workloads. The talk will also spotlight advancements in GKE's GCS Fuse CSI driver, demonstrating its crucial role in enabling distributed checkpointing and dataset processing at scale. Attendees will gain valuable insights into pushing the boundaries of distributed machine learning infrastructure, applicable to both modest and massive deployments.
--
Interested in more?
--
🔗 Connect with us: