Maximizing GPU Utilization Over Multi-Cluster: Challenges and Solutions for Cloud-Native AI Platform

preview_player
Показать описание

Maximizing GPU Utilization Over Multi-Cluster: Challenges and Solutions for Cloud-Native AI Platform - William Wang & Hongcai Ren, Huawei

With the increasing usage of running AI/ML workloads on Kubernetes, many companies build their cloud native AI platforms over multiple Kubernetes clusters that spread across data centers and a diverse range of GPU types. However, managing such a large-scale and heterogeneous GPU environment presents even more critical challenges, such as resource fragmentation, operational costs, and scheduling workload across different resources etc. This talk will explore how these challenges are addressed by using the Karmada and Volcano, that enables multi-cluster batch job management, together with other types of workloads. This talk will cover: • Intelligent GPU workload scheduling across multiple clusters • Ensuring cluster failover support for seamless workload migration to clusters with available resources • Dealing with two-level scheduling consistency and efficiency - in cluster and across cluster • Balancing utilization and QoS for resource sharing among workloads with different priorities
Рекомендации по теме