Maximizing GPU Utilization Over Multi-Cluster: Challenges and Solutions for Cloud-Native AI Platform

Показать описание

Maximizing GPU Utilization Over Multi-Cluster: Challenges and Solutions for Cloud-Native AI Platform - William Wang & Hongcai Ren, Huawei

With the increasing usage of running AI/ML workloads on Kubernetes, many companies build their cloud native AI platforms over multiple Kubernetes clusters that spread across data centers and a diverse range of GPU types. However, managing such a large-scale and heterogeneous GPU environment presents even more critical challenges, such as resource fragmentation, operational costs, and scheduling workload across different resources etc. This talk will explore how these challenges are addressed by using the Karmada and Volcano, that enables multi-cluster batch job management, together with other types of workloads. This talk will cover: • Intelligent GPU workload scheduling across multiple clusters • Ensuring cluster failover support for seamless workload migration to clusters with available resources • Dealing with two-level scheduling consistency and efficiency - in cluster and across cluster • Balancing utilization and QoS for resource sharing among workloads with different priorities

CNCF [Cloud Native Computing Foundation]

Рекомендации по теме

Maximizing GPU Utilization Over Multi-Cluster: Challenges and Solutions for Cloud-Native AI Platform

Maximizing GPU Utilization Over Multi-Cluster: Challenges and Solutions for Cloud-Native AI Platform

From 20% to 80% GPU Utilization: Best Practices for Maximizing GPU Utilization

Maximizing GPU Utilization in Kubernetes with Virtual Kubelets - Goutam Verma & Dean Troyer

Boosting Performance and Utilization with Multi-Instance GPU

Are You Really Out of GPUs? How to Better Understand Your GPU... - Natasha Romm & Raz Rotenberg

GPU as a Service Over K8s: Drive Productivity and Increase Utilization - Yaron Haviv, Iguazio

Nvidia CUDA in 100 Seconds

Lu Qiu - Maximize GPU Utilization for Model Training | PyData Global 2023

AI workloads on Kubernetes - How to maximize GPU utilization and cut costs

Scaling AI Workloads with Kubernetes: Sharing GPU Resources Across Multiple Containers - Jack Ong

Maximizing the Utilization of GPU Resources On-prem and in the Cloud

Keynote: Accelerating AI Workloads with GPUs in Kubernetes - Kevin Klues & Sanjay Chatterjee

Building a GPU cluster for AI

Accelerating Model Training in Multi-Cluster Environments with Consumer-Grade GPUs (SIGCOMM'24,...

4. GPU Monitoring Basics [Deep Learning + GPU Tutorial]

Maximizing GPU Utilization for AI with Run:ai | Ronen Dar

Scaling Kubernetes Clusters for Generative Models: Managing GPU Resources for AI App... Jack Min Ong

Optimize GPU Usage For VM on vSphere 6.7 Servers | vSphere

HotCloud '20 - Towards GPU Utilization Prediction for Cloud Deep Learning

Is Sharing GPU to Multiple Containers Feasible? - Samed Güner, SAP

NVIDIA GPU Operator Overview

Monitoring GPUs at Scale for AI/ML and HPC Clusters - Bharti L Agrawal, NVIDIA

Maximizing NVIDIA A100 Acceleration with Optimized Platform Designs

ProphetStor Federator.ai GPU Booster Feature Demo