Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes - A. Singh & A. Paithankar

Показать описание

Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes - Arpit Singh & Abhijit Paithankar, NVIDIA

In K8s based ML platforms, job failures from hardware errors such as GPU malfunctions, network disruptions, ECC errors, and OOM events pose significant challenges. These failures cause resource underutilization, wasted engineering time, and high operational costs, often requiring users to resubmit jobs. Current AI/ML frameworks lack adequate fault tolerance strategies, typically requiring manual intervention and causing delays before jobs can resume. This talk explores fault tolerance strategies including naive job restarts on failure, job restarts with hot spares, and job restarts by replacing faulty nodes. We discuss how to achieve fault propagation by leveraging node and pod conditions and address gaps in fault discovery and error propagation in the existing Kubernetes ecosystem. Our talk will also include ways to enhance components like the node-problem-detector and introduce new elements to close the gaps in fault detection , propagation reaction and remediation.

CNCF [Cloud Native Computing Foundation]

Рекомендации по теме

Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes - A. Singh & A. Paithankar

Enabling Fault Tolerance for GPU Accelerated AI Workloads in Kubernetes - A. Singh & A. Paithank...

Lightning Talk: Enabling Hot Restart of Stateful Applications Including GPU-Accelerate...- Bernie Wu

Detecting & Overcoming GPU Failures During ML Training- Ganeshkumar Ashokavardhanan & Sarah ...

Aging Resilience and Fault Tolerance in Runtime Reconfigurable Architectures

Ares: A fault injection framework for quantifying DNN fault tolerance

New Techniques for GPU On-line Testing and Fault Mitigation | J. E. Rodriguez Condia | PitchD 47

31. vSphere Fault Tolerance (FT) Explained: Architecture, Fast Checkpointing, & How FT Works!

Fault tolerance in P2P-MPI

ParaDox: Eliminating Voltage Margins via Heterogeneous Fault Tolerance (HPCA 2021 Short Talk)

Using Modularity to Enable Hardware Re use across AI Platforms in a Rapidly Evolving Ecosyste

Activate 4 extra fault tolerant cores in XEON 54xx series - 159

Sematic feature: Fault Tolerance with Function Retries

Keynote: Accelerating AI Workloads with GPUs in Kubernetes - Kevin Klues & Sanjay Chatterjee

Benchmarks + How-Tos of Convolutional Neural Network on HorovodRunner Enabled Apache Spark Clusters

PyTorch Lightning Live: Session 3 - Fault tolerance

Elixir | Fault Tolerance

Automating Load Balancing and Fault Tolerance via Predictive Analysis - Steven Rosenberg, Red Hat

TrainingCXL: Failure Tolerant Training with Persistent Memory Disaggregation over CXL

[ASPLOS 2022] GPM: Leveraging Persistent Memory from a GPU

Dynamic Graphs on the GPU

Double your FPS for FREE!*** - Nvidia DLSS 3.0

$IO NET Decentralized AI computing power launchpad on Binance, solving GPU shortage issue

The Brewing Problem with GPU Power Design | Transients

Nvidia and Red Hat; Partners in AI, ML and Other GPU Enabled Deployments