Scaling Distributed Machine Learning leveraging VMware Bitfusion on Kubernetes with NVIDIA GPUs

preview_player
Показать описание
VMware Bitfusion extends the power of VMware vSphere’s virtualization technology to GPUs. VMware Bitfusion helps enterprises disaggregate the GPU compute and dynamically attach GPUs anywhere in the datacenter just like attaching storage. Bitfusion enables use of any arbitrary fractions of GPUs. Support more users in test and development phase.

Distributed machine learning across multiple nodes can be effectively used for training.
This video demonstrates the effectiveness of sharing GPU across jobs with minimal loss of performance. VMware Bitfusion makes distributed training scalable across physical resources and makes it limitless from a GPU resources capability. The solution showcases the benefits of combining best in class infrastructure provide by the NVIDIA GPU, VMware SDDC with Bitfusion to run distributed machine learning across a scalable infrastructure. This solution clearly demonstrated the value of distributed machine learning with Horovod on the vSphere platform. VMware essential PKS was successfully deployed with NVIDIA vComputeServer for vGPU and PVRDMA to provide for a high-performance container based machine learning platform. By running machine learning benchmarks across multiple worker pods with independent GPUs, the solution showed excellent scalability while leveraging PVRDMA. vMotion testing validated that even under heavy load virtual machine using vGPUs and PVRDMA can be migrated successfully. This capability provides flexibility improves availability of high-performance machine learning environments
Рекомендации по теме