filmov
tv
Distributed Machine Learning with Horovod on VMware vSphere with NVIDIA GPUs and PVRDMA
![preview_player](https://i.ytimg.com/vi/FH8Plm2HGaI/maxresdefault.jpg)
Показать описание
The goal of this solution is to showcase the use of distributed machine learning to leverage HW distributed across multiple servers in the datacenter. Horovod with PyTorch is used as the machine learning platform. vSphere 7 based virtualized infrastructure is used as the base for the solution. The entire ImageNet based dataset is trained across many epochs (iterations) to create a robust generalized image recognition model. The solution leverages GPU based acceleration for training and uses 100 Gbps RoCE networking for PVRDMA. The scalability of the training from.1 to 4 nodes is measured with comparisons between TCPIP and PVRDMA performance.