Distributed deep learning best practices

Показать описание

Dr. David R. Pugh, Visualization Scientist, Visualization
Dr. Glendon Holst, Visualization Scientist, Visualization
Dr. Mohsin Ahmed Shaikh, Computational Scientist, KAUST Supercomputing Lab

With increasing size and complexity of both Deep Learning (DL) models and datasets, the computational cost of training these model can be non-trival, ranging from few tens of hours to even several days. Exploiting data parallelism exhibited inherently in the training process of DL models, we can distribute training on multiple GPUs on a single or multiple nodes of Ibex. We discuss and demonstrate the use of Horovod, a scalable distributed training framework, to train DL models on multiple GPUs on Ibex. Horovod integrates with Tensorflow 1 & 2, Pytorch and MxNet. We also discuss the some caveats to look for when using Horovod for large mini-batch training.