Distributed Deep Learning with Horovod on Ray - Travis Addair, Uber

Показать описание

Distributed Deep Learning with Horovod on Ray - Travis Addair, Uber

Horovod is an open source framework created to make distributed training of deep neural networks fast and easy for TensorFlow, PyTorch, and MXNet models. Horovod's API makes it easy to take an existing training script and scale it run on hundreds of GPUs, but provisioning a Horovod job with hundreds of GPUs can often be a challenge for users who lack access to HPC systems preconfigured with tools like MPI. The newly introduced Elastic Horovod API introduces fault tolerance and auto-scaling capabilities, but requires further infrastructure scaffolding to configure. In this talk, you will learn how Horovod on Ray can be used to easily provision large distributed Horovod jobs and take advantage of Ray's auto-scaling and fault tolerance with Elastic Horovod out of the box. With Ray Tune integration, Horovod can further be used to accelerate your time-constrained hyperparameter search jobs. Finally, we'll show you how Ray and Horovod are helping to define the future of machine learning workflows at scale.