Tackling Challenges of Distributed Deep Learning with Open Source Solutions

Показать описание

Deep learning has had an enormous impact in a variety of domains, however, with model and data size growing at a rapid pace, scaling out deep learning training has become essential for practical use.

In this talk, you will learn about the challenges and various solutions for distributed deep learning.

We will first cover some of the common patterns used to scale out deep learning training.

We will then describe some of the challenges with distributed deep learning in practice:
Infrastructure and hardware management
Spending too much time managing clusters, resources, and the scheduling/placement of jobs or processes.
Developer iteration speed.
Too much overhead to go from small-scale local ML development to large-scale training
Hard to run distributed training jobs in a notebook/interactive environment.
Difficulty integrating with open source software.
Scale out training while still being able to leverage open source tools such as MLflow, Pytorch Lightning, and Huggingface
Managing large-scale training data.
Efficiently ingest large amounts of training data to my distributed machine learning model.
Cloud compute costs.
Leverage cheaper spot instances, without having to restart training in case of node pre-emption.
Easily switch between cloud providers to reduce costs without rewriting all my code

Then, we will share the merits of the ML open source ecosystem for distributed deep learning. In particular, we will introduce Ray Train, an open source library built on the Ray distributed execution framework, and show how it’s integrations with other open source libraries (PyTorch, Huggingface, MLflow, etc.) alleviate the pain points above.

We will conclude with a live demo showing large-scale distributed training using these open source tools.

Connect with us: