Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

Показать описание

A complete tutorial on how to train a model on multiple GPUs or multiple servers.
I first describe the difference between Data Parallelism and Model Parallelism. Later, I explain the concept of gradient accumulation (including all the maths behind it). Then, we get to the practical tutorial: first we create a cluster on Paperspace with two servers (each having two GPUs) and then training a model in a distributed manner on the cluster.
We will explore collective communication primitives: Broadcast, Reduce and All-Reduce and the algorithm behind them.
I also provide a template on how to integrate DistributedDataParallel in your existing training loop.
In the last part of the video we review advanced topics, like bucketing and computation-communication overlap during backpropagation.

Chapters
00:00:00 - Introduction
00:02:43 - What is distributed training?
00:04:44 - Data Parallelism vs Model Parallelism
00:06:25 - Gradient accumulation
00:19:38 - Distributed Data Parallel
00:26:24 - Collective Communication Primitives
00:28:39 - Broadcast operator
00:30:28 - Reduce operator
00:32:39 - All-Reduce
00:33:20 - Failover
00:36:14 - Creating the cluster (Paperspace)
00:49:00 - Distributed Training with TorchRun
00:54:57 - LOCAL RANK vs GLOBAL RANK
00:56:05 - Code walkthrough
01:06:47 - No_Sync context
01:08:48 - Computation-Communication overlap
01:10:50 - Bucketing
01:12:11 - Conclusion

Рекомендации по теме

Комментарии

This is the best video about Torch distributed I have ever seen. Thanks for making this video!

thinhon

I really love your vidoes. you have a natural talent on simplifying logic and code. in same capacity as Andrej

abdallahbashir

This is second video Ive watched from this channel after "quantization". And frankly wanted to express my gratitude towards your work as it is very easy to follow and the level of abstractions is tenable to understand concepts holistically.

КириллКлимушин

Great video, thanks for creating this. I have use DDP quite a lot but seeing the visualizations for communication overlap helped me build a very good mental model.
Would love to see more content around distributed training - Deepspeed ZeRO, Megatron DP + TP + PP

chiragjn

Starting to watch my 3rd video on this channel, after transformer from scratch and quantization. Thank you for the great content and also for the code and notes to look back again. Thank you.

amishasomaiya

That's an amazing resource! It's great to see you sharing such detailed information on a complex topic. Your effort to explain everything clearly will really help others understand and apply these concepts. Keep up the great work!

rachadlakis

Thank you for the tutorial. It is really helpful to learn beyond pytorch documentations.

normxu

Great introduction. Love the pace of the class and the balance of breadth vs depth

jiankunli

Super high quality lecture. You have a gift of teaching, man. Thank you!

karanacharya

Dang. Never thought learning DDP would be this easy. Another great content from Umar. Looking forward for FSDP

tharunbhaskar

Amazing video. Ideal video of how a lecture on a video should be

pulkitnijhawan

absolutely amazing! You made these concepts so accessible!

thuanncats

Incredible content, Umar! Well done! 🎉

Maximos

Umar hits the sweet spot (Goldilocks zone) by balancing theory and practical😄😄😄😄😄

nithinma

Amazing content! Thanks for your sharing

cken

It's very amazing. Thank you sir.

tribunetech

The video was very interesting and useful. Please make a similar video on DeepSpeed functionality. And in general, how to train large models (for example LLaMa SFT) on distributed systems (Multi-Server) when GPUs are located on different PCs.

МихаилЮрков-тэ

You deserve many more likes and subscribers!

nova

Thankyou so much for this amazing video. It is really informative.

prajolshrestha

Thank you very much for your wonderful video. Can you teach a video on how to use the accelerate library with dpp?

huu-lc

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

PyTorch vs TensorFlow | Ishan Misra and Lex Fridman

PyTorch in 100 Seconds

Multi node training with PyTorch DDP, torch.distributed.launch, torchrun and mpirun

PyTorch or Tensorflow? Which Should YOU Learn!

Distributed Training with PyTorch on Piz Daint - Session 1

Training on multiple GPUs and multi-node training with PyTorch DistributedDataParallel

PyTorch Distributed: Towards Large Scale Training

17. Distributed Training with Pytorch and TF

Distributed Training with PyTorch on Piz Daint - Day 1a

Run your first distributed PyTorch training job

Part 2: What is Distributed Data Parallel (DDP)

PyTorch Community Voices | Distributed PyTorch with Ray | Michael & Richard

PyTorch Distributed RPC | PyTorch Developer Day 2020

TensorFlow in 100 Seconds

A friendly introduction to distributed training (ML Tech Talks)

PYTORCH DISTRIBUTED | YANLI ZHAO

What is PyTorch Distributed - Distributed Deep Learning Model Training

Nvidia CUDA in 100 Seconds

Operationalize Distributed Training with PyTorch on Google Cloud (PT Conf. 2022 Breakout Session)

PyTorch Complete Training 2024: Learning PyTorch from Basics to Advanced

Part 1: Welcome to the Distributed Data Parallel (DDP) Tutorial Series

Part 3: Multi-GPU training with DDP (code walkthrough)

How Fully Sharded Data Parallel (FSDP) works?