I explain Fully Sharded Data Parallel (FSDP) and pipeline parallelism in 3D with Vision Pro

Показать описание

Build intuition about how scaling massive LLMs works. I cover two techniques for making LLM models train very fast, fully Sharded Data Parallel (FSDP) and pipeline parallelism in 3D with the Vision Pro. I'm excited to see how AR can help teach complex ideas easily.

Long time dream of mine to show conceptually, how I visualize these systems.

Chapters:

00:00 Introduction
01:02 Two machines each with 2 GPUs
01:37 Transformer models blocks
02:02 Forward pass
02:10 Backward pass
02:43 Fully Sharded Data Parallel introduction
02:51 Layer sharding
03:30 Weight concat
05:25 Memory upper bound
05:58 Why more GPUs speed up training
07:23 Shard across nodes (machines)
09:20 Sharding a block across nodes
10:14 Another way of seeing sharding
11:30 Understand interconnect bottleneck
12:00 Hybrid sharding
15:00 Pipeline parallelism
16:04 Forward pass in pipeline parallelism
16:10 Intuition around pipeline parallelism
16:50 Future directions on pipeline parallelism

william falcon

Рекомендации по теме

Комментарии

i literally searched for a walk through like this for months. big big thanks.

shinagawaintelligencecoltd

Great explanation. Loved the enthusiasm.

deependu__

This is really cool please do more of this even if it’s a bit buggy

Immurement

This is excellent! Very clear explanation with the help of vision pro! I do have one question, the fsdp seems pretty similar to model parallelism, how do they coordinate with each other when we enable both at the same time?

qiansun

Is sharding usually done on a layer level? Meaning, the distributed gpus will all have all the layers but only a piece of each layer? Or is it done like a pipeline, where one gpu feeds the output of a layer to another one?

juancolmenares

Hey really awesome video, but can you please translate this on iPad or a board? Thanks a lot

codewithyouml

I explain Fully Sharded Data Parallel (FSDP) and pipeline parallelism in 3D with Vision Pro

I explain Fully Sharded Data Parallel (FSDP) and pipeline parallelism in 3D with Vision Pro

How Fully Sharded Data Parallel (FSDP) works?

What is Database Sharding?

Database Sharting Explained

[Long Review] Fully Sharded Data Parallel: faster AI training with fewer GPUs

[Short Review] Fully Sharded Data Parallel: faster AI training with fewer GPUs

7 Must-know Strategies to Scale Your Database

What is DATABASE SHARDING?

NoSQL vs SQL: What's better?

Part 2: What is Distributed Data Parallel (DDP)

What is Database Sharding?

Manual sharding vs Automatic sharding

Too Big to Train: Large model training in PyTorch with Fully Sharded Data Parallel

Advantages of Using NoSQL Databases

The Basics of Database Sharding and Partitioning in System Design

FSDP Production Readiness

Database Sharding and Partitioning

Choosing the Right Database for System Design

Scaling Cassandra and MySQL

A friendly introduction to distributed training (ML Tech Talks)

6 SQL Joins you MUST know! (Animated + Practice)

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)

Key Based Sharding

Important sharding techniques.