Episode 3: From PyTorch to PyTorch Lightning

preview_player
Показать описание

This video covers the magic of PyTorch Lightning! We convert the pure PyTorch classification model we created in the previous episode to PyTorch Lightning, which makes all the latest AI best practices trivial. We go over training on single and multi GPUs, logging and saving models, and many more!

Willam Falcon is an AI Ph.D. researcher at NYU, and creator and founder of PyTorch Lightning

Chapters:
00:00 Introduction to PyTorch Lightning
00:38 Install PyTorch Lightning
01:03 5 main components of a Lightning Module
01:47 Defining a model
04:05 Optimizer
05:20 The Training Loop
07:26 Loading and preparing data
09:10 Running training experiments
16:04 Training on a GPU
17:35 Logging and saving models
23:02 Validation loop
31:48 Multi GPU training
Рекомендации по теме
Комментарии
Автор

Great tutorial!

One thing I noticed for anyone working through this in 2022, the accuracy wont show up on the progress bar using the method in the tutorial.

To get it to work, you need to remove the progress bar pbar variable from the return statement and instead insert "self.log("accuracy", acc, prog_bar=True)" into the training_step function

paulmathew
Автор

My vote for the order of feature coolness: #1- trivial multi GPU, #2- flexible tensorboard (I'm logging a bunch of metrics), #3- accumulate_grad_batches, #4- resume_from_checkpoint, #5 Hparms logging in tensorboard (especially useful when I keep tweaking parameters in the middle of a day long run, then resume), #6- warmup learning rate with optimizer_step.

johngrabner
Автор

@20:15 is my favorite part of the video. Alfredo is so freaking honest at that moment, love it.

adityassrana
Автор

this was very helpful to reorganise my pytorch-lightning 0.7 or 0.8 code into the latest version. thanks guys waiting for more.

timothydell
Автор

Awesome, it's so easy to implement distributed training across nodes along with custom hooks!! 😉

mayankbhaskar
Автор

When will you publish the next video? This is amazing

asiffaisal
Автор

Both of these guys have something that confuses them with their back wall, amazing

michelangelo
Автор

Thanks for this video, If you can cover call backs, that will be interesting learning. The progress bar always overwrites the previous metrics. I am hoping if you can cover printing metrics for each epoch separately, it will be of great help.

KSK
Автор

the return dict of the training_step [21:00], unfortunately the docs don't provide a lot of info about this point

osamansr
Автор

Great job, you both!

Your 'setup' method got a typo, it should be:
"train_data = datasets.MNIST('da ..."
istead of:
"datasets = datasets.MNIST('da ..."

But it gives me an error:
'TypeError: setup() takes 1 positional argument but 2 were given'

ramisketcher
Автор

Wonderful video and I will start using it, will the next episode do a VAE, cycle GAN, and hook at least? 😃

Feel free to ignore this part if you think it is too much.
I hope we will do world model, pixel-level classifiers, cycle-GAN, Transformer, LSTM, Heatmap, Hook, Upsampling, GPT, Bert, music generation, and more because these are the basic today.
Colab doesn't run the world model(truck backing one) in anime. I am not sure we have something that makes Colab runs it.
We should do more on self-supervised learning and Energy-Based model.

jonathansum
Автор

Hey guys this is a great video. and I am really looking forward to simplify my pytorch pipeline with some of this code. There are just two issues I am running into:
1. When using acc = accuracy(logits, y), lightning complains about non-normalized predictions. What would you propose for this specific task, a lot of people just use a softmax layer in the end and add a log-likelihood loss.
2. When I define my train and val dataset split in my train_dataloader function by assigning self.train and self.val, and then just use a DataLoader on self.val in my val_dataLoader, I receive an error saying that my object has no attribute val, so I assume the call order is diffrent?

Great introduciton apart from these minor things though, keep up the good work
Cheers Nico

nicolasmandel
Автор

Accuracy as demonstrated in the video is deprecated as of now. I think now you have to use `torchmetrics` and `self.log(prog_bar=True)` to obtain the effect demonstrated in the vid. Correct me if I'm wrong?

MateuszModrzejewski
Автор

I struggled a bit with this to get it working on my current setup. Looking through the API, I figured out that setup() also requires you to pass in stage. Might be good to add an overlay or something to the video pointing that out? Really looking forward to trying this out on a multi-gpu setup once I get my cooling situation under control.

xOoOverflw
Автор

got some error


MisconfigurationException: No `train_dataloader()` method defined. Lightning `Trainer` expects as minimum a `training_step()`, `train_dataloader()` and `configure_optimizers()` to be defined. Any idea why this error.

rameshprakash
Автор

Hi William, Afredo, thank you for this introductory tutorial! I just wanted to point out something. I followed along with mine version of the code and I noticed that calling the training portion of the data "train" may cause some issues (you instantiate self.train and self.val in the setup hook): the LightningModule invokes self.train() at a certain point which became instead a Subset in your example :)

ga
Автор

I would like to see more explanations on why certain functions inside the model are chosen and the implications of numbers chosen for the functions. Ie why use linear vs conv2d. Also I don't quite understand the second linear transformation which goes from 64 to 64. In most tutorials usually the output is greater than the input? Thanks for making these videos. I'm new to machine learning and trying to apply these concepts to unstructured binary data using pytorch.

xOoOverflw
Автор

I wish you guys finished that "train_loss/val_loss" array setup for plotting later. Love the videos!

ulugbekdjuraev
Автор

Nice video! Just in case I misunderstood, when using multi-GPU, do I still need to specify the number of GPUs and nodes in the code after specifying in the SLURM script? Which specification will pl choose when the two are different?

anniezhi
Автор

You should do more advanced tutorials to really show off the features

rahuldeora