Converting from PyTorch to PyTorch Lightning

preview_player
Показать описание
In this video, William Falcon refactors a PyTorch VAE into PyTorch Lightning. As it's obvious in the video, this was an honest attempt at refactoring a new repository without having prior knowledge of it. Despite this, the full conversion took under 45 minutes.

This video is meant to show all the details and issues you might run into while converting a model.

The original VAE is here:

The refactored Lightning VAE is here:

00:00 - Intro
00:55 - Why you need Pytorch lightning (even though PyTorch is already simple)
01:51 - Advantages of 16-bit precision
02:27 - Tour of the PyTorch Lightning repo
03:28 - Finding the "magic" (ie: the training loop core code)
07:47 - training_step
10:34 - train_dataloader
12:09 - configure_optimizers
12:54 - training_step vs forward
14:44 - validation_step
23:55 - dataloaders passed into .fit() vs inside LightningModule
26:38 - how to structure forward
29:26 - validation_epoch_end
30:52 - Using tensorboard (or any other logger)
33:59 - automatic model checkpointing
34:44 - how to add all Trainer args to Argparse automatically
35:56 - single-GPU training
38:22 - multi-GPU training
39:32 - 16-bit precision training
40:41 - summary
Рекомендации по теме
Комментарии
Автор

I’ve been writing my own train and validation loops, logging etc but your refactoring video touched so many things we do over and over in every model. Thanks for the awesome library and a very useful video to show its benefits.

katnoria
Автор

Great tutorial, looking forward to the next one. I'm currently struggling to convert a complex training loop from StyleGan2

bogabrain
Автор

Really good, I can definitely see the benefit of using lighting. I've been refactoring my code as I watch. It's not very often I finds YouTube video that's I've rewatched as many times as this one. One thing that I may have missed or you may have accidentally edited out or skipped. As part of the refactor of the data loaders you seem to have switched to passing the args to the next constructor via the hparams parameter. At 25:04 you mention that "we're going to generalise it (args) in a second". Then you move on to refactoring "forward" and at 32:18 you're still using args but when you return from tensorboard at 33:06 you're using hparams.

davewaterworth
Автор

I would love a video just on loggers and a maybe a complex NLP example.

SudarshanSrinivasan
Автор

Heard the Tensorflow comment and thought "oh maybe this video is super old" ... nope 😅. I do that kind of debugging in Tensorflow 2 all the time.

JackofSome
Автор

The refactored Lightning VAE (last version) cannot automatically download MNIST even if download=True in train_dataloader. An error was reported from val_dataloader. Is this something about self-check? I have to download it myself or put dataloader outside (just like you did at the beginning).

CD-kdem
Автор

Great video! Could you please clarify how to get a speedup from multi-gpu? It seems from tqdm line that training time for an epoch is getting larger in multi-gpu case, compare e.g. 37:54 and 38:50.

renatabbyazov
Автор

If you're using "cross_entropy_with_logits" -- shouldn't you remove "sigmoid" also?

not_a_human_being
Автор

I am a fan of PytorchLightning. However, I can't find any information about train/eval modes when we use batchnorm and dropout. Is PytorchLightning able to handle batchnorm and dropout management automatically? Do i need to use model.train() and model.eval() when i use batchnorm and droput in PytorchLightning?

aykutcayir
Автор

Please do a video on handling video on vocab and multi-task learning

thak
Автор

I'm a bit busy at the moment, but I'm planning to move my SR project from Pytorch to PL soon, I've been looking at the code and examples and it seems very straightforward. Just out of curiosity, is there support for multiple losses in the dictionary or only 'loss'? I test multiple loss functions and usually log them individually to track their behavior. I can change that, but just to know if that's an option. Thanks!

Phobos
Автор

I'm too lazy to read more now. One quick question, how well do this multigpu support batchnorm

menglin
Автор

Hey, just started out using Lightning. I like that you don't have to faff about with to.(device) Just a bit scary that it handles all the optimizer stuff.
And this tutorial was really helpful. I think if you did a series of different models it would gain a lot of traction, if this becomes de facto standard for neurips.
I had a couple of questions.
Why do you not do loss.item()? I'm hoping lightning deals with this efficiently. I notice when I do use .item() it doesn't wokr.
I was wonderingg what your process in your work is with validation train and (test) loader? Do you just do split, don'y shuffle validation and do early stopping type thing with validation, and a sepearte batch just for .test() at the end.
Thanks a lot.

jordieclive
Автор

Why should the shuffle be turned off in the validation dataloader?

MrErkout
Автор

4:10 Do you have a video that covers more complicated models? My model is an extremely complicated multi-file multi-package model. I'm fixed one error about "device=self.device" which I fixed with the ugly hack of hardcoding "device='cuda:0'" (which only works if you have a gpu with cuda). I tried both to_torchscript() and to_onnx() on my model and both fail.

ThinkTank
Автор

How do we pass in multiple GPU indexes i.e. if we want to use specific GPUs like gpus = [0, 2, 3, 8] using only the 1st, 3rd, 4th and 9th GPU. How do we do that?

imflash
Автор

I am unable to run in tpus both in kaggle and Collab .. in kaggle it isn't using tpu

varunsai
Автор

Can I technically not even define a forward method and just use self.encode and self.decode instead? Someone assist please!

Darkev
Автор

Is the return of the method "training_step" changed? I see in the documentation that now you return just the loss and not the dictionary :)

riccardmenoli
Автор

Where could I set the number of epochs? Since there is only Trainer and fit()

faraway