Lesson 18: Deep Learning Foundations to Stable Diffusion

preview_player
Показать описание

We continue by implementing the OneCycleLR scheduler from PyTorch, which adjusts the learning rate and momentum during training. We also discuss how to improve the architecture of a neural network by making it deeper and wider, introducing ResNets and the concept of residual connections. Finally, we explore various ResNet architectures from the PyTorch Image Models (timm) library and experiment with data augmentation techniques, such as random erasing and test time augmentation.

0:00:00 - Accelerated SGD done in Excel
0:01:35 - Basic SGD
0:10:56 - Momentum
0:15:37 - RMSProp
0:16:35 - Adam
0:20:11 - Adam with annealing tab
0:23:02 - Learning Rate Annealing in PyTorch
0:26:34 - How PyTorch’s Optimizers work?
0:32:44 - How schedulers work?
0:34:32 - Plotting learning rates from a scheduler
0:36:36 - Creating a scheduler callback
0:40:03 - Training with Cosine Annealing
0:42:18 - 1-Cycle learning rate
0:48:26 - HasLearnCB - passing learn as parameter
0:51:01 - Changes from last week, /compare in GitHub
0:52:40 - fastcore’s patch to the Learner with lr_find
0:55:11 - New fit() parameters
0:56:38 - ResNets
1:17:44 - Training the ResNet
1:21:17 - ResNets from timm
1:23:48 - Going wider
1:26:02 - Pooling
1:31:15 - Reducing the number of parameters and megaFLOPS
1:35:34 - Training for longer
1:38:06 - Data Augmentation
1:45:56 - Test Time Augmentation
1:49:22 - Random Erasing
1:55:55 - Random Copying
1:58:52 - Ensembling
2:00:54 - Wrap-up and homework

Many thanks to Francisco Mussari for timestamps and transcription.
Рекомендации по теме
Комментарии
Автор

Bam. This lesson is dynamite. So much depth in just one lesson. ❤

mkamp
Автор

Around 1:58:00 (Rand copy). To truly preserve the existing distribution we could also copy the patch from a to b, but also copy what was prior to the copy on b to a.

mkamp
Автор

the random replace doesn't need to be slices/patches.. it could "swap" individual pixels. even easier to implement

seanriley
Автор

Jeramy comments about twitter not existing is quite ept. Its now X

alexkelly
Автор

Around 1:36:00, using batchnorm scales the activations, but the activations are also scaled by the weights and with gamma of batch norm. Regularizing the weights of the linear modules becomes ineffective if the model learns to increase gamma? And it would because there is only one gamma parameter per module, but many weight parameters, therefore the gamma penalty is not having too much of an impact on the loss? Is that what Jeremy explains? Also this would be true for LayerNorm as well?

mkamp
Автор

Just before you went into copying I was sitting here thinking you could do a random shuffle to maintain the distribution.

It may not matter, but the distribution stil changes when you delete pixels.
After all, now there are more of the ones you copied.

(And I should write this on the forums, but for now I'll write it here lest I forget.)

JensNyborg