Building makemore Part 3: Activations & Gradients, BatchNorm

preview_player
Показать описание
We dive into some of the internals of MLPs with multiple layers and scrutinize the statistics of the forward pass activations, backward pass gradients, and some of the pitfalls when they are improperly scaled. We also look at the typical diagnostic tools and visualizations you'd want to use to understand the health of your deep network. We learn why training deep neural nets can be fragile and introduce the first modern innovation that made doing so much easier: Batch Normalization. Residual connections and the Adam optimizer remain notable todos for later video.

Links:

Useful links:

Exercises:
- E01: I did not get around to seeing what happens when you initialize all weights and biases to zero. Try this and train the neural net. You might think either that 1) the network trains just fine or 2) the network doesn't train at all, but actually it is 3) the network trains but only partially, and achieves a pretty bad final performance. Inspect the gradients and activations to figure out what is happening and why the network is only partially training, and what part is being trained exactly.
- E02: BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the batchnorm gamma/beta can be "folded into" the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time. Set up a small 3-layer MLP with batchnorms, train the network, then "fold" the batchnorm gamma/beta into the preceeding Linear layer's W,b by creating a new W2, b2 and erasing the batch norm. Verify that this gives the same forward pass during inference. i.e. we see that the batchnorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.

Chapters:
00:00:00 intro
00:01:22 starter code
00:04:19 fixing the initial loss
00:12:59 fixing the saturated tanh
00:27:53 calculating the init scale: “Kaiming init”
00:40:40 batch normalization
01:03:07 batch normalization: summary
01:04:50 real example: resnet50 walkthrough
01:14:10 summary of the lecture
01:18:35 just kidding: part2: PyTorch-ifying the code
01:26:51 viz #1: forward pass activations statistics
01:30:54 viz #2: backward pass gradient statistics
01:32:07 the fully linear case of no non-linearities
01:36:15 viz #3: parameter activation and gradient statistics
01:39:55 viz #4: update:data ratio over time
01:46:04 bringing back batchnorm, looking at the visualizations
01:51:34 summary of the lecture for real this time
Рекомендации по теме
Комментарии
Автор

Andrej, as a third year PhD student this video series has given me so much more understanding of the systems I take for granted. You're doing incredible work here!

mileseverett
Автор

1:30:10 The 5/3 gain in the tanh comes for the average value of tanh^2(x) where x is distributed as a Gaussian, i.e.

integrate (tanh x)^2*exp(-x^2/2)/sqrt(2*pi) from -inf to inf ~= 0.39

The square root of this value is how much the tanh squeezes the variance of the incoming variable: 0.39 ** .5 ~= 0.63 ~= 3/5 (hence 5/3 is just an approximation of the exact gain).
We then multiply by the gain to keep the output variance 1.

leopetrini
Автор

This has to be the best hands-on coding tutorial for these small yet super-important deep learning fundamentals online. Absolutely great job!

Erosis
Автор

Andrej you have a wonderful gift for educating others. I’m a self learner of NNs and it’s a painful process but you seriously help ease that suffering… much appreciated! Ty

parentx
Автор

Every time another Andrej Karpathy video drops, its like Christmas for me. This video series has helped me to develop genuine Intuition about how neural networks work. I hope you continue to put these out, its making a massive impact on making these "black box" technologies accessible to the anyone and everyone!

nkhuang
Автор

this video totally opened my mind about this subject I've been obsessing about for over a year. Truly an amazing video, I feel it's the bare minimum to thank you for this.

joaovitormeyer
Автор

This series is definitely the clearest presentation of ML concepts I have seen, but the teaching generalizes so well. I’ll be using this step-by-step intuition-building approach for all complicated things from now on. Nice that the approach also gives a confidence that I can understand stuff with enough time. Truly appreciate your doing this.

hlinc
Автор

Turns out this should be the way to teach machine learning: a combination of theory reference and actual coding. Thank you Andrej!

khuongtranhoang
Автор

I like that not even the smallest detail is pulled out of thin air, everything is completely explained

chanep
Автор

Minute by minute, this course is giving us master-level knowledge. We're being molded into experts without even attending a world-class university! 🚀

dreamwalker
Автор

These video series are exceptional. The clarity and practicality of the tutorials are unmatched by any other course. Thank you for the invaluable help for all practitioners of deep learning!

IchibanKanobee
Автор

The quality of these lectures is out of the charts. This channel is a gold mine!. Andrej, thank you, thank you very much for these lectures.

knmd
Автор

To put BatchNorm into perspective, I am going through Geoffrey Hinton's 2012 lecture notes on bag of tricks for mini-batch descent, it was when AlexNet was first published. Hinton was saying there was no one best way for learning method/gradient descent with mini-batches. Well, here it is BatchNorm. Hinton: "Just think of how much better neural nets will work, once we've got this sorted out". We are living in that future :)

hungrydeal
Автор

Thank you Mr. Karpathy. I am in love with your teaching. Given how accomplished and experienced you are, you are still teaching with such grace. I dream about sitting in one of your classes and learning from you, and I hope to meet you one day. May you be blessed with good health. Lots of love and respect.

ragibshahriyear
Автор

So many small things, scrutinzers and how easily he has pointed them out one by one, step by step, problem to solution is just amazing. Love your work Andrej. You are amazing.

enchanted_swiftie
Автор

You have some much depth in your knowledge,
Yet, you manage to explain complex concepts with such and incredible didactics.
This is someone who truly understands his field. Andrej, thank you so much and even more for the humility in how you do it so.
You explain how libraries and languages like python and pytorch work and dive into the WHYs on why things are happening.
This is absolute priceless.

vitorzucher
Автор

I think your videos are the only ones I've come across that actually explain why you have a validation split, for the developers/data scientists to check and optimise the parameters/distribution. The ability to stop and replay is invaluable for me. Thank-you so much for these fantastic videos.

pwdrhrn
Автор

This whole series is absolutely amazing. Thank you very much Andrej! Being able to code along with you, improving a system as my own knowledge improves is fantastic

project-hq
Автор

Thank you, Andrej, amazing content! As a beginner in deep learning and even in programing, I find most materials out there are either pure theories or pure API applications, and they rarely come this deep and detailed. Your videos cover not just the knowledge of this field, but also so many empirical insights that came from working on actual projects. Just fantastic! Please make more of these lessons!

scottsun
Автор

My mind is totally blown at the detail I am getting. Feel like this is an ivy league level course, with the content so meticulously covered.

swarajnanda