Optimization for Deep Learning (Momentum, RMSprop, AdaGrad, Adam)

preview_player
Показать описание
Here we cover six optimization schemes for deep neural networks: stochastic gradient descent (SGD), SGD with momentum, SGD with Nesterov momentum, RMSprop, AdaGrad and Adam.

Chapters
---------------
Introduction 00:00
Brief refresher 00:27
Stochastic gradient descent (SGD) 03:16
SGD with momentum 05:01
SGD with Nesterov momentum 07:02
AdaGrad 09:46
RMSprop 12:20
Adam 13:23
SGD vs Adam 15:03
Рекомендации по теме
Комментарии
Автор

As a junior AI developer, this was the best toturial of Adam and Other optimizers I've ever seen. Simply explained but not too simply to be a useless overview

Thanks

HojjatMonzavi
Автор

The best explanation I've seen till now!

rhugvedchaudhari
Автор

Most mind blowing thing in this video was what Cauchy did in 1847.

zhang_han
Автор

Good work man its the best explanation i have ever seen. Thank you so much for your work.

EFCK
Автор

That's a great video with clear explanations in such a short time. Thanks a lot.

saqibsarwarkhan
Автор

Very good explanation!

15:03 Arguably, I would say that it's not the responsibility of the optimization algorithm to ensure good generalization. I feel like it would be more fair to judge optimizers only on their fit of the training data, and leave the responsibility of generalization out of their benchmark. In your example, I think it would be the responsibility of model architecture design to get rid of this sharp minimum (by having dropout, fewer parameters, etc...), rather than the responsibility of Adam not to fall inside of it.

tempetedecafe
Автор

Clearly explained indeed! Great video!

dongthinh
Автор

keep doing the awesome work, you deserve more subs

Justin-zwhx
Автор

Fantastic video and graphics. Please find time to make more. Subscribed 👍

markr
Автор

Thank you this is really well put together and presented !

TheTimtimtimtam
Автор

why didnt you explain the (1-\beta_1) term?

MikeSieko
Автор

Nesterov is silly. You have the gradient g(w(t)) because the weight w is calculating in the forward the activation of the neuron and contributes to the loss. You don't have the gradient g(w(t)+pV(t)) because at this fictive position of the weight the inference was not calculated and so you don't have any information about what the loss contribution at that weight position would have been. It's PURE NONSENSE. But it only cost a few more calculations without doing much damage, so no one really seems to complain about it.

donmiguel
Автор

Really? i didn't know SGD generalized better than ADAM

wishIKnewHowToLove