AdaGrad Explained in Detail with Animations | Optimizers in Deep Learning Part 4

preview_player
Показать описание
Adaptive Gradient Algorithm (Adagrad) is an algorithm for gradient-based optimization. The learning rate is adapted component-wise to the parameters by incorporating knowledge of past observations.

============================
Do you want to learn from me?
============================

📱 Grow with us:

👍If you find this video helpful, consider giving it a thumbs up and subscribing for more educational videos on data science!

💭Share your thoughts, experiences, or questions in the comments below. I love hearing from you!

⌚Time Stamps⌚

00:00 - Intro
00:15 - Adaptive Gradient Introduction
03:42 - Elongated Bowl Problem
07:22 - Visual Reprensation
09:42 - How do Optimizers behave?
17:22 - Mathematical Intuition
24:20 - Disadvantage
26:16 - Outro
Рекомендации по теме
Комментарии
Автор

Firstly, thanks for your sessions.
Answer to your question is that firstly the update in normal feature which is "b" in your case is more significant than update of "m" but once it reaches to its optimum value then the change in weight in b is near to zero whereas m has not reached to its optimum value. So, it still tries to get its optimum value and continue to change even with small change in weight and thus take large time but still converge.

ManbirSinghMago
Автор

The answer to your question that after a particular point the curve descending max at first where input feature is not sparse compared to sparse feature can be understood by the graph itself, at first it descends in one direction where feature is not sparse till it reaches max achievable location then from there it has to descent towards that direction where it has not moved in our case the sparse feature direction, since now weights update are being noticed as other one has already reaced its max descent location

satyamgawade
Автор

Brilliantly Explained Sir, ..,. I do not see such deep concepts well explained in any other Youtube / text books ... 🙂

SidIndian
Автор

Sir your details 🛐🛐, got a satisfactory reasons for every changes.

TNSR
Автор

3:22 revising concepts for the placement from IIT kgp with your playlist and suddenly you took the example of IIT, this made me jovial sir and Thanks for your help

Shivam_kgp
Автор

Thanks for the amazing content sir....your explanation are so good we don't have a chance to forget the concept in our entire life 😅

ramsu
Автор

17:06 : The Answer is, firstly when b was normal, the algorithm was trying to reduce the error, for b eventually reaching the optimal value for b, but after getting optimal value for b, it starts to get the optimal value for x,

innocentgamer
Автор

Sir thanks alot for providing such education for free to is Indians. You are helping in soo much. May god bless you sir. ♥️♥️Much love.

alroygama
Автор

A simple explanation for the movement in the m direction that sir mentioned would be that as the curve is updated, the b value is dominating in the update process. Finally, as seen in the graph b reach almost 0 value hence all the value of b is zero making b much more sparse than m. so now the same thing that happened with m will happen to b but even more aggressively making m the only updating parameter.

sheiphanshaijan
Автор

17:04 At first b was dominating and step size was large but as we start taking small steps the resultant gradient of m which was smoll already is being noticed, and start participating in finding optmial minima. I think that's why it start moving in that direction. Please correct.

Sara-fpzw
Автор

So detailed and nicely explained. Man I logged in just to subscribe !!

SleepeJobs
Автор

Nitish sir... Tysm for osm deep learning videos♥️ btw aapse ek request se zara time series par ek 3-4 hr ki video banado na i mean beginner ya entry level k hissab se ya phir aapke pehle ka video hoga to bhi chalega uako compile krke daaldo aap... muze time series padna hai actually aapse ho and mera 3rd week of aug interview hai

raj
Автор

Answer: What we are observing here is relative update in b and m in respective directions. Assume that both b and m need delta change during training process from their initial values. Now b in able to achieve that delta update in few initial epochs and has attained the right value. But m is little slower learner so thats why it took it some more epochs to reach that stage.

radhikawadhawan
Автор

Sir you are really really great teacher, moreover your content is premium

saduddinshaikh-rt
Автор

You're Love. Thank you for existing.

arskas
Автор

Answer - value of b reached to the optimal solution. So derivatives become near to zero, that stops the updation in b axis. Now respectively derivatives of w become large, so training takes speed towards w axis.

Bhudeep
Автор

A student from IIT watching your Vedio, thank you!!! 03:22

Kumar-clk
Автор

Thanks Nitesh for this easy explanation

narendraparmar
Автор

At 7:26, b is on x axis and m is on y changing b is changing the loss more . It should be opposite.

shubhamkumar-nwui
Автор

17:00 bcz value of b already reached a point where gradient became zero .In other words most optimal value of b was near so it started updating by less amount and eventually saturated .

AbhinavSharma-yflz
visit shbcf.ru