L3 Policy Gradients and Advantage Estimation (Foundations of Deep RL Series)

preview_player
Показать описание
Lecture 3 of a 6-lecture series on the Foundations of Deep RL
Topic: Policy Gradients and Advantage Estimation
Instructor: Pieter Abbeel

Рекомендации по теме
Комментарии
Автор

Really like the lecture, both the depth it covers and the clear explanation!

xudong
Автор

Awesome lecture. Thanks a lot for providing such great content.

mohamedebbed
Автор

Thank you very much Pieter. The advantage makes sense now, after you describe the baseline.
So it's the way to recenter around zero, to have negatives and positives. A lot better than just increasing all terms. We won't have to hope that the actions_softmax and multiple replays will eventually tame those probabilities.

IgorAherne
Автор

At 29:20 in the Vanilla Policy Gradient Algorithm why do we calculate the advantage estimate before refitting the baseline? Surely we would want the advantage to be calculated after refitting, so that it is based on the most recent samples and not a whole rollout set ago?

LoganDunbar
Автор

Thank you for this awesome lecture. Prof, it will be great if you could post videos about your discussion on latest advances in RL field by going through some papers. Your views and perspectives will be valuable for the field.

htetnaing
Автор

When pieter says "This lecture is going to be pretty mathematical", it is going to be pretty epic

saihemanthiitkgp
Автор

Thanks for the nice lecture and the clear explanation. This video is very intuitive to me.

hongkyulee
Автор

great lecture! it covers all literature & clearly explanation.

binhtruong
Автор

37:49 I am lost on how Qhat is estimated. is that coming from the phi-network or is it calculated using Monte Carlo? I see the variations in the Note, but just want to be sure.

chandermatrubhutam
Автор

At 19:40 (Slide 27)
Why does H become H - 1 ?

KrischerBoy
Автор

@9:05 Does that really result in weighting by the Probability? It seems like its just multiplying by 1, so if we weren't weighting b4, how can we be doing so now?

JohnSmith-hexg
Автор

34:20 When I do bootstrap to train my value net, the outputs of the value net almost always explode and don't seam to get smaller the closer the state is to the final state. It just becomes a positive feedback loop that increase the value net output. Is that a common problem? How to fix it? Great Video btw, love it :D

timkellermann
Автор

The dog meme was glorious, now I'll never forget it also works for Discountinuous functions 😂

fuma
Автор

In the baseline approach, can we learn conditional trajectory probability instead, trying to limit the sample space of P(tau; theta) ?

BruinChang
Автор

Thanks for your explanation. I'd like to ask if the action space is continuous, how can you compute ∂(log(π(a|s))?

ZephyrineFreiberg
Автор

Its like u did all the lectures in a single day. And uploaded in segments. Kudos to u.

sarai
Автор

if my action space is [-1, 1] then how can I take log of the policy? It just gives math error.

Himanshu-xeek
Автор

Discount shrinks rewards, thus reduce variance. But why function approximation also helps?

徐超-yv
Автор

the sound is not good I am quite disappointed

luisortiz