L3 Policy Gradients and Advantage Estimation (Foundations of Deep RL Series)

Показать описание

Lecture 3 of a 6-lecture series on the Foundations of Deep RL
Topic: Policy Gradients and Advantage Estimation
Instructor: Pieter Abbeel

Pieter Abbeel

Рекомендации по теме

Комментарии

Really like the lecture, both the depth it covers and the clear explanation!

xudong

Awesome lecture. Thanks a lot for providing such great content.

mohamedebbed

Thank you very much Pieter. The advantage makes sense now, after you describe the baseline.
So it's the way to recenter around zero, to have negatives and positives. A lot better than just increasing all terms. We won't have to hope that the actions_softmax and multiple replays will eventually tame those probabilities.

IgorAherne

At 29:20 in the Vanilla Policy Gradient Algorithm why do we calculate the advantage estimate before refitting the baseline? Surely we would want the advantage to be calculated after refitting, so that it is based on the most recent samples and not a whole rollout set ago?

LoganDunbar

Thank you for this awesome lecture. Prof, it will be great if you could post videos about your discussion on latest advances in RL field by going through some papers. Your views and perspectives will be valuable for the field.

htetnaing

When pieter says "This lecture is going to be pretty mathematical", it is going to be pretty epic

saihemanthiitkgp

Thanks for the nice lecture and the clear explanation. This video is very intuitive to me.

hongkyulee

great lecture! it covers all literature & clearly explanation.

binhtruong

37:49 I am lost on how Qhat is estimated. is that coming from the phi-network or is it calculated using Monte Carlo? I see the variations in the Note, but just want to be sure.

chandermatrubhutam

At 19:40 (Slide 27)
Why does H become H - 1 ?

KrischerBoy

@9:05 Does that really result in weighting by the Probability? It seems like its just multiplying by 1, so if we weren't weighting b4, how can we be doing so now?

JohnSmith-hexg

34:20 When I do bootstrap to train my value net, the outputs of the value net almost always explode and don't seam to get smaller the closer the state is to the final state. It just becomes a positive feedback loop that increase the value net output. Is that a common problem? How to fix it? Great Video btw, love it :D

timkellermann

The dog meme was glorious, now I'll never forget it also works for Discountinuous functions 😂

fuma

In the baseline approach, can we learn conditional trajectory probability instead, trying to limit the sample space of P(tau; theta) ?

BruinChang

Thanks for your explanation. I'd like to ask if the action space is continuous, how can you compute ∂(log(π(a|s))?

ZephyrineFreiberg

Its like u did all the lectures in a single day. And uploaded in segments. Kudos to u.

sarai

if my action space is [-1, 1] then how can I take log of the policy? It just gives math error.

Himanshu-xeek

Discount shrinks rewards, thus reduce variance. But why function approximation also helps?

徐超-yv

the sound is not good I am quite disappointed

luisortiz

L3 Policy Gradients and Advantage Estimation (Foundations of Deep RL Series)

L3 Policy Gradients and Advantage Estimation (Foundations of Deep RL Series)

RL4.2 - Basic idea of policy gradient

DeepMind x UCL RL Lecture Series - Policy-Gradient and Actor-Critic methods [9/13]

Policy Gradient Methods | Reinforcement Learning Part 6

Introduction to Reinforcement Learning|Policy Gradients in 7 mins!

What is Policy Gradient Methods #Shorts

Policy Gradient Theorem Explained - Reinforcement Learning

lecture 14 policy gradient and variance reduction

4) Policy Gradient REINFORCE

31. Policy Gradient in TensorFlow for CartPole

CS 182: Lecture 15: Part 1: Policy Gradients

Reinforcement Learning 8: Policy gradient methods

From Policy Gradient with baseline to Actor-Critic (RLVS 2021 version)

Policy Gradients are Easy in Tensorflow 2 | Complete Deep Reinforcement Learning Tutorial |

Policy Gradient Methods for Reinforcement Learning

CS 182: Lecture 15: Part 3: Policy Gradients

Advantage function in Reinforcement Learning

Exercise 12: Policy Gradients

L4 TRPO and PPO (Foundations of Deep RL Series)

Advantage Actor Critic

lecture 15 natural policy gradient

Proximal Policy Optimization Explained

L5 DDPG and SAC (Foundations of Deep RL Series)

CS885 Lecture 7a: Policy Gradient