RL Course by David Silver - Lecture 7: Policy Gradient Methods

preview_player
Показать описание
#Reinforcement Learning Course by David Silver# Lecture 7: Policy Gradient Methods (updated video thanks to: John Assael)

Рекомендации по теме
Комментарии
Автор

People who feel like quiting at this stage, relax, take a break, watch this video over and over again and read sutton and barto. Do everything but dont quit. You are amongst the 10% who came this far.

akshatgarg
Автор

Oh, I can't concentrate without seeing David.

alexanderyau
Автор

This is what I call commitment: David Silver explored not showing his face policy, received less reward, and then switched back to the past lectures' optimal policy.
Nothing like learning from this "one stream of data called life."

xicocaio
Автор

For those confused:
- whenever he speaks of the u vector he's talking about the theta vector (slides don't match).
- At 1:02:00 he's talking about slide 4.
- At 1:16:35 he says Vhat but slides show Vv
- He refers to Q in Natural Policy Gradient, which is actually Gtheta in the slides
- At 1:30:30 the slide should be 41 (the last slide), not the Natural Actor-Critic slide

JonnyHuman
Автор

ahhh... where did u go david.. i loved your moderated gesturing

saltcheese
Автор

This course should be called: "But wait, there's an even better algorithm!"

michaellin
Автор

And it turns out that this is best course to learn RL even after 6 years.

krishnanjanareddy
Автор

3:24 Introduction
26:39 Finite Difference Policy Gradient
33:38 Monte-Carlo Policy Gradient
52:55 Actor-Critic Policy Gradient

NganVu
Автор

1:30 Outline

3:25 Policy-Based Reinforcement Learning
7:40 Value-Based and Policy-Based RL
10:15 Advantages of Policy Based RL
14:10 Example: Rock-Paper-Scissors
16:00 Example: Aliased Gridworld

20:45 Policy Objective Function
23:55 Policy Optimization
26:40 Policy Gradient
28:30 Computing Gradients by Finite Differences
30:30 Training AIBO to Walk by Finite Difference Policy Gradient
33:40 Score Function
36:45 Softmax Policy
39:28 Gaussian Policy
41:30 One-Step MDPs

46:35 Policy Gradient Theorem
48:30 Monte-Carlo Policy Gradient (REINFORCE)
51:05 Puck World Example

53:00 Reducing Variance Using a Critic
56:00 Estimating the Action-Value Function
57:10 Action-Value Actor-Critic
1:05:04 Bias in Actor-Critic Algorithms
1:05:30 Compatible Function Approximation
1:06:00 Proof of Compatible Function Approximation Theorem
1:06:33 Reducing Variance using a Baseline
1:12:05 Estimating the Advantage Function
1:17:00 Critics at Different Time-Scales
1:18:30 Actors at Different Time-Scales
1:21:38 Policy Gradient with Eligibility Traces

1:23:50 Alternative Policy Gradient Directions
1:26:08 Natural Policy Gradient
1:30:05 Natural Actor-Critic

yasseraziz
Автор

I have to listen repetitively because I could not concentrate with out seeing him. I have to imagine what he was trying to show through his gestures . This is a gold standard lecture for RL. Thank you professor David Silver.

finarwa
Автор

Damn.Its was alot easier understandin it with gestures

WuuD
Автор

It would have been great if it was possible to recreate David in this lecture based on his voice using some combination of RL frameworks.

georgegvishiani
Автор

Starts at 1:25.
Actor critic at 52:55.

chrisanderson
Автор

Unfortunately the slides do not fit what is said. It's a pity they don't seem to put much effort into these videos. David is surely one of the best people to learn RL from.

MrCmon
Автор

I am not sure exactly how this video was created, but the right slide is often not displayed (especially near the end, but elsewhere as well). It is probably better to download the slides for the lecture and find your own way through them while listening to the audio.

liamroche
Автор

It is unfortunate that exactly this episode is without david in the screen. It is again a quite compley topic and Devaid jumping and running around and pointing out the relevant parts make it much easier to digest.

florentinrieger
Автор

Just to make sure, in 36:22, the purpose of the likelihood ratio trick is to make the gradient of the objective function gets converted to a expectation again? Just a David said at 44:33, "... that's the whole point of using the likelihood ratio trick".

helinw
Автор

This lecture was immensely difficult to get owing to david's absence and mismatch of slides

akarshrastogi
Автор

“No matter how ridiculous the odds may seem, within us resides the power to overcome these challenges and achieve something beautiful. That one day we look back at where we started, and be amazed by how far we’ve come.” -Technoblade

I started this series a month ago in summer break, I even did the Easy21 assignment and now I finally learned what I wanted, when I started this series i.e. Actor Critic Method. Time to do some gymnasium env.

OmPrakash-vtvr
Автор

It took me a while to realize that policy function pi(s, a) is alternately used as the probability of taking a certain action in state s, and the action proper (a notation overload that comes from the Sutton book). I think specific notation for each instance would avoid a lot of confusion.

jorgelarangeira