RL Course by David Silver - Lecture 4: Model-Free Prediction

preview_player
Показать описание
#Reinforcement Learning Course by David Silver# Lecture 4: Model-Free Prediction

Рекомендации по теме
Комментарии
Автор

0:53 Outline
2:10 Introduction

5:06 Monte Carlo Learning
9:20 First Visit MC Policy Evaluation
14:55 Every Visit MC Policy Evaluation
16:23 Blackjack example
26:30 Incremental Mean
29:00 Incremental MC Updates

34:00 Temporal Difference (TD) Learning
35:45 MC vs TD
39:50 Driving Home Example

44:56 Advantages and Disadvantages of MC vs. TD
53:35 Random Walk Example
58:04 Batch MC and TD
58:45 AB Example
1:01:33 Certainty Equivalence

1:03:32 Markov Property - Advantages and Disadvantages of MC vs. TD
1:04:50 Monte Carlo Backup
1:07:45 Temporal Difference Backup
1:08:14 Dynamic Programming Backup
1:09:10 Bootstrapping and Sampling
1:10:50 Unified View of Reinforcement Learning

1:15:50 TD(lambda) and n-Step Prediction
1:17:29 n-Step Return
1:20:22 Large Random Walk Example
1:22:53 Averaging n-Step Return
1:23:55 lambda-return
1:28:52 Forward-view TD(lambda)
1:30:30 Backward view TD(lambda) and Eligibility Trace
1:33:40 TD(lambda) and TD(0)
1:34:40 TD(lambda) and MC

yasseraziz
Автор

Questions from students are of very high quality and they are one of the many reasons that make this lecture series particularly great.

appendix
Автор

2:03 Introduction
5:04 Monte-Carlo Learning
33:56 Temporal-Difference Learning
1:23:53 TD(lambda)

NganVu
Автор

When you realized every lecture corresponds to a chapter in Sutton's "Introduction to Reinforcement Learning"

scienceofart
Автор

I love how he relates the form of the incremental mean to the meaning of RL updates.

azerotrlz
Автор

The lecture 💯.
The questions the students were asking 💯.
My enjoyment of the whole thing 💯.

ikechukwuokerenwogba
Автор

38:29 what a great example to explain how TD is different from MC

saminchowdhury
Автор

I think the reason why looking one step into the future is better than using past predictions is that you can treat the next step as if it were the last step, then that would be the terminating goal and the game is over, we've already known the current state didn't make the episode end up, so only the future state could and that's why we always look into the future for the terminating state.

yuxinzhang
Автор

The backup diagrams have made everything much more clearer.

achyuthvishwamithra
Автор

love the example to demonstrate the difference between TD and MC!!

testxy
Автор

haha.. "You don't need to wait til you die to update your value function.. "

joshash
Автор

I have to say, David Silver is slightly smarter than me.

SunSon
Автор

Thanks for good lecture. This lecture really help me a lot.
I have a suggestion for improving this lecture. It is English subtitle. It will make this lecture more accessible for the handicapped and non-English speakers.

nightfall
Автор

These lectures are sooo helpful! Thank you very much for posting. They are really good :).

tacobellas
Автор

At 1:27:47, David explains why we use λ geometric series by saying it is memoryless so "You can do TD(λ) for the same cost as TD(0)".. but I don't see how! TD(0) merely looks at one 1 step whereas TD(λ) has to look at all future steps (or in the backward view, TD(0) merely updates the current state, while TD(λ) updates all previous states)

ErwinDSouza
Автор

Another meaty lecture !
This is pure treasure

billykotsos
Автор

I have watched the lecture 4 four times, and this is the clearest one. For non-English speakers, language is really an obstacle to understanding this lecture. Oh my poor English, I only got 6.5 at IELTS Listening.

alexanderyau
Автор

24:52 I think the professor's explanation is a bit misleading about this question. The Sutton & Barto book, where the figure came from, clearly told that the dealer has a fixed strategy to stick on any sum of 17 or greater, and hit otherwise.

qingfangliu
Автор

43:35 A student asked about the goal and actions of the driving-home example. I have read the book where this example comes from. And here is my take on the question:

The actions come from a policy that is determined by the person. In this case, the policy is getting home by driving a car through particular roads. The person can use other policies to get home such as walking home and driving through other roads.

The goal of Monte Carlo or Temporal Difference is to estimate how good his policy is. Remember his policy involves driving through particular roads. The example shows just one sample of updating the algorithms. To actually see how good his policy his. He needs to take the same route everyday, obtain more data, and update the algorithms.

danielc
Автор

Thank you for these lectures. They are fantastic.

mind-set