RL Course by David Silver - Lecture 4: Model-Free Prediction

Показать описание

#Reinforcement Learning Course by David Silver# Lecture 4: Model-Free Prediction

Google DeepMind

Рекомендации по теме

Комментарии

0:53 Outline
2:10 Introduction

5:06 Monte Carlo Learning
9:20 First Visit MC Policy Evaluation
14:55 Every Visit MC Policy Evaluation
16:23 Blackjack example
26:30 Incremental Mean
29:00 Incremental MC Updates

34:00 Temporal Difference (TD) Learning
35:45 MC vs TD
39:50 Driving Home Example

44:56 Advantages and Disadvantages of MC vs. TD
53:35 Random Walk Example
58:04 Batch MC and TD
58:45 AB Example
1:01:33 Certainty Equivalence

1:03:32 Markov Property - Advantages and Disadvantages of MC vs. TD
1:04:50 Monte Carlo Backup
1:07:45 Temporal Difference Backup
1:08:14 Dynamic Programming Backup
1:09:10 Bootstrapping and Sampling
1:10:50 Unified View of Reinforcement Learning

1:15:50 TD(lambda) and n-Step Prediction
1:17:29 n-Step Return
1:20:22 Large Random Walk Example
1:22:53 Averaging n-Step Return
1:23:55 lambda-return
1:28:52 Forward-view TD(lambda)
1:30:30 Backward view TD(lambda) and Eligibility Trace
1:33:40 TD(lambda) and TD(0)
1:34:40 TD(lambda) and MC

yasseraziz

Questions from students are of very high quality and they are one of the many reasons that make this lecture series particularly great.

appendix

2:03 Introduction
5:04 Monte-Carlo Learning
33:56 Temporal-Difference Learning
1:23:53 TD(lambda)

NganVu

When you realized every lecture corresponds to a chapter in Sutton's "Introduction to Reinforcement Learning"

scienceofart

I love how he relates the form of the incremental mean to the meaning of RL updates.

azerotrlz

The lecture 💯.
The questions the students were asking 💯.
My enjoyment of the whole thing 💯.

ikechukwuokerenwogba

38:29 what a great example to explain how TD is different from MC

saminchowdhury

I think the reason why looking one step into the future is better than using past predictions is that you can treat the next step as if it were the last step, then that would be the terminating goal and the game is over, we've already known the current state didn't make the episode end up, so only the future state could and that's why we always look into the future for the terminating state.

yuxinzhang

The backup diagrams have made everything much more clearer.

achyuthvishwamithra

love the example to demonstrate the difference between TD and MC!!

testxy

haha.. "You don't need to wait til you die to update your value function.. "

joshash

I have to say, David Silver is slightly smarter than me.

SunSon

Thanks for good lecture. This lecture really help me a lot.
I have a suggestion for improving this lecture. It is English subtitle. It will make this lecture more accessible for the handicapped and non-English speakers.

nightfall

These lectures are sooo helpful! Thank you very much for posting. They are really good :).

tacobellas

At 1:27:47, David explains why we use λ geometric series by saying it is memoryless so "You can do TD(λ) for the same cost as TD(0)".. but I don't see how! TD(0) merely looks at one 1 step whereas TD(λ) has to look at all future steps (or in the backward view, TD(0) merely updates the current state, while TD(λ) updates all previous states)

ErwinDSouza

Another meaty lecture !
This is pure treasure

billykotsos

I have watched the lecture 4 four times, and this is the clearest one. For non-English speakers, language is really an obstacle to understanding this lecture. Oh my poor English, I only got 6.5 at IELTS Listening.

alexanderyau

24:52 I think the professor's explanation is a bit misleading about this question. The Sutton & Barto book, where the figure came from, clearly told that the dealer has a fixed strategy to stick on any sum of 17 or greater, and hit otherwise.

qingfangliu

43:35 A student asked about the goal and actions of the driving-home example. I have read the book where this example comes from. And here is my take on the question:

The actions come from a policy that is determined by the person. In this case, the policy is getting home by driving a car through particular roads. The person can use other policies to get home such as walking home and driving through other roads.

The goal of Monte Carlo or Temporal Difference is to estimate how good his policy is. Remember his policy involves driving through particular roads. The example shows just one sample of updating the algorithms. To actually see how good his policy his. He needs to take the same route everyday, obtain more data, and update the algorithms.

danielc

Thank you for these lectures. They are fantastic.

mind-set

RL Course by David Silver - Lecture 4: Model-Free Prediction

RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning

RL Course by David Silver - Lecture 3: Planning by Dynamic Programming

RL Course by David Silver - Lecture 4: Model-Free Prediction

RL Course by David Silver - Lecture 2: Markov Decision Process

RL Course by David Silver - Lecture 5: Model Free Control

RL Course by David Silver - Lecture 6: Value Function Approximation

RL Course by David Silver - Lecture 9: Exploration and Exploitation

RL Course by David Silver - Lecture 8: Integrating Learning and Planning

RL Course by David Silver - Lecture 7: Policy Gradient Methods

RL Course by David Silver - Lecture 10: Classic Games

RL Course by David Silver Lecture 1 Introduction to Reinforcement Learning

RL Course by David Silver Lecture 9 Exploration and Exploitation

RL Course by David Silver Lecture 2 Markov Decision Process

RL Course by David Silver Lecture 1 Introduction to Reinforcement Learning

RL Course by David Silver Lecture 1 Introduction to Reinforcement Learning

RL Course by David Silver Lecture 2 Markov Decision Process

RL Course by David Silver Lecture 1 Introduction to Reinforcement Learning

RL Course by David Silver Lecture 5 Model Free Control

【Outline of RL】RL Course by David Silver Lecture 1 Introduction to Reinforcement Learning

RL Course by David Silver - Lecture 10: Classic Games [w/visible slides]

RL Course by David Silver Lecture 5 Model Free Control

RL Course by David Silver Lecture 4 Model Free Prediction

RL Course by David Silver Lecture 7 Policy Gradient Methods

RL Course by David Silver Lecture 8 Integrating Learning and Planning