DeepMind x UCL RL Lecture Series - MDPs and Dynamic Programming [3/13]

Показать описание

Research Scientist Diana Borsa explains how to solve MDPs with dynamic programming to extract accurate predictions and good control policies.

Google DeepMind

Рекомендации по теме

Комментарии

Thanks Diana! For everyone else. I got the answers below for the questions at around 43:10
.
.
.
For pi = a1 I get S0 has value -5.233 and the others both have -4.767. We expect the others to be higher as their first reward is most likely to be 0 and there is the gamma discount.
For pi is uniformly random we get the same answer because the problem is symmetric.
For a myopic gamma we get -0.9 for S0 and -0.1 for the others.

Let me know if you got different!

swazza

About exercise 43:00, I get results: v=[-5.291; -4.710; -4.710] in both cases (conceptually they create the same situation both always a1 and uniform).
While for gamma=1 I get [-inf; -inf; -inf] and I think makes sense because considering forever the value of each state will tend to -inf
Explanation (I used Numpy for calculations):

matteorocco

Thank you for providing this lecture.

About 11:20: should Statement (1) not be false, since At is not specified on the right-hand side?

OakCMC

Excellent video, thank you for making this available. 11:27, should this not be false, since A_t is not specified (to be A_t=a) on the RHS?

rogiervdw

great explanation! also it was very important for me to see the connections between different topics of material to have much better control over them and this is first explanation where I actually found that

davidlearnforus

Thanks! :) That was a very good lecture for me.

WilliamCorrias

Question regarding the slide at 33:45. Can you explain why, in the second line in the recursive derivation of v_pi(s), you replace the conditioning on pi with conditioning on At~pi(St) ?

atallcosts

Unfortunately this lecture is neither continuation nor introduction to previous ones from this series :(... I can only agree with others on lower quality of explanations here... sorry.

marcin.sobocinski

Hi Diana, regarding policy iteration, just wondering if there's any proof that you can't get stuck in an == loop. In other words, how do we know that we can't reach a pi for which pi_new (wrt greedification on the evaluated policy) is not such that pi == pi_new, but pi != pi*.

swazza

It seems that the definition of average reward (for the return) is different from the one in Sutton & Barto. Is there a link between these two definitions?

DFM-bn

fantastic lecture! Thank you very much.

intuitivej

Anyone can explain at 59:30 why while the answer is -1.75 the grid is showing -1.7? Checked the book and it says that the 4 actions should be considered even if the cell is on a side of the cell (the action would just bounce back to the same state with a -1 reward). When I work it out I also get -1.75. The errata of the book don't list any mistake here though. Why isn't the grid showing -1.8 if it is rounded to 2 significant figures (as the book states).

josefbajada

1:22:48 Mistake? I think it should be v* here, not v^pi

vslaykovsky

The quality/content of this lecture is much lower than of the previous two (by van Hasselt). This one essentially skims through Sutton-Barto book without providing any additional info/insight.

JumpDiffusion

1:06:00 what is the notation here? "2" is "q"? Then what is lambda?

vslaykovsky

how does the robo end up in 'high' state from 'low' state when action is 'search' ? bit puzzled ...

nitind

the first two lectures were great.
this, not so much. quality took a dive here.
seriously, you posted an example and did not solve it!
i actually paused the video to solve it, and you just gave a totally new solution approach. and Still didnt solve it
i would have appreciated explaining the solution, rather than reading bellman's life story there.

i have been studying this out of self motivation on my free time for 3 weeks now. i saw the bellman equations a month ago. this whole time, i couldnt understand how to calculate the value, with this recurse v(s') exists inside v(s), and s_(t+1)
i thought: this is it. the moment i've waited. she will explain it now.

only to get an empty slide....
i understand now from the matrix equation what happens. but still dont know how to do it manually.
i cant describe my disappointment.

lordleoo

In 15:40 How can I calculate r(s, a, s') from P(r, s' | s, a)? I can't find the way that make this function.. Isn't there only Expectation that can from Probability? And even if r(s, a, s') is Expectation (E[R | s, a, s']), How can I get P(r | s, a, s') from P(r, s' | s, a)...?

sidang_

Incorrect quizes, reading slides, mistakes in reading slides... Is this DeepMind reality?

primawerefox

Hmm, not that great, a better way to make this clear is to write down the pseudo code (1+3 loops) involved in the updates.

kd

DeepMind x UCL RL Lecture Series - MDPs and Dynamic Programming [3/13]

DeepMind x UCL RL Lecture Series - Introduction to Reinforcement Learning [1/13]

DeepMind x UCL RL Lecture Series - Theoretical Fund. of Dynamic Programming Algorithms [4/13]

DeepMind x UCL RL Lecture Series - Exploration & Control [2/13]

DeepMind x UCL RL Lecture Series - Function Approximation [7/13]

DeepMind x UCL RL Lecture Series - MDPs and Dynamic Programming [3/13]

RL Course by David Silver - Lecture 2: Markov Decision Process

DeepMind x UCL | Deep Learning Lectures | 1/12 | Intro to Machine Learning & AI

DeepMind x UCL | Deep Learning Lectures | 5/12 | Optimization for Machine Learning

DeepMind x UCL RL Lecture Series - Model-free Prediction [5/13]

DeepMind x UCL RL Lecture Series - Multi-step & Off Policy [11/13]

DeepMind x UCL RL Lecture Series - Model-free Control [6/13]

DeepMind x UCL RL Lecture Series - Policy-Gradient and Actor-Critic methods [9/13]

DeepMind x UCL | Deep Learning Lectures | 2/12 | Neural Networks Foundations

DeepMind x UCL RL Lecture Series - Approximate Dynamic Programming [10/13]

RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning

DeepMind x UCL | Deep Learning Lectures | 12/12 | Responsible Innovation

RL Course by David Silver - Lecture 3: Planning by Dynamic Programming

DeepMind x UCL | Deep Learning Lectures | 3/12 | Convolutional Neural Networks for Image Recognition

DeepMind x UCL | Deep Learning Lectures | 8/12 | Attention and Memory in Deep Learning

DeepMind x UCL | Deep Learning Lectures | 10/12 | Unsupervised Representation Learning

DeepMind x UCL | Deep Learning Lectures | 6/12 | Sequences and Recurrent Networks

Free DS and ML courses from TOP universities and Google!

DeepMind x UCL | Deep Learning Lectures | 9/12 | Generative Adversarial Networks

Reinforcement Learning 1: Introduction to Reinforcement Learning