DeepMind x UCL RL Lecture Series - MDPs and Dynamic Programming [3/13]

preview_player
Показать описание
Research Scientist Diana Borsa explains how to solve MDPs with dynamic programming to extract accurate predictions and good control policies.

Рекомендации по теме
Комментарии
Автор

Thanks Diana! For everyone else. I got the answers below for the questions at around 43:10
.
.
.
For pi = a1 I get S0 has value -5.233 and the others both have -4.767. We expect the others to be higher as their first reward is most likely to be 0 and there is the gamma discount.
For pi is uniformly random we get the same answer because the problem is symmetric.
For a myopic gamma we get -0.9 for S0 and -0.1 for the others.

Let me know if you got different!

swazza
Автор

About exercise 43:00, I get results: v=[-5.291; -4.710; -4.710] in both cases (conceptually they create the same situation both always a1 and uniform).
While for gamma=1 I get [-inf; -inf; -inf] and I think makes sense because considering forever the value of each state will tend to -inf
Explanation (I used Numpy for calculations):

matteorocco
Автор

Thank you for providing this lecture.

About 11:20: should Statement (1) not be false, since At is not specified on the right-hand side?

OakCMC
Автор

Excellent video, thank you for making this available. 11:27, should this not be false, since A_t is not specified (to be A_t=a) on the RHS?

rogiervdw
Автор

great explanation! also it was very important for me to see the connections between different topics of material to have much better control over them and this is first explanation where I actually found that

davidlearnforus
Автор

Thanks! :) That was a very good lecture for me.

WilliamCorrias
Автор

Question regarding the slide at 33:45. Can you explain why, in the second line in the recursive derivation of v_pi(s), you replace the conditioning on pi with conditioning on At~pi(St) ?

atallcosts
Автор

Unfortunately this lecture is neither continuation nor introduction to previous ones from this series :(... I can only agree with others on lower quality of explanations here... sorry.

marcin.sobocinski
Автор

Hi Diana, regarding policy iteration, just wondering if there's any proof that you can't get stuck in an == loop. In other words, how do we know that we can't reach a pi for which pi_new (wrt greedification on the evaluated policy) is not such that pi == pi_new, but pi != pi*.

swazza
Автор

It seems that the definition of average reward (for the return) is different from the one in Sutton & Barto. Is there a link between these two definitions?

DFM-bn
Автор

fantastic lecture! Thank you very much.

intuitivej
Автор

Anyone can explain at 59:30 why while the answer is -1.75 the grid is showing -1.7? Checked the book and it says that the 4 actions should be considered even if the cell is on a side of the cell (the action would just bounce back to the same state with a -1 reward). When I work it out I also get -1.75. The errata of the book don't list any mistake here though. Why isn't the grid showing -1.8 if it is rounded to 2 significant figures (as the book states).

josefbajada
Автор

1:22:48 Mistake? I think it should be v* here, not v^pi

vslaykovsky
Автор

The quality/content of this lecture is much lower than of the previous two (by van Hasselt). This one essentially skims through Sutton-Barto book without providing any additional info/insight.

JumpDiffusion
Автор

1:06:00 what is the notation here? "2" is "q"? Then what is lambda?

vslaykovsky
Автор

how does the robo end up in 'high' state from 'low' state when action is 'search' ? bit puzzled ...

nitind
Автор

the first two lectures were great.
this, not so much. quality took a dive here.
seriously, you posted an example and did not solve it!
i actually paused the video to solve it, and you just gave a totally new solution approach. and Still didnt solve it
i would have appreciated explaining the solution, rather than reading bellman's life story there.

i have been studying this out of self motivation on my free time for 3 weeks now. i saw the bellman equations a month ago. this whole time, i couldnt understand how to calculate the value, with this recurse v(s') exists inside v(s), and s_(t+1)
i thought: this is it. the moment i've waited. she will explain it now.

only to get an empty slide....
i understand now from the matrix equation what happens. but still dont know how to do it manually.
i cant describe my disappointment.

lordleoo
Автор

In 15:40 How can I calculate r(s, a, s') from P(r, s' | s, a)? I can't find the way that make this function.. Isn't there only Expectation that can from Probability? And even if r(s, a, s') is Expectation (E[R | s, a, s']), How can I get P(r | s, a, s') from P(r, s' | s, a)...?

sidang_
Автор

Incorrect quizes, reading slides, mistakes in reading slides... Is this DeepMind reality?

primawerefox
Автор

Hmm, not that great, a better way to make this clear is to write down the pseudo code (1+3 loops) involved in the updates.

kd