Function Approximation | Reinforcement Learning Part 5

preview_player
Показать описание

Here, we learn about Function Approximation. This is a broad class of methods for learning within state spaces that are far too large for our previous methods to work. This is part five of a six part series on Reinforcement Learning.

SOCIAL MEDIA

SOURCES

[1] R. Sutton and A. Barto. Reinforcement learning: An Introduction (2nd Ed). MIT Press, 2018.

SOURCE NOTES

This video covers topics from chapters 9, 10 and 11 from [1], with only a light covering of chapter 11. [2] includes a lecture on Function Approximation, which was a helpful secondary source.

TIMESTAMP
0:00 Intro
0:25 Large State Spaces and Generalization
1:55 On Policy Evaluation
4:31 How do we select w?
6:46 How do we choose our target U?
9:27 A Linear Value Function
10:34 1000-State Random Walk
12:51 On Policy Control with FA
14:26 The Mountain Car Task
19:30 Off-Policy Methods with FA

LINKS

NOTES

[1] In the Mountain Car Task, I left out a hyperparameter to tune: Lambda. This controls how far away the evenly spaced proto-points are from any given evaluation point. If lambda is very high, the prototypical points are considered very close together, and they won't do a good job discriminating different values over the state space. But if lambda is too low, then the prototypical points won't share any information beyond a tiny region surrounding each point.
Рекомендации по теме
Комментарии
Автор

"Who needs theorems when you've got hopes?" - words to live by.

Tehom
Автор

That animation updating the estimates and showing the path the ball -- err "Car" -- took was spectacular. Great work as always!

mCoding
Автор

Thanks Duane, loving these videos. They're a big help for my group of undergrads who are interested in getting into RL research!

JetzYT
Автор

You're really criminally underrated, should have hundreds of thousands of views at least

mightymonke
Автор

Your explanations are brilliant, thanks for making these videos

Lukas-wmdy
Автор

Deadly triad reminds me of CAP theorem from databases. "You can only keep two of consistency, availability, and partition tolerance." (Consistency = data is consistent between partitions, availability = data is able to be retrieved, partition tolerance = partitions that aren't turned off being able to uphold consistency and availability, even when another partition becomes unavailable.)
- Function approximation = availability bc we're able to have a model that we can pass inputs to and it generalizes across all input-output mappings,
- Off-policy training = partition tolerance bc the exploration policy and (attempted) optimal policy diverge and they're essentially two partitions that are trying to maintain coherent communication/maintain modularity so they can communicate,
- Bootstrapping = consistency bc we're trying to shortcut what our expected value is but we have to trade off exploring every possible path with sampling to get a good enough but not perfect expected value.

Admittedly I feel like I'm stretching a bit here but I feel like it fits somehow and I just haven't found the exact way yet. It feels like there has to be a foundation for them to be stable on and when all three of the deadly triad are present it's like that spiderman meme where they're pointing to each other for who will be the stable foundation. If none of those three are the foundation, then what is? 🤔 It feels like the only way is to invert flow and try to predict what oneself will do/predict how to scaffold intermediate reward, rather than try to calculate the final answer (and rather than having an algorithm that has an exploration policy based on its ability to predict the final reward rather than intermediate. I may be misunderstanding this though based on what you said about MCTS vs Q-learning(?). I'm not sure if the predicting how to scaffold part is equivalent to Q-learning. I'm still learning sorry haha.). I think that's pretty much what predictive coding in the brain is. Not sure how to break it down correctly into subproblem though so that we can do "while surprise.exists(): build". Maybe one thing is that humans have more punctuated phases of adjusting value and policy in wake and sleep. Wake = learn about environment, REM = learn about one's brain itself. Curious if anyone has any thoughts on the CAP theorem comparison or any of the other stuff.

Thanks so much for the great video(s)! They help me learn a lot and help really get to the essence of the concepts. And are really clear and concise. And are entertaining.

joeystenbeck
Автор

Change the title to RL with DJ featuring Lake Moraine. 😂😂😂😂 . The green screen is actually really useful. Once again, grateful for these videos. You are making content that can be binge watched with a notebook 😂😂😂😂

siddharthbisht
Автор

Great playlist ! It would have been cool to include the time of each trainning

neithane
Автор

amazing series. really appreciate your work!

RANDOMGUY-wzur
Автор

These videos are great! I really did not like the formatting of Barto and Sutton (eg. definitions in the middle of paragraphs), but you've done an awesome job of exacting and presenting the most valuable concepts

rr
Автор

Your videos are so fricking good! Thank you for such quality content on YT many of us appreciate it. I'm sure the channel will blow up in the future!!

glowish
Автор

Just a bit of "surfing" on the very broad topic (like just mentioning a "deadly triad" without any hints as to how to deal with it) ;), but the mountain car animation is just wonderful! Thank you for the code! It's always a pleasure to watch such a well prepared videos :D

marcin.sobocinski
Автор

Thank you for great series!
BTW - changing background to completely dark allows to concentrate on the content better

BohdanMushkevych
Автор

Looking forward to the part6 video. Any idea when will it be out?

aptxkok
Автор

>tfw irl all data is spread in multiple excel files throughout the company with no structure whatsoever.

iiiiii-wh
Автор

Thanks a lot for the great content. May I know when the final video will be released?

letadangkhoa
Автор

If I may suggest a future video topic, how about a deep dive into mercer's theorem and how it is applicable to support vector machines?

TheElementFive
Автор

I am waiting for your policy gradient video to use in my class! Are you going to release it any time soon?👀🙏

imanmossavat
Автор

Thank you very much for sharing this amazing content, I have a question. I think the obvious choices for features in the mountain car example are distance and velocity. I don't understand why you (or the book that used tile coding) chose to use normalized radial basis to convert these 2 features into 1225 (352) features. My understanding of function approximation was that its main goal was to shrink a huge space state into a smaller one. I get the impression that this solution expands the state space.

bonettimauricio
Автор

Sir can you provide the coding of these classes theory is really great but I am having trouble in implementation.
One more playlist, please

noobtopro