Lecture 24: Advantage Actor-Critic. Trust Regions. Proximal Policy Optimization.

preview_player
Показать описание
Lecture Series Advanced Machine Learning for Physics, Science, and Artificial Scientific Discovery".
Advantage Actor-Critic. Trust Regions. Proximal Policy Optimization.

Рекомендации по теме
Комментарии
Автор

Beginning at 36:23, the average E(b(s) d ln pi) is stated to be 0. I learned and understood from previous lectures that this average would be zero if the baseline is constant and the average is over the whole trajectory. However when b depends on s_t, I don't see how it becomes zero. Can you explain further how this term can be zero?

maxwellsdaemon