Value Iteration and Policy Iteration - Model Based Reinforcement Learning Method - Machine Learning

preview_player
Показать описание

Model Based Reinforcement Learning

In model-based reinforcement learning algorithm, the environment is modelled as a Markov Decision Process (MDS) with following elements:

* A set of states
* A set of actions available in each state
* Transition probability function from current state (st) to next state (st+1) under action a
* Reward function: reward received on transition from current state (st) to next state (st+1) under action a

There are two common approaches to find optimal policy using recursive relation of Bellman Equation

1. Value Iteration: In this method, the optimal policy is obtained by iteratively computing the optimal state value function V(s) for each state until it converges. In this method policy function in not computed explicitly during iteration, rather optimal state value function is computed by choosing the action that maximizes Q value for a given state.

Algorithm

2. Policy Iteration: In this method, we start with a baseline policy and improve it iteratively to obtain optimal policy. There are two steps

1. Policy evaluation: in this step we evaluate the value function for current policy
2. Policy improvement: in this step policy is improved at each step by selecting the action that maximizes the Q value.

Algorithm

Shortcomings of Value Iteration and Policy Iteration Methods

1. These methods are computationally feasible only for finite small Markov Decision Processes, i.e., small number of time steps and small number of states.

2. These methods cannot be used for games or processes where model of environment, i.e., Markov Decision Process, is not known beforehand. Rather than the model we are given a simulation model of the environment and the only way to collect information about the environment is by interacting with it.

If the model of the environment is not known then Model Free Reinforcement Learning Techniques can be used.

Monte Carlo Method

This is a Model Free Reinforcement Learning Technique. It can be used where we are given a simulation model of the environment and the only way to collect information about the environment is by interacting with it.

This method works along the lines of policy iteration method. There are two steps

1. Policy evaluation: in this method estimate of action value function ( Q value) for each state action pair (s, a) for a given policy is computed by averaging the sampled returns that originate from (s, a) over time. Given sufficient time this procedure can construct precise estimates of Q(s, a) for all state action pairs.

2. Policy improvement: Improved policy is obtained by a greedy approach with respect to Q; given a state s new policy is the action that maximizes Q(s, a) obtained in step 1.

Monte Carlo Method Algorithm

Shortcomings of Monte Carlo Method

There are many shortcomings of Monte Carlo Method
1. it is feasible for games with small number of states, actions and steps.
2. sample are used inefficiently as long trajectory improves only one state action pair
3. procedure may spend too much time evaluating suboptimal policies
4. it works for episodic problems only
Рекомендации по теме
Комментарии
Автор

Hi Dr. Porwal, the video content jumps back to the beginning at the 1:50 mark. Just thought I'd let you (and viewers) know. Thanks for creating this video.

ManojRajagopalan
Автор

Thank you for the video. At 7:35, f I'm not mistaken instead of gamma V*(s') you might have meant gamma V^pi(s').

amyrs
Автор

hello sir thanks for the video
i would like to know if you have the matlab code for value iteration and policy iteration, because i really need it

adilkasbaoui
Автор

Hi Dr. Porwal, towards the end, you mention that policy iteration is done till the computed policy \pi converges to the optimal policy \pi^*. We do not know the latter and are trying to compute it. My inference is that we need to iterate till there is no change in the computed policy from one iteration to the next and that this could be a local optimum and not necessarily a global optimum. Is this correct? Thanks, once again, for creating this video.

ManojRajagopalan
Автор

You are just reading the PPT. How is that teaching?

chrisnolan
visit shbcf.ru