Value Iteration and Policy Iteration - Model Based Reinforcement Learning Method - Machine Learning

Показать описание

Model Based Reinforcement Learning

In model-based reinforcement learning algorithm, the environment is modelled as a Markov Decision Process (MDS) with following elements:

* A set of states
* A set of actions available in each state
* Transition probability function from current state (st) to next state (st+1) under action a
* Reward function: reward received on transition from current state (st) to next state (st+1) under action a

There are two common approaches to find optimal policy using recursive relation of Bellman Equation

1. Value Iteration: In this method, the optimal policy is obtained by iteratively computing the optimal state value function V(s) for each state until it converges. In this method policy function in not computed explicitly during iteration, rather optimal state value function is computed by choosing the action that maximizes Q value for a given state.

Algorithm

2. Policy Iteration: In this method, we start with a baseline policy and improve it iteratively to obtain optimal policy. There are two steps

1. Policy evaluation: in this step we evaluate the value function for current policy
2. Policy improvement: in this step policy is improved at each step by selecting the action that maximizes the Q value.

Algorithm

Shortcomings of Value Iteration and Policy Iteration Methods

1. These methods are computationally feasible only for finite small Markov Decision Processes, i.e., small number of time steps and small number of states.

2. These methods cannot be used for games or processes where model of environment, i.e., Markov Decision Process, is not known beforehand. Rather than the model we are given a simulation model of the environment and the only way to collect information about the environment is by interacting with it.

If the model of the environment is not known then Model Free Reinforcement Learning Techniques can be used.

Monte Carlo Method

This is a Model Free Reinforcement Learning Technique. It can be used where we are given a simulation model of the environment and the only way to collect information about the environment is by interacting with it.

This method works along the lines of policy iteration method. There are two steps

1. Policy evaluation: in this method estimate of action value function ( Q value) for each state action pair (s, a) for a given policy is computed by averaging the sampled returns that originate from (s, a) over time. Given sufficient time this procedure can construct precise estimates of Q(s, a) for all state action pairs.

2. Policy improvement: Improved policy is obtained by a greedy approach with respect to Q; given a state s new policy is the action that maximizes Q(s, a) obtained in step 1.

Monte Carlo Method Algorithm

Shortcomings of Monte Carlo Method

There are many shortcomings of Monte Carlo Method
1. it is feasible for games with small number of states, actions and steps.
2. sample are used inefficiently as long trajectory improves only one state action pair
3. procedure may spend too much time evaluating suboptimal policies
4. it works for episodic problems only

Рекомендации по теме

Комментарии

Hi Dr. Porwal, the video content jumps back to the beginning at the 1:50 mark. Just thought I'd let you (and viewers) know. Thanks for creating this video.

ManojRajagopalan

Thank you for the video. At 7:35, f I'm not mistaken instead of gamma V*(s') you might have meant gamma V^pi(s').

amyrs

hello sir thanks for the video
i would like to know if you have the matlab code for value iteration and policy iteration, because i really need it

adilkasbaoui

Hi Dr. Porwal, towards the end, you mention that policy iteration is done till the computed policy \pi converges to the optimal policy \pi^*. We do not know the latter and are trying to compute it. My inference is that we need to iterate till there is no change in the computed policy from one iteration to the next and that this could be a local optimum and not necessarily a global optimum. Is this correct? Thanks, once again, for creating this video.

ManojRajagopalan

You are just reading the PPT. How is that teaching?

chrisnolan

Value Iteration and Policy Iteration - Model Based Reinforcement Learning Method - Machine Learning

Model Based Reinforcement Learning: Policy Iteration, Value Iteration, and Dynamic Programming

Value Iteration and Policy Iteration - Model Based Reinforcement Learning Method - Machine Learning

RL 6: Policy iteration and value iteration - Reinforcement learning

Value Iteration in Deep Reinforcement Learning

Bellman Equations, Dynamic Programming, Generalized Policy Iteration | Reinforcement Learning Part 2

Policy Iteration algorithm (with worked out example) -Reinforcement Learning Lecture #2

Policy iteration and Value iteration in machine learning (Hindi) | Reinforcement Learning | Lec-33

Markov Decision Process (MDP) - 5 Minutes with Cyrill

Optimal Policies and Value Iteration

How to use Bellman Equation Reinforcement Learning | Bellman Equation Machine Learning Mahesh Huddar

Policy Iteration

Reinforcement Learning - Lecture 8 (Value Iteration)

28. Value Iteration using Python || End to End AI Tutorial

Value Iteration Visualization.

Value Iteration and Q-Learning Reinforcement Learning Algorithms

Policy Iteration

Why Does Policy Iteration Work?

value iteration

Value Iteration - Implemented (11)

L4: Value Iteration and Policy Iteration (P1-Value iteration)—Mathematical Foundations of RL

Policy iteration and Value iteration in machine learning (Hindi) | Reinforcement Learning

Bellman Equation & Value Iteration, Policy Iteration

CS885 Lecture 3a: Policy Iteration

7 POLICY ITERATION