RLVS 2021 - Day 3 - Regret bounds of model-based reinforcement learning

preview_player
Показать описание
Speaker: Mengdi Wang
Chairman: Sébastien Gerchinovitz

Abstract. We discuss some recent results on model-based methods for online reinforcement learning (RL). The goal of online RL is to adaptively explore an unknown environment and learn to act with provable regret bounds. In particular, we focus on finite-horizon episodic RL where the unknown transition law belongs to a generic family of models. We propose a model-based "value-targeted regression" RL algorithm that is based on the optimism principle: In each episode, the set of models that are "consistent" with the data collected is constructed. The criterion of consistency is based on the total squared error that the model incurs on the task of predicting values as determined by the last value estimate along with the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, for an arbitrary family of transition models, using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014).

---------------
Reinforcement Learning Virtual School (March and April 2021).

Organized by the Toulouse AI institute ANITI, with the support of IRT Saint Exupéry, ISAE-SUPAERO, LAAS-CNRS, Université Fédérale Toulouse Midi-Pyrénées.

Officially sponsored by DeepMind.

We also thank Institut de Mathématiques de Toulouse - Université Toulouse III - Paul Sabatier, Toulouse School of Economics - Université Toulouse I - Capitole.

This virtual event took place in March and April 2021. It gathered 1500 registered participants from 43 different countries.
Рекомендации по теме