Regularizing Trajectory Optimization with Denoising Autoencoders (Paper Explained)

preview_player
Показать описание
Can you plan with a learned model of the world? Yes, but there's a catch: The better your planning algorithm is, the more the errors of your world model will hurt you! This paper solves this problem by regularizing the planning algorithm to stay in high probability regions, given its experience.

Abstract:
Trajectory optimization using a learned model of the environment is one of the core elements of model-based reinforcement learning. This procedure often suffers from exploiting inaccuracies of the learned model. We propose to regularize trajectory optimization by means of a denoising autoencoder that is trained on the same trajectories as the model of the environment. We show that the proposed regularization leads to improved planning with both gradient-based and gradient-free optimizers. We also demonstrate that using regularized trajectory optimization leads to rapid initial learning in a set of popular motor control tasks, which suggests that the proposed approach can be a useful tool for improving sample efficiency.

Authors: Rinu Boney, Norman Di Palo, Mathias Berglund, Alexander Ilin, Juho Kannala, Antti Rasmus, Harri Valpola

Links:
Рекомендации по теме
Комментарии
Автор

You absolutely smashed this review, nice one Yannic! I know Harri will love it. Can't wait to get our interview with him uploaded to street talk channel.

machinelearningdojo
Автор

Seems like this approach would pair quite well with the "Planning to Explore" paper! Great stuff, thanks for sharing

jakevikoren
Автор

hello sir is. There any possible to teach me write DAE on my project ?

qw
Автор

where i can find the link for the intervue with harri valpoda?

remix
Автор

I don't see where this would be used over other methods. Are we waiting for a good exploration factor to be discovered which can be added to the equation?
What if we go the opposite route and subtract the confidence factor to encourage exploration?

rishikaushik
Автор

perhaps phasing the optimization out over time would help - the assumption being that the model becomes more accurate and able to generalise and should be allowed to explore more.

jeremykothe
Автор

Could an approach like this work for automatic PID tunning for quadcopters? How much processing power does it require?

tiagotiagot
Автор

This might be a silly question but in the case of offline learning (using existing data for a process you want to model), can you still learn a world model (e.g. any DNN that can create a latent representation of it offline) and use it later to sample trajectories? Does this mean this approach has 2 different DNN, one for sampling and one for regularizing or is the same model this DAE? Sorry if this was obvious in the video.

theodorosgalanos
Автор

in (5) why do you need the gradient with respect to the action? do we not just take the action with the highest reward?

pragmascrypt
Автор

couldn't you also use the inverse of the probabilities to find states/trajectories that you haven't visited yet and use them to explore ? This when it works would also not suffer from retrospective novelty as described in the Planning to Explore paper.

timz
Автор

Can't you just have the world model output an uncertainty metric that's high in unexplored areas, and then the trajectory optimizer takes the uncertainty metric into consideration by behaving as if less certain areas are higher cost? Then the trajectory will tend to avoid areas that haven't been explored.

jrkirby
Автор

Hmm, I would be surprised if the DAE would have the usual AE architecture including bottleneck and still produce so good results that DAE(xtilde)-xtilde gives useful results at all. I guess they use something like a "residual" network where the input xtilde is added to the output so that it basically only has to predict the difference itself.

dermitdembrot
join shbcf.ru