Decision Transformer: Reinforcement Learning via Sequence Modeling (Research Paper Explained)

preview_player
Показать описание
#decisiontransformer #reinforcementlearning #transformer

Proper credit assignment over long timespans is a fundamental problem in reinforcement learning. Even methods designed to combat this problem, such as TD-learning, quickly reach their limits when rewards are sparse or noisy. This paper reframes offline reinforcement learning as a pure sequence modeling problem, with the actions being sampled conditioned on the given history and desired future rewards. This allows the authors to use recent advances in sequence modeling using Transformers and achieve competitive results in Offline RL benchmarks.

OUTLINE:
0:00 - Intro & Overview
4:15 - Offline Reinforcement Learning
10:10 - Transformers in RL
14:25 - Value Functions and Temporal Difference Learning
20:25 - Sequence Modeling and Reward-to-go
27:20 - Why this is ideal for offline RL
31:30 - The context length problem
34:35 - Toy example: Shortest path from random walks
41:00 - Discount factors
45:50 - Experimental Results
49:25 - Do you need to know the best possible reward?
52:15 - Key-to-door toy experiment
56:00 - Comments & Conclusion

Abstract:
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

Authors: Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch

Links:

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Рекомендации по теме
Комментарии
Автор

OUTLINE:
0:00 - Intro & Overview
4:15 - Offline Reinforcement Learning
10:10 - Transformers in RL
14:25 - Value Functions and Temporal Difference Learning
20:25 - Sequence Modeling and Reward-to-go
27:20 - Why this is ideal for offline RL
31:30 - The context length problem
34:35 - Toy example: Shortest path from random walks
41:00 - Discount factors
45:50 - Experimental Results
49:25 - Do you need to know the best possible reward?
52:15 - Key-to-door toy experiment
56:00 - Comments & Conclusion

YannicKilcher
Автор

“this must just have gotten in here by *accident*” right...

hongyihuang
Автор

general intelligence can be achieved by maximizing the Schmids that are Hubed.

sofia.eris.bauhaus
Автор

I think there's a missunderstanding in the example at 32:00 The issue with a limited context length is the following: {if you have a situation where the action that would lead to reward R depends on an action that you took a long time ago, then your limited-context policy cant know that which action it should take}. However, that problem is *not* solved by Q-learning/dynamic programming. If you use a policy with limited context even with Q-learning, you cannot learn which action you should take, for precisely the same reasons of having an input to your policy network that doesn't have enough info. I think this is a problem of partial observability, not of RL credit assignment.

On the other hand, if your issue is the time separation between a critical action and the reward, as opposed to the observation context length, then the approach in the paper is fine, as I think they are using *returns* (i.e. accumulated rewards over the whole episode). I found your example a bit vague, but I think your criticism was more for the above point.

GuillermoValleCosmos
Автор

"I realize, some of you youngsters don't know what an LSTM actually is" ow boy, am I getting old now?

DennisBakhuis
Автор

Scary to think that there might be “youngsters” that watch these videos who do not know what an LSTM is. I love living in a time with this pace of innovation.

seanohara
Автор

Developing an entire literature around game theory, monte carlo tree search, domain specific methods: 🤢
Throwing a transformer on it: 😎

mgostIH
Автор

The fact that conditioning on the past works better probably means the problem is non markovian with respect to the state representation chosen initially for the task. Condition on past states and rewards (and actions, why not) enriches the states and allow the model to better discriminate the best action. It is limited in term of context size, but much richer than classic RL where the system is supposed to be markovian and a single state is all you get.
Also, credit assignment happens whatever the size of context, because the reward is going to be propagated backwards in time as the agent encounter states which are close enough.
In more classic RL models it should be even worse that this model if it was not the case, because it only updates a single state-action value, rather than this rich (and smoothed) state representation.
It is because value is = current reward + future reward, that the reward is progressively progragated back. (you maximize non discounted rewards defining a value fonction with discounted future rewards so the series converges in inifinite horizon)

Also interesting, in the planning as inference litterature, you also condition on the "optimality" of your action, similarly to conditionning on the reward, although it does not matter the value of the reward, simply that its the optimal trajectory.

JTMoustache
Автор

I love how AI researchers from different firms work together 😊

menzithesonofhopehlope
Автор

You threw in Schumidhuber's 2019 paper, but it's also interesting to note how this approach goes back to Hutter 2005 with General Reinforcement Learning as Solomonoff Induction + Utility Theory.

dylancope
Автор

I would imagine that by discount factor they were referring to gamma. As Q-learning is a TD(0) algorithm so there is no lambda to tune. One good intuition for the meaning/purpose of a discount factor is a proxy for the likelihood your agent will survive to reach a future reward. It’s more about tuning how far back it can look for credit assignment, which affects how stable the learning process is.

pjbontrager
Автор

Schmidhuber often gets into my folders as the earliest dated file - I don't know how I keep screwing up, but it's good to hear I'm not the only one.

scottmiller
Автор

Perhaps this was already pointed out, and I apologise for sounding overly-rigurous!

At 17:50 you start describing the fundamental intution of Temporal difference learning via saying "Q^{\pi}(S) = r + Q^{\pi}(s')". Which is great, but That's the value function (V(s))! Not the state-action value function (Q(s, a)), which also takes an action in its function signature. For the purpose of your explanation it doesn't really matter.

But I'll leave this comment here just in case. Keep up the amazing work.
And congrats on your recent graduation, Dr. Kilcher :D

DanielHernandez-rnrp
Автор

Thanks for the video, it's very helpful. Ultimately I think this architecture design is awkward (requiring the expected reward to predict an action) and that we're just trying to explain something that doesn't make too much sense. The Transformer can output the probability distribution over a vocabulary, and in this respect it is perfectly suitable for the RL setting, where we need a probability distribution over actions. The problem lies in other aspects, in particular the many **layers** of Transformers which makes it unstable to train in the RL setting (as pointed out in Parisotto's paper). I see this paper as an early attempt to put the Transformer into RL and this, if successful, would be an AGI prototype. We're very close to having an AGI 😀

CyberneticOrganism
Автор

Reward discounting is equivalent to 1 - the per turn probability of leaving the game before it is finished. You can have cyclic behavior whenever the rewards cannot be expressed as a potential; it's not quite the same thing as stability for reward. W/o reward discounting you may see things like asymptotic explosions, which are not cyclic.

scottmiller
Автор

As someone from an NLP background where transformers are prevelant, I must raise that the concern of limited span is partially addressed by lines of research which are addressing how to expand the look back possible with transformers

dojutsu
Автор

Hi, great video as always.

I have a problem with the term "offline RL" not every policy learning algorithm is reinforcement learning.
The main problem that RL tries to solve is not reward assignment but exploration vs exploration tradeoff.

If there is no exploration it is not RL.

oshri
Автор

It feels like the notion of “reward “ was confused with “return”. Discount factor is just gamma and lambda is just for return.

binjianxin
Автор

Thanks for the video it is very helpful. In 36:50 the toy example seems to be correct though. During the generation it uses the prior knowledge from its experiences to make a decision, so has -3 from the yellow graph and -1 from blue graph. During generation it does not care about the real reward it gets (which is -2). what matters and is shown in the figure is the expected reward from prior knowledge.

BehnamManeshgar
Автор

Couldn't you simply do a hybrid version where you like pass in Q-values at the start of the context length or something?

Kram