Divide-and-Conquer Monte Carlo Tree Search For Goal-Directed Planning (Paper Explained)

preview_player
Показать описание
When AI makes a plan it usually does so step by step, forward in time. But often it is beneficial to define intermediate goals to divide a large problem into easier sub-problems. This paper proposes a generalization of MCTS that searches not for the best next actions to take, but for the best way to sub-divide the problem recursively into problems so tiny that they can each be solved in a single step.

Abstract:
Standard planners for sequential decision making (including Monte Carlo planning, tree search, dynamic programming, etc.) are constrained by an implicit sequential planning assumption: The order in which a plan is constructed is the same in which it is executed. We consider alternatives to this assumption for the class of goal-directed Reinforcement Learning (RL) problems. Instead of an environment transition model, we assume an imperfect, goal-directed policy. This low-level policy can be improved by a plan, consisting of an appropriate sequence of sub-goals that guide it from the start to the goal state. We propose a planning algorithm, Divide-and-Conquer Monte Carlo Tree Search (DC-MCTS), for approximating the optimal plan by means of proposing intermediate sub-goals which hierarchically partition the initial tasks into simpler ones that are then solved independently and recursively. The algorithm critically makes use of a learned sub-goal proposal for finding appropriate partitions trees of new tasks based on prior experience. Different strategies for learning sub-goal proposals give rise to different planning strategies that strictly generalize sequential planning. We show that this algorithmic flexibility over planning order leads to improved results in navigation tasks in grid-worlds as well as in challenging continuous control environments.

Authors: Giambattista Parascandolo, Lars Buesing, Josh Merel, Leonard Hasenclever, John Aslanides, Jessica B. Hamrick, Nicolas Heess, Alexander Neitz, Theophane Weber

Links:
Рекомендации по теме
Комментарии
Автор

Thank you so much, you're the best ! May God reward you !

friedrichwaterson
Автор

thanks mate, you saved my engineering thesis with that

karolszymczyk
Автор

The method seems to rely on the select-heuristic, e.g. it will perform in environments that can be judged 'intuitively'. I wonder how it would perform in hard mazes with deceptive traps (e.g. physical distance becomes a bad indicator for step-distance). I would expect the advantage to go away (guessing power of neural net might be much more limited in such situations). So the interesting question is: How good can a neural net 'guess' a good subdivision state and this might be very dependent on the way the problem is fed into the net etc. (disclaimer: I have not yet read the paper.)

bluelng
Автор

This is an interesting technique. The idea of trading off deep tree searches for wide tree searches is an interesting one.

As it stands, I don't see it having much practical significance. It depends on having a deterministic, simulatable environment. It assumes a discrete state space. It requires perfect knowledge of the environment. It needs to cheaply test whether a state is valid or not. It's only been tested on a toy problem with only a couple hundred states. Real problems frequently have more states than the number of atoms in the universe, and this kind of approach would probably just break down.

Despite all that, there still may be some merit to the idea. You might be able to change the training so this still works even with absurdly large state spaces. Perhaps if you did that, you could beat AlphaGo with less training time. But what's the point in beating an AI at a game when the AI already dominates humans?


Often you can reformulate problems to make some the assumptions this needs to be true, or reformulate the algorithm to work if they aren't. This could make this algorithm applicable to some real world practical problems.I suppose the question is: do we have problems where we'd like to use MCTS, but it's too slow because the tree depth to the goal is too long? Perhaps it's a bit like a hammer in search of a nail, but I guess it's nice to be clever enough at making hammers that we can beat any nail that comes our way.

I have a very strong suspicion that in practical cases, this should only be a fallback for when regular MCTS does not work. At every layer of the DC-MCTS search tree, it should, instead of subdividing the problem further, see if MCTS can solve it quickly. If so, no need to make it deeper.

I see a lot of algorithmic freedom in how to train a model to efficiently and effectively so the select heuristic. This will probably require a lot of future research for this technique to be effective. Probably every problem this is applied to will need it's own twist. I think it's important to note - you don't necessarily even need to use neural nets as a model for this. You could use random forests, or k nearest neighbors, or any other model which is effective. In fact, perhaps some of these simpler models might be more effective with fewer training samples, which makes them a better fit.

Ultimately, the big question over whether this technique is worth pursuing is whether it can be effective at problems with enormous high dimensional state spaces. Because most of the really hard problems fall in that category. If it can be made to work there, this could have great potential.

jrkirby
Автор

Question: given enough time, can MCTS (Monte Carlo Tree Search) find the best solution?

The problem about MCTS is that it chooses the child node with the highest probability of having a solution.
As long as those probabilities don't change, MCTS will choose the same node, no matter how many iterations you perfom. That means some leaves (terminal nodes) are unreachable.
If the best solution happens to be in an unreachable leaf, MCTS will never find it.

hcm
Автор

Maybe one day when we have large enough GPUs, I think it would be interesting to backpropagate through traverse procedure (9:35). That might give the SELECT function enough gradients to train for more complex tasks. IDK The whole procedure looks differentiable to me. If I didnt miss something.

herp_derpingson