SupSup: Supermasks in Superposition (Paper Explained)

preview_player
Показать описание
Supermasks are binary masks of a randomly initialized neural network that result in the masked network performing well on a particular task. This paper considers the problem of (sequential) Lifelong Learning and trains one Supermask per Task, while keeping the randomly initialized base network constant. By minimizing the output entropy, the system can automatically derive the Task ID of a data point at inference time and distinguish up to 2500 tasks automatically.

OUTLINE:
0:00 - Intro & Overview
1:20 - Catastrophic Forgetting
5:20 - Supermasks
9:35 - Lifelong Learning using Supermasks
11:15 - Inference Time Task Discrimination by Entropy
15:05 - Mask Superpositions
24:20 - Proof-of-Concept, Task Given at Inference
30:15 - Binary Maximum Entropy Search
32:00 - Task Not Given at Inference
37:15 - Task Not Given at Training
41:35 - Ablations
45:05 - Superfluous Neurons
51:10 - Task Selection by Detecting Outliers
57:40 - Encoding Masks in Hopfield Networks
59:40 - Conclusion

Abstract:
We present the Supermasks in Superposition (SupSup) model, capable of sequentially learning thousands of tasks without catastrophic forgetting. Our approach uses a randomly initialized, fixed base network and for each task finds a subnetwork (supermask) that achieves good performance. If task identity is given at test time, the correct subnetwork can be retrieved with minimal memory usage. If not provided, SupSup can infer the task using gradient-based optimization to find a linear superposition of learned supermasks which minimizes the output entropy. In practice we find that a single gradient step is often sufficient to identify the correct mask, even among 2500 tasks. We also showcase two promising extensions. First, SupSup models can be trained entirely without task identity information, as they may detect when they are uncertain about new data and allocate an additional supermask for the new training distribution. Finally the entire, growing set of supermasks can be stored in a constant-sized reservoir by implicitly storing them as attractors in a fixed-sized Hopfield network.

Authors: Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, Ali Farhadi

Links:
Рекомендации по теме
Комментарии
Автор

If you are interested in more literature for catastrophic forgetting. I would recommend, "Overcoming catastrophic forgetting in neural networks" with elastic weight consolidation. Pretty interesting paper by DeepMind.

herp_derpingson
Автор

This channel should easily be 1M sub. Many other coding channels around that copy GitHub rep and try to teach people. These are proper theoretical lectures for newbies and also more experienced. Explanation with critical comments really helps people develop their own logic about Deep Learning.

davidelesci
Автор

Wow you are really fast, and even share insights within different domains! Love it

manojb
Автор

Any clues about the connections with dropout? As far as I am concerned, the idea behind dropout is to produce random masks in order to train a potentially exponential number of sub-networks and to prevent co-dependencies and therefore overfitting... I am surprised the term "dropout" does not even appear in the paper.
Thanks Yannic for the great job!

adamantidus
Автор

40:00 I wonder if a learned task embedding could be used to tell when a novel task is being presented?
In long term goal planning (eg an RL agent) there are going to be a sequence of sub-tasks. Similar to word2vec and SAFE's Instruction2vec, a task2vec could help Network Architecture Search (or in this paper's case it could help create the mask for the network prior). Attention over the context around each sub-task (the context would be other sub-tasks) could produce this embedding. Then the space could used as a measure of similarity between tasks, or paired with another network which decodes individual vectors into neural networks.

rbain
Автор

Full power to you, Yannic :)
11:57 - A silly question: what if the number of classes varies across the networks? How would one select which distribution to follow then?

sayakpaul
Автор

Really loving your channel. Learning so much.

couragefox
Автор

I want to see these results on a dataset other than MNIST

siyn
Автор

But how is this actually different from learning multiple networks? Each mask takes up just as much memory as a full copy of the network, and we may as well just swap the entries of each mask with the product of itself with its associated "random" weight in the underlying NN, and then set all weights of the underlying network to 1. Clearly nothing was achieved overall?

seraphim
Автор

26:12 Not much, what's supsup with you?

rbain
Автор

Could you make more vids on the blender chatbot, no one else is and I’m thirsty for a good chatbot like that, also will you be making videos on googles meena when it comes out?

CMatt
Автор

couldn't you deform/mutate mnist, like they do with eg old-school captchas according to some parameter? That would remove the global information parity between tasks?

jeremykothe
Автор

Why every paper sounds like agi solved

dmitrysamoylenko
Автор

It might not be that big a deal if multiple networks are pretty confident about a task. That might simply mean that very similar things are being decided, making each result plausible, right?
With the split image net thing, might you not just as well go all the way and effectively train a binary classifier per task? Not sure if that'd be simpler than training an n-way classifier in one go. If you've got a thousand classes, certainly it might well be a huge problem, that each task will have *vastly* more negative than positive examples. But otherwise that might work?
And together with the heuristic of adding new super masks each time, it might effectively end up being agnostic to the number of labels you ask it to produce? Although I guess if you want each thing to be one-hot-identified, you'd still somehow have to add in new final output weights as you add classes, and I'm not sure that's easy or reasonable to do.

Kram
Автор

Can those kinds of masks be used to achieve an ensemble effect within the same model? Find two or more distinct supermasks for the same task. Does someone know whether there are publications about that?

deepblender
Автор

Task may be easy. but, if authors have used shallow NNs then its fine. Results and conclusion will still hold.

dippatel
Автор

I know who I am. And you're learning who you are :D

jeremykothe