SupSup: Supermasks in Superposition (Paper Explained)

Показать описание

Supermasks are binary masks of a randomly initialized neural network that result in the masked network performing well on a particular task. This paper considers the problem of (sequential) Lifelong Learning and trains one Supermask per Task, while keeping the randomly initialized base network constant. By minimizing the output entropy, the system can automatically derive the Task ID of a data point at inference time and distinguish up to 2500 tasks automatically.

OUTLINE:
0:00 - Intro & Overview
1:20 - Catastrophic Forgetting
5:20 - Supermasks
9:35 - Lifelong Learning using Supermasks
11:15 - Inference Time Task Discrimination by Entropy
15:05 - Mask Superpositions
24:20 - Proof-of-Concept, Task Given at Inference
30:15 - Binary Maximum Entropy Search
32:00 - Task Not Given at Inference
37:15 - Task Not Given at Training
41:35 - Ablations
45:05 - Superfluous Neurons
51:10 - Task Selection by Detecting Outliers
57:40 - Encoding Masks in Hopfield Networks
59:40 - Conclusion

Abstract:
We present the Supermasks in Superposition (SupSup) model, capable of sequentially learning thousands of tasks without catastrophic forgetting. Our approach uses a randomly initialized, fixed base network and for each task finds a subnetwork (supermask) that achieves good performance. If task identity is given at test time, the correct subnetwork can be retrieved with minimal memory usage. If not provided, SupSup can infer the task using gradient-based optimization to find a linear superposition of learned supermasks which minimizes the output entropy. In practice we find that a single gradient step is often sufficient to identify the correct mask, even among 2500 tasks. We also showcase two promising extensions. First, SupSup models can be trained entirely without task identity information, as they may detect when they are uncertain about new data and allocate an additional supermask for the new training distribution. Finally the entire, growing set of supermasks can be stored in a constant-sized reservoir by implicitly storing them as attractors in a fixed-sized Hopfield network.

Authors: Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, Ali Farhadi

Links:

Рекомендации по теме

Комментарии

If you are interested in more literature for catastrophic forgetting. I would recommend, "Overcoming catastrophic forgetting in neural networks" with elastic weight consolidation. Pretty interesting paper by DeepMind.

herp_derpingson

This channel should easily be 1M sub. Many other coding channels around that copy GitHub rep and try to teach people. These are proper theoretical lectures for newbies and also more experienced. Explanation with critical comments really helps people develop their own logic about Deep Learning.

davidelesci

Wow you are really fast, and even share insights within different domains! Love it

manojb

Any clues about the connections with dropout? As far as I am concerned, the idea behind dropout is to produce random masks in order to train a potentially exponential number of sub-networks and to prevent co-dependencies and therefore overfitting... I am surprised the term "dropout" does not even appear in the paper.
Thanks Yannic for the great job!

adamantidus

40:00 I wonder if a learned task embedding could be used to tell when a novel task is being presented?
In long term goal planning (eg an RL agent) there are going to be a sequence of sub-tasks. Similar to word2vec and SAFE's Instruction2vec, a task2vec could help Network Architecture Search (or in this paper's case it could help create the mask for the network prior). Attention over the context around each sub-task (the context would be other sub-tasks) could produce this embedding. Then the space could used as a measure of similarity between tasks, or paired with another network which decodes individual vectors into neural networks.

rbain

Full power to you, Yannic :)
11:57 - A silly question: what if the number of classes varies across the networks? How would one select which distribution to follow then?

sayakpaul

Really loving your channel. Learning so much.

couragefox

I want to see these results on a dataset other than MNIST

siyn

But how is this actually different from learning multiple networks? Each mask takes up just as much memory as a full copy of the network, and we may as well just swap the entries of each mask with the product of itself with its associated "random" weight in the underlying NN, and then set all weights of the underlying network to 1. Clearly nothing was achieved overall?

seraphim

26:12 Not much, what's supsup with you?

rbain

Could you make more vids on the blender chatbot, no one else is and I’m thirsty for a good chatbot like that, also will you be making videos on googles meena when it comes out?

CMatt

couldn't you deform/mutate mnist, like they do with eg old-school captchas according to some parameter? That would remove the global information parity between tasks?

jeremykothe

Why every paper sounds like agi solved

dmitrysamoylenko

It might not be that big a deal if multiple networks are pretty confident about a task. That might simply mean that very similar things are being decided, making each result plausible, right?
With the split image net thing, might you not just as well go all the way and effectively train a binary classifier per task? Not sure if that'd be simpler than training an n-way classifier in one go. If you've got a thousand classes, certainly it might well be a huge problem, that each task will have *vastly* more negative than positive examples. But otherwise that might work?
And together with the heuristic of adding new super masks each time, it might effectively end up being agnostic to the number of labels you ask it to produce? Although I guess if you want each thing to be one-hot-identified, you'd still somehow have to add in new final output weights as you add classes, and I'm not sure that's easy or reasonable to do.

Kram

Can those kinds of masks be used to achieve an ensemble effect within the same model? Find two or more distinct supermasks for the same task. Does someone know whether there are publications about that?

deepblender

Task may be easy. but, if authors have used shallow NNs then its fine. Results and conclusion will still hold.

dippatel

I know who I am. And you're learning who you are :D

jeremykothe

SupSup: Supermasks in Superposition (Paper Explained)

SupSup: Supermasks in Superposition (Paper Explained)

Addendum for Supermasks in Superposition: A Closer Look (Paper Explained)

ContinualAI Reading Group: 'Supermasks in Superposition'

MLBBQ: Toy Models of Superposition by Eloy Geenjaar

Continuous Learning: MAML, OML, ANML, and Supermasks - August 3, 2020

A Walkthrough of Toy Models of Superposition w/ Jess Smith

Review: Superposition of many models into one

'Toy Models of Superposition' by N. Elhage et al

MLBBQ: Superposition in Neural Networks II by Pavel Popov

Gradient Origin Networks (Paper Explained w/ Live Coding)

Radioactive data: tracing through training (Paper Explained)

Weight Agnostic Neural Networks Explained!

[Classic] Deep Residual Learning for Image Recognition (Paper Explained)

LeDeepChef 👨‍🍳 Deep Reinforcement Learning Agent for Families of Text-Based Games

CVPR'20 iMLCV tutorial: Introduction to Circuits in CNNs by Chris Olah

Linformer: Self-Attention with Linear Complexity (Paper Explained)

How the Brain Processes and Stores Memory

Online Education - How I Make My Videos

Deep Ensembles: A Loss Landscape Perspective (Paper Explained)

Integrals with Radicals in Them, pg 1, pt 1

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)

Superposition

Homochirality through photon-induced structuring of Life´s fundamental molecules (Karo Michaelian)

Finance & Banking Network -Webinar 1: “Deep Credit Risk - Machine Learning in Python” on 10-Aug-...