Learning Forever, Backprop Is Insufficient

preview_player
Показать описание
#ai #ml

Continual Learning, or Life-long learning, is becoming more popular in Machine Learning (ML). This new research paper talks about plasticity decay, and how normal backpropagation is insufficient for continual learning. The inherent non-stationary property in many problems, especially in Reinforcement Learning (RL), makes it difficult to learn. Continual Backpropagation (CBP) is proposed as a solution to this.

Outline:
0:00 - Overview
2:00 - Paper Intro
2:53 - Problems & Environments
8:11 - Plasticity Decay Experiments
11:45 - Continual Backprop Explained
15:54 - Continual Backprop Experiments
22:00 - Extra Interesting Experiments
25:34 - Summary

Abstract:
The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former. We show that in continual learning setups, Backprop performs well initially, but over time its performance degrades. Stochastic gradient descent alone is insufficient to learn continually; the initial randomness enables only initial learning but not continual learning. To the best of our knowledge, ours is the first result showing this degradation in Backprop's ability to learn. To address this issue, we propose an algorithm that continually injects random features alongside gradient descent using a new generate-and-test process. We call this the Continual Backprop algorithm. We show that, unlike Backprop, Continual Backprop is able to continually adapt in both supervised and reinforcement learning problems. We expect that as continual learning becomes more common in future applications, a method like Continual Backprop will be essential where the advantages of random initialization are present throughout learning.
Рекомендации по теме
Комментарии
Автор

Reminds me of creating sparse neural networks that remove neurons that don't contribute a lot to make inference more efficient on computationally limited devices.

craftmechanics
Автор

thank you for breaking these down. some day i'll learn math notation. but for now, it just feels like my brain is imploding.
putting this into terms i can understand as a programmer, or just a layman really helps me understand the underlying functions and adaptations these papers present 🙌

JakeDownsWuzHere
Автор

I'm sad that youtube took so long to recommend your channel to me.
Great videos, thanks!

siarez
Автор

Thanks for the great video! Some thoughts on the paper:

1. The cited DeepMind paper on overcoming catastrophic forgetting (Kirkpatrick et al., 2017) has a different approach for identifying "useful" parts of the network, which relies on Fisher information—their approach looked at the information carried by weights, but you could look at the Fisher information carried by neurons' output values instead. This would give an alternative to the "contribution utility" used in this paper. I'd be interested to see a comparison here, since it seems like Fisher information is a more direct/accurate way of determining utility (it also might be best to combine the two methods). Mathematically, it should just mean taking a running average of the squared loss gradients at the neuron, instead of using the product of the mean-adjusted output with the outgoing weights. This also means that low-utility neurons would be ones that consistently have small gradients, so resetting them makes a lot of sense (as little learning is likely taking place there).

2. Regarding the Figure 19 experiments (23:00), I really wish we could see how the saturation and gradient magnitudes correlate with a neuron's utility (and for CBP its age). Intuitively, I would guess that despite having a lower mean gradient magnitude than L2, CBP's distribution probably has an upper tail, with a small number of recently-reset neurons seeing larger gradients as they drive adaptation. Maybe what's harming the non-CBP approaches is the abundance of low-utility, low-gradient neurons, which are not contributing much and also unable to adapt. In CBP, these would just get reset, restoring them to have better gradients, so we likely wouldn't see as many low-utility and low-gradient neurons.

3. I wonder if this paper's technique would help in training GANs? It seems like adaptability would be a big benefit in that space, as the generator and discriminator are being trained against each other while both are constantly evolving. Does anyone know if something similar to this has been explored?

Disclaimer: I'm not an expert :)

shirogane_
Автор

Great overview! keep the good work <3

levanhieu
Автор

Thanks for the overview!

This approach reminds me a lot of dropout. The idea seems elegant and straightforward. You’d have to pick an optimizer that does not decay it’s learning rate to zero, but that is easy to do.

I’m curious how an approach like this could identify neurons that are important for other tasks in a multi-task setting. If we train it, for example, to play Go, and then switch to playing Chess, can we distinguish between neurons that were important for Go and neurons that weren’t important for anything? (Probably difficult to do, but maybe there is a way. Or we resort to curriculum learning and toss in some Go examples from time to time)

I’m a bit sad about the title. The alteration is not to backprop, and it has more to do with long-term learning than continuity. I’d have called it Backprop with Re-initialization or similar. Decidedly less sexy, but more to-the-point.

timallanwheeler
Автор

Many thanks for the overview, a very thorough explanation! A quick and very specific question after seeing the algorithm...in relation with the replacement rate p (I guess it's 'ro'), why do you think is so small in most of the experiments? From what I saw, the biggest layer they use has 2, 000 units/features/nodes (whatever!), and even assuming that on a given time all those are eligible...then if you use a p=0.001, that gives you just 2 units, but sometimes they even use lower rates like 10-4 (0.0001). This would give then 0.2 units, which makes no sense to me. Am I missing anything here? On the other hand, some aspects on this paper (this somehow 'selective' dropout) bear some similarities with a paper I saw a while ago: (Jung et al. 2021, "Continual Learning with Node-Importance based Adaptive Group Sparse Regularization"). Thanks!

inyi
Автор

Hey Eldan, just found your content and i really appreciate your insight on these somewhat hidden topics. I’m curious what you think are some of the most important skills for aspiring machine learning researchers to develop. Thanks for the content!

patrickl
Автор

1 Question: Resetting all the output weights of the neurons r to zero should cause them to "die". Their signal doesnt contribute to the output signal anymore and thus the derivative of the error calculated by the backprop algorithm creates no training signal for the weights of those neurons. Consequently, the output weights of the neurons r stay zero forever and they are "dead"?

If thats true, all the algorithm does is to prevent overfitting.

markusweber
Автор

man I love your videos. okay maybe I'm interpreting this incorrectly or my takeaways are wrong as I'm not an ML researcher/practitioner but my understanding is that this akin to the "garbage in / garbage out" problem that is pervasive in software development. I'm curious if results could be improved with signal processing functions applied to the inputs. For instance, in the slippery ant problem, if I was building a multi-pedal robot I would be inclined to use a movement controller that would adjust servo output based on some variable such as how that terrain is impacting the pitch or yaw of the robot. In a sense that's a signal processing function feeding the vector back to the AI controlling it, the ground is rough so we are moving at X vector or the ground is treacherous so we are moving at vX vector where v is the velocity coefficient given the impact of the terrain adjustments. So essentially we are normalizing our inputs into a single domain which is the vector of the robot relative to its destination. Not sure if that made any sense. Thanks again!

josgraha
Автор

Could you interview the authors of the paper? Maybe show some code and simulations

AmCanTech
Автор

Interesting paper. The video is quite long for the benefit for me personally. I would think that this is the case for most people. I know that a more concise version would be more time consuming to produce. If you want to grow your subscriber base, I would aim for 10 minute videos.

aethermass
Автор

Can you please make a video on HSIC Bottleneck training? It's faster than Backprop and even Rprop

Sun.Protector
Автор

What kind of math is being used in the paper?

-delilahlin-
Автор

I would love to see the experiments rerun with a more complicated dataset like ImageNet. The approach looks promising but I am curious if it holds true for less simple problems and how it performs against L2 on those more difficult tasks

Markste-in
Автор

Wait so continual/lifelong learning means you CHANGE the neural network over time and not hard code its weights??

StochasticCockatoo
Автор

why don t they just reset the ones that are less used. that is probably how our network works. i forget everything so i reset even those used

dancar
Автор

Oh wow the inverse of transfer learning

maxim_ml
Автор

Trying to force a square peg into a round hole. The whole thing is completely messed up, patching backpropagation seems inherently flawed. Look up HTM theory, extremely fast learning of spatio-temporal patterns after the model only sees them 2-3 times, absolutely no catastrophic forgetting as neurons specialise over time and only active neurons learn from their input using hebbian learning. Its an algorithm inspired by the human neocortex which is just so much better than modern deep learning in every way

seanjhardy
welcome to shbcf.ru