Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained)

preview_player
Показать описание
#selfsupervisedlearning #yannlecun #facebookai

Deep Learning systems can achieve remarkable, even super-human performance through supervised learning on large, labeled datasets. However, there are two problems: First, collecting ever more labeled data is expensive in both time and money. Second, these deep neural networks will be high performers on their task, but cannot easily generalize to other, related tasks, or they need large amounts of data to do so. In this blog post, Yann LeCun and Ishan Misra of Facebook AI Research (FAIR) describe the current state of Self-Supervised Learning (SSL) and argue that it is the next step in the development of AI that uses fewer labels and can transfer knowledge faster than current systems. They suggest as a promising direction to build non-contrastive latent-variable predictive models, like VAEs, but ones that also provide high-quality latent representations for downstream tasks.

OUTLINE:
0:00 - Intro & Overview
1:15 - Supervised Learning, Self-Supervised Learning, and Common Sense
7:35 - Predicting Hidden Parts from Observed Parts
17:50 - Self-Supervised Learning for Language vs Vision
26:50 - Energy-Based Models
30:15 - Joint-Embedding Models
35:45 - Contrastive Methods
43:45 - Latent-Variable Predictive Models and GANs
55:00 - Summary & Conclusion

ERRATA:
- The difference between loss and energy: Energy is for inference, loss is for training.
- The R(z) term is a regularizer that restricts the capacity of the latent variable. I think I said both of those things, but never together.
- The way I explain why BERT is contrastive is wrong. I haven't figured out why just yet, though :)

Video approved by Antonio.

Abstract:
We believe that self-supervised learning (SSL) is one of the most promising ways to build such background knowledge and approximate a form of common sense in AI systems.

Authors: Yann LeCun, Ishan Misra

Links:

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Рекомендации по теме
Комментарии
Автор

ERRATA:
- The difference between loss and energy: Energy is for inference, loss is for training.
- The R(z) term is a regularizer that restricts the capacity of the latent variable. I think I said both of those things, but never together.
- The way I explain why BERT is contrastive is wrong. I haven't figured out why just yet, though :)

OUTLINE:
0:00 - Intro & Overview
1:15 - Supervised Learning, Self-Supervised Learning, and Common Sense
7:35 - Predicting Hidden Parts from Observed Parts
17:50 - Self-Supervised Learning for Language vs Vision
26:50 - Energy-Based Models
30:15 - Joint-Embedding Models
35:45 - Contrastive Methods
43:45 - Latent-Variable Predictive Models and GANs
55:00 - Summary & Conclusion

YannicKilcher
Автор

All DNN with loss = some_distance(y, pred) is indeed energy-based model as you said. But not All energy based model has the form loss = some_distance(y, pred) where pred = f(x) is an explicit part of the model. So by Energy-based model, Yann means a generalization of traditional formulation where we can escape the problem of multiple y given a single x. The blogpost needs to make this distinction clearer.

nyoseesoyn
Автор

thanks to you that I can both watch YouTube and keep up with the research at the same time.

baskaisimkalmamisti
Автор

Congrats!! Yann Lecun sent me to your video.

falconeagle
Автор

We love us some content that doesn't chase sota, thank you as always Yannic!

sheggle
Автор

13:00 Am I the only one who found that question mark really satisfying?

sehbanomer
Автор

"energy-based method", as defined in the paper, is *extensionally* the same as machine learning using a loss function, but is *intensionally* different. The problem is that LeCun didn't describe it rigorously using the language of symmetry, though in his subconscious (and in the subconscious of every physicist who reads the paper), the "energy function" is intended to be "energy function that has good symmetries". I will explain.

## Feynmann's unworldly equation

Consider, for example, Feynman's "unworldliness equation" U = 0, where U = f^2 + g^2 + ..., and each f, g, ... is a scalar equation of nature. This equation is of course entirely correct, but it is trivial.
However, this does not make every equation trivial. Some equations really are more substantial than others. What is the substance? It is *symmetry*, or invariance under transformations.
When Maxwell wrote down the Maxwell equations, he used 20 scalar equations. In 4-vector notation, there are just 2 equations. Why such a great simplification? It is not the trivial kind of simplification as in U = 0, but a deep simplification -- all equations written in 4-vector notation are necessarily invariant under Lorentz transforms. Because the proper "home" of Maxwell equations is a universe that is invariant under Lorentz transforms, it's no wonder that they are more elegant when in 4-vector notations.
Conversely, when you notice how elegant the equations are in 4-vector form, you realize that the universe should probably be invariant under Lorentz transforms.
Modern theoretical physics is basically a game of inventing new transforms, then constructing equations invariant under the transforms, then publish it.

> So the “beautifully simple” law in Eq. (25.32) is equivalent to the whole series of equations that you originally wrote down. It is therefore absolutely obvious that a simple notation that just hides the complexity in the definitions of symbols is not real simplicity. It is just a trick. The beauty that appears in Eq. (25.32)—just from the fact that several equations are hidden within it—is no more than a trick. When you unwrap the whole thing, you get back where you were before.
> However, there is more to the simplicity of the laws of electromagnetism written in the form of Eq. (25.29). It means more, just as a theory of vector analysis means more. The fact that the electromagnetic equations can be written in a very particular notation which was designed for the four-dimensional geometry of the Lorentz transformations—in other words, as a vector equation in the four-space—means that it is invariant under the Lorentz transformations. It is because the Maxwell equations are invariant under those transformations that they can be written in a beautiful form.

(Feynman 2, 25:6)

## Energy-based methods, from the POV of
### a ML scientist

Extensionally, any machine learning problem defined using an energy function is equivalent to one defined using a loss function. And conversely, any ML problem defined by a loss function is equivalent to one defined by an energy function.

Intensionally, if you start with any loss function, and find its equivalent energy function, you would almost certainly get an energy function with no good symmetry at all.

Energy-based method is a principled way to convert symmetries in the problem into good priors over your neural network. Instead of using arbitrary loss functions constructed ad-hoc, or perhaps meta-learn a loss function, we impose the prior over the space of loss functions that respect the symmetries. Writing down an energy that respects the symmetries is just an efficient, implicit way to impose the prior.

### a physicist

Energy-based methods provide a principled way to write down equations that are invariant under physically relevant symmetries, such as translation (R^n), rotation (SO(n)), reflections (E(n)), volume-preserving maps (SL(n)), and so on. It also allows us to use gauge theory for ML.

Not only that, it also allows one to enforce only local interactions, by writing the energy as a sum of local interactions (such as E = x1 x2 + x2 x3 + x3 x4 + ...), bringing statistical mechanics and renormalization techniques to the table.

Not only does this allow you to import the greatest hits of modern physics and make ML as abstract as string theory, it also imposes good priors. A ML model for physical processes should probably only consider models that are invariant under the symmetries of nature, such as translation, rotation, reflection, etc.

### a mathematician

Energy-based methods is the Erlangen program for high-dimensional probability. All hail Felix Klein, the felicitous king of symmetry.

### a linguist

Extensional definition and intensional definitions often diverge, and it's more important to discover the intension and making it explicit, than to focus on the extension and quibble. For example, extensionally, an "activation function" is *any* function of type R^n → R, but that's the extensional definition. When you actually say "activation function" you mean any function of this type that has been profitably used in a neural network.

CosmiaNebula
Автор

There is a difference between energy functions and objective functions:
In physics, energy functions are defined as scalar fields with curl = 0 everywhere, so their gradient field is conservative (which is important, because otherwise the path integral for a closed path would be > 0, violating conservation of energy)
In ML, there are objectives with gradient fields that are not conservative. The best-known example for this is the GAN objective

lucathiede
Автор

Didn’t like the video thumbnail at first sight but the content is king! Subscribed!

CalvinJKu
Автор

Regarding the object permanence / gravity thing. They did experiments with cats, raising them in environments that just had vertical stripes all over everything, effectively denying them the opportunity to see horizontal lines and when they matured they would put the cats in more natural conventional environments and they had no concept of the danger of heights or falling because they couldn't perceive the ledges they were approaching.

CharlesVanNoland
Автор

Awesome video as always! And I completely agree, I feel like they are kind of trying to "set their terminology" on already existing concepts, but it was still an interesting read, and even better to hear your point of view on it!

WhatsAI
Автор

14:05 Regarding whether the third kind of masking could be used for NLP: if the word embedding is good, probably you could mask out a subset of the dimensions.

QuadraticPerplexity
Автор

looking forward to "Barlow Twins: Self-Supervised Learning via Redundancy Reduction" review, also from Yann's group. BYOL like method but without momentum updates!

shengyaozhuang
Автор

Very helpful video. I was able fill in many gaps present in the post.

membershipyuji
Автор

I think "energy based model" more precisely is supposed to refer to models that output unnormalized scores as opposed to (log-) probabilities. LeCun has said that he doesn't like approaches that are specifically designed to output valid probabilities or approximations of probabilities (i.e. normalizing flows, traditional VAEs) when arguably some other non-probability based approach would work better. But confusingly he also seems to lump even probability based models into the EBM category when he feels like it.

norabelrose
Автор

Good topic to tackle in this time. I will enjoy watching the video.

emransaleh
Автор

If you could only watch 4 Yannic videos this would be one of them

StochasticCockatoo
Автор

52:20 AFAIK the latent variable _z_ and the "embedding" are actually sort of the same thing. (The embedding is just a realization of that random variable I would say.) The confusion probably comes from the fact that there are different distributions over _z_ involved: _p(z)_ and _q(z|x)_ – the latter is what the encoder outputs, including the reparametrization trick.

ocifka
Автор

Thankyou. Informative and nicely explained.

brendawilliams
Автор

Great video, please keep them coming!
I actually didn't know you're German until you mentioned the eierlegende Wollmilchsau :D

bsdjns