Beyond neural scaling laws – Paper Explained

Показать описание

„Beyond neural scaling laws: beating power law scaling via data pruning” paper explained with animations. You do not need to train your neural network on the entire dataset!

ERRATUM: See pinned comment for what easy/hard examples are chosen.

Outline:
00:00 Stable Diffusion is a Latent Diffusion Model
01:43 NVIDIA (sponsor): Register for the GTC!
03:00 What are neural scaling laws? Power laws explained.
05:15 Exponential scaling in theory
07:40 What the theory predicts
09:50 Unsupervised data pruning with foundation models

Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Don Rosenthal, Dres. Trost GbR, Julián Salazar, Edvard Grødem, Vignesh Valliappan, Mutual Information, Mike Ton

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🔗 Links:

#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research

Video editing: Nils Trost

Рекомендации по теме

Комментарии

Great summary of the paper, thank you!

I've dived a bit deeper into it and I think the explanation of the theoretical setup in the video does not fully match the one in the paper.
What I got from the video:
1. We have a labeled (infinite) dataset
2. The teacher perceptron learns to label the training data
3. The student also learns on the training data but only for a few epochs
4. The margin is the difference between the distance of the point to the teacher and to the student boundaries

What I got from the paper:
1. We get (infinite) data points from a normal distribution
2. We initialize the teacher perceptron with a random weights vector and use it to label the data (i.e. the teacher is only used to generate synthetic labels)
3. The student learns from the labeled data
4. The margin is the distance from the point to the student boundary (the teacher is not involved here)

The results in Fig.1 assume the student is perfectly aligned with the teacher (i.e. the margin perfectly reflects the distance to the real class boundary), while in Fig.2 the authors show the effect of having a misaligned student.
Let me know your thoughts on this :)

amenezes

So awesome that you have NVIDIA as a sponsor xD

Neptutron

4:39 mann the diminishing returns be hitting real hard today💀

frommarkham

3:22 thanks for the knowledge🙏we gonna make it out the data center with this tutorial🗣🗣🗣🗣

frommarkham

The comparison of pruning strategies was very helpful to me. Thank you for summarizing the paper, and best wishes at the conference.

WilliamDye-willdye

These results feel intuitive with what I've felt in practice. The math is nuts, though. :)

Erosis

I was going to read this paper, thanks for the nice explanation!

thipoktham

Super interesting. I thought they were going to map entropy of the dataset, which is kind of what they imply, easy vs hard is equivalent to novel vs non-novel data in the distribution of the data.

joecincotta

Very similar with the idea of active learning

lighterswang

11:40 so in experiment authers selected top 80% difficult/hard examples from clusters and did not include bottom 20% of easy examples while training because initial dataset (ImageNet) is fairly large. Is my understanding correct?

Thanks for explaining.

mandarjoshi

super nice explanation and reasoning ! Thanks for the insight

cipritom

Could this be useful for data augmentation?

For example: assuming I start with a certain size dataset and don't need to prune any examples, could/should I make more augmented copies of the more informative samples? Could I also test to see what kinds of augmentations are more or less useful?

frenchmarty

Awesome video, love your explanations!

RfMac

Thanks for the great introduction to this topic

TheNettforce

Great content. very accessible: thank you

ScriptureFirst

The screen shot of the mathematics made me chuckle....in horror. Thanks Letitia for an excellent video!

flamboyanta

How does this compare to finetuning the same model on smaller data. How much data would be needed.

averma

How do you make those animations like in "Exponential scaling in theory" part, which software do you use? I would really be appreciated if you could tell me :)

sonataarcfan

Isn't this just hard sample mining?

poketopa

Shouldnt you use density based clustering?

Quaquaquaqua

Beyond neural scaling laws – Paper Explained

Beyond neural scaling laws – Paper Explained

10 minutes paper (episode 22); Beyond neural scaling laws

Using Scaling Laws for Smaller, but still Accurate Models

AI can't cross this line and we don't know why.

Understanding the Origins and Taxonomy of Neural Scaling Laws

Neural Scaling Laws

Neural Scaling Laws: how much more data we need?

Studying Scaling Laws for Transformer Architecture … | Shola Oyedele | OpenAI Scholars Demo Day 2021...

Stanford CS224N NLP with Deep Learning | Spring 2022 | Guest Lecture: Scaling Language Models

Explaining Neural Scaling Laws

Neural Scaling Laws and GPT-3

Scaling Laws for Large Language Models

Neural network architectures, scaling laws and transformers

Scaling laws for large language models

Architectures Beyond CNNs and Visual Scaling Laws (Neil Houlsby) | Tutorial (1/3)

Scaling Laws for Neural Language Models

Alex Wadell: Neural Scaling Laws - Fitting Scaling Laws for SciFMs (Tutorial 3)

Neural Scaling Laws and GPT-3 - Jared Kaplan

Eric Michaud—Scaling, Grokking, Quantum Interpretability

Adam Grzywaczewski | The scaling laws of AI Why neural networks continue to grow

Lecture 7: Explaining Neural Scaling Laws

WHY AND HOW OF SCALING LARGE LANGUAGE MODELS | NICHOLAS JOSEPH

Finding scaling laws for Reinforcement Learning

Exploring Neural Scaling Law and Data Pruning Methods For Node Classification on Large scale Graphs