Beyond neural scaling laws – Paper Explained

preview_player
Показать описание
„Beyond neural scaling laws: beating power law scaling via data pruning” paper explained with animations. You do not need to train your neural network on the entire dataset!

ERRATUM: See pinned comment for what easy/hard examples are chosen.

Outline:
00:00 Stable Diffusion is a Latent Diffusion Model
01:43 NVIDIA (sponsor): Register for the GTC!
03:00 What are neural scaling laws? Power laws explained.
05:15 Exponential scaling in theory
07:40 What the theory predicts
09:50 Unsupervised data pruning with foundation models

Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Don Rosenthal, Dres. Trost GbR, Julián Salazar, Edvard Grødem, Vignesh Valliappan, Mutual Information, Mike Ton

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🔗 Links:

#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​

Video editing: Nils Trost
Рекомендации по теме
Комментарии
Автор

Great summary of the paper, thank you!

I've dived a bit deeper into it and I think the explanation of the theoretical setup in the video does not fully match the one in the paper.
What I got from the video:
1. We have a labeled (infinite) dataset
2. The teacher perceptron learns to label the training data
3. The student also learns on the training data but only for a few epochs
4. The margin is the difference between the distance of the point to the teacher and to the student boundaries

What I got from the paper:
1. We get (infinite) data points from a normal distribution
2. We initialize the teacher perceptron with a random weights vector and use it to label the data (i.e. the teacher is only used to generate synthetic labels)
3. The student learns from the labeled data
4. The margin is the distance from the point to the student boundary (the teacher is not involved here)

The results in Fig.1 assume the student is perfectly aligned with the teacher (i.e. the margin perfectly reflects the distance to the real class boundary), while in Fig.2 the authors show the effect of having a misaligned student.
Let me know your thoughts on this :)

amenezes
Автор

So awesome that you have NVIDIA as a sponsor xD

Neptutron
Автор

4:39 mann the diminishing returns be hitting real hard today💀

frommarkham
Автор

3:22 thanks for the knowledge🙏we gonna make it out the data center with this tutorial🗣🗣🗣🗣

frommarkham
Автор

The comparison of pruning strategies was very helpful to me. Thank you for summarizing the paper, and best wishes at the conference.

WilliamDye-willdye
Автор

These results feel intuitive with what I've felt in practice. The math is nuts, though. :)

Erosis
Автор

I was going to read this paper, thanks for the nice explanation!

thipoktham
Автор

Super interesting. I thought they were going to map entropy of the dataset, which is kind of what they imply, easy vs hard is equivalent to novel vs non-novel data in the distribution of the data.

joecincotta
Автор

Very similar with the idea of active learning

lighterswang
Автор

11:40 so in experiment authers selected top 80% difficult/hard examples from clusters and did not include bottom 20% of easy examples while training because initial dataset (ImageNet) is fairly large. Is my understanding correct?

Thanks for explaining.

mandarjoshi
Автор

super nice explanation and reasoning ! Thanks for the insight

cipritom
Автор

Could this be useful for data augmentation?

For example: assuming I start with a certain size dataset and don't need to prune any examples, could/should I make more augmented copies of the more informative samples? Could I also test to see what kinds of augmentations are more or less useful?

frenchmarty
Автор

Awesome video, love your explanations!

RfMac
Автор

Thanks for the great introduction to this topic

TheNettforce
Автор

Great content. very accessible: thank you

ScriptureFirst
Автор

The screen shot of the mathematics made me chuckle....in horror. Thanks Letitia for an excellent video!

flamboyanta
Автор

How does this compare to finetuning the same model on smaller data. How much data would be needed.

averma
Автор

How do you make those animations like in "Exponential scaling in theory" part, which software do you use? I would really be appreciated if you could tell me :)

sonataarcfan
Автор

Isn't this just hard sample mining?

poketopa
Автор

Shouldnt you use density based clustering?

Quaquaquaqua