Scaling laws for large language models

preview_player
Показать описание
The lecture presents the idea of scaling laws determining the relationship between model size (number of parameters), training dataset size (number of tokens), and the amount of computing available for training. At the end I am also introducing one of the weirdest phenomena in language model training - grokking
Рекомендации по теме