Train Large, Then Compress

preview_player
Показать описание
This video explains a new study on the best way to use a limited compute budget when training Natural Language Processing tasks. They show that Large models reach a lower error faster than smaller models and stopping training early with large models achieves better performance than longer training with smaller models. These larger models come with an inference bottleneck, it takes longer to make predictions and costs more to store these weights. The authors alleviate the inference bottleneck by showing that these larger models are robust to compression techniques like quantization and pruning! Thanks for watching, Please Subscribe!

Paper Links:
Рекомендации по теме
Комментарии
Автор

1:25 Common Practice vs. Optimal
2:25 Faster Convergence with Large Models
3:20 Function of Parameter Count
4:08 Gradient Accumulation
5:28 MNLI and SST-2
6:38 Larger Models are not Harder to Finetune
7:28 Inference Bottleneck of Large Models
9:30 Compressing Larger Models
12:03 Influence of Dataset Size
13:05 Conncetion to The Lottery Ticket Hypothesis

connor-shorten
Автор

Many thanks for this video, Connor! As always, the key ideas are shortly and clearly explained. I'm more on CV side, but found a bunch of fresh things from that to try in practice. Very interesting and exciting! Thank you!

aclexvideo
Автор

Thank you for sharing such amazing paper!

mjchiu
Автор

Thank you for your videos, you are doing an amazing job!

MartinFerianc
Автор

It's really cool that you make these videos :)

joshsmit
Автор

funny enough, in computational fluid mechanic. There is a similar concept call multi-grid method. Essentially you first run your simulation on a coarse grid, let the node values lightly converge, then extrapolate the node values to initialize the node values a finer grid. So the more costly finer grid can converge faster. Occasionally the finer grid results are compress down to fit the coarser grid and then let it converge slightly before swapping back to finer grid again.

keenheat
Автор

Very good overview. In deep learning it's a bit of a luxury to have a training set so large that you cannot even fit it with a billion param model. So this technique wouldn't apply to my non-luxurious smallish training setup, but cool idea nevertheless.

citiblocsMaster
Автор

I wonder how this compares to binarized models.

snippletrap