Train Large, Then Compress

Показать описание

This video explains a new study on the best way to use a limited compute budget when training Natural Language Processing tasks. They show that Large models reach a lower error faster than smaller models and stopping training early with large models achieves better performance than longer training with smaller models. These larger models come with an inference bottleneck, it takes longer to make predictions and costs more to store these weights. The authors alleviate the inference bottleneck by showing that these larger models are robust to compression techniques like quantization and pruning! Thanks for watching, Please Subscribe!

Paper Links:

Connor Shorten

Рекомендации по теме

Комментарии

1:25 Common Practice vs. Optimal
2:25 Faster Convergence with Large Models
3:20 Function of Parameter Count
4:08 Gradient Accumulation
5:28 MNLI and SST-2
6:38 Larger Models are not Harder to Finetune
7:28 Inference Bottleneck of Large Models
9:30 Compressing Larger Models
12:03 Influence of Dataset Size
13:05 Conncetion to The Lottery Ticket Hypothesis

connor-shorten

Many thanks for this video, Connor! As always, the key ideas are shortly and clearly explained. I'm more on CV side, but found a bunch of fresh things from that to try in practice. Very interesting and exciting! Thank you!

aclexvideo

Thank you for sharing such amazing paper!

mjchiu

Thank you for your videos, you are doing an amazing job!

MartinFerianc

It's really cool that you make these videos :)

joshsmit

funny enough, in computational fluid mechanic. There is a similar concept call multi-grid method. Essentially you first run your simulation on a coarse grid, let the node values lightly converge, then extrapolate the node values to initialize the node values a finer grid. So the more costly finer grid can converge faster. Occasionally the finer grid results are compress down to fit the coarser grid and then let it converge slightly before swapping back to finer grid again.

keenheat

Very good overview. In deep learning it's a bit of a luxury to have a training set so large that you cannot even fit it with a billion param model. So this technique wouldn't apply to my non-luxurious smallish training setup, but cool idea nevertheless.

citiblocsMaster

I wonder how this compares to binarized models.

snippletrap

Train Large, Then Compress

Train Large, Then Compress

Rethinking Model Size: Train Large, Then Compress with Joseph Gonzalez - #378

Train Big, Compress Smart: New Secrets to Speedy AI

tinyML Talks local Germany Marcus Rueb: Introduction to optimization algorithms to compress neural..

How To INSTANTLY Freeze Water On Impact!

How to Compress Your BERT NLP Models For Very Efficient Inference

Ancient Technique To Burn Fat Fast

How to Stretch the Masseter Muscle - Trigger Point Therapy

How to Compress Your NLP Models for Efficient Inference

Compress Deep Learning models 10,000x with Probabilistic Hash Functions

How to Eliminate Microphone Feedback - As Fast As Possible

How to move a mattress | How to return mattress | How to compress a mattress using a vacuum bag

Ganglion Cyst of Wrist Diagnosis and Treatment Dr Vizniak

LPG Transport Unloading Application (Liquefied Gas Transfer & Vapor Recovery)

VPTQ - Extreme Low Bit LLM Quantization - Compress 405B, 70B Models

What Is Rib Flare? Watch To Learn How To Fix (Part 1 of 2)

SGD and Weight Decay Secretly Compress Your Neural Network

Gout Attack in the Big Toe Joint & Foot Diet & Treatment *2 MINUTES!*

Elbow Bursitis Treatment at Home - How to Treat Olecranon Bursitis

How to fire up the deepest core muscles (TVA)

Common Mistake in Surgical Drain Handling

How To Make Your Lungs Explode When Scuba Diving

How to Self Adjust Your Big Toe

The Science of Implosion | MythBusters

Gout Attack in the Big Toe Joint & Foot Diet & Treatment 2 MINUTES!