GaLore EXPLAINED: Memory-Efficient LLM Training by Gradient Low-Rank Projection

preview_player
Показать описание
We explain GaLore, a new parameter-efficient training technique that outperforms LoRA in accuracy and supports both pre-training and fine-tuning. Now you can train LLMs without running out of GPU memory! You can even pre-train a LLaMA-7B from scratch on one 24GB GPU (NVIDIA RTX 4090), for example.

Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Vignesh Valliappan, Michael, Sunny Dhiana, Andy Ma

Outline:
00:00 Parameter-efficient Training
01:05 What is eating up GPU memory & LoRA recap
03:17 GaLore key idea
04:32 GaLore explained
08:43 Memory savings
09:38 Accuracy losses
10:23 Optimal T

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Join this channel to get access to perks:
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🔗 Links:

#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​

Video editing: Nils Trost

Music 🎵 : Bella Bella Beat - Nana Kwabena
Рекомендации по теме
Комментарии
Автор

I always thought you have a Ph.D. until you mentioned your defense, congrads for the 'high rank' decoration !

GeoffY
Автор

I never thought I'd say this, but I'm actually excited to learn about efficient training methods for deep learning models on consumer GPUs. Who knew running out of GPU memory could be so... enlightening?

Thanks for explaining LoRA and Galore in a way that doesn't make my brain hurt (too much). Now, if you'll excuse me, I have some large language models to train or at least, try not to run out of GPU memory

MechanicumMinds
Автор

Holy frick what a perfectly concise video on galore!

There is a GitHub implementation of the research from the paper, they are currently working on a multigpu implementation. I too am curious how well things scale up to modern and larger llms, and have a multigpu rig I want to test it out on.

AaronALAI
Автор

I love this paper. The solution is elegant.

HomunMage
Автор

I imagine that they would do the SVD on the CPU as it would take way too much ram to do it on the GPU or maybe they do it layer by layer and discard the intermediate buffers. Anyway it seems like a great idea. ReLoRA achieved something similar but it required an small amount of pre training and I expect slower convergency for ReLoRA because each time a new LoRA is initialized the weights need to be moved around a lot while using SVD they would be roughly where they need to be minimizing the error in the main matrix.

eruiluvatar
Автор

Congrats on your PhD!

Could you use 'r' as a hyperparameter during pretraining as well? e.g. start pretraining with low r and gradually increase it as more precision is needed? I don't think it could do that much since gains are already very high at the start.

azertyQ
Автор

May I ask, how do you edit these videos? Inspired to do some youtube videos myself and would love to follow some of the steps you've taken.

bhavulgauri
Автор

One important thing to note is that while this technique is more memory efficient it's also 17% slower in the setup the authors use. That's a pretty big deal, especially for pretraining.

TheRyulord
Автор

It is now official. You have a big brain and have papers to prove it.

Ben_D.
Автор

I always liked Galor(e), though I might be biased.

amitgalor
Автор

As far as i get it, i determine the gradients G and and also a low rank component P. That component allows me to „shrink“ the gradient matrix G i calculate at every step to down to R before applying it to W. So i do not save compute while calculating the gradients, but while applying and saving (as momentum or such) them?

IxCIHAoX
Автор

Do I understand it correctly that it can't work with quantization and the model must fit in memory in 16bit?

vladimirnadvornik
Автор

Can info shed during the lossy compression process be set aside in a non memory fashion for retrieval? Thinking out loud. Not always helpful. But state mapping would be interesting in the training process as well as post.

timothywcrane
Автор

The size of a LoRA (on disk) is a fraction of the size of the model it's applied to. GaLore looses this benefit by updating the weights.
If you have use for many fine tunes of a single model, two GaLore fine-tunes will take more space than ten LoRAs (depending on rank, to be fair).
I assume they don't mention this very significant tradeoff, as you don't mention it in the video. That seems like a dishonest comparison, if that's the case.

ArnaldurBjarnason
Автор

I ran into trouble trying to Fine Tune a Swin model with LoRA. That type of model isn’t supported yet for LoRA. I wonder if it’ll be the same for GaLoRA

proterotype
Автор

One of the selling points of Lora is to be able to mix and match the A and B matrices for different fine-tuning runs, without having to keep the weights of the full model if they are available elsewhere. Here it seems you have to save the entire model, so this is a big tradeoff compared to Lora and derivatives.

vladimirtchuiev
Автор

😘 LOL! 礼 Stay for the EPIC lipstick! M4D L0V3!

SUD
Автор

The authors say LoRA is about low rank weight updates, which is a bad idea since all weight updates are not always low rank. But low rank gradients are a better alternative.
My question is that the only difference between weight update matrices and gradient matrices is the multiplication of learning rate i.e. weight update matrix = learning rate * gradients
Isn't it?
So how come weight update matrices are not always low rank, but gradient matrices are?
PS: congratulations on your defense :)

koiRitwikHai