GaLore EXPLAINED: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Показать описание

We explain GaLore, a new parameter-efficient training technique that outperforms LoRA in accuracy and supports both pre-training and fine-tuning. Now you can train LLMs without running out of GPU memory! You can even pre-train a LLaMA-7B from scratch on one 24GB GPU (NVIDIA RTX 4090), for example.

Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
Dres. Trost GbR, Siltax, Vignesh Valliappan, Michael, Sunny Dhiana, Andy Ma

Outline:
00:00 Parameter-efficient Training
01:05 What is eating up GPU memory & LoRA recap
03:17 GaLore key idea
04:32 GaLore explained
08:43 Memory savings
09:38 Accuracy losses
10:23 Optimal T

▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Join this channel to get access to perks:
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🔗 Links:

#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research

Video editing: Nils Trost

Music 🎵 : Bella Bella Beat - Nana Kwabena

Рекомендации по теме

Комментарии

I always thought you have a Ph.D. until you mentioned your defense, congrads for the 'high rank' decoration !

GeoffY

I never thought I'd say this, but I'm actually excited to learn about efficient training methods for deep learning models on consumer GPUs. Who knew running out of GPU memory could be so... enlightening?

Thanks for explaining LoRA and Galore in a way that doesn't make my brain hurt (too much). Now, if you'll excuse me, I have some large language models to train or at least, try not to run out of GPU memory

MechanicumMinds

Holy frick what a perfectly concise video on galore!

There is a GitHub implementation of the research from the paper, they are currently working on a multigpu implementation. I too am curious how well things scale up to modern and larger llms, and have a multigpu rig I want to test it out on.

AaronALAI

I love this paper. The solution is elegant.

HomunMage

I imagine that they would do the SVD on the CPU as it would take way too much ram to do it on the GPU or maybe they do it layer by layer and discard the intermediate buffers. Anyway it seems like a great idea. ReLoRA achieved something similar but it required an small amount of pre training and I expect slower convergency for ReLoRA because each time a new LoRA is initialized the weights need to be moved around a lot while using SVD they would be roughly where they need to be minimizing the error in the main matrix.

eruiluvatar

Congrats on your PhD!

Could you use 'r' as a hyperparameter during pretraining as well? e.g. start pretraining with low r and gradually increase it as more precision is needed? I don't think it could do that much since gains are already very high at the start.

azertyQ

May I ask, how do you edit these videos? Inspired to do some youtube videos myself and would love to follow some of the steps you've taken.

bhavulgauri

One important thing to note is that while this technique is more memory efficient it's also 17% slower in the setup the authors use. That's a pretty big deal, especially for pretraining.

TheRyulord

It is now official. You have a big brain and have papers to prove it.

Ben_D.

I always liked Galor(e), though I might be biased.

amitgalor

As far as i get it, i determine the gradients G and and also a low rank component P. That component allows me to „shrink“ the gradient matrix G i calculate at every step to down to R before applying it to W. So i do not save compute while calculating the gradients, but while applying and saving (as momentum or such) them?

IxCIHAoX

Do I understand it correctly that it can't work with quantization and the model must fit in memory in 16bit?

vladimirnadvornik

Can info shed during the lossy compression process be set aside in a non memory fashion for retrieval? Thinking out loud. Not always helpful. But state mapping would be interesting in the training process as well as post.

timothywcrane

The size of a LoRA (on disk) is a fraction of the size of the model it's applied to. GaLore looses this benefit by updating the weights.
If you have use for many fine tunes of a single model, two GaLore fine-tunes will take more space than ten LoRAs (depending on rank, to be fair).
I assume they don't mention this very significant tradeoff, as you don't mention it in the video. That seems like a dishonest comparison, if that's the case.

ArnaldurBjarnason

I ran into trouble trying to Fine Tune a Swin model with LoRA. That type of model isn’t supported yet for LoRA. I wonder if it’ll be the same for GaLoRA

proterotype

One of the selling points of Lora is to be able to mix and match the A and B matrices for different fine-tuning runs, without having to keep the weights of the full model if they are available elsewhere. Here it seems you have to save the entire model, so this is a big tradeoff compared to Lora and derivatives.

vladimirtchuiev

😘 LOL! 礼 Stay for the EPIC lipstick! M4D L0V3!

SUD

The authors say LoRA is about low rank weight updates, which is a bad idea since all weight updates are not always low rank. But low rank gradients are a better alternative.
My question is that the only difference between weight update matrices and gradient matrices is the multiplication of learning rate i.e. weight update matrix = learning rate * gradients
Isn't it?
So how come weight update matrices are not always low rank, but gradient matrices are?
PS: congratulations on your defense :)

koiRitwikHai

GaLore EXPLAINED: Memory-Efficient LLM Training by Gradient Low-Rank Projection

GaLore EXPLAINED: Memory-Efficient LLM Training by Gradient Low-Rank Projection

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Gradient Low-Rank Projection (GaLore): Revolutionizing Memory-Efficient LLM Training

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Galore: memory efficient trainning

GaLore Memory Efficient LLM Training by Gradient Low Rank Projection （CAT & Meta & UTA &...

Full Fine tuning with Fewer GPUs - Galore, Optimizer Tricks, Adafactor

Atlas Wang: Democratizing LLM Training by Exploiting Low-Rank Gradients at Open AGI Summit Brussels.

LoRA - Low-rank Adaption of AI Large Language Models: LoRA and QLoRA Explained Simply

CS Colloquium: Zhangyang Wang- Low Rank Strikes Back in Large Language Models

Making AI Work: Fine-Tuning, Inference, Memory | Sharon Zhou, CEO, Lamini

LLM Projects Bootcamp: GaLore

Parquet for Training and Fine-Tuning (AI Concepts and Code)

Generalized LoRA (GLoRA) Paper Reading

Hella New AI Paper Summaries March 24-29, 2024

Leveraging Large Language Models to build Enterprise AI

LLAMAFACTORY: Unified Efficient Fine-Tuning of 100+ Language Models

IDEFICS 2 API Endpoint, vLLM vs TGI, and General Fine-tuning tips

PyTorch for Deep Learning & Machine Learning – Full Course

LoRA Learns Less and Forgets Less

Understanding How AI Works is Critical to Our Privacy Defense

GPUs @ KubeCon 2024 + New DeepLearning.ai Data Engineering Course + LLMs with Amazon EKS/Ray Serve