GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Показать описание

Large language models (LLMs) typically demand substantial GPU memory, rendering training impractical on a single consumer GPU, especially for a 7-billion-parameter model that necessitates 58GB of memory. In response, the GaLore paper introduces an innovative strategy that projects gradients into a low-rank space, enabling the model to fit within the constraints of a single GPU. Remarkably, this approach not only addresses the memory challenge but also outperforms other parameter-efficient tuning methods like LoRA, delivering superior results.

Table of Content:
00:00 Intro
02:17 LoRA
03:18 Limitations of LoRA
05:58 GaLore
18:18 Adam with GaLore
21:01 8-Bit Optimizers
22:50 LOMO
24:48 GaLore vs LoRA
26:20 Rank vs Perplexity
27:07 results

Рекомендации по теме

Комментарии

"mr" is the size of Projector P_t I think. In the algorithm they calculate R_t = P_t.T G_t
Great video by the way! Thanks.

yashmandilwar

Your explanation is truly awesome! Keep making more, please!

savanthtadepalli

Excellent video! Would you recommend any resources that explains the theorems they propose for low-rank gradients and their convergence in-depth?
Also, what tools do you use to create such cool animations?

HarishPrakash-oo

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

GaLore EXPLAINED: Memory-Efficient LLM Training by Gradient Low-Rank Projection