Full Fine tuning with Fewer GPUs - Galore, Optimizer Tricks, Adafactor

preview_player
Показать описание

VIDEO RESOURCES:

TIMESTAMPS:
0:00 LLM Full fine-tuning with lower VRAM
0:37 Video Overview
4:02 Understanding Optimisers
6:17 Stochastic Gradient Descent (SGD)
7:53 AdamW Optimizer and VRAM requirements
9:31 AdamW 8-bit optimizer
11:03 Adafactor optimiser and memory requirements
14:28 GaLore - reducing gradient and optimizer VRAM
19:10 LoRA versus GaLoRe
19:49 Better and Faster GaLoRe via Subspace Descent
22:59 Layerwise gradient updates
26:17 Training Scripts
27:10 How gradient checkpointing works to reduce memory
40:30 AdamW Performance
41:14 AdamW 8bit Performance
42:45 Adafactor with manual learning rate and schedule
44:10 Adafactor with default/auto learning rate
45:47 Galore AdamW
50:22 Galore AdamW with Subspace descent
52:25 Using AdamW8bit and Adafactor with GaLoRe
53:14 Notebook demo of layerwise gradient updates
55:28 Running with LoRa
58:36 Inferencing and Pushing Models to Hub
1:00:00 Single GPU Recommendations
1:25:00 Multi-GPU Recommendations
1:03:25 Resources
Рекомендации по теме
Комментарии
Автор

Very up to date! Includes GaLore, etc.

andpoul
Автор

Can you implement a few papers in pytorch like gradtts and more

imranullah
Автор

Hey Trelis! Can you help me setup a **Multi-node, multi-gpu** training infra using RunPod. I figured this out using the community cloud option where I can set a Public IP for my pods and expose the TCP ports with the same internal and external port numbers. However, I'm not able to add a shared disk across my community pods to save checkpoints in case of node failure. I totally failed to set communication between two different pods when I launched them in the secure cloud. But secure cloud allows network volume that can be shared across different pods.

Can you help me set-up infra for multi-node multi-gpu set up in secure cloud. In paperspace this was easy, but I am not able to figure this out using RunPod. Any suggestions are welcome

padmasrivaddiparthi
Автор

Can we convert full fine tuned model to lora (svd on delta weights)

VijayEranti
Автор

Hi. Will this work for continued pretraining on text books for domain specific adaptive learning? All i see on the internet are LoRA videos. I have seen your video on FFT and thats what i want for my use case.

mdrafatsiddiqui