How to run Large AI Models from Hugging Face on Single GPU without OOM

Показать описание

This demo shows how to run large AI models from #huggingface on a Single GPU without Out of Memory error. Take a OPT-175B or BLOOM-176B parameter model .These are large language models and often require very high processing machine or multi-GPU, but thanks to bitsandbytes, in just a few tweaks to your code, you can run these large models on single node.

In this tutorial, we'll see 3 Billion parameter BLOOM AI model (loaded from Hugging Face) and #LLM inference on Google Colab (Tesla T4) without OOM.

This is brilliant! Kudos to the team.

1littlecoder

Рекомендации по теме

Комментарии

It is really impressive! I didn’t expect that it would be possible for me to host a huge Model like bloom myself !

serta

Thanks for walking through the notebook and sharing the resources ! Good job!

NoobMLDude

Woah this is what I needed . Thank you !!

prathameshjadhav

Excellent video! I'd love to learn more and hopefully contribute to these feats of optimization someday.

samlaki

🎯 Key Takeaways for quick navigation:

00:00 🚀 *Running Large AI Models on Single GPU*
- Exploring how to run large language models on a single GPU.
- Introducing the use of the "bits and bytes" library for this purpose.
- Acknowledging the source of the content from Tim Ditmers.
01:11 🧮 *Quantization for Model Size Reduction*
- Explaining the concept of quantization in neural networks.
- Highlighting the importance of quantization for reducing model size.
- Emphasizing the use of 8-bit and 16-bit precision for quantization.
04:11 🔧 *Setting Up Environment for Model Loading*
- Listing the steps to set up the environment for loading large models.
- Mentioning the installation of required libraries (bits and bytes, transformers, accelerate).
- Providing guidance on selecting the appropriate GPU hardware.
06:20 📦 *Loading Large Models with Ease*
- Demonstrating how to load a large language model with a single line of code.
- Showcasing the ability to load a 3 billion parameter model without RAM issues.
- Comparing the use of transformers' pipeline with manual model loading.
09:33 💾 *Quantization Without Performance Degradation*
- Highlighting the key benefit of quantization: reducing model size without performance degradation.
- Discussing memory savings achieved with quantization for large models.
- Illustrating how quantization allows hosting large models on single GPUs.
13:18 👏 *Acknowledgment and Conclusion*
- Expressing gratitude to Tim Ditmers and his team for simplifying the process.
- Recognizing the potential impact of this advancement on hosting AI models.
- Encouraging viewers to explore this opportunity and stay tuned for further research details.

Made with HARPA AI

jonathanberry

You are a fantastic explainer, thank you!

robert

Thanks to Kalyan KS who suggested me this amazing video!

darshantank

There's a typo in that notebook which you've linked. "bitsandbytes" is missing the s at the end so pip can't find the package.

EvanBurnetteMusic

I bought recently 4070 Ti Super, which want to use together with 2070 Super in tandem.

fontenbleau

Excellent. I looked at your Google CoLab notebook, and I want to know if Nvidia V100 GPU is supported? The CoLab notebook says, "Currently Turing and Ampere GPUs are supported." Volta is not listed. V100 is Volta micro-architecture. [update: V100 GPUs are mentioned in Table 1 of “8-BIT OPTIMIZERS VIA BLOCK-WISE QUANTIZATION” by Dettmers et al]

vtrandal

can you please verify if you can run the 175b bloom model?
i see you are run 3b model but i want to know if you have 175b model working in colab, please help

poxmeog