How to run Large AI Models from Hugging Face on Single GPU without OOM

preview_player
Показать описание
This demo shows how to run large AI models from #huggingface on a Single GPU without Out of Memory error. Take a OPT-175B or BLOOM-176B parameter model .These are large language models and often require very high processing machine or multi-GPU, but thanks to bitsandbytes, in just a few tweaks to your code, you can run these large models on single node.

In this tutorial, we'll see 3 Billion parameter BLOOM AI model (loaded from Hugging Face) and #LLM inference on Google Colab (Tesla T4) without OOM.

This is brilliant! Kudos to the team.

Рекомендации по теме
Комментарии
Автор

It is really impressive! I didn’t expect that it would be possible for me to host a huge Model like bloom myself !

serta
Автор

Thanks for walking through the notebook and sharing the resources ! Good job!

NoobMLDude
Автор

Woah this is what I needed . Thank you !!

prathameshjadhav
Автор

Excellent video! I'd love to learn more and hopefully contribute to these feats of optimization someday.

samlaki
Автор

🎯 Key Takeaways for quick navigation:

00:00 🚀 *Running Large AI Models on Single GPU*
- Exploring how to run large language models on a single GPU.
- Introducing the use of the "bits and bytes" library for this purpose.
- Acknowledging the source of the content from Tim Ditmers.
01:11 🧮 *Quantization for Model Size Reduction*
- Explaining the concept of quantization in neural networks.
- Highlighting the importance of quantization for reducing model size.
- Emphasizing the use of 8-bit and 16-bit precision for quantization.
04:11 🔧 *Setting Up Environment for Model Loading*
- Listing the steps to set up the environment for loading large models.
- Mentioning the installation of required libraries (bits and bytes, transformers, accelerate).
- Providing guidance on selecting the appropriate GPU hardware.
06:20 📦 *Loading Large Models with Ease*
- Demonstrating how to load a large language model with a single line of code.
- Showcasing the ability to load a 3 billion parameter model without RAM issues.
- Comparing the use of transformers' pipeline with manual model loading.
09:33 💾 *Quantization Without Performance Degradation*
- Highlighting the key benefit of quantization: reducing model size without performance degradation.
- Discussing memory savings achieved with quantization for large models.
- Illustrating how quantization allows hosting large models on single GPUs.
13:18 👏 *Acknowledgment and Conclusion*
- Expressing gratitude to Tim Ditmers and his team for simplifying the process.
- Recognizing the potential impact of this advancement on hosting AI models.
- Encouraging viewers to explore this opportunity and stay tuned for further research details.

Made with HARPA AI

jonathanberry
Автор

You are a fantastic explainer, thank you!

robert
Автор

Thanks to Kalyan KS who suggested me this amazing video!

darshantank
Автор

There's a typo in that notebook which you've linked. "bitsandbytes" is missing the s at the end so pip can't find the package.

EvanBurnetteMusic
Автор

I bought recently 4070 Ti Super, which want to use together with 2070 Super in tandem.

fontenbleau
Автор

Excellent. I looked at your Google CoLab notebook, and I want to know if Nvidia V100 GPU is supported? The CoLab notebook says, "Currently Turing and Ampere GPUs are supported." Volta is not listed. V100 is Volta micro-architecture. [update: V100 GPUs are mentioned in Table 1 of “8-BIT OPTIMIZERS VIA BLOCK-WISE QUANTIZATION” by Dettmers et al]

vtrandal
Автор

can you please verify if you can run the 175b bloom model?
i see you are run 3b model but i want to know if you have 175b model working in colab, please help

poxmeog
Автор

How to run chronos hermes 13b on the PC, what i need?

smoklares
Автор

For human like original text do you prefer paraphrase or generate text? Which model do you recommend?

fractalarbitrage
Автор

how would you recommend building a custom pc for running alocal llm?

thumperhunts
Автор

I think they not working for fine-tuning large model 💔☹️

imranullah
Автор

How do I fine-tune an LLM model in free google colab?

ElNinjaZeros
Автор

Do you know if anybody is working on instructOPT, like instructGPT?

knowledgelover
Автор

You are an amazing Instructor no doubt. But why dont you work on improving your English accent?

geekyprogrammer