Run Any 70B LLM Locally on Single 4GB GPU - AirLLM

preview_player
Показать описание
This video is a hands-on step-by-step tutorial to show how to locally install AirLLM and run Llama 3 8B or any 70B model on one GPU with 4GB VRAM.

#airllm #vram

PLEASE FOLLOW ME:

RELATED VIDEOS:

All rights reserved © 2021 Fahd Mirza
Рекомендации по теме
Комментарии
Автор

What's the point of this? Is AirLLM meant to be used if you also don't have enough normal RAMfor a 70b model and inference speed is not important? I'm wondering because about every LLM software nowadays can run models in RAM and most support offloading layers, so the GPU VRAM can be used as much as possible though the performances rapidly declines the more layers are offloaded (stored in normal RAM).

testales
Автор

Nice Find!::
Right now it seems that running each layer like this is a trade off, but for large models, the trade off is worth it:
With a fully structured prompt : in fact you would only need a single execution of the model to produce a full output: as you would also notice . this would not be useful for random chat with the model !:


Hence practicing the prompt on a smaller model before sending it to the super large model :

Some Prompts can be highly complex : ie: to build an app is not a simple "Build a game called snake?"
there is a whole brief required : the initial prompt to build the game would be highly specific : the workforce required to design and build the application professionally, using a scrum/waterfall system : with test and development and testing :
as well as full documentation and streamlining : : the model may require to use these agents checking the job, searching for information etc : hence the one shot prompt :

The output produced will not only be the built and tested app: and documentation and installers etc :


So having the system fully connected to your chains, agents, rag: open interpreter etc :(i would suggest some smaller models for these tasks and allow this large model to utilize them through API: as deploying agents on API surfaces and allowing for them to communicate on a project produces high work: so a Small Home Network With agents deployed on different stations:: IE: the full AI Surface could be created using VM-Ware/Multi Docker instances::: And a multi machine network setup : then use your main AI to Ping The FAT CONTROLLER! - (often the Super large language models are just various configurations of LLMS ie : MOE/MOA - (not real B!) This technique of layer by layer is pretty fast but the problem was that you can see the verboses!! hence mentally slowing down the response !
(obviously it does not keep the object in memory!! so it would have to reload each time !)<<<<Actually more efficient as when i build apps i also do the same thing (loading the model in the function and NOT Persisting it past the current operation(it seems slower but it is not! caching causes crashes!<<< due to build up as other residual operations) So now! with this airLLM your only restricted by the size of the LAYER!
Hence now!

you can Run GROK! local! (VERY IMPORTANT!)

Its one thing to share data and open source stuff with the public (but the original RDF Triples from Wikipedia could NEVER be loaded with TECHNOLOGY of the Period!) Hence Releasing a model which nobody can run locally ! I think this just back fired !

as people like me! always download it anyway! (just in case the internet cable is cut!)

xspydazx
Автор

How much airllm RAM and hard disk space do you needfor 70b?

QorQar
Автор

*Ótimo vídeo! As IAs avançam tão rápido dia a dia que é impossível testar todas as novidades! Esse projeto é fantástico, pois estamos falando de modelos gigantes, se diminuirmos a escala, um modelo de 13B pode rodar quase perfeito com essa técnica aí, em maquinas aonde não conseguem rodar. Eu que desejo ver as IAs rodarem melhor em apenas CPU sem GPU, consigo enxergar todo o poder desse projeto!*

ramikanimperador
Автор

Can the same idea be run with integrated VGA?

QorQar
Автор

1050ti here. Makes things interesting To say the least. Seeing some of the comments, I will have to compare to LMSTudio layer offloading with and w/o PA to see if there is a difference using my "entrance" card.

timothywcrane
Автор

so do you think a gpu with 8gb vram like an rtx 4060 is enough if we want to start trying to run AI locally?

Laniakea
Автор

Fahd could you make a tutorial about fine-tuning Command-R since it's the only viable LLM for non-European languages? I couldn't find YouTube tutorials about fine-tuning Command-R from Cohere!

sheikhakbar
Автор

Too slow to be useful. What is the use case?

pensiveintrovert
Автор

I have 4000 of vram How do I run a model on them?

QorQar
Автор

I have the lama3 70b running from the i9 cpu 32 threads using ollama. But this is interesting alternative

autoboto
Автор

enable batch layer like 20 layer per layer un memories give better result

flimactu
Автор

Without running this method you can run llm 13b with vram 8 giga If that's not enough, you can run on it cpu and veam together

ROKKor-hstg