Run Any 70B LLM Locally on Single 4GB GPU - AirLLM

Показать описание

This video is a hands-on step-by-step tutorial to show how to locally install AirLLM and run Llama 3 8B or any 70B model on one GPU with 4GB VRAM.

#airllm #vram

PLEASE FOLLOW ME:

RELATED VIDEOS:

All rights reserved © 2021 Fahd Mirza

Fahd Mirza

Рекомендации по теме

Комментарии

What's the point of this? Is AirLLM meant to be used if you also don't have enough normal RAMfor a 70b model and inference speed is not important? I'm wondering because about every LLM software nowadays can run models in RAM and most support offloading layers, so the GPU VRAM can be used as much as possible though the performances rapidly declines the more layers are offloaded (stored in normal RAM).

testales

Nice Find!::
Right now it seems that running each layer like this is a trade off, but for large models, the trade off is worth it:
With a fully structured prompt : in fact you would only need a single execution of the model to produce a full output: as you would also notice . this would not be useful for random chat with the model !:

Hence practicing the prompt on a smaller model before sending it to the super large model :

Some Prompts can be highly complex : ie: to build an app is not a simple "Build a game called snake?"
there is a whole brief required : the initial prompt to build the game would be highly specific : the workforce required to design and build the application professionally, using a scrum/waterfall system : with test and development and testing :
as well as full documentation and streamlining : : the model may require to use these agents checking the job, searching for information etc : hence the one shot prompt :

The output produced will not only be the built and tested app: and documentation and installers etc :

So having the system fully connected to your chains, agents, rag: open interpreter etc :(i would suggest some smaller models for these tasks and allow this large model to utilize them through API: as deploying agents on API surfaces and allowing for them to communicate on a project produces high work: so a Small Home Network With agents deployed on different stations:: IE: the full AI Surface could be created using VM-Ware/Multi Docker instances::: And a multi machine network setup : then use your main AI to Ping The FAT CONTROLLER! - (often the Super large language models are just various configurations of LLMS ie : MOE/MOA - (not real B!) This technique of layer by layer is pretty fast but the problem was that you can see the verboses!! hence mentally slowing down the response !
(obviously it does not keep the object in memory!! so it would have to reload each time !)<<<<Actually more efficient as when i build apps i also do the same thing (loading the model in the function and NOT Persisting it past the current operation(it seems slower but it is not! caching causes crashes!<<< due to build up as other residual operations) So now! with this airLLM your only restricted by the size of the LAYER!
Hence now!

you can Run GROK! local! (VERY IMPORTANT!)

Its one thing to share data and open source stuff with the public (but the original RDF Triples from Wikipedia could NEVER be loaded with TECHNOLOGY of the Period!) Hence Releasing a model which nobody can run locally ! I think this just back fired !

as people like me! always download it anyway! (just in case the internet cable is cut!)

xspydazx

How much airllm RAM and hard disk space do you needfor 70b?

QorQar

*Ótimo vídeo! As IAs avançam tão rápido dia a dia que é impossível testar todas as novidades! Esse projeto é fantástico, pois estamos falando de modelos gigantes, se diminuirmos a escala, um modelo de 13B pode rodar quase perfeito com essa técnica aí, em maquinas aonde não conseguem rodar. Eu que desejo ver as IAs rodarem melhor em apenas CPU sem GPU, consigo enxergar todo o poder desse projeto!*

ramikanimperador

Can the same idea be run with integrated VGA?

QorQar

1050ti here. Makes things interesting To say the least. Seeing some of the comments, I will have to compare to LMSTudio layer offloading with and w/o PA to see if there is a difference using my "entrance" card.

timothywcrane

so do you think a gpu with 8gb vram like an rtx 4060 is enough if we want to start trying to run AI locally?

Laniakea

Fahd could you make a tutorial about fine-tuning Command-R since it's the only viable LLM for non-European languages? I couldn't find YouTube tutorials about fine-tuning Command-R from Cohere!

sheikhakbar

Too slow to be useful. What is the use case?

pensiveintrovert

I have 4000 of vram How do I run a model on them?

QorQar

I have the lama3 70b running from the i9 cpu 32 threads using ollama. But this is interesting alternative

autoboto

enable batch layer like 20 layer per layer un memories give better result

flimactu

Without running this method you can run llm 13b with vram 8 giga If that's not enough, you can run on it cpu and veam together

ROKKor-hstg

Run Any 70B LLM Locally on Single 4GB GPU - AirLLM

Run Any 70B LLM Locally on Single 4GB GPU - AirLLM

Cheap mini runs a 70B LLM 🤯

All You Need To Know About Running LLMs Locally

How to Run LLaMA 70B on Your LOCAL PC with Petals

How to Run 70B LLMs Locally on RTX 3090 OR 4060 - AQLM

I used LLaMA 2 70B to rebuild GPT Banker...and its AMAZING (LLM RAG)

REFLECTION Llama3.1 70b Tested on Ollama Home Ai Server - Best Ai LLM?

M3 max 128GB for AI running Llama2 7b 13b and 70b

How to Run 70B and 120B LLMs Locally - 2 bit LLMs

How To Run Llama 3 8B, 70B Models On Your Laptop (Free)

How To Install CodeLlama 70B Locally For FREE! (EASY)

How to Run Llama3 70B on a Single 4GB GPU Locally

FREE Local LLMs on Apple Silicon | FAST!

How To Run Llama 3.1: 8B, 70B, 405B Models Locally (Guide)

This Llama 3 is powerful and uncensored, let’s run it

How To Run ANY Open Source LLM LOCALLY In Linux

Run Llama3 70B on GeForce RTX 4090

First local LLM to Beat GPT-4 on Coding | Codellama-70B

Run 70B Llama-3 LLM (for FREE) with NVIDIA endpoints | Code Walk-through

LLAMA 3.1 70b GPU Requirements (FP32, FP16, INT8 and INT4)

Reflection 70B LLM Explained | Open Source GPT-4o Killer ?

Run 70Bn Llama 3 Inference on a Single 4GB GPU

Run ANY Open-Source Model LOCALLY (LM Studio Tutorial)

'okay, but I want Llama 3 for my specific use case' - Here's how