It’s over…my new LLM Rig

preview_player
Показать описание
This runs faster than a Thunderbolt eGPU

🛒 Gear Links 🛒

🎥 Related Videos 🎥

— — — — — — — — —

❤️ SUBSCRIBE TO MY YOUTUBE CHANNEL 📺

— — — — — — — — —

Join this channel to get access to perks:

— — — — — — — — —

#machinelearning #llm #softwaredevelopment

CHAPTERS
0:00 Unboxing
1:14 Installing RTX 4090
2:00 Setting Up Power Supply
2:46 Assembling GPU Dock
5:49 Software Installation
7:13 Running LLMs
9:59 Testing Larger Models
12:25 Testing Stable Diffusion
Рекомендации по теме
Комментарии
Автор

If you are running a GGUF model, ollama will split the process, putting as many layers as it can on GPU and the rest on CPU. IT will run slower, but faster than CPU only.

irrelevantdata
Автор

Some points here Alex:
1. The power cable that's split into three power plugs is a "dongle" that converts the 12V HPWR plug that's too recent for most power supplies out there, so they supply a splitter that is powered by 3 or 4 of the 8 pin PCI-E connectors.
2. The drivers for your GPU are provided by Nvidia themselves (just Google game ready drivers for RTX 4090), as the AIB drivers (Gigabyte's) are outdated.
3. All modern GPUs (from 2010 onwards) are all set to have zero fan revs at sub 60° C.

blackhorseteck
Автор

Mini PCs have revolutionized the boring PC market. The power they are able to squeeze inside these small boxes gives me hope for the future of computing.

blackhorseteck
Автор

small advice regarding ollama:
use verbose

example: ollama run llama3.1:8b --verbose
technically these commands, including what you ran, keeps the model loaded. You have to manually unload it, or you can tell the model to unload after you go /bye:

ollama run llama3.1:8b --verbose --keepalive 10s

the verbose will tell you the tokens per second generated.
the keepalive 10s will drop the model from memory after 10s

serikazero
Автор

Chinese modders transplanted a chip from RTX 4090D to a custom board or a 3090 board and soldered 48 GB of memory. Real beast for AI rig. However, I'm not sure about the warranty for such Frankenstein card.

eternalnightmare
Автор

Serious question: why not just use the power directly if you're UPS isn't big enough? This isn't a mission critical server with important information, it doesn't need 24/7 operation during a power outage.

harryhall
Автор

Haven't seen anyone do a video using multiple video cards in parallel to run a large model. So that is my humble request. Love the content.

Krath
Автор

10:10 The GPU spikes while running Ollama could be due to
Batch processing: The AI might be processing data in phases, causing short bursts of high GPU usage.
Resource optimization: The GPU is only used for certain tasks, leading to inconsistent usage.
Power management: The GPU adjusts its power consumption based on demand, resulting in spikes.

TechGameDev
Автор

I experienced the same VRAM problems. I have a i9-32 thread and 128GB system RAM and runs the large models in slow motion but works. Small models run fast enough to use on the i9, but if it fits the gpu 16GB its really fast and enough to use as a service for a few clients. I'm using 4090 mobile=4080 desktop. Large models, these days, seems the Mac unified ram is the way to go for running large models but slower, but at least it runs and the wait is not too long.

autoboto
Автор

I think you meant 40Gigabits/s for Thunderbolt 4 - instead of gigabytes.

monsterbeast
Автор

I've got a 1080 ti from 2017, and a PC I built in 2016 overclocked to almost 5Ghz, 6 cores 12 threads, with 64 GB Ram
I am currently running Llama 3.2 7B models lightning fast with my PC

ldandco
Автор

I have a 3090 with 24GB and yes you can run 13b. Nice setup

Heythisismychannel
Автор

ollama is smart enough to use two gpus simultaneously, so for that 40GB LLM you really have to use two gpus with 24GB vram each,
once you get over gpu vram capacity, things go into ram and though cpu which is terribly slow - at such point Apple Silicon Macs have advantage of utilizing shared ram, so something like 64GB Mac Studio "outperforms" and PC with lack of gpu vram

TazzSmk
Автор

The white video cards are usually purchased by people building a "snow blind" PC - white case, white video card, white power supply, white cables, etc. These white video cards can be difficult to source and during periods of short supply they command a premium price with no other benefit than matching the color of the build.

Larger LLMs yield very poor performance when spilling over from the maximum RAM of the nVidia RTX 4090 into the 128 Gb RAM in my tower PC. I get much better performance running up to 70 billion parameter LLMs on my MacBook Pro M3 Max. This is why I will be purchasing a Mac Studio M4 Ultra with maximum RAM installed when it is available.

gaiustacitus
Автор

i love this man, never bored watching

jukiy
Автор

Thanks for the demo. Now i understand how the llm work, especially the part where it consume power and how it consume the memory. With this info, i can manage the usage properly.

albertjeremy
Автор

running "ollama ps" will show you how much of the model is loaded on system ram vs GPU ram. You want 2x4090s for enough VRAM to run a 70b at a good speed.

Tarbard
Автор

You can see additional stats on model performance like tokens per second by using the --verbose flag with Ollama run.

So Ollama run llama3.1 --verbose.

Love the videos!

jake-epwq
Автор

This was a crazy video!! One of your best!!!

itiswhatitis-yes
Автор

Also ISTA-DASlab (on huggingface) managed to squeeze original Llama 70B 140Gb model into 22Gb remaining 90+% quality ratio, so it can run on one 3090 card. 8B model they've made possible to run on smartphones.

fontenbleau
visit shbcf.ru