Llama 3.1 405b LOCAL AI Home Server on 7995WX Threadripper and 4090

preview_player
Показать описание
Running a 405B LLM on your home server is possible? YES! Today we give this a spin on out 7995WX and we also test the 3090 and 4090 head to head on the 6400 MHz DDR5 we have in this system. Two fun topics as we checkout performance and uncover a mystery in performance.

QUAD 3090 AI SERVER BUILD
(sTRX4 fits SP3 and retention kit comes with the CAPELLIX)

Chapters
0:00 Threadripper 7995WX AI Testing
1:00 Running 405B Local AI Server
5:56 Qwen 2.5 32b on 4090
8:34 Llama 3.2 3b on 4090
10:17 Llama 3.1 8b fp16 on 4090
12:15 Qwen 2.5 32b on 3090
14:32 Llama 3.1 8b on 3090
15:47 Llama 3.2 3b on 3090

Be sure to 👍✅Subscribe✅👍 for more content like this!

Please share this video to help spread the word and drop a comment below with your thoughts or questions. Thanks for watching!

Digital Spaceport Website

🛒Shop (Channel members get a 3% or 5% discount)

*****
As an Amazon Associate I earn from qualifying purchases.

When you click on links to various merchants on this site and make a purchase, this can result in this site earning a commission. Affiliate programs and affiliations include, but are not limited to, the eBay Partner Network.
*****
Рекомендации по теме
Комментарии
Автор

Great content!

My conclusion is that AI-at-home is profoundly I/O bound -- MMA VRAM capacity specifically. Pat Gelsinger cancelled Beast Lake (a.k.a. "Royal Core") because he doesn't think the world needs powerful CPUs anymore. A capable CPU & nominally clocked RAM are required but nothing spicy seems to be necessary.

For I/O capability, TRP is the most I can afford. Entry into TRP is a 24-core CPU. That works.

4 x 5090s (128 combined VRAM; NVIDIA's popular stack) seems to be a reasonable target for 2025-2026, albeit still insufficient for responsiveness with the larger models.

(1) engage an electrician to add 2 new 20A circuits & put in two new UPSs to my home office
(2) buy/build a rig capable of holding 4 x GPUs -- 2 power supplies
(3) custom loop water cooling (external radiator?)
(4) get a second job & change my diet to Ramen-only to be able to pay for all of this

markldevine
Автор

We need to start rating AI’s to give them a resource score to indicate how much the hardware costs would be to run them.

HectorDiabolucus
Автор

이 동영상은 여러분의 집에서 405B 모델을 사용하면 안 된다는 것을 잘 보여주고 있습니다. 가장 빠른 CPU가 달린 컴퓨터라고 하더라도 405B 모델은 20분 뒤에 답해 주기 때문입니다. 여러분의 집에 필요한 것은 VRAM이 많이 달린 그래픽 카드입니다. 3070, 3080과 같은 구형 chipset에 48GB VRAM이 달린 그래픽 카드가 제조되고 판매었으면 좋겠습니다. 만약 가능하다면, 그래픽 카드 회사들이 480GB VRAM이 달린 그래픽 카드를 제조해서 판매해 주면 좋겠습니다.

lietz
Автор

You should run a benchmark like STREAM TRIAD to see what your actual memory bandwidth is for vector math vs theoretical bandwidth. On a recent test of a bunch of EPYC chips, many models have much lower TRIAD results vs their theoretical memory bandwidth.

BTW, while it might not make for as entertaining videos, in general, if you use `llama-bench` as a standardized benchmark for all your different testing, I bet you'll get more useful/standardized results for your spreadsheet (I also recommend recording the build number for your tests).

For those interested, on an 24C EPYC 9274F (395GB/s TRIAD vs 460.8GB/s theoretical), w/ a Qwen2.5-32B q4_0 GGUF with llama-bench (b3985), CPU-only gives pp/tg of 24.7 t/s and 8 t/s. With a W7900 (ROCm) w/ display attached, I get tg/pp 656.5 t/s and 25.9 t/s. If I run with `-ngl 0` -- using the Navi31 (gfx1100) GPU for compute, but with the model loaded into system memory, not VRAM, I am able to get tg/pp 344.3 t/s and 7.6 t/s, which is actually not bad (eg, even if you didn't have enough memory to load any layers onto the GPU, you'd still get a pretty huge compute benefit and barely any loss on MBW.

lhl
Автор

It would be interesting if you tested this with an 7900 XTX in combination with ROCM as well.

---David---
Автор

Re: why do some 5995WX benchmarks run faster than 7995WX?

Even though the turbo speed of the 7995WX is faster than the 5995WX, (5.1 ghz vs 4.5), the clock speed of the 5995WX is 8% faster, ie 2.7 ghz vs 2.5 ghz for the 7995wx. This could mean that the 5995WX benchmark could run faster than the 7995WX if both chips ran the llm threads using the standard clock speed instead of turbo.

walteralias
Автор

can you do a 4090 with 70b and a longer context length as well as 70b tokens per second on 70b on Q8 or above?

owens
Автор

Hello. Try your tests with SMT disabled.

michaelbyrd
Автор

Great channel. Have you done any videos on creating a small GPU cluster? Obviously using 4 GPUs in a single chassis is better than 4 single GPU systems in a cluster, but I am wondering where the tipping point is. For a hobbyist seeking to up skill, it seems like learning about working with a cluster is worth the performance tradeoff considering all deployments at scale are clusters.

ruzaroos
Автор

Maybe you should have a new SSD for the new chip and gpu and do a fresh install on it. Then compare the two setups.

codescholar
Автор

Yess I've wanted a video like this

init_yeah
Автор

If you have it enabled, try disabling IOMMU and cpu tuning is set to "Maximum Performance" in the BIOS. Would also be interesting to see with and without hyper-threading enabled.

ethanwaldo
Автор

I have a Lenovo ThinkPad t440p which is about 10 years old running on a gt730m and shared integrated graphics I do have the i7 version. I am able to run a 8 billion parameter model with response times around 1 minute. I've also seen mini PCs run 70 billion models with shared graphics on YouTube with very quick response times.

fizzyfizzgigcouple
Автор

With some super heavy optimization and all of the possibilities strategies to make it more efficient and some delulu spirit we can make it work buddy

tedguy
Автор

Hello
Thank you so much for your time, and insights.
I am managing a team, and we are looking forward to build a server for ML.
Thank you for your former video. May I ask where I can find the thread ripper configuration you are testing here. Is it a custom build?
Thanks

Vadinaka
Автор

History you need to add a x570S and a 3950x cpu 128g 3200mhz to your test as a BASE. At 32 threads AMD lost the balance of the cpu's chasing the marketing of SPEED. A big rig has a 600hp diesal, but a sports car can have a 800hp engine. The 800hp car can't pull 40, 000lbs. Energy is energy its about digital leverarging balance. Add those buildable 3900's and 3950's

GunMuse
Автор

do you use bitsandbytes to quantization this model to int4 right? why this pc not use all of ram and vram?

thisrobotman
Автор

0:18 For the life of everything holy do not use a morph cut like this again or I will have nightmares of it in my dreams.

maxmustermann
Автор

Great video! Any recommendations for a small vision model on a 1080 Ti with 11GB? If not, what hardware would you say is the minimum required?

MediaCreators
Автор

Massive amounts of cpu and ram aren't going to do much on running a single model on GPU. The GPU is still doing all the work.
The difference would be seen in multi-agent instances where there's a lot of orchestration involved across several models and gpus. Or just operating on the CPU itself and not the GPU.

go-dev-o