Llama 3.1 405b LOCAL AI Home Server on 7995WX Threadripper and 4090

Показать описание

Running a 405B LLM on your home server is possible? YES! Today we give this a spin on out 7995WX and we also test the 3090 and 4090 head to head on the 6400 MHz DDR5 we have in this system. Two fun topics as we checkout performance and uncover a mystery in performance.

QUAD 3090 AI SERVER BUILD
(sTRX4 fits SP3 and retention kit comes with the CAPELLIX)

Chapters
0:00 Threadripper 7995WX AI Testing
1:00 Running 405B Local AI Server
5:56 Qwen 2.5 32b on 4090
8:34 Llama 3.2 3b on 4090
10:17 Llama 3.1 8b fp16 on 4090
12:15 Qwen 2.5 32b on 3090
14:32 Llama 3.1 8b on 3090
15:47 Llama 3.2 3b on 3090

Be sure to 👍✅Subscribe✅👍 for more content like this!

Please share this video to help spread the word and drop a comment below with your thoughts or questions. Thanks for watching!

Digital Spaceport Website

🛒Shop (Channel members get a 3% or 5% discount)

*****
As an Amazon Associate I earn from qualifying purchases.

When you click on links to various merchants on this site and make a purchase, this can result in this site earning a commission. Affiliate programs and affiliations include, but are not limited to, the eBay Partner Network.
*****

Рекомендации по теме

Комментарии

Great content!

My conclusion is that AI-at-home is profoundly I/O bound -- MMA VRAM capacity specifically. Pat Gelsinger cancelled Beast Lake (a.k.a. "Royal Core") because he doesn't think the world needs powerful CPUs anymore. A capable CPU & nominally clocked RAM are required but nothing spicy seems to be necessary.

For I/O capability, TRP is the most I can afford. Entry into TRP is a 24-core CPU. That works.

4 x 5090s (128 combined VRAM; NVIDIA's popular stack) seems to be a reasonable target for 2025-2026, albeit still insufficient for responsiveness with the larger models.

(1) engage an electrician to add 2 new 20A circuits & put in two new UPSs to my home office
(2) buy/build a rig capable of holding 4 x GPUs -- 2 power supplies
(3) custom loop water cooling (external radiator?)
(4) get a second job & change my diet to Ramen-only to be able to pay for all of this

markldevine

We need to start rating AI’s to give them a resource score to indicate how much the hardware costs would be to run them.

HectorDiabolucus

이 동영상은 여러분의 집에서 405B 모델을 사용하면 안 된다는 것을 잘 보여주고 있습니다. 가장 빠른 CPU가 달린 컴퓨터라고 하더라도 405B 모델은 20분 뒤에 답해 주기 때문입니다. 여러분의 집에 필요한 것은 VRAM이 많이 달린 그래픽 카드입니다. 3070, 3080과 같은 구형 chipset에 48GB VRAM이 달린 그래픽 카드가 제조되고 판매었으면 좋겠습니다. 만약 가능하다면, 그래픽 카드 회사들이 480GB VRAM이 달린 그래픽 카드를 제조해서 판매해 주면 좋겠습니다.

lietz

You should run a benchmark like STREAM TRIAD to see what your actual memory bandwidth is for vector math vs theoretical bandwidth. On a recent test of a bunch of EPYC chips, many models have much lower TRIAD results vs their theoretical memory bandwidth.

BTW, while it might not make for as entertaining videos, in general, if you use `llama-bench` as a standardized benchmark for all your different testing, I bet you'll get more useful/standardized results for your spreadsheet (I also recommend recording the build number for your tests).

For those interested, on an 24C EPYC 9274F (395GB/s TRIAD vs 460.8GB/s theoretical), w/ a Qwen2.5-32B q4_0 GGUF with llama-bench (b3985), CPU-only gives pp/tg of 24.7 t/s and 8 t/s. With a W7900 (ROCm) w/ display attached, I get tg/pp 656.5 t/s and 25.9 t/s. If I run with `-ngl 0` -- using the Navi31 (gfx1100) GPU for compute, but with the model loaded into system memory, not VRAM, I am able to get tg/pp 344.3 t/s and 7.6 t/s, which is actually not bad (eg, even if you didn't have enough memory to load any layers onto the GPU, you'd still get a pretty huge compute benefit and barely any loss on MBW.

lhl

It would be interesting if you tested this with an 7900 XTX in combination with ROCM as well.

---David---

Re: why do some 5995WX benchmarks run faster than 7995WX?

Even though the turbo speed of the 7995WX is faster than the 5995WX, (5.1 ghz vs 4.5), the clock speed of the 5995WX is 8% faster, ie 2.7 ghz vs 2.5 ghz for the 7995wx. This could mean that the 5995WX benchmark could run faster than the 7995WX if both chips ran the llm threads using the standard clock speed instead of turbo.

walteralias

can you do a 4090 with 70b and a longer context length as well as 70b tokens per second on 70b on Q8 or above?

owens

Hello. Try your tests with SMT disabled.

michaelbyrd

Great channel. Have you done any videos on creating a small GPU cluster? Obviously using 4 GPUs in a single chassis is better than 4 single GPU systems in a cluster, but I am wondering where the tipping point is. For a hobbyist seeking to up skill, it seems like learning about working with a cluster is worth the performance tradeoff considering all deployments at scale are clusters.

ruzaroos

Maybe you should have a new SSD for the new chip and gpu and do a fresh install on it. Then compare the two setups.

codescholar

Yess I've wanted a video like this

init_yeah

If you have it enabled, try disabling IOMMU and cpu tuning is set to "Maximum Performance" in the BIOS. Would also be interesting to see with and without hyper-threading enabled.

ethanwaldo

I have a Lenovo ThinkPad t440p which is about 10 years old running on a gt730m and shared integrated graphics I do have the i7 version. I am able to run a 8 billion parameter model with response times around 1 minute. I've also seen mini PCs run 70 billion models with shared graphics on YouTube with very quick response times.

fizzyfizzgigcouple

With some super heavy optimization and all of the possibilities strategies to make it more efficient and some delulu spirit we can make it work buddy

tedguy

Hello
Thank you so much for your time, and insights.
I am managing a team, and we are looking forward to build a server for ML.
Thank you for your former video. May I ask where I can find the thread ripper configuration you are testing here. Is it a custom build?
Thanks

Vadinaka

History you need to add a x570S and a 3950x cpu 128g 3200mhz to your test as a BASE. At 32 threads AMD lost the balance of the cpu's chasing the marketing of SPEED. A big rig has a 600hp diesal, but a sports car can have a 800hp engine. The 800hp car can't pull 40, 000lbs. Energy is energy its about digital leverarging balance. Add those buildable 3900's and 3950's

GunMuse

do you use bitsandbytes to quantization this model to int4 right? why this pc not use all of ram and vram?

thisrobotman

0:18 For the life of everything holy do not use a morph cut like this again or I will have nightmares of it in my dreams.

maxmustermann

Great video! Any recommendations for a small vision model on a 1080 Ti with 11GB? If not, what hardware would you say is the minimum required?

MediaCreators

Massive amounts of cpu and ram aren't going to do much on running a single model on GPU. The GPU is still doing all the work.
The difference would be seen in multi-agent instances where there's a lot of orchestration involved across several models and gpus. Or just operating on the CPU itself and not the GPU.

go-dev-o

Llama 3.1 405b LOCAL AI Home Server on 7995WX Threadripper and 4090

Llama 3.1 405B & 70B vs MacBook Pro. Apple Silicon is overpowered! Bonus: Apple's OpenELM

Llama 3.1 405b LOCAL AI Home Server on 7995WX Threadripper and 4090

Start Running LLaMA 3.1 405B In 3 Minutes With Ollama

Llama 405b BEAST already exploited | Here’s how

Llama 3.1 405b model is HERE | Hardware requirements

'I want Llama3.1 to perform 10x with my private knowledge' - Self learning Local Llama3.1 ...

How To Run Llama 3.1: 8B, 70B, 405B Models Locally (Guide)

Llama 405b: Full 92 page Analysis, and Uncontaminated SIMPLE Benchmark Results

Llama 3.1 is ACTUALLY really good! (and open source)

Meta Llama 3.1 is Game Over for GPT 4o ❓

LLaMA 405b Fully Tested - Open-Source WINS!

How to Use Meta's LLaMA 3.1 405B Model AI for Free | No Chat Limits, No Deployment Needed

Llama 3.1 405b Deep Dive | The Best LLM is now Open Source

Llama-3.1 (405B, 70B, 8B) + Groq + TogetherAI + OpenWebUI : FREE WAYS to USE ALL Llama-3.1 Models

Run New Llama 3.1 on Your Computer Privately in 10 minutes

Llama-3.1 (405B, 70B, & 8B) + ContinueDev FREE Copilot! Fully Locally and Opensource!

Run LLAMA 3.1 405b on 8GB Vram

Using Clusters to Boost LLMs 🚀

Cheap mini runs a 70B LLM 🤯

Build Anything with Llama 3.1 Agents, Here’s How

LLM System and Hardware Requirements - Running Large Language Models Locally #systemrequirements

LlamaCoder: Easily Generate FULL-STACK Apps with Llama3.1 405B with NO Code For FREE + FULLY LOCAL

Llama 3.1 405B Artifacts: Code Entire Apps With One Prompt Locally - Llama Coder

Llama-3.1 (405B & 8B) + Groq + TogetherAI : FULLY FREE Copilot! (Coding Copilot with ContinueDev...