Run Local LLMs on Hardware from $50 to $50,000 - We Test and Compare!

preview_player
Показать описание

I’m doing something that’s never been done before: we’re going to run a ChatGPT-style large language model locally on a wide range of hardware, from a $50 Raspberry Pi 4 all the way up to a $50,000 Dell AI workstation with dual Nvidia 6000 Ada cards.

If you saw my last video, you know I caught some heat for using top-tier hardware and running everything in WSL on Windows. This time, I’m doing things differently. We’re starting small and budget-friendly—no Linux shenanigans—and testing out local inference on everything from a Pi to a high-performance mini PC, a gaming rig, an M2 Mac Pro, and, of course, that beast of a Dell workstation.

Along the way, I’ll show you how to install Ollama on Windows, and we’ll compare how well each machine can handle models like Llama 3.1 and even the monstrous 405-billion parameter model. Which system will shine? Which one will falter? Can a Raspberry Pi even handle a large language model at all? And what happens when we push the $50,000 workstation to its limit?

If you’ve ever wondered what it takes to run a large language model locally, or just want to see how different hardware stacks up, this episode is for you! Be sure to stick around to the end for some surprising results.

💻 Hardware tested in this episode:

Raspberry Pi 4 (8GB RAM)
Orion Herk Mini PC (Ryzen 9 7940HS)
Desktop Gaming PC (Threadripper 3970X & Nvidia 4080)
Apple M2 Mac Pro
Dell Threadripper Workstation (96 cores & Nvidia 6000 Ada)

Check out Dave’s Attic for behind-the-scenes Q&A on episodes like this one.!

Follow me on Facebook for daily updates!
Twitter: @davepl1968davepl1968
Рекомендации по теме
Комментарии
Автор

I'm here for the moment when the Pi says: "I can't do that, Dave"

DataIsBeautifulOfficial
Автор

Dave, I appreciate your mindfulness of how valuable our time is and editing this vid down to a reasonable time frame.

LilaHikes
Автор

The main (and almost only) factor for speed is memory bandwidth. Every token is generated by pulling the entire model from RAM and doing a bit of math to it. An 8gb model on an 12gb RTX 3060 TI with 6 channels (of 2gb each) get 448 gb/s for about 50 tokens/s (accounting for some overhead). That's why GPUs are so fast. If you have 2 channels of 3200 DDR4 memory, you have 51.2 gb/s - so you'll get about 6 tokens/s or around 1 token/s on a ~48 gb llama 3 70b model with 4bit quantization. - DDR 5 helps a lot, so does having more than 2 channels. CPU doesn't really matter. (Unless you're limited to 2933 MHz by a shoddy memory controller in a Ryzen 2600 and upgrade to a 5600X and get a 22% boost by pushing your DDR4 to 3600 MHz.)

wozaiwodejia
Автор

I found this video both informative and entertaining! I chuckled when you mentioned that it made you sad to see that big boss PC struggling. Great video as always Dave!

martyb
Автор

Thanks for updating and including budget friendly options.

XTCD
Автор

having failed to get the webserver running on your previous WSL demo, i removed everything in frustration. Great to see it works from the command line equally well under Windows. I now have AI on my laptop (8G RAM no GPU), something i never thought possible! Thanks for showing something for everyone.

chrisdulledge
Автор

Thanks Dave, I really appreciate the time you spend to make these videos for us. Really enjoy these geeky rabbitholes.

drelephanttube
Автор

Hey Dave, in your next LLM tutorial, can you give us a demo on how to connect external data sources to it? I'm struggling to wrap my brain around it.

Ultimatebubs
Автор

Superb content. Not many channels with this amount of quality in terms of delivery.

Madgod
Автор

11:00 I believe you've been running the 8B model if you're pulling 3.1 latest. I could be wrong, but I believe latest defaults to 8B flavor.

matt_b...
Автор

Dave, thank you for running those tests for us. While I am currently working with GPT through web browser and looking forward to switching to API, it is becoming more and more clear that the frameworks involved might hit hard limitations sooner than later and running a local model will be my only option in the future. Seeing that it is feasible, even today is very reassuring!

DJCatmom
Автор

Thanks Dave! Really appreciate your time, and energy on this topic. I was playing with the former video yesterday and thought, "man I hope he does a little more on this".... and BAM, you did. THANK YOU!

speed
Автор

I very much believe that local LLMs are an answer to privacy in the future. As long as a large group of open testers materialize, we can also try and remove bias as best we can.

EhdrianEh
Автор

I rather liked your having demonstrated with WSL, as I was able to follow along on my Ubuntu server

seanwright
Автор

The Llama 3.2 1B and 3B models run surprisingly well using Ollama on my OrangePi 5+ 8-core RK3588 processor with 8G RAM. Both models generate tokens at speeds that match or exceed normal human speech. I believe additional cores make a big difference. I also want to test these models on the Radxa X4 8G, N100 processor.

LanningRon
Автор

As someone who gave you "heat" in the last video, thank you for the follow-up!

nittany
Автор

I'm so glad you're doing a hardware comparison. I watched your previous video and wanted this immediately.

alastorclark
Автор

🎯 Key points for quick navigation:

00:00:00 *💡 Introduction & Overview*
- Introduction to testing LLMs on different hardware setups, ranging from $50 to $50, 000,
- Motivation for addressing viewers' requests for more budget-friendly hardware and direct Windows installation.
00:00:43 *🐢 Running on Raspberry Pi 4*
- Attempt to run LLaMA on a Raspberry Pi 4 with 8 GB of RAM,
- Installed on Raspbian, demonstrated extremely slow performance, impractical for real-time use.
00:03:27 *🔄 Testing on Consumer Mini PC (Orion Herk)*
- Upgraded to a $676 Mini PC with a Ryzen 9 7940HS and Radeon 780M iGPU,
- Faster performance compared to Raspberry Pi, but model could not fit in GPU memory, relying on CPU instead.
00:07:50 *🎮 Desktop Gaming PC with Nvidia 4080*
- Running the LLM on a 3970X Threadripper with Nvidia 4080 using WSL 2,
- GPU offloading enabled faster performance, similar to ChatGPT, demonstrating good use of available hardware.
00:09:42 *🍎 Mac Pro M2 Ultra Testing*
- Tested on Mac Pro with M2 Ultra and 128 GB unified memory,
- Model ran efficiently with GPU usage around 50%, producing rapid responses, demonstrating M2 Ultra’s suitability for LLMs.
00:10:51 *🚀 High-End 96-Core Threadripper & Nvidia 6000 Ada*
- Attempt to run a 405-billion-parameter model on an overclocked Threadripper with Nvidia 6000 Ada,
- Performance lagged significantly, highlighting that larger models can struggle even on high-end consumer hardware.
00:13:12 *⚡ Efficient Model on High-End Hardware*
- Switching to a smaller, more efficient LLaMA 3.2 model on the high-end setup,
- Demonstrated much better performance, producing rapid answers in real-time, highlighting the importance of model size optimization.
00:14:33 *📢 Conclusion & Call to Action*
- Summary of testing LLMs on various hardware from low-end to high-end,
- Encouraged viewers to subscribe and check out more content, highlighting the educational and entertainment aspects of the video.

Made with HARPA AI

warezit
Автор

Hey Dave - 11:00 With a sub 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy. You were running pretty much the smallest version possible.

Steamrick
Автор

I loved seeing how AI can bring super hardware to it's knees. It instantly demonstrates why the AI cutting edge is moving to Blackwell and Rubin. Many thanks for this demo.

Billwzw