Run Local LLMs on Hardware from $50 to $50,000 - We Test and Compare!

Показать описание

I’m doing something that’s never been done before: we’re going to run a ChatGPT-style large language model locally on a wide range of hardware, from a $50 Raspberry Pi 4 all the way up to a $50,000 Dell AI workstation with dual Nvidia 6000 Ada cards.

If you saw my last video, you know I caught some heat for using top-tier hardware and running everything in WSL on Windows. This time, I’m doing things differently. We’re starting small and budget-friendly—no Linux shenanigans—and testing out local inference on everything from a Pi to a high-performance mini PC, a gaming rig, an M2 Mac Pro, and, of course, that beast of a Dell workstation.

Along the way, I’ll show you how to install Ollama on Windows, and we’ll compare how well each machine can handle models like Llama 3.1 and even the monstrous 405-billion parameter model. Which system will shine? Which one will falter? Can a Raspberry Pi even handle a large language model at all? And what happens when we push the $50,000 workstation to its limit?

If you’ve ever wondered what it takes to run a large language model locally, or just want to see how different hardware stacks up, this episode is for you! Be sure to stick around to the end for some surprising results.

💻 Hardware tested in this episode:

Raspberry Pi 4 (8GB RAM)
Orion Herk Mini PC (Ryzen 9 7940HS)
Desktop Gaming PC (Threadripper 3970X & Nvidia 4080)
Apple M2 Mac Pro
Dell Threadripper Workstation (96 cores & Nvidia 6000 Ada)

Check out Dave’s Attic for behind-the-scenes Q&A on episodes like this one.!

Follow me on Facebook for daily updates!
Twitter: @davepl1968davepl1968

Dave's Garage

Рекомендации по теме

Комментарии

I'm here for the moment when the Pi says: "I can't do that, Dave"

DataIsBeautifulOfficial

Dave, I appreciate your mindfulness of how valuable our time is and editing this vid down to a reasonable time frame.

LilaHikes

The main (and almost only) factor for speed is memory bandwidth. Every token is generated by pulling the entire model from RAM and doing a bit of math to it. An 8gb model on an 12gb RTX 3060 TI with 6 channels (of 2gb each) get 448 gb/s for about 50 tokens/s (accounting for some overhead). That's why GPUs are so fast. If you have 2 channels of 3200 DDR4 memory, you have 51.2 gb/s - so you'll get about 6 tokens/s or around 1 token/s on a ~48 gb llama 3 70b model with 4bit quantization. - DDR 5 helps a lot, so does having more than 2 channels. CPU doesn't really matter. (Unless you're limited to 2933 MHz by a shoddy memory controller in a Ryzen 2600 and upgrade to a 5600X and get a 22% boost by pushing your DDR4 to 3600 MHz.)

wozaiwodejia

I found this video both informative and entertaining! I chuckled when you mentioned that it made you sad to see that big boss PC struggling. Great video as always Dave!

martyb

Thanks for updating and including budget friendly options.

XTCD

having failed to get the webserver running on your previous WSL demo, i removed everything in frustration. Great to see it works from the command line equally well under Windows. I now have AI on my laptop (8G RAM no GPU), something i never thought possible! Thanks for showing something for everyone.

chrisdulledge

Thanks Dave, I really appreciate the time you spend to make these videos for us. Really enjoy these geeky rabbitholes.

drelephanttube

Hey Dave, in your next LLM tutorial, can you give us a demo on how to connect external data sources to it? I'm struggling to wrap my brain around it.

Ultimatebubs

Superb content. Not many channels with this amount of quality in terms of delivery.

Madgod

11:00 I believe you've been running the 8B model if you're pulling 3.1 latest. I could be wrong, but I believe latest defaults to 8B flavor.

matt_b...

Dave, thank you for running those tests for us. While I am currently working with GPT through web browser and looking forward to switching to API, it is becoming more and more clear that the frameworks involved might hit hard limitations sooner than later and running a local model will be my only option in the future. Seeing that it is feasible, even today is very reassuring!

DJCatmom

Thanks Dave! Really appreciate your time, and energy on this topic. I was playing with the former video yesterday and thought, "man I hope he does a little more on this".... and BAM, you did. THANK YOU!

speed

I very much believe that local LLMs are an answer to privacy in the future. As long as a large group of open testers materialize, we can also try and remove bias as best we can.

EhdrianEh

I rather liked your having demonstrated with WSL, as I was able to follow along on my Ubuntu server

seanwright

The Llama 3.2 1B and 3B models run surprisingly well using Ollama on my OrangePi 5+ 8-core RK3588 processor with 8G RAM. Both models generate tokens at speeds that match or exceed normal human speech. I believe additional cores make a big difference. I also want to test these models on the Radxa X4 8G, N100 processor.

LanningRon

As someone who gave you "heat" in the last video, thank you for the follow-up!

nittany

I'm so glad you're doing a hardware comparison. I watched your previous video and wanted this immediately.

alastorclark

🎯 Key points for quick navigation:

00:00:00 *💡 Introduction & Overview*
- Introduction to testing LLMs on different hardware setups, ranging from $50 to $50, 000,
- Motivation for addressing viewers' requests for more budget-friendly hardware and direct Windows installation.
00:00:43 *🐢 Running on Raspberry Pi 4*
- Attempt to run LLaMA on a Raspberry Pi 4 with 8 GB of RAM,
- Installed on Raspbian, demonstrated extremely slow performance, impractical for real-time use.
00:03:27 *🔄 Testing on Consumer Mini PC (Orion Herk)*
- Upgraded to a $676 Mini PC with a Ryzen 9 7940HS and Radeon 780M iGPU,
- Faster performance compared to Raspberry Pi, but model could not fit in GPU memory, relying on CPU instead.
00:07:50 *🎮 Desktop Gaming PC with Nvidia 4080*
- Running the LLM on a 3970X Threadripper with Nvidia 4080 using WSL 2,
- GPU offloading enabled faster performance, similar to ChatGPT, demonstrating good use of available hardware.
00:09:42 *🍎 Mac Pro M2 Ultra Testing*
- Tested on Mac Pro with M2 Ultra and 128 GB unified memory,
- Model ran efficiently with GPU usage around 50%, producing rapid responses, demonstrating M2 Ultra’s suitability for LLMs.
00:10:51 *🚀 High-End 96-Core Threadripper & Nvidia 6000 Ada*
- Attempt to run a 405-billion-parameter model on an overclocked Threadripper with Nvidia 6000 Ada,
- Performance lagged significantly, highlighting that larger models can struggle even on high-end consumer hardware.
00:13:12 *⚡ Efficient Model on High-End Hardware*
- Switching to a smaller, more efficient LLaMA 3.2 model on the high-end setup,
- Demonstrated much better performance, producing rapid answers in real-time, highlighting the importance of model size optimization.
00:14:33 *📢 Conclusion & Call to Action*
- Summary of testing LLMs on various hardware from low-end to high-end,
- Encouraged viewers to subscribe and check out more content, highlighting the educational and entertainment aspects of the video.

Made with HARPA AI

warezit

Hey Dave - 11:00 With a sub 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy. You were running pretty much the smallest version possible.

Steamrick

I loved seeing how AI can bring super hardware to it's knees. It instantly demonstrates why the AI cutting edge is moving to Blackwell and Rubin. Many thanks for this demo.

Billwzw

Run Local LLMs on Hardware from $50 to $50,000 - We Test and Compare!

Run Local LLMs on Hardware from $50 to $50,000 - We Test and Compare!

All You Need To Know About Running LLMs Locally

LLM System and Hardware Requirements - Running Large Language Models Locally #systemrequirements

Cheap mini runs a 70B LLM 🤯

LLMs with 8GB / 16GB

6 Best Consumer GPUs For Local LLMs and AI Software in Late 2024

I Ran Advanced LLMs on the Raspberry Pi 5!

Run Your Own LLM Locally: LLaMa, Mistral & More

Local LLM | Ollama.com | RAG | LLaMA | Python | ChatGPT Alternative | Responsible AI

It’s over…my new LLM Rig

Local LLM Challenge | Speed vs Efficiency

Using Ollama to Run Local LLMs on the Raspberry Pi 5

Run LLMs without GPUs | local-llm

Budget-Friendly Power: Unlocking Ollamma LLM with Affordable GPU Options

FREE Local LLMs on Apple Silicon | FAST!

host ALL your AI locally

Local LLM with Ollama, LLAMA3 and LM Studio // Private AI Server

Run ALL Your AI Locally in Minutes (LLMs, RAG, and more)

Using Clusters to Boost LLMs 🚀

This new AI is powerful and uncensored… Let’s run it

LocalAI LLM Testing: How many 16GB 4060TI's does it take to run Llama 3 70B Q4

How to Turn Your AMD GPU into a Local LLM Beast: A Beginner's Guide with ROCm

M3 max 128GB for AI running Llama2 7b 13b and 70b

Run Your Own ChatGPT-like LLM on Your Windows PC!