Cheap mini runs a 70B LLM 🤯

preview_player
Показать описание
I put 96GB of RAM in this tiny mini PC and ran Llama 70B LLM on it.
Chair: Doro S100 Chair - enjoy 6%OFF: YTBZIS

🛒 Gear Links 🛒

🎥 Related Videos 🎥

— — — — — — — — —

❤️ SUBSCRIBE TO MY YOUTUBE CHANNEL 📺

— — — — — — — — —

Join this channel to get access to perks:

— — — — — — — — —

#sihoo #sihoodoros100 #DefyGravityWithSihoo
#machinelearning #llm #softwaredevelopment
Рекомендации по теме
Комментарии
Автор

I wouldn't necessarily say that this PC can "run" a 70B model.
It can walk one for sure...

shapelessed
Автор

This is my favorite subsubsubgenre because figuring out how to run LLMs on consumer equipment with fast & smart models is hard today. Gaming GPUs (too small), Mac Studios (too expensive) are stop gap solutions. I think these will have huge application in business when Groq-like chips are available and we don't have to send most LLM requests to frontier models.

pythonlibrarian
Автор

I don't want to disappoint you but I am quite sure you will get the same 1.4t/s running 70b parameters model purely on a CPU and it will use half of the memory. So theoretically you will be able to run 180b models on CPU (q4_K_M version). The thing is that on current PCs not a compute power is the limiting factor but the memory bandwidth and since both iGPU and CPU using the same memory you will get very similar speeds. Make a follow up video, may be I am wrong if so I will be happy to learn that.

perelmanych
Автор

I wonder if it would be capable to run Mixtral 8x22b. Does anybody have experience with it? How fast would it be if it can run it?

kiloabnehmen
Автор

Thank you. I learned a lot. With respect, we have a truly vastly different idea of what cheap means. U.S. $700-$800 total for this unit (after tax ~$900.00) is a whole lot to me. I get it that it's cheaper than other new stuff by comparison.

tyanite
Автор

In 10 years, videos like this will be nostalgic

WonderSilverstrand
Автор

Yes, keep exploring these alternatives to running expensive GPU cards or Apple silicon

_RobertOnline_
Автор

Most hardwares are still not designed for running ai. Average Joe won't buy 192gb Mac to run llm. 4090 doesn't have enough vram to run most llm.

ps
Автор

Running Ollama with Phi3.5 and multimodel models like minicpm-v on an Amazon DeepLens, basically a camera that Amazon sold to developers that is actually an Intel PC with 8GB of RAM and some Intel-optimized AI frameworks built in. Amazon discontinued the cloud-based parts of the DeepLens program so these perfectly functional mini-PCs are as cheap as $20 on eBay. I have 10. :)

rbus
Автор

i run 103B on a 4 slot RAM and also get about 3T/s and this is almost exactly half that with 2 slots
The way LLMs run at 1~20 T/s until they get a decent GPU is entirely dependant on the memory bandwidth. The best machine for a 70B is actually a 256GB 12 slot dual CPU circa 2015 xeon which run about $2000 total with ebay parts (90% of the cost is the mb and cpu)
in other words no GPU is required at all, just as many iRAM slots as you can find.

monoham
Автор

They need to start making GPUs with DDR slots. It would be slower for gaming but great for LLM and image generation

ryanchappell
Автор

bro, this is the fucking videos we need. why everyone talking and never do videos like this?

maxxflyer
Автор

MiniPCs are amazing!!! I got a ser8 last week with 96GB of memory and a 4TB nvme and it matches my old threadripper 1950x in multicore but has more memory and storage and BLOWS It away in single core and fits in the palm of my hand I literally am in love with it now o.o

isbestlizard
Автор

Hmmm, besides RAM/VRAM size, its mostly RAM bandwidth for token-generation, which determines llama.cpp's speed (the 4090 has >1TB/s, the M2 Ultra has 800GB/s) . GPU-horsepower is mainly useful for (batched) prompt-processing and learning.

And for RAM-size, its not just the model! With the large-context models like llama-3.1, RAM-requirements totally explode, if you try to start the model with its default 128k token-limit.

But cool video, thanks!!!!

andikunar
Автор

soooo excited to see you testing the new lunar lake intel cpus

DunckingTest
Автор

It's only a dual channel RAM machine. 50w to run a 7b model slowly could be 150w to run the same model at 6x the speed on a 4060 Ti, which equals more efficiency by completion. The 4060 Ti doesn't push full load or even close so the power drw can be less.

jcdenton
Автор

1.43 t/s is kinda OK, but realistically, it's not very useful.
I think a bang for buck situation would be to use a couple Tesla P40s to get like 5t/s. It won't look pretty, but if you chuck it in the garage or something it's not a problem.

..
Автор

Thanks for testing this out. Thought about testing this for myself using the Minisforum version of this mini PC.
There seems to be another way of running LLMs using the actual NPU of the Ultra CPUs instead of the Arc GPU when running it via OpenVino.
I would be very much interested in some more testing on linux+openvino.

ptrxwsmitt
Автор

You can always pick up a second hand tesla k80 and run it side by side with your 3090/4090 or other gpu I have a 3090 and the tesla k80 sure its old but heck I have 48 gig of vram to play with and things just run smoothly. Sure Im not going to break the 100meter sprint but coming last in an Olympics out of 8 billion people is good enough for me.
Lots of alternative ways to leverage big company clean outs of servers which no longer have value to them but are of value to us consumers running AI on the smell of an oily rag.

Love the videos.

AnOldMansView
Автор

7:05 Always try to power up before screwing to close cover of the device. On rare occasion need to reseat them DIMMs.

yewhanlim