LocalAI LLM Single vs Multi GPU Testing scaling to 6x 4060TI 16GB GPUS

preview_player
Показать описание
An edited version of a demo I put together for a conversation amongst friends about single vs multiple GPU's when running LLM's locally. We walk through testing from a single to up to 6x 4060TI 16GB VRAM GPUs.

Machine Components: (These are affiliate-based links that help the channel if you purchase from them!)

Recorded and best viewed in 4K
Your results may vary due to hardware, software, model used, context size, weather, wallet, and more!
Рекомендации по теме
Комментарии
Автор

If your model fits on a single GPU (which your 13B Q_K_S model does), there's no benefit to running it across multiple GPUs. In fact, at best you'll be flat performance wise spreading the model across more than one GPU. Generally, there will be a slight performance penalty having to coordinate across GPUs for the LLM inference. The primary benefit for multiple GPUs is to run bigger models like a 70B or Mixtral 8x7B that do not fit on a single GPU, or to run batched inference using vLLM.

The smaller the model (7B/8B and below in particular), the more impact the single-threaded CPU performance will have on the tokens per second speed. For a LLaMA-2 13B Q5_K_S model on an Intel i9 13900K + 4090 for example, I get 82 tokens per second:

llama_print_timings: eval time = 11228.67 ms / 921 runs ( 12.19 ms per token, 82.02 tokens per second)

On same machine using 3090, 71 t/s:
llama_print_timings: eval time = 8536.02 ms / 614 runs ( 13.90 ms per token, 71.93 tokens per second)

If you took one of those 4060 Ti cards and put it into a gaming PC with a current gen i7/i9 or Ryzen X3D CPU, you should see a big improvement in tokens per second.

hienngo
Автор

New sub! thanks. I was researching the 4060Ti for a future purchase and your video popped up in my feed. Thanks.. great content.

nlay
Автор

a reminder - you could get a gpu with lots of memory banks and upgrade the memory chips in to have huge capacity cards. you could for instance mod a 12gb rtx 3080 to have 24gbs

GraveUypo
Автор

Thank you for this contribution to the internet Brother! I have learned from this. Subbed, liked and commented to show support. Good stuff sir! This will help us all. <3

NevsTechBits
Автор

Question: Do you have video going through the Kubernetes setup of using multi-gpus? Would be helpful for those just starting out.

andre-le-bone-aparte
Автор

Just found your channel. Excellent Content - Another subscriber for you sir!

andre-le-bone-aparte
Автор

Your test very interesting. I use the old GTX 1060 with 6Gb run Ollama or LM Studio, the very significant peformance impact is about size of LLM model. And BTW I try to find someone that test RX 7900xt using ROCM, Did not find any in entire youtube.

maxh-yanz
Автор

I was waiting big models to be ran on these :) How about 6x a770 16GB GPUS?

stuffinfinland
Автор

Hey RoboTF, my thinking is that 6x 16GB 4060 Ti's for a total of 96GB of VRAM will allow you to run 130B params (Q4) and easily run any 70B model unquantisied.

jackinsights
Автор

Amazing video, was really helpful! keep it up!

animecharacter
Автор

I came here expecting to see some tests actually utilizing that combined vram, and now I'm left confused what was the point of this video?
Why would anyone expect speed to be different running exact same model with exact same prompt size, on 1 gpu versus 6 gpu?

gaiusbc
Автор

Doesn’t look like it split the work load well.
It could have sent an iteration simultaneously to each gpu but it’s doesn’t look like it does that.

jasonn
Автор

Thank you very much for this, it must have cost a small fortune

sixfree
Автор

Just found your channel, this is some amazing stuff, please do post a video of your Kuberetes setup, im currently working on designing a AI powered Helpdesk with SIEM and just really getting in deep with the ML/AI areas...really love playing with all this.

alptraum
Автор

Thanks for the content, so two 16GB GPU will act as 32GB for the models ?

aminvand
Автор

this is due to extreme pci-e bandwidth bottlenecking, probably. putting a ton of gpus together without the bandwidth to push the data to them and back yields no improvement.

GraveUypo
Автор

Just saw this, thanks for testing. I already have a 4090, but definitely chasing the almighty VRAM for testing bigger models and running different things at one time. What would you recommend for system ram to run a 6x GPU setup like this?

rhadiem
Автор

I'm curious about my use case which is coding (needs high enough quantization). What t/s would I get with Codestral 22b 6-bit quantized with a moderate context size? 2x 4060TI 16GB should be enough for that and leave plenty of room for context. And secondly, what would be the speed penalty when going for 8-bit quantized instead of 6-bit? Around 33% or aim I wrong?

jeroenadamdevenijn
Автор

Two - three cards give better consistency, but it seems better to run them as individual nodes working on different problems, so the real test would be a more complex problem that has 6 parts.. and load times are still part of the equation. Some type of problem that your LLM has shown to have some fault with (not 95% correct more like 67-75%) and run each 3 times.. so the single GPU should take 6 times longer.. bt fore what I have seen (other reports) 6 GPUs will run only 5.5 times faster.. but when problem solving this is a time-saver. Total memory is also a real consideration along with if your are running PCIe4 vs 5. I've read that the slower your bus speed, the more advantageous it is to have a bigger GPU-vram card.

tsclly
Автор

Hi thanks for sharing great video! Would you please also share a hardware list for using this test if you had a chance? I am very interested in how GPUs are connected on a mainboard.

akirakudo