LocalAI LLM Single vs Multi GPU Testing scaling to 6x 4060TI 16GB GPUS

Показать описание

An edited version of a demo I put together for a conversation amongst friends about single vs multiple GPU's when running LLM's locally. We walk through testing from a single to up to 6x 4060TI 16GB VRAM GPUs.

Machine Components: (These are affiliate-based links that help the channel if you purchase from them!)

Recorded and best viewed in 4K
Your results may vary due to hardware, software, model used, context size, weather, wallet, and more!

Рекомендации по теме

Комментарии

If your model fits on a single GPU (which your 13B Q_K_S model does), there's no benefit to running it across multiple GPUs. In fact, at best you'll be flat performance wise spreading the model across more than one GPU. Generally, there will be a slight performance penalty having to coordinate across GPUs for the LLM inference. The primary benefit for multiple GPUs is to run bigger models like a 70B or Mixtral 8x7B that do not fit on a single GPU, or to run batched inference using vLLM.

The smaller the model (7B/8B and below in particular), the more impact the single-threaded CPU performance will have on the tokens per second speed. For a LLaMA-2 13B Q5_K_S model on an Intel i9 13900K + 4090 for example, I get 82 tokens per second:

llama_print_timings: eval time = 11228.67 ms / 921 runs ( 12.19 ms per token, 82.02 tokens per second)

On same machine using 3090, 71 t/s:
llama_print_timings: eval time = 8536.02 ms / 614 runs ( 13.90 ms per token, 71.93 tokens per second)

If you took one of those 4060 Ti cards and put it into a gaming PC with a current gen i7/i9 or Ryzen X3D CPU, you should see a big improvement in tokens per second.

hienngo

New sub! thanks. I was researching the 4060Ti for a future purchase and your video popped up in my feed. Thanks.. great content.

nlay

a reminder - you could get a gpu with lots of memory banks and upgrade the memory chips in to have huge capacity cards. you could for instance mod a 12gb rtx 3080 to have 24gbs

GraveUypo

Thank you for this contribution to the internet Brother! I have learned from this. Subbed, liked and commented to show support. Good stuff sir! This will help us all. <3

NevsTechBits

Question: Do you have video going through the Kubernetes setup of using multi-gpus? Would be helpful for those just starting out.

andre-le-bone-aparte

Just found your channel. Excellent Content - Another subscriber for you sir!

andre-le-bone-aparte

Your test very interesting. I use the old GTX 1060 with 6Gb run Ollama or LM Studio, the very significant peformance impact is about size of LLM model. And BTW I try to find someone that test RX 7900xt using ROCM, Did not find any in entire youtube.

maxh-yanz

I was waiting big models to be ran on these :) How about 6x a770 16GB GPUS?

stuffinfinland

Hey RoboTF, my thinking is that 6x 16GB 4060 Ti's for a total of 96GB of VRAM will allow you to run 130B params (Q4) and easily run any 70B model unquantisied.

jackinsights

Amazing video, was really helpful! keep it up!

animecharacter

I came here expecting to see some tests actually utilizing that combined vram, and now I'm left confused what was the point of this video?
Why would anyone expect speed to be different running exact same model with exact same prompt size, on 1 gpu versus 6 gpu?

gaiusbc

Doesn’t look like it split the work load well.
It could have sent an iteration simultaneously to each gpu but it’s doesn’t look like it does that.

jasonn

Thank you very much for this, it must have cost a small fortune

sixfree

Just found your channel, this is some amazing stuff, please do post a video of your Kuberetes setup, im currently working on designing a AI powered Helpdesk with SIEM and just really getting in deep with the ML/AI areas...really love playing with all this.

alptraum

Thanks for the content, so two 16GB GPU will act as 32GB for the models ?

aminvand

this is due to extreme pci-e bandwidth bottlenecking, probably. putting a ton of gpus together without the bandwidth to push the data to them and back yields no improvement.

GraveUypo

Just saw this, thanks for testing. I already have a 4090, but definitely chasing the almighty VRAM for testing bigger models and running different things at one time. What would you recommend for system ram to run a 6x GPU setup like this?

rhadiem

I'm curious about my use case which is coding (needs high enough quantization). What t/s would I get with Codestral 22b 6-bit quantized with a moderate context size? 2x 4060TI 16GB should be enough for that and leave plenty of room for context. And secondly, what would be the speed penalty when going for 8-bit quantized instead of 6-bit? Around 33% or aim I wrong?

jeroenadamdevenijn

Two - three cards give better consistency, but it seems better to run them as individual nodes working on different problems, so the real test would be a more complex problem that has 6 parts.. and load times are still part of the equation. Some type of problem that your LLM has shown to have some fault with (not 95% correct more like 67-75%) and run each 3 times.. so the single GPU should take 6 times longer.. bt fore what I have seen (other reports) 6 GPUs will run only 5.5 times faster.. but when problem solving this is a time-saver. Total memory is also a real consideration along with if your are running PCIe4 vs 5. I've read that the slower your bus speed, the more advantageous it is to have a bigger GPU-vram card.

tsclly

Hi thanks for sharing great video! Would you please also share a hardware list for using this test if you had a chance? I am very interested in how GPUs are connected on a mainboard.

akirakudo

LocalAI LLM Single vs Multi GPU Testing scaling to 6x 4060TI 16GB GPUS

LocalAI LLM Single vs Multi GPU Testing scaling to 6x 4060TI 16GB GPUS

All You Need To Know About Running LLMs Locally

Which nVidia GPU is BEST for Local Generative AI and LLMs in 2024?

LocalAI LLM Testing: Distributed Inference on a network? Llama 3.1 70B on Multi GPUs/Multiple Nodes

3090 vs 4090 Local AI Server LLM Inference Speed Comparison on Ollama

host ALL your AI locally

6 Best Consumer GPUs For Local LLMs and AI Software in Late 2024

Local AI Just Got Easy (and Cheap)

LocalAI LLM Testing: How many 16GB 4060TI's does it take to run Llama 3 70B Q4

Power Each AI Agent With A Different LOCAL LLM (AutoGen + Ollama Tutorial)

Run ALL Your AI Locally in Minutes (LLMs, RAG, and more)

Unlimited AI Agents running locally with Ollama & AnythingLLM

LocalAI LLM Testing: i9 CPU vs Tesla M40 vs 4060Ti vs A4500

I Ran Advanced LLMs on the Raspberry Pi 5!

'I want Llama3 to perform 10x with my private knowledge' - Local Agentic RAG w/ llama3

FREE Local LLMs on Apple Silicon | FAST!

Python Advanced AI Agent Tutorial - LlamaIndex, Ollama and Multi-LLM!

'I want Llama3.1 to perform 10x with my private knowledge' - Self learning Local Llama3.1 ...

NVIDIA Nemotron 70b Local AI Testing - The BEST Open Source LLM?

Run your own AI (but private)

LocalAI LLM Testing: Llama 3.1 8B Q8 Showdown - M40 24GB vs 4060Ti 16GB vs A4500 20GB vs 3090 24GB

AI/ML/DL GPU Buying Guide 2024: Get the Most AI Power for Your Budget

Set up a Local AI like ChatGPT on your own machine!

LocalAI: Free, Open Source OpenAI Alternative 🚀 INSANE 🤯 Step-by-Step Tutorial