LocalAI LLM Testing: Llama 3.1 8B Q8 Showdown - M40 24GB vs 4060Ti 16GB vs A4500 20GB vs 3090 24GB

Показать описание

3090 24GB Has joined the lab + Llama 3.1 models released this week.

Just a fun night in the lab, grab your favorite relaxation method and join in.

GPU Bench Node

Recorded and best viewed in 4K
Your results may vary due to hardware, software, model used, context size, weather, wallet, and more!

Рекомендации по теме

Комментарии

GREAT video... Learned a lot from this video. It's hard to find good AI benchmark videos on YouTube.

jksoftware

undervolt the 3090 and will give you bassically the same performance with around 220-250 watts

ArtificialLife-GameOfficialAcc

Test out the 3060 12 gb cards comprehensively pleaser! Also would be nice to hear your opinions on what the best card combos might be be for cost to performance .

makerspersona

This is great, and a very professional test environment. I was especially impressed by the ability to switch GPU's using the Kubernetes cluster.

hammadusmani

Great video. It is definitely nice to see a benchmark against different Nvidia board with something similar I have ran before. At the end of June, I bought parts and built for a computer for AI development with a Ryzen 7 7800X3D for $339 and a 4060 Ti 16GB for ($450). I bought it to begin local development waiting on the RTX 5090 but it looks like that will be delayed for awhile.
I've just been using LM Studio and Anything LLM for running local LLM to analyze data. And using many Python open source projects for audio and image processing.

shawnvines

Superb! Could you make a tutorial on how to setup and implement everything needed (SOFTWARE WISE) to achieve what you did here?

hablalabiblia

Pro tipp: You can reduce the power draw of the RTX 3090 by 90 Watts via undervolting without speed reduction during LLM inference.

Viewable

I have two Tesla P40s here but I unsuccessfull in my trys on making use of both for my AI workloads. especially my stable diffusion trainings are taking very long. do you know how i could make them appear as one large gpu?

Rewelife

Something I am really wondering about is Radeon VII vs RX 6950XT (to keep it inside the AMD family).
Having stuff work with ROCm is bothersome and most of what is available for NVIDIA just refuses to but as long as only inference is involved it works well (tried some tuning with no success so far).

Would the HBM2 massive bandwidth able to score any win against a more recent more capable compute. Or if no win were to be seen for the HBM2, how would it affect the scaling ?

Vaasref

exactly as requested. more useful videos. thank you for your content...

i think about buying to buy a h100 80gb because i wanna run miatral large 2 so badly 😅

dllsmartphone

How many shrouds and fan sizes have you tried on Tesla GPU's? I want to get a quieter run which a larger fan could theoretically do but the shroud funneling might be a source of noise so I don't know what to get for best silence.

jcdenton

The read speed of an LLM (prompt eval tokens/s) only depends on the compute speed of the hardware (which depends on number and frequency of tensor cores, number and frequency of CUDA cores, chip generation). The write speed of an LLM (eval tokens/s) only depends on the memory bandwith (in GB/s) of the hardware and the chip generation.

Viewable

Can you please test the A770 16gb card? Thanks

iheuzio

Very informative, thank you!
Detailed, to the point and exactly what we need to know.
I use a 3090 at home, a 4060 at work and on my coding machine I use an old GTX 1080Ti with 11GB. It does OK for Continue in VSCode but it is slow.
Tell you wife "it's for science" in your best Doc from back to the future voice.
Thank you again.

jackflash

Good video, very detailed. I like that you looked at all aspects, power, price, efficiency etc.

I don't suppose you have an AMD card lying around to compare as well? :D

Flixerine

Me with a 4090 and a 1500w psu chuckling about your concern for burning the house down at 300w. :D I triggered a 15a breaker earlier this week, found out the outlet I have my microwave plugged into is on the same circuit as my home office and I must have been pushing the GPU at the time. So glad I don't have insane power costs like Europe. Btw, if you need a "the youtubers asked for it, it's a business write-off" excuse for a 4090... you really need a 4090 for testing data comparisons for the people. :D Thanks for all the tests. Would love to see a 4070ti in there too, to fight with the 4060.

rhadiem

GGUF model file format is meant for usage in CPU inference. For GPU inference, use the GPTQ file format (.safetensor). The GGUF format takes more space, has less quality, but can be used on CPU.

Viewable

Hey, you make really interesting and comprehensive videos! Many thanks for that. What I always ask myself and I think maybe many others too(?):
What exactly do you use to connect the GPUs? So your system looks like a mining rig. Is there any performance loss between this extension or the direct connection via PCIe 16x lane?
Have you already been able to test things like NVLink with your systems? Does it make sense to use different GPU models, or does this create some kind of bottlenecks?
What do you think is important when it comes to choosing hardware to build such a system?

Sorry for all the questions. I just find the whole topic really exciting.

minagornas

Can someone explain in a nutshell what this is? Is it an Ai language model like chatgpt that runs entirely offline on my own computer?

krisiluttinen

Great job! Llama3.1 is really much better, so I would encourage you to go on a quest! How to run different flavours of 3.1 most efficiently on commodity hardware. The it projects around llm's will explode imo, because the model family is good and a lot of companies can not share their data to public clouds.

marekkroplewski

LocalAI LLM Testing: Llama 3.1 8B Q8 Showdown - M40 24GB vs 4060Ti 16GB vs A4500 20GB vs 3090 24GB

LocalAI LLM Testing: Llama 3.1 8B Q8 Showdown - M40 24GB vs 4060Ti 16GB vs A4500 20GB vs 3090 24GB

LocalAI LLM Testing: How many 16GB 4060TI's does it take to run Llama 3 70B Q4

LocalAI LLM Testing: Distributed Inference on a network? Llama 3.1 70B on Multi GPUs/Multiple Nodes

Llama 3 8B: BIG Step for Local AI Agents! - Full Tutorial (Build Your Own Tools)

LocalAI LLM Single vs Multi GPU Testing scaling to 6x 4060TI 16GB GPUS

'I want Llama3 to perform 10x with my private knowledge' - Local Agentic RAG w/ llama3

All You Need To Know About Running LLMs Locally

Zuck's new Llama is a beast

This Llama 3 is powerful and uncensored, let’s run it

Fully local RAG agents with Llama 3.1

'okay, but I want Llama 3 for my specific use case' - Here's how

How to Run Llama 3 Locally on your Computer (Ollama, LM Studio)

host ALL your AI locally

FREE Local LLMs on Apple Silicon | FAST!

LLMs with 8GB / 16GB

FINALLY! Open-Source 'LLaMA Code' Coding Assistant (Tutorial)

Build Anything with Llama 3 Agents, Here’s How

How Did Llama-3 Beat Models x200 Its Size?

LLaMA 3 UNCENSORED 🥸 It Answers ANY Question

This new AI is powerful and uncensored… Let’s run it

REFLECTION Llama3.1 70b Tested on Ollama Home Ai Server - Best Ai LLM?

Llama 3.1 405b model is HERE | Hardware requirements

Run your own AI (but private)

'I want Llama3.1 to perform 10x with my private knowledge' - Self learning Local Llama3.1 ...