LocalAI LLM Testing: Llama 3.1 8B Q8 Showdown - M40 24GB vs 4060Ti 16GB vs A4500 20GB vs 3090 24GB

preview_player
Показать описание
3090 24GB Has joined the lab + Llama 3.1 models released this week.

Just a fun night in the lab, grab your favorite relaxation method and join in.

GPU Bench Node

Recorded and best viewed in 4K
Your results may vary due to hardware, software, model used, context size, weather, wallet, and more!
Рекомендации по теме
Комментарии
Автор

GREAT video... Learned a lot from this video. It's hard to find good AI benchmark videos on YouTube.

jksoftware
Автор

undervolt the 3090 and will give you bassically the same performance with around 220-250 watts

ArtificialLife-GameOfficialAcc
Автор

Test out the 3060 12 gb cards comprehensively pleaser! Also would be nice to hear your opinions on what the best card combos might be be for cost to performance .

makerspersona
Автор

This is great, and a very professional test environment. I was especially impressed by the ability to switch GPU's using the Kubernetes cluster.

hammadusmani
Автор

Great video. It is definitely nice to see a benchmark against different Nvidia board with something similar I have ran before. At the end of June, I bought parts and built for a computer for AI development with a Ryzen 7 7800X3D for $339 and a 4060 Ti 16GB for ($450). I bought it to begin local development waiting on the RTX 5090 but it looks like that will be delayed for awhile.
I've just been using LM Studio and Anything LLM for running local LLM to analyze data. And using many Python open source projects for audio and image processing.

shawnvines
Автор

Superb! Could you make a tutorial on how to setup and implement everything needed (SOFTWARE WISE) to achieve what you did here?

hablalabiblia
Автор

Pro tipp: You can reduce the power draw of the RTX 3090 by 90 Watts via undervolting without speed reduction during LLM inference.

Viewable
Автор

I have two Tesla P40s here but I unsuccessfull in my trys on making use of both for my AI workloads. especially my stable diffusion trainings are taking very long. do you know how i could make them appear as one large gpu?

Rewelife
Автор

Something I am really wondering about is Radeon VII vs RX 6950XT (to keep it inside the AMD family).
Having stuff work with ROCm is bothersome and most of what is available for NVIDIA just refuses to but as long as only inference is involved it works well (tried some tuning with no success so far).

Would the HBM2 massive bandwidth able to score any win against a more recent more capable compute. Or if no win were to be seen for the HBM2, how would it affect the scaling ?

Vaasref
Автор

exactly as requested. more useful videos. thank you for your content...

i think about buying to buy a h100 80gb because i wanna run miatral large 2 so badly 😅

dllsmartphone
Автор

How many shrouds and fan sizes have you tried on Tesla GPU's? I want to get a quieter run which a larger fan could theoretically do but the shroud funneling might be a source of noise so I don't know what to get for best silence.

jcdenton
Автор

The read speed of an LLM (prompt eval tokens/s) only depends on the compute speed of the hardware (which depends on number and frequency of tensor cores, number and frequency of CUDA cores, chip generation). The write speed of an LLM (eval tokens/s) only depends on the memory bandwith (in GB/s) of the hardware and the chip generation.

Viewable
Автор

Can you please test the A770 16gb card? Thanks

iheuzio
Автор

Very informative, thank you!
Detailed, to the point and exactly what we need to know.
I use a 3090 at home, a 4060 at work and on my coding machine I use an old GTX 1080Ti with 11GB. It does OK for Continue in VSCode but it is slow.
Tell you wife "it's for science" in your best Doc from back to the future voice.
Thank you again.

jackflash
Автор

Good video, very detailed. I like that you looked at all aspects, power, price, efficiency etc.

I don't suppose you have an AMD card lying around to compare as well? :D

Flixerine
Автор

Me with a 4090 and a 1500w psu chuckling about your concern for burning the house down at 300w. :D I triggered a 15a breaker earlier this week, found out the outlet I have my microwave plugged into is on the same circuit as my home office and I must have been pushing the GPU at the time. So glad I don't have insane power costs like Europe. Btw, if you need a "the youtubers asked for it, it's a business write-off" excuse for a 4090... you really need a 4090 for testing data comparisons for the people. :D Thanks for all the tests. Would love to see a 4070ti in there too, to fight with the 4060.

rhadiem
Автор

GGUF model file format is meant for usage in CPU inference. For GPU inference, use the GPTQ file format (.safetensor). The GGUF format takes more space, has less quality, but can be used on CPU.

Viewable
Автор

Hey, you make really interesting and comprehensive videos! Many thanks for that. What I always ask myself and I think maybe many others too(?):
What exactly do you use to connect the GPUs? So your system looks like a mining rig. Is there any performance loss between this extension or the direct connection via PCIe 16x lane?
Have you already been able to test things like NVLink with your systems? Does it make sense to use different GPU models, or does this create some kind of bottlenecks?
What do you think is important when it comes to choosing hardware to build such a system?

Sorry for all the questions. I just find the whole topic really exciting.

minagornas
Автор

Can someone explain in a nutshell what this is? Is it an Ai language model like chatgpt that runs entirely offline on my own computer?

krisiluttinen
Автор

Great job! Llama3.1 is really much better, so I would encourage you to go on a quest! How to run different flavours of 3.1 most efficiently on commodity hardware. The it projects around llm's will explode imo, because the model family is good and a lot of companies can not share their data to public clouds.

marekkroplewski