5. Comparing Quantizations of the Same Model - Ollama Course

preview_player
Показать описание
Welcome back to the Ollama course! In this lesson, we dive into the fascinating world of AI model quantization. Using variations of the llama3.1 model, we explore how different quantization levels affect performance and output quality.

Through this video, you'll gain a deeper understanding of how to choose the right quantization for how you use AI models, ensuring you get the best performance and results for your specific needs. Don't forget to subscribe for more lessons in this free Ollama course!
Thanks for watching!

My Links 🔗

00:00 - Start with an example
00:24 - Introduction
00:56 - Lots of claims on the Discord
01:26 - Intro to the app
01:57 - Where to find the code
02:20 - Grab a few quantizations
02:57 - You should regularly pull the models again
03:30 - Back to the Black Hole answers
04:39 - The classic logic problem
05:35 - How about function calling
08:31 - How about for prompts with more reasoning
09:01 - Are those questions stupid?
09:30 - Which quant to use?
Рекомендации по теме
Комментарии
Автор

On the subject of the quantization, you could have included how to choose amongst q4_0, q4_K, q4_K_M. If for the same quantization level one is better, for example if q4_K is better than q4_0, why are we creating both? Thank you for these videos, the model of the course focusing on a single fact per session is very nice!

romulopontual
Автор

I'm really in awe by how well you explain everything. I wish I had professors with your patience and teaching abilities when I was at university. Anyway. Thank you the lesson. I already love Ollama but your content is making me really see LLMs and, in this case, quantizations, with different eyes.

To be honest, I used to think that anything lower than q8 for, let's say, a 7 or 8 billion parameter model, would be pretty much useless, but after experimenting with Llama 3.1, Mistral, and a few others, I think q4 is definitely the sweet spot for my needs. Llama 3.1-q4 retains a decent amount of reasoning capabilities and I can increase the context length to have it work better with whatever information I want to feed it on the spot.

Thanks again for the content. It's awesome!

tudor-octavian
Автор

Looking at quantize models is something I haven't even looked at yet for my home server. I have an HPE microserver that I can't put a graphics card into just because it's physically too small so I'm running CPU only. Now you've got me curious if I can actually get vaster speeds just by using a smaller quantized model. Thank you so much for making this content You're absolutely amazing.

markjones
Автор

I think that q2 was the best in the scenario where you basically were searching for "json" string because it wasn't trying to "understand" what json is, it was just a word/string, thus, always "catched"

wgabrys
Автор

Excellent video, Matt. Your take on benchmarks usefulness is smart.

fabriai
Автор

You changed my thinking on which quant to use! I’ll experiment more with running the lowest quant to get acceptable answers. Thanks!

ReidKimball
Автор

Great tutorial as usual.

Btw, here is a windows powershell command to update all the models if you have more then 1 already installed on Windows.

ollama ls | ForEach-Object {"{0}" -f($_ -split '\s+')} | Where-Object { $_ -notmatch 'failed' -and $_ -notmatch 'NAME' } | ForEach-Object {$model = $_; "Updating model $model"; ollama pull $model}

This one works both on Mac and Linux:

ollama ls | awk '{print $1}' | grep -v NAME | while read model; do echo "### updating $model ###"; ollama pull $model; done

Those I wrote them myself but you can ask your favorite GPT for an explanation.

LuisYax
Автор

This is great. I’d love to see the difference between the quant levels as the length of the prompt increases. I find that the lower quants don’t handle longer inputs very well. I’m. Im not sure why that is.

ManjaroBlack
Автор

I agree with all the points made in the video and would just like to add my own experience for viewers looking for more details in the comments.

One important factor to consider when choosing quantization levels is the impact of hardware constraints. For example, I've been running the LLaMA 3.1 70B model, which fits in 48GB of VRAM at Q4 but with a limited context window. I found that running the 70B model at Q2 (which frees up memory for extending the context window) gave me better results than the 8B model with an extended context window. This balance between model size, quantization, and context window size can be crucial depending on your specific use case and hardware capabilities.

tylerlindsay
Автор

I found the evaluation interesting, and the conclusion wise: try with your own prompts and see what happens.
I would suggest to extend the evaluation to many more conversation turns, because some models get lost later on despite doing well on the first reply.

Your evaluation made me curious to try different quants at temperature 0 and the same seed, to see if some of the quants end up with an identical output!

supercurioTube
Автор

Would love to see how to get it running in Proxmox with multiple GPUs. Lots of old articles out there

vulcand
Автор

Thanks Matt. Very well explain and informative. I employ three of thoughts, obviously with multi-shoot prompting, and those are normally very complex tasks where the model needs to pick the hints in a clinical case description to help diagnose or optimize the treatment of the patient. I see the bigger parameters models perform better, because they pick more details and correlate better the facts. I noticed that Claude and Gemini are the kings there. What about quantization in this case? any recommendation?

ISK_VAGR
Автор

@technovangelist Thanks, Matt. The experiment was surprising. All in all, it seems that higher quantization achieves better results (at least for function calling), which is counterintuitive to me. However, if this is statistically true, it's good news for local applications driven by a small LLM that calls external (but still local) services. In other words, it's promising for real-time on-prem automation!

As someone suggested in a comment, the balance between model size, quantization, and context window size seems to be crucial. I’d suggest dedicating a session to the context window size and its usage. I’m personally confused by the default length value in Ollama and how to set the desired window size.

Thanks for this course.
Giorgio

solyarisoftware
Автор

For Blackhole prompt there are some faults in the physics wording of the Q2 response that had me rank it second, and the Q4 highest.

The bad English style cues of the FP16 (two uses of "pull" in the opening sentence) had me rate it lowest when, in retrospect, it probably contains less problematic physics wording of the three.

(Background: failed '90s astrophysics major)

I guess prompting is important. ;)

michaelmistaken
Автор

Rule of thumb
Take the k-quants that fit into your gpu memory. Usually until q3 there is really a negligible loss.
If the model+context fit into the memory use I-Quants

VinCarbone
Автор

What about with complex coding tasks such as refactoring a codebase?

sammcj
Автор

Another great video thanks. Would you mind adding the link to the youtube videos, related to the folder in your videoprojects repo, in each README file?
Would make it a lot easier to find the video when browsing the repo.
Ta, keep up the good work.

AndyAinsworth
Автор

Would you (in general) prefer a greater parameter/smaller quant size model over a lesser parameter/larger quant size model?

UnwalledGarden
Автор

is there a way i can split a big model like llama3. w.e to 2 models of 50% of the size so i can load it on a rpi5 8gb without seeing not enough memory error? also any way to find out what the model consists off so i can strip out unwanted parts and keep only parts i find usefull?

JNET_Reloaded
Автор

I usually benchmark models with a simple programming question:

```
You are a software engineer experienced in C++: Write a trivial C++ program that follow this code-style:
Use modern C++20
Use the auto func(...) -> ret syntax, even for auto main()->int
Always open curly braces on new line: DONT auto main()->int{\n... (with no new line between int and '{'); but DO: auto main() ->int \n{\n... instead (with new line between int and '{').
Comment your code.
No explanation, no introduction, keep verbosity to the minimum, only code.

```

Even this dummy question fails most of the time on any Q4 model I tried on Ollama.
I hope to get better results with better quantization, but I need to upgrade my computer for this.

escain