5. Comparing Quantizations of the Same Model - Ollama Course

Показать описание

Welcome back to the Ollama course! In this lesson, we dive into the fascinating world of AI model quantization. Using variations of the llama3.1 model, we explore how different quantization levels affect performance and output quality.

Through this video, you'll gain a deeper understanding of how to choose the right quantization for how you use AI models, ensuring you get the best performance and results for your specific needs. Don't forget to subscribe for more lessons in this free Ollama course!
Thanks for watching!

My Links 🔗

00:00 - Start with an example
00:24 - Introduction
00:56 - Lots of claims on the Discord
01:26 - Intro to the app
01:57 - Where to find the code
02:20 - Grab a few quantizations
02:57 - You should regularly pull the models again
03:30 - Back to the Black Hole answers
04:39 - The classic logic problem
05:35 - How about function calling
08:31 - How about for prompts with more reasoning
09:01 - Are those questions stupid?
09:30 - Which quant to use?

Рекомендации по теме

Комментарии

On the subject of the quantization, you could have included how to choose amongst q4_0, q4_K, q4_K_M. If for the same quantization level one is better, for example if q4_K is better than q4_0, why are we creating both? Thank you for these videos, the model of the course focusing on a single fact per session is very nice!

romulopontual

I'm really in awe by how well you explain everything. I wish I had professors with your patience and teaching abilities when I was at university. Anyway. Thank you the lesson. I already love Ollama but your content is making me really see LLMs and, in this case, quantizations, with different eyes.

To be honest, I used to think that anything lower than q8 for, let's say, a 7 or 8 billion parameter model, would be pretty much useless, but after experimenting with Llama 3.1, Mistral, and a few others, I think q4 is definitely the sweet spot for my needs. Llama 3.1-q4 retains a decent amount of reasoning capabilities and I can increase the context length to have it work better with whatever information I want to feed it on the spot.

Thanks again for the content. It's awesome!

tudor-octavian

Looking at quantize models is something I haven't even looked at yet for my home server. I have an HPE microserver that I can't put a graphics card into just because it's physically too small so I'm running CPU only. Now you've got me curious if I can actually get vaster speeds just by using a smaller quantized model. Thank you so much for making this content You're absolutely amazing.

markjones

I think that q2 was the best in the scenario where you basically were searching for "json" string because it wasn't trying to "understand" what json is, it was just a word/string, thus, always "catched"

wgabrys

Excellent video, Matt. Your take on benchmarks usefulness is smart.

fabriai

You changed my thinking on which quant to use! I’ll experiment more with running the lowest quant to get acceptable answers. Thanks!

ReidKimball

Great tutorial as usual.

Btw, here is a windows powershell command to update all the models if you have more then 1 already installed on Windows.

ollama ls | ForEach-Object {"{0}" -f($_ -split '\s+')} | Where-Object { $_ -notmatch 'failed' -and $_ -notmatch 'NAME' } | ForEach-Object {$model = $_; "Updating model $model"; ollama pull $model}

This one works both on Mac and Linux:

ollama ls | awk '{print $1}' | grep -v NAME | while read model; do echo "### updating $model ###"; ollama pull $model; done

Those I wrote them myself but you can ask your favorite GPT for an explanation.

LuisYax

This is great. I’d love to see the difference between the quant levels as the length of the prompt increases. I find that the lower quants don’t handle longer inputs very well. I’m. Im not sure why that is.

ManjaroBlack

I agree with all the points made in the video and would just like to add my own experience for viewers looking for more details in the comments.

One important factor to consider when choosing quantization levels is the impact of hardware constraints. For example, I've been running the LLaMA 3.1 70B model, which fits in 48GB of VRAM at Q4 but with a limited context window. I found that running the 70B model at Q2 (which frees up memory for extending the context window) gave me better results than the 8B model with an extended context window. This balance between model size, quantization, and context window size can be crucial depending on your specific use case and hardware capabilities.

tylerlindsay

I found the evaluation interesting, and the conclusion wise: try with your own prompts and see what happens.
I would suggest to extend the evaluation to many more conversation turns, because some models get lost later on despite doing well on the first reply.

Your evaluation made me curious to try different quants at temperature 0 and the same seed, to see if some of the quants end up with an identical output!

supercurioTube

Would love to see how to get it running in Proxmox with multiple GPUs. Lots of old articles out there

vulcand

Thanks Matt. Very well explain and informative. I employ three of thoughts, obviously with multi-shoot prompting, and those are normally very complex tasks where the model needs to pick the hints in a clinical case description to help diagnose or optimize the treatment of the patient. I see the bigger parameters models perform better, because they pick more details and correlate better the facts. I noticed that Claude and Gemini are the kings there. What about quantization in this case? any recommendation?

ISK_VAGR

@technovangelist Thanks, Matt. The experiment was surprising. All in all, it seems that higher quantization achieves better results (at least for function calling), which is counterintuitive to me. However, if this is statistically true, it's good news for local applications driven by a small LLM that calls external (but still local) services. In other words, it's promising for real-time on-prem automation!

As someone suggested in a comment, the balance between model size, quantization, and context window size seems to be crucial. I’d suggest dedicating a session to the context window size and its usage. I’m personally confused by the default length value in Ollama and how to set the desired window size.

Thanks for this course.
Giorgio

solyarisoftware

For Blackhole prompt there are some faults in the physics wording of the Q2 response that had me rank it second, and the Q4 highest.

The bad English style cues of the FP16 (two uses of "pull" in the opening sentence) had me rate it lowest when, in retrospect, it probably contains less problematic physics wording of the three.

(Background: failed '90s astrophysics major)

I guess prompting is important. ;)

michaelmistaken

Rule of thumb
Take the k-quants that fit into your gpu memory. Usually until q3 there is really a negligible loss.
If the model+context fit into the memory use I-Quants

VinCarbone

What about with complex coding tasks such as refactoring a codebase?

sammcj

Another great video thanks. Would you mind adding the link to the youtube videos, related to the folder in your videoprojects repo, in each README file?
Would make it a lot easier to find the video when browsing the repo.
Ta, keep up the good work.

AndyAinsworth

Would you (in general) prefer a greater parameter/smaller quant size model over a lesser parameter/larger quant size model?

UnwalledGarden

is there a way i can split a big model like llama3. w.e to 2 models of 50% of the size so i can load it on a rpi5 8gb without seeing not enough memory error? also any way to find out what the model consists off so i can strip out unwanted parts and keep only parts i find usefull?

JNET_Reloaded

I usually benchmark models with a simple programming question:

```
You are a software engineer experienced in C++: Write a trivial C++ program that follow this code-style:
Use modern C++20
Use the auto func(...) -> ret syntax, even for auto main()->int
Always open curly braces on new line: DONT auto main()->int{\n... (with no new line between int and '{'); but DO: auto main() ->int \n{\n... instead (with new line between int and '{').
Comment your code.
No explanation, no introduction, keep verbosity to the minimum, only code.

```

Even this dummy question fails most of the time on any Q4 model I tried on Ollama.
I hope to get better results with better quantization, but I need to upgrade my computer for this.

escain

5. Comparing Quantizations of the Same Model - Ollama Course

5. Comparing Quantizations of the Same Model - Ollama Course

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

5. Quantization - Digital Audio Fundamentals

What is LLM quantization?

The five most promising ways to quantize gravity

Quantization Part 5: Bit Depth Vs Quantization Noise

EfficientML.ai Lecture 5 - Quantization (Part I) (MIT 6.5940, Fall 2023)

How small are atoms?

What is LLM Quantization?

[CVPR 2020] APQ: Joint Search for Network Architecture, Pruning and Quantization Policy

Quantizing LLMs - How & Why (8-Bit, 4-Bit, GGUF & More)

Understanding 4bit Quantization: QLoRA explained (w/ Colab)

Dynamic Quantization with Unsloth: Shrinking a 20GB Model to 5GB Without Accuracy Loss!

Quantization (pt. 1)

Planck's Quantum Theory | Chemistry

HOW WOULD JOHN BONHAM SOUND TODAY? (Quantized)

Lecture 1: Image Formation with Sampling & Quantization

Did you know how to remember reactivity series?

The Battle for REALITY: String Theory vs Quantum Field Theory

tinymL Summit 2022: Model Optimization with QKeras’ Quantization-Aware Training and Vizier’s...

StatMolThermo 01.03 Quantization of Energy

What is Quantum Gravity? | COSMOS in a minute #25

QLoRA paper explained (Efficient Finetuning of Quantized LLMs)

Topological Superconductors (Lecture 5) - Anthony Leggett - 2016