filmov
tv
Quantization in LLM
![preview_player](https://i.ytimg.com/vi/God1OAeJBN8/maxresdefault.jpg)
Показать описание
If we can quantize a 16-bit floating point number down to an 8-bit or a 4-bit, why not go all the way to 1-bit? Well, it turns out you can do that, but you'll see significant degradation in the output quality of that language model. One bit is not enough, so we need an extra bit.
You can't have a fraction of a bit, though. Quantization is about reducing the number of bits a floating point number uses to fit into the available memory of the device doing the computation. When you have billions of 16-bit floating points, that's a lot of memory.
So we round the numbers to the nearest digit and cut down on bits. This reduces the precision of each parameter in the model but makes it fit into the GPU memory, which is a scarce resource. Doing this allows for faster computation, lower hardware requirements, and less energy consumption.
The term "1.58 bits" means using three states: -1, 0, and 1. While one bit gives 0 and 1, two bits can represent four states, but we only use three and throw one away. log2(3) = 1.58 (trinary state) in reality the hardware is using 2 bits. The software algorithm uses 3 of 4 available states. The 4th state is never used.
You can't have a fraction of a bit, though. Quantization is about reducing the number of bits a floating point number uses to fit into the available memory of the device doing the computation. When you have billions of 16-bit floating points, that's a lot of memory.
So we round the numbers to the nearest digit and cut down on bits. This reduces the precision of each parameter in the model but makes it fit into the GPU memory, which is a scarce resource. Doing this allows for faster computation, lower hardware requirements, and less energy consumption.
The term "1.58 bits" means using three states: -1, 0, and 1. While one bit gives 0 and 1, two bits can represent four states, but we only use three and throw one away. log2(3) = 1.58 (trinary state) in reality the hardware is using 2 bits. The software algorithm uses 3 of 4 available states. The 4th state is never used.
What is LLM quantization?
Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)
Quantization vs Pruning vs Distillation: Optimizing NNs for Inference
5. Comparing Quantizations of the Same Model - Ollama Course
Quantization in Deep Learning (LLMs)
Part 1-Road To Learn Finetuning LLM With Custom Data-Quantization,LoRA,QLoRA Indepth Intuition
LoRA explained (and a bit about precision and quantization)
Quantization in deep learning | Deep Learning Tutorial 49 (Tensorflow, Keras & Python)
What is LLM Quantization?
LLMs Quantization Crash Course for Beginners
LLaMa GPTQ 4-Bit Quantization. Billions of Parameters Made Smaller and Smarter. How Does it Work?
AWQ for LLM Quantization
Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training
How to Quantize an LLM with GGUF or AWQ
Quantization Explained in 60 Seconds #AI
Understanding: AI Model Quantization, GGML vs GPTQ!
Llama 1-bit quantization - why NVIDIA should be scared
Understanding 4bit Quantization: QLoRA explained (w/ Colab)
QLoRA - Efficient Finetuning of Quantized LLMs
New Tutorial on LLM Quantization w/ QLoRA, GPTQ and Llamacpp, LLama 2
LLM model quantization and how it impacts model performance
Quantization in LLM
Day 61/75 LLM Quantization | How Accuracy is maintained? | How FP32 and INT8 calculations same?
What is Quantization? - LLM Concepts ( EP - 3 ) #quantization #llm #ml #ai #artificialintelligence
Комментарии