filmov
tv
Deep Dive: Quantizing Large Language Models, part 1

Показать описание
Quantization is an excellent technique to compress Large Language Models (LLM) and accelerate their inference.
In this video, we discuss model quantization, first introducing what it is, and how to get an intuition of rescaling and the problems it creates. Then we introduce the different types of quantization: dynamic post-training quantization, static post-training quantization, and quantization-aware training. Finally, we start looking at and comparing actual quantization techniques: PyTorch, ZeroQuant, and bitsandbytes.
00:00 Introduction
02:05 What is quantization?
06:50 Rescaling weights and activations
08:17 The mapping function
12:38 Picking the input range
16:15 Getting rid of outliers
19:50 When can we apply quantization?
26:00 Dynamic post-training quantization with PyTorch
28:42 ZeroQuant
34:50 bitsandbytes
In this video, we discuss model quantization, first introducing what it is, and how to get an intuition of rescaling and the problems it creates. Then we introduce the different types of quantization: dynamic post-training quantization, static post-training quantization, and quantization-aware training. Finally, we start looking at and comparing actual quantization techniques: PyTorch, ZeroQuant, and bitsandbytes.
00:00 Introduction
02:05 What is quantization?
06:50 Rescaling weights and activations
08:17 The mapping function
12:38 Picking the input range
16:15 Getting rid of outliers
19:50 When can we apply quantization?
26:00 Dynamic post-training quantization with PyTorch
28:42 ZeroQuant
34:50 bitsandbytes
Deep Dive: Quantizing Large Language Models, part 1
Deep Dive: Quantizing Large Language Models, part 2
LoRA explained (and a bit about precision and quantization)
Deep Dive on PyTorch Quantization - Chris Gottbrath
Quantization vs Pruning vs Distillation: Optimizing NNs for Inference
What is Retrieval-Augmented Generation (RAG)?
Quantization: Methods for Running Large Language Model (LLM) on your laptop
Day 26 : Fine-Tuning Large Language Models (LLMs) | LORA, QLORA & Quantization Explained
What is LLM Quantization?
Revolutionizing Large Language Models: OneBit's 1 Bit Quantization Breakthrough
Quantization in Deep Learning (LLMs)
LLMs Quantization Crash Course for Beginners
Should You Use Open Source Large Language Models?
SmoothQuant: Efficient & Accurate Quantization for Massive Language Models
What is quantization?
QLoRA: Efficient Finetuning of Quantized Large Language Models (Tim Dettmers)
LLaMa GPTQ 4-Bit Quantization. Billions of Parameters Made Smaller and Smarter. How Does it Work?
Dynamic Quantization with Unsloth: Shrinking a 20GB Model to 5GB Without Accuracy Loss!
AMD Explores DNN Quantization, from Theory to Practice (Preview)
How Quantization Makes AI Models Faster and More Efficient
Tim Dettmers | QLoRA: Efficient Finetuning of Quantized Large Language Models
QLoRA paper explained (Efficient Finetuning of Quantized LLMs)
All You Need To Know About Running LLMs Locally
Fine-tuning Large Language Models (LLMs) | w/ Example Code
Комментарии