Deep Dive: Quantizing Large Language Models, part 1

Показать описание

Quantization is an excellent technique to compress Large Language Models (LLM) and accelerate their inference.

In this video, we discuss model quantization, first introducing what it is, and how to get an intuition of rescaling and the problems it creates. Then we introduce the different types of quantization: dynamic post-training quantization, static post-training quantization, and quantization-aware training. Finally, we start looking at and comparing actual quantization techniques: PyTorch, ZeroQuant, and bitsandbytes.

00:00 Introduction
02:05 What is quantization?
06:50 Rescaling weights and activations
08:17 The mapping function
12:38 Picking the input range
16:15 Getting rid of outliers
19:50 When can we apply quantization?
26:00 Dynamic post-training quantization with PyTorch
28:42 ZeroQuant
34:50 bitsandbytes

Рекомендации по теме

Комментарии

Great explanation! I have one question... Is it common practice to regularize the LLM cost function like with L2 to reduce the weight "outliers" while training?

Joe-nhfy

I have been wanting to understand quantization for a very long time. Thank you! Would you mind sharing the slides please? Thank you.

DED_Search

what is meant by calibration dataset? is it eqivalent to evaluation set?

monishostwal

Watching this at 1.25x speed. High-quality content as usual. Keep it up, Julien 💪

joaogalego

Is there any chance to get the slides? Its very well organized and presented. Thank you so much for your work✨🔥🔥

AI-Projects

Deep Dive: Quantizing Large Language Models, part 1

Deep Dive: Quantizing Large Language Models, part 1

Deep Dive: Quantizing Large Language Models, part 2

LoRA explained (and a bit about precision and quantization)

Deep Dive on PyTorch Quantization - Chris Gottbrath

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

What is Retrieval-Augmented Generation (RAG)?

Quantization: Methods for Running Large Language Model (LLM) on your laptop

Day 26 : Fine-Tuning Large Language Models (LLMs) | LORA, QLORA & Quantization Explained

What is LLM Quantization?

Revolutionizing Large Language Models: OneBit's 1 Bit Quantization Breakthrough

Quantization in Deep Learning (LLMs)

LLMs Quantization Crash Course for Beginners

Should You Use Open Source Large Language Models?

SmoothQuant: Efficient & Accurate Quantization for Massive Language Models

What is quantization?

QLoRA: Efficient Finetuning of Quantized Large Language Models (Tim Dettmers)

LLaMa GPTQ 4-Bit Quantization. Billions of Parameters Made Smaller and Smarter. How Does it Work?

Dynamic Quantization with Unsloth: Shrinking a 20GB Model to 5GB Without Accuracy Loss!

AMD Explores DNN Quantization, from Theory to Practice (Preview)

How Quantization Makes AI Models Faster and More Efficient

Tim Dettmers | QLoRA: Efficient Finetuning of Quantized Large Language Models

QLoRA paper explained (Efficient Finetuning of Quantized LLMs)

All You Need To Know About Running LLMs Locally

Fine-tuning Large Language Models (LLMs) | w/ Example Code