Deep Dive: Quantizing Large Language Models, part 1

preview_player
Показать описание
Quantization is an excellent technique to compress Large Language Models (LLM) and accelerate their inference.

In this video, we discuss model quantization, first introducing what it is, and how to get an intuition of rescaling and the problems it creates. Then we introduce the different types of quantization: dynamic post-training quantization, static post-training quantization, and quantization-aware training. Finally, we start looking at and comparing actual quantization techniques: PyTorch, ZeroQuant, and bitsandbytes.

00:00 Introduction
02:05 What is quantization?
06:50 Rescaling weights and activations
08:17 The mapping function
12:38 Picking the input range
16:15 Getting rid of outliers
19:50 When can we apply quantization?
26:00 Dynamic post-training quantization with PyTorch
28:42 ZeroQuant
34:50 bitsandbytes
Рекомендации по теме
Комментарии
Автор

Great explanation! I have one question... Is it common practice to regularize the LLM cost function like with L2 to reduce the weight "outliers" while training?

Joe-nhfy
Автор

I have been wanting to understand quantization for a very long time. Thank you! Would you mind sharing the slides please? Thank you.

DED_Search
Автор

what is meant by calibration dataset? is it eqivalent to evaluation set?

monishostwal
Автор

Watching this at 1.25x speed. High-quality content as usual. Keep it up, Julien 💪

joaogalego
Автор

Is there any chance to get the slides? Its very well organized and presented. Thank you so much for your work✨🔥🔥

AI-Projects
join shbcf.ru