Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)

Показать описание

In this tutorial, we will explore many different methods for loading in pre-quantized models, such as Zephyr 7B. We will explore the three common methods for quantization, GPTQ, GGUF (formerly GGML), and AWQ.

Timeline
0:00 Introduction
0:25 Loading Zephyr 7B
3:25 Quantization
7:42 Pre-quantized LLMs
8:42 GPTQ
10:29 GGUF
12:22 AWQ
14:46 Outro

Support my work:
👪 Join as a Channel Member:
/ @maartengrootendorst

I'm writing a book!

#datascience #machinelearning #ai

Maarten Grootendorst

Рекомендации по теме

Комментарии

Amazing content! Most youtube tutorials just go into trying out the outputs of pre-made LLMs but rarely dive into this level of technical details.

cken

Thanks Maarten! I was searching for quantizing exactly the zephyr-7b-beta and I realized you used it halfway in the video!

utsavdavda

Thanks a lot for clarifying the main differences between quantization methods and also for sharing your code.

BitsNBytesAI

Thanks didn't want to feel too unproductive on a thanksgiving but didn't want to commit to a full video series. Always releasing timely and great stuff!

jacehua

Thanks for including the colab, and I wasn't aware of AWQ before this video.
Would you consider making a video on the efficiency on each, especially when using gpu on gguf model?

wezfaas

I was struggling with quantization last weekend! very timely! thanks

fzmnszt

Outstanding! (To put things in perspective: I've seen a LOT of praise for wrapping the obvious or marketing-only BS into lengthy videos and I'm not shy to speak my mind there too!)

gue

Thank you for the informative video. I understood how I made a huge mistake using gguf when having the VRAM to use GPU primarily.

JGKorny

Useful information and well made video.

sanjayojha

Thanks for the video! What I don't understand is that people always say that AWQ is faster than GPTQ, but in my 3060 12gb they are usually quite slow, around 3t/s, while in gptq I can get from 5 to 20t/s

silentwindstudio

Thanks, Maarteen. I wish you could share some performance comparison between the difference methods. I have been trying to find some but I couldn't. I do know that AWQ is better than GPTQ, but wish to compare it to GGUF.

rubencontesti

Thank you for the differences and the code.

AmishaHSomaiya

Thanks for this video - was a great explanation on the difference between the three models. How's the support for AWQ now? Also I would love it if you could make a video on how to deploy these quantized models for production

maryamashraf

Great video and great comparisons. Can you make a video on how to quantize a model oneself as well?

naseerfaheem

Really enjoyed this session! Any chance you can continue this by showing how to fine-tune this versions of the models?

radmilraychev

Really enjoyed your video. It was very informative. Just wanted to know can finetuning be done on these pre-quantized models ??

venushah

Brilliant video; you have a style that explains things nicely. Thank you. Sub'd.

If you are looking for ideas, I think an overview of what "weights, biases and parameters" mean for models would be great.

FamilyManMoving

Great content! I am wondering whether nowadays we should choose LLMs over BERT models on most tasks or use seperately based on specific use cases? That could be an interest topic to discuss!

yueyu

Thank you so much for the video, i would like to know which method is faster at inference time.

TheMrguiller

Great video, how ever its quite frustrating trying to run this code in production the dependencies are never correct.

efexzium

Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)

Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)

Lecture 05 - Quantization (Part I) | MIT 6.S965

ICLR Paper: Learn Step Size Quantization

#Shorts Hybrid Quantization vs Standard Quantization

tinymL Summit 2022: Model Optimization with QKeras’ Quantization-Aware Training and Vizier’s...

Introduction to the quantization of neural networks

Lecture 05 - Quantization (Part I) | MIT 6.S965

Deep Dive on PyTorch Quantization - Chris Gottbrath

EfficientML.ai Lecture 4 - Pruning and Sparsity Part II (MIT 6.5940, Fall 2024, Zoom recording)

AWQ for LLM Quantization

Quantization in Neural Networks - May 27, 2020

Quantization of Deep Learning Solution for Efficient Inference | Kim Hee, UMM [PyData Südwest]

Understanding 4bit Quantization: QLoRA explained (w/ Colab)

Optimizing Quantization of Large Language Models for Efficiency and Accuracy

Testing Stable Diffusion inpainting on video footage #shorts

Introduction to Quantization in Deep Neural Networks

New Tutorial on LLM Quantization w/ QLoRA, GPTQ and Llamacpp, LLama 2

EfficientML.ai Lecture 5 - Quantization Part I (MIT 6.5940, Fall 2024)

Quantization in Neural Networks - Basics Explained | Affine and Symmetric Quantization

Part 1-Road To Learn Finetuning LLM With Custom Data-Quantization,LoRA,QLoRA Indepth Intuition

How small are atoms?

Markus' approach: not quantization method but a solution to eliminate activation quantization i...

GTC 2021: Systematic Neural Network Quantization

Second Quantization (Quantum Field Theory 2b)