LLaMa GPTQ 4-Bit Quantization. Billions of Parameters Made Smaller and Smarter. How Does it Work?

Показать описание

We dive deep into the world of GPTQ 4-bit quantization for large language models like LLaMa. We'll explore the mathematics behind quantization, immersion features, and the differential geometry that drives this powerful technique. We'll also demonstrate how to use the GPTQ 4-bit quantization with the Llama library. This video is a must-watch if you're curious about optimizing large language models and preserving emergent features. Join us as we unravel the mysteries of quantization and improve our understanding of how large language models work! Don't forget to like, subscribe, and tell us what you'd like to learn about next in the comments.

#GPTQ4Bit #Quantization #LargeLanguageModels #NeuralNetworks #Optimization #EmergentFeatures #LlamaLibrary #DeepLearning #AI #optimization #EmergentFeatures #LlamaLibrary #DeepLearning #ai

0:00 Intro
0:33 What is quantization?
2:17 Derivatives and the Hessian
4:03 Emergent features
5:17 GPTQ 4-Bit quantization process
8:40 Using GPTQ-for-LLaMa
10:50 Outro

Рекомендации по теме

Комментарии

You really put a lot of your time and effort into these highly informative videos. Thank you so much

kaymcneely

Thanks for publishing this. I am glad someone is breaking it down as i have been talking over heads quite a lot about this the last three weeks.

nightwintertooth

this is exactly the level of explanation that I need being able to pick up on key concepts and dive deeper in other ways at my own pace. keep it up!

svb

I sincerely appreciate your willingness to share the results of your research and understanding!

beerreeb

Really nice. As someone who barely knows how matrixes and such work, you made these quantization concepts easy to understand..

nacs

This is ridiculously well explained and easy to understand for someone only beginning to explore this rabbit hole. Whatever motivates you to keep making this videos I hope it continues to. I am gonna go ahead and check rest of your library. I also hope you continue to explain concepts around the subject of these models. Thank you.

vishnunair

Fantastic explanation and great tutorial! Hoping this channel grows a lot in the future!

MaJetiGizzle

It's so sad you abandoned your channel. Your explanations are gems

alx

Dude keep these great videos up. We appreciate you

logan

Thanks for the clear and concise explanation, it was perfect.

quinn

Thanks for all the effort that went into making this video. Very informative indeed.

fahnub

What’s your view on bitsandbytes NF4 versus GPTQ for quantisation?

TrelisResearch

Your intelligence is impressive as it compensates for my lack of understanding 😅, but thanks to your articulate explanations, I believe I'm grasping it. I'm grateful to you for imparting such incredible content.

redfield

Thank you so much for simplifying this to such extent. Subscribed

dhirajkumarsahu

Great video as always. Thanks for sharing your knowledge.

jonrross

Wow, really good explanation, the part of encoding the 16 bit float as 8 bit integer by scaling is pretty intuitive, but the process of adding the error to make small values less likely to fail its mind-blowing I didn't expect it to work but if it is a thing that is being implemented thing right now is because it does.

enmanuelsancho

this was a great explaination, thank you

smellslikeupdog

Underrated Channel, You sir, deserve more subs.

P.S: Could you do the same for GGML? and If already did, Playlist with GGML, GPT-Q, LoRA, QLoRA, 4bit vs 8bit, Performance based on parameters(3B, 7B, etc.) would be a nice to have. A lot of channels cover the model as a whole but most of them would never cover the process behind the models. Your video was easy follow and understand the basics behind the LLM Quantization. Keep it up.

yolo

Thanks. Where can I find the model 'lmsys_vicuna-7b-delta-v1.1' that you mentioned in your demonstration?

hoatran-lvrj

Great explanation.

Some questions:

When we are quantising and computing the quantisation loss, do we not need to supply some data for it to compute the loss against? If not, how exactly is this loss computed? (surely we need some inputs and expected outputs to compute this loss, is this why all of the weight errors were 0 when you quantised? )

If we do, could this be interpreted as a form of post training, quantisation 'fine-tuning'? By this I mean that we can use domain data in the quantisation process to help preserve the emergent features in the model that are most useful for our specific domain dataset?

Thanks!

tomm

LLaMa GPTQ 4-Bit Quantization. Billions of Parameters Made Smaller and Smarter. How Does it Work?

LLaMa GPTQ 4-Bit Quantization. Billions of Parameters Made Smaller and Smarter. How Does it Work?

Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)

New Tutorial on LLM Quantization w/ QLoRA, GPTQ and Llamacpp, LLama 2

QLoRA - Efficient Finetuning of Quantized LLMs

Discussion on Model Backends GPTQ 4-Bit Quantisation: Compressing The Models After Pretraining

compressing large language models

Hands on Llama Quantization with GPTQ and HuggingFace Optimum

Quantize LLMs with AWQ: Faster and Smaller Llama 3

AI Everyday #20 - Llama2, GPTQ Quantization, and Text Generation WebUI

While quantization works well on conv nets, Pete breaks down why it's a game-changer for LLMs

Loading GPTQ 4-bit Model With Exllama: How To Load 7B Parameters Model in 4GB VRAM

Understanding 4bit Quantization: QLoRA explained (w/ Colab)

🔥🚀 Inferencing on Mistral 7B LLM with 4-bit quantization 🚀 - In FREE Google Colab

All You Need To Know About Running LLMs Locally

Vicuna 13B V1.1! With 4-Bit Quantization, What Can't it Run On? OobaBooga One Click Installer.

How to Quantize an LLM with GGUF or AWQ

Quantize any LLM with GGUF and Llama.cpp

Vicuna-13b-GPTQ-4bit test

LLMs Naming Convention Explained

Llama 2: Fine-tuning Notebooks - QLoRA, DeepSpeed

Loading Llama 2 13B in GGUF & GPTQ formats and comparing performance

NEW Friendly Samantha that can Code

GGML vs GPTQ in Simple Words

8-Bit Quantisation Demistyfied With Transformers : A Solution For Reducing LLM Sizes