LLaMa GPTQ 4-Bit Quantization. Billions of Parameters Made Smaller and Smarter. How Does it Work?

preview_player
Показать описание
We dive deep into the world of GPTQ 4-bit quantization for large language models like LLaMa. We'll explore the mathematics behind quantization, immersion features, and the differential geometry that drives this powerful technique. We'll also demonstrate how to use the GPTQ 4-bit quantization with the Llama library. This video is a must-watch if you're curious about optimizing large language models and preserving emergent features. Join us as we unravel the mysteries of quantization and improve our understanding of how large language models work! Don't forget to like, subscribe, and tell us what you'd like to learn about next in the comments.

#GPTQ4Bit #Quantization #LargeLanguageModels #NeuralNetworks #Optimization #EmergentFeatures #LlamaLibrary #DeepLearning #AI #optimization #EmergentFeatures #LlamaLibrary #DeepLearning #ai

0:00 Intro
0:33 What is quantization?
2:17 Derivatives and the Hessian
4:03 Emergent features
5:17 GPTQ 4-Bit quantization process
8:40 Using GPTQ-for-LLaMa
10:50 Outro
Рекомендации по теме
Комментарии
Автор

You really put a lot of your time and effort into these highly informative videos. Thank you so much

kaymcneely
Автор

Thanks for publishing this. I am glad someone is breaking it down as i have been talking over heads quite a lot about this the last three weeks.

nightwintertooth
Автор

this is exactly the level of explanation that I need being able to pick up on key concepts and dive deeper in other ways at my own pace. keep it up!

svb
Автор

I sincerely appreciate your willingness to share the results of your research and understanding!

beerreeb
Автор

Really nice. As someone who barely knows how matrixes and such work, you made these quantization concepts easy to understand..

nacs
Автор

This is ridiculously well explained and easy to understand for someone only beginning to explore this rabbit hole. Whatever motivates you to keep making this videos I hope it continues to. I am gonna go ahead and check rest of your library. I also hope you continue to explain concepts around the subject of these models. Thank you.

vishnunair
Автор

Fantastic explanation and great tutorial! Hoping this channel grows a lot in the future!

MaJetiGizzle
Автор

It's so sad you abandoned your channel. Your explanations are gems

alx
Автор

Dude keep these great videos up. We appreciate you

logan
Автор

Thanks for the clear and concise explanation, it was perfect.

quinn
Автор

Thanks for all the effort that went into making this video. Very informative indeed.

fahnub
Автор

What’s your view on bitsandbytes NF4 versus GPTQ for quantisation?

TrelisResearch
Автор

Your intelligence is impressive as it compensates for my lack of understanding 😅, but thanks to your articulate explanations, I believe I'm grasping it. I'm grateful to you for imparting such incredible content.

redfield
Автор

Thank you so much for simplifying this to such extent. Subscribed

dhirajkumarsahu
Автор

Great video as always. Thanks for sharing your knowledge.

jonrross
Автор

Wow, really good explanation, the part of encoding the 16 bit float as 8 bit integer by scaling is pretty intuitive, but the process of adding the error to make small values less likely to fail its mind-blowing I didn't expect it to work but if it is a thing that is being implemented thing right now is because it does.

enmanuelsancho
Автор

this was a great explaination, thank you

smellslikeupdog
Автор

Underrated Channel, You sir, deserve more subs.

P.S: Could you do the same for GGML? and If already did, Playlist with GGML, GPT-Q, LoRA, QLoRA, 4bit vs 8bit, Performance based on parameters(3B, 7B, etc.) would be a nice to have. A lot of channels cover the model as a whole but most of them would never cover the process behind the models. Your video was easy follow and understand the basics behind the LLM Quantization. Keep it up.

yolo
Автор

Thanks. Where can I find the model 'lmsys_vicuna-7b-delta-v1.1' that you mentioned in your demonstration?

hoatran-lvrj
Автор

Great explanation.

Some questions:

When we are quantising and computing the quantisation loss, do we not need to supply some data for it to compute the loss against? If not, how exactly is this loss computed? (surely we need some inputs and expected outputs to compute this loss, is this why all of the weight errors were 0 when you quantised? )

If we do, could this be interpreted as a form of post training, quantisation 'fine-tuning'? By this I mean that we can use domain data in the quantisation process to help preserve the emergent features in the model that are most useful for our specific domain dataset?

Thanks!

tomm