🔥🚀 Inferencing on Mistral 7B LLM with 4-bit quantization 🚀 - In FREE Google Colab

preview_player
Показать описание

🔥🚀 Inferencing on Mistral 7B with 4-bit quantization 🚀 | | Large Language Models

I explain the BitsAndBytesConfig in detail

📌 Max System RAM is only 4.5 GB and

📌 Max GPU VRAM is 5.9 GB

👉 **`load_in_4bit` parameter** is for loading the model in 4 bits precision

This means that the weights and activations of the model are represented using 4 bits instead of the usual 32 bits. This can significantly reduce the memory footprint of the model. 4-bit precision models can use up to 16x less memory than full precision models and can be up to 2x faster than full precision models.

However, if you need the highest possible accuracy, then you may want to use full precision models.

-------------------

🔥🐍 Check out my new Python Book - where I cover, 350+ Python Core Fundamental concepts, across 1300+ pages needed in daily real-life problems for a Python Engineer.

For each of the concepts, I discuss the 'under-the-hood' view of how Python Interpreter is handling it.

-----------------

----------------

You can find me here:

**********************************************

**********************************************

Other Playlist you might like 👇

----------------------

#LLM #Largelanguagemodels #Llama2 #opensource #NLP #ArtificialIntelligence #datascience #langchain #llamaindex #vectorstore #textprocessing #deeplearning #deeplearningai #100daysofmlcode #neuralnetworks #datascience #generativeai #generativemodels #OpenAI #GPT #GPT3 #GPT4 #chatgpt
Рекомендации по теме
Комментарии
Автор

how to use your model in the lagchain agent? I used this but it says llm value is not a valid dict
agent = initialize_agent(tools,
model,
agent="zero-shot-react-description",
verbose=True,
handle_parsing_errors=True,
max_new_tokens=1000)

manueljan
Автор

great video, sweet and simple. However, how can we control the token max limit, and also, do we have the option of separating our messages into a system message and a user message just like in Openai?

efpvduj
Автор

Hi Sir,
Could you tell us the mic setup and how you make your videos with such clear qulaity. Thanks

saravanajogan
Автор

What is better quantify with "bitsandbytes" o do it with "cllama" GUFF? What is the difference?

javiergimenezmoya
Автор

hi, is there a simple change that can be made to the code to run inference in 8-bit?

JavMend
Автор

Sir, any advice if I use japanese or chinese language for RAG? Thanks

vinsmokearifka
Автор

Hello there, this is exactly what I was looking for. Could you please give resources or any tutorial where details of those functions are discussed?

My teammate gave a Kaggle Notebook with the exact same code and I am continuing to make that a conversational chatbot. But since I am brand new to this, I feel lost now.

gazzalifahim
Автор

thanks for your tutorial. I have question, how to generate output to 32k ?

seinaimut
Автор

Great video, can you make video on finetuning llm with best method.

venkateshr
Автор

Loved your content buddy ❤. Can we keep this Google Colab instance keep running for free and how can we expose this model as an Rest API to use in hosted projects and that too not locally.

thehkmalhotra
Автор

Hi, I get my token from huggingface but I dont know where I have to put it in colab

tomasgarcia
Автор

Can you make video how to use open source LLM to query structured databse (sql/pandas) for chat

anuvratshukla
Автор

colab file not found pls give notebook link

onesecondnanba
Автор

Can we do this type of qunatization with any model?

xewhtwq