Faster LLM Inference: Speeding up Falcon 7b (with QLoRA adapter) Prediction Time

preview_player
Показать описание
How can you speed up your LLM inference time?
In this video, we'll optimize the token generation time for our fine-tuned Falcon 7b model with QLoRA. We'll explore various model loading techniques and look into batch inference for faster predictions.

00:00 - Introduction
01:26 - Google Colab Setup
03:58 - Training Config Baseline
07:06 - Loading in 4 Bit
08:26 - Loading in 8 Bit
10:25 - Batch Inference
12:00 - Lit-Parrot
16:57 - Conclusion

Turtle image by stockgiu

#chatgpt #gpt4 #llms #artificialintelligence #promptengineering #chatbot #transformers #python #pytorch
Рекомендации по теме
Комментарии
Автор

Man this is the best channel in llm era

sherryhp
Автор

Thanks for taking the time to put this video together. Really informative and helping me grasp these concepts.

arinco
Автор

Thank you for your great videos! Very informative and sharp to the point!

thevadimb
Автор

You rule! Thanks so much for doing these videos on LLMs

paulinagc
Автор

Thank you for your effort, your videos are great and direct into practical use. Great work!

hakikitosunpasa
Автор

Thank you for the great video. Is there a place for subscribers to ask questions if they join?

jdlovely
Автор

Hello, Great video so far. Let me ask some questions here:
1. What should I do if my training loss is not decrease consistently (sometimes up, sometimes down) ?
2. How to use multiple GPU with QLORA ? I always get OOM if I use Falcon-40B, so I rented 2 GPUs in cloud provider. Unfortunatelly, it ran just for 1 GPU.

IchSan-jxeg
Автор

Thank you so much for this great video, I have some doubts and would be great if you help me understand,
1. The temperature parameter does not change the response at all. Did i do something terribly wrong.
2. The method presented in lit-parrot works with base model of falcon. But how to load the model that we trained using QLoRa?
Thanks again for such amazing content.

sohelshaikhh
Автор

Great video! Just curious, why do have the - - NORMAL - -, - - VISUAL - -, - - INSERT - - when you click on each code block? Some functionalities from google ?

澤翰陳
Автор

Very Informative. Now please make tutorials on Llama Index as there is lot of buzz around it

dataflex
Автор

Hi Venelin, I just subscribed to your website. Can you tell me if there is a way to limit answers to adapter data only and not the base model?

ikjb
Автор

Thanks for the informative video!
One question: Why does loading in 8 bit takes less time than 4 bit? Shouldn't it be the other way around, since 8 bit format has higher precision?

untiteledx
Автор

can you put a open source languag3e model, example like llama open source implementation. to understand the actual implmentation ? for a beginer,
stnadfor5ld alpaca or others are all tuned on the existing model, but is there anything like llama to really understand from ground up, i see gpt neo or nanogpt. but they are not actual llama implmentations.

kishoretvk
Автор

Hi Venelin! Great work but a quick question.
I'm confused.
How is the load_in_8_bit better than load_in_4_bit and Qlora b&b in terms of execution time?
Isn't the loading in 4bit has to be faster than loading in 8bit?
Please clarify!

shivammittal
Автор

Content is really informative, It save almost a week for me :-) . One quick question, I was trying to load Falcon 40b from google drive, but it showed me the error . HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name':
Use `repo_type` argument if needed. Any possible suggestion . It will be a great help

prashantjoshi
Автор

How can we use falcon 7b for summarization tasks?

vakkalagaddatarun
Автор

thank you so much your video is helping a lot to my tests. I have doubt, maybe you know what is happen. I'm training locally in multi gpu rtx 3060 ti, I notice it uses all memory for all gpu, but cuda is not been used fully in all gpu. cuda get 100% usage on 1 and in others are not distributting equally the processing. Some of them uses only 10%. What I did different of you is remove max_steps to process entier dataset I'm using, and increase per_device_train_batch_size to 3 and gradient_accumulation_steps = 16 because i'm using 4 gpus. Do you have any tip ?

odev
Автор

I think it might be useful to explain what the lines of code are actually doing.

wryltxw
Автор

have seen slower performance in terms of inference time for 4 bit quantized compared to 8 bit quantized, people complain about qlora

jaivalani