Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

preview_player
Показать описание
vLLM is a fast and easy-to-use library for LLM inference Engine and serving.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Optimized CUDA kernels
vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
vLLM seamlessly supports many Huggingface models

❤️ If you want to support the channel ❤️
Support here:
Рекомендации по теме
Комментарии
Автор

my dude you're the super hero of these tutorials! I was just thinking about how i'm annoyed these llm's take so long to respond. And bamb, you posted this wonderful video! Thank you!

seanmurphy
Автор

I appeared here after not being able to make OpenLLM working due to various issues on my PC locally. This is awesome. I got it working in a few minutes with different models. Thank you! Subscribing!

khorLDW
Автор

great work man...
can't thank enough. thanks again. great to see more indian AI tech talent in the out :D

sujantkumarkv
Автор

Thank you, your videos are becoming a daily thing for me

mohegyux
Автор

Thanks you are the vLLM of this space
love the speed of your videos. Colab let's more of us learn with less $$

deabyam
Автор

inference was the main bottel neck of LLMs, this is amazing thank you so much 🤩🤩🤩. Please make video on this page attention algorithm 🤩🤩

shamaldesilva
Автор

Fantastic video! Just what I wanted to see.

gpsb
Автор

Amazing tutorial my friend! Was looking for this. It would have been more helpful if you could explain how to deploy LLMs created using vLLM inference engine.

maazshaikh
Автор

Wow. Thank you for always bringing news to us ;)

MarceloLimaXP
Автор

Great information, really appreciate 🎉🎉🎉.
If possible, can you show me how we can add our own data (Excel report) along with the model so that LLM can provide our data too.

arjunm
Автор

mad props to the creator for breaking down vllm and its advantages over traditional llms

that page attention tech sounds lit, giving it those crazy throughput numbers

but, not gonna lie, using google collab as a production environment? kinda sus

still, respect the hustle for making it accessible to peeps without fancy GPUs

mad respect for that grind

moondevonyt
Автор

amazing 🔥🔥, btw any idea how we can patch this into Gradio ?? that way sharing and access will be much easier

anki
Автор

Thanks for this wonderful video. I want to know can we have RAG over the model with vllm. Also can we run vllm in a kubernetes cluster?

santoshshetty
Автор

Thanks for the video, the only problem is that im getting "torch.cuda.OutOfMemoryError: CUDA out of memory." error when trying to serve the LLM in google colab. Is there a way to change the batch size using vllm parameters?

cmeneseslob
Автор

Hello, in your opinion, what's better to use in production, TGI or vLLM ?

aliissa
Автор

Awesomeness man does it work with langchain?

pavanpraneeth
Автор

Does it support quantised models? Is it supposed in Oobabooga already? Quite an interesting topic - I hear ppl are using it in production often

alx
Автор

Awesome video, But while inferencing and using the endpoint on postman, it shows jupyter notebook server is running and not the answer from the LLM..v1/completions

sakshatkatyarmal
Автор

So if I had to choose, one of the best LLMs from that selection would be Falcon?

ilianos
Автор

How does it compare to 4bit quantization?

aozynoob
visit shbcf.ru